snputils.processing.maasmds API documentation

maasMDS( snpobj, laiobj, labels_file, ancestry, is_masked: bool = True, prob_thresh: float = 0, average_strands: bool = False, is_weighted: bool = False, groups_to_remove: Dict[int, List[str]] = {}, min_percent_snps: float = 4, save_masks: bool = False, load_masks: bool = False, masks_file: Union[str, pathlib._local.Path] = 'masks.npz', distance_type: str = 'AP', n_components: int = 2, rsid_or_chrompos: int = 2) View Source

 22    def __init__(
 23            self, 
 24            snpobj, 
 25            laiobj,
 26            labels_file,
 27            ancestry,
 28            is_masked: bool = True,
 29            prob_thresh: float = 0,
 30            average_strands: bool = False,
 31            is_weighted: bool = False,
 32            groups_to_remove: Dict[int, List[str]] = {},
 33            min_percent_snps: float = 4,
 34            save_masks: bool = False,
 35            load_masks: bool = False,
 36            masks_file: Union[str, pathlib.Path] = 'masks.npz',
 37            distance_type: str = 'AP',
 38            n_components: int = 2,
 39            rsid_or_chrompos: int = 2
 40        ):
 41        """
 42        Args:
 43            snpobj (SNPObject, optional): 
 44                A SNPObject instance.
 45            laiobj (LAIObject, optional): 
 46                A LAIObject instance.
 47            labels_file (str, optional): 
 48                Path to the labels file in .tsv format. The first column, `indID`, contains the individual identifiers, and the second 
 49                column, `label`, specifies the groups for all individuals. If `is_weighted=True`, a `weight` column with individual 
 50                weights is required. Optionally, `combination` and `combination_weight` columns can specify sets of individuals to be 
 51                combined into groups, with respective weights.
 52            ancestry (str, optional): 
 53                Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at `0`.
 54            is_masked (bool, default=True): 
 55                True if an ancestry file is passed for ancestry-specific masking, or False otherwise.
 56            prob_thresh (float, default=0.0): 
 57                Minimum probability threshold for a SNP to belong to an ancestry.
 58            average_strands (bool, default=False): 
 59                True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
 60            is_weighted (bool, default=False): 
 61                True if weights are provided in the labels file, or False otherwise.
 62            groups_to_remove (dict of int to list of str, default={}): 
 63                Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are 
 64                lists of groups to remove for each array.
 65                Example: `{1: ['group1', 'group2'], 2: [], 3: ['group3']}`.
 66            min_percent_snps (float, default=4.0): 
 67                Minimum percentage of SNPs to be known in an individual for an individual to be included in the analysis. 
 68                All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.
 69            save_masks (bool, default=False): 
 70                True if the masked matrices are to be saved in a `.npz` file, or False otherwise.
 71            load_masks (bool, default=False): 
 72                True if the masked matrices are to be loaded from a pre-existing `.npz` file specified by `masks_file`, or False otherwise.
 73            masks_file (str or pathlib.Path, default='masks.npz'): 
 74                Path to the `.npz` file used for saving/loading masked matrices.
 75            distance_type (str, default='AP'): 
 76                Distance metric to use. Options to choose from are: 'Manhattan', 'RMS' (Root Mean Square), 'AP' (Average Pairwise).
 77                If `average_strands=True`, use 'distance_type=AP'.
 78            n_components (int, default=2): 
 79                The number of principal components.
 80            rsid_or_chrompos (int, default=2): 
 81                Format indicator for SNP IDs in the SNP data. Use 1 for `rsID` format or 2 for `chromosome_position`.
 82        """
 83        self.__snpobj = snpobj
 84        self.__laiobj = laiobj
 85        self.__labels_file = labels_file
 86        self.__ancestry = ancestry
 87        self.__is_masked = is_masked
 88        self.__prob_thresh = prob_thresh
 89        self.__average_strands = average_strands
 90        self.__groups_to_remove = groups_to_remove
 91        self.__min_percent_snps = min_percent_snps
 92        self.__is_weighted = is_weighted
 93        self.__save_masks = save_masks
 94        self.__load_masks = load_masks
 95        self.__masks_file = masks_file
 96        self.__distance_type = distance_type
 97        self.__n_components = n_components
 98        self.__rsid_or_chrompos = rsid_or_chrompos
 99        self.__X_new_ = None  # Store transformed SNP data
100        self.__haplotypes_ = None  # Store haplotypes after filtering if min_percent_snps > 0
101        self.__samples_ = None  # Store samples after filtering if min_percent_snps > 0
102
103        # Fit and transform if a `snpobj`, `laiobj`, `labels_file`, and `ancestry` are provided
104        if self.snpobj is not None and self.laiobj is not None and self.labels_file is not None and self.ancestry is not None:
105            self.fit_transform(snpobj, laiobj, labels_file, ancestry)

Arguments:

snpobj (SNPObject, optional): A SNPObject instance.
laiobj (LAIObject, optional): A LAIObject instance.
labels_file (str, optional): Path to the labels file in .tsv format. The first column, indID, contains the individual identifiers, and the second column, label, specifies the groups for all individuals. If is_weighted=True, a weight column with individual weights is required. Optionally, combination and combination_weight columns can specify sets of individuals to be combined into groups, with respective weights.
ancestry (str, optional): Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.
is_masked (bool, default=True): True if an ancestry file is passed for ancestry-specific masking, or False otherwise.
prob_thresh (float, default=0.0): Minimum probability threshold for a SNP to belong to an ancestry.
average_strands (bool, default=False): True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
is_weighted (bool, default=False): True if weights are provided in the labels file, or False otherwise.
groups_to_remove (dict of int to list of str, default={}): Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are lists of groups to remove for each array. Example: {1: ['group1', 'group2'], 2: [], 3: ['group3']}.
min_percent_snps (float, default=4.0): Minimum percentage of SNPs to be known in an individual for an individual to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.
save_masks (bool, default=False): True if the masked matrices are to be saved in a .npz file, or False otherwise.
load_masks (bool, default=False): True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.
masks_file (str or pathlib.Path, default='masks.npz'): Path to the .npz file used for saving/loading masked matrices.
distance_type (str, default='AP'): Distance metric to use. Options to choose from are: 'Manhattan', 'RMS' (Root Mean Square), 'AP' (Average Pairwise). If average_strands=True, use 'distance_type=AP'.
n_components (int, default=2): The number of principal components.
rsid_or_chrompos (int, default=2): Format indicator for SNP IDs in the SNP data. Use 1 for rsID format or 2 for chromosome_position.

def copy(self) -> maasMDS: View Source

127    def copy(self) -> 'maasMDS':
128        """
129        Create and return a copy of `self`.
130
131        Returns:
132            **maasMDS:** 
133                A new instance of the current object.
134        """
135        return copy.copy(self)

Create and return a copy of self.

Returns:

maasMDS: A new instance of the current object.

snpobj: Optional[snputils.snp.genobj.SNPObject] View Source

137    @property
138    def snpobj(self) -> Optional['SNPObject']:
139        """
140        Retrieve `snpobj`.
141        
142        Returns:
143            **SNPObject:** A SNPObject instance.
144        """
145        return self.__snpobj

Retrieve snpobj.

Returns:

SNPObject: A SNPObject instance.

laiobj: Optional[snputils.ancestry.genobj.LocalAncestryObject] View Source

154    @property
155    def laiobj(self) -> Optional['LocalAncestryObject']:
156        """
157        Retrieve `laiobj`.
158        
159        Returns:
160            **LocalAncestryObject:** A LAIObject instance.
161        """
162        return self.__laiobj

Retrieve laiobj.

Returns:

LocalAncestryObject: A LAIObject instance.

labels_file: Optional[str] View Source

171    @property
172    def labels_file(self) -> Optional[str]:
173        """
174        Retrieve `labels_file`.
175        
176        Returns:
177            **str:** 
178                Path to the labels file in `.tsv` format.
179        """
180        return self.__labels_file

Retrieve labels_file.

Returns:

str: Path to the labels file in .tsv format.

ancestry: Optional[str] View Source

189    @property
190    def ancestry(self) -> Optional[str]:
191        """
192        Retrieve `ancestry`.
193        
194        Returns:
195            **str:** Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at `0`.
196        """
197        return self.__ancestry

Retrieve ancestry.

Returns:

str: Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.

is_masked: bool View Source

206    @property
207    def is_masked(self) -> bool:
208        """
209        Retrieve `is_masked`.
210        
211        Returns:
212            **bool:** True if an ancestry file is passed for ancestry-specific masking, or False otherwise.
213        """
214        return self.__is_masked

Retrieve is_masked.

Returns:

bool: True if an ancestry file is passed for ancestry-specific masking, or False otherwise.

prob_thresh: float View Source

223    @property
224    def prob_thresh(self) -> float:
225        """
226        Retrieve `prob_thresh`.
227        
228        Returns:
229            **float:** Minimum probability threshold for a SNP to belong to an ancestry.
230        """
231        return self.__prob_thresh

Retrieve prob_thresh.

Returns:

float: Minimum probability threshold for a SNP to belong to an ancestry.

average_strands: bool View Source

240    @property
241    def average_strands(self) -> bool:
242        """
243        Retrieve `average_strands`.
244        
245        Returns:
246            **bool:** True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
247        """
248        return self.__average_strands

Retrieve average_strands.

Returns:

bool: True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.

is_weighted: bool View Source

257    @property
258    def is_weighted(self) -> bool:
259        """
260        Retrieve `is_weighted`.
261        
262        Returns:
263            **bool:** True if weights are provided in the labels file, or False otherwise.
264        """
265        return self.__is_weighted

Retrieve is_weighted.

Returns:

bool: True if weights are provided in the labels file, or False otherwise.

groups_to_remove: Dict[int, List[str]] View Source

274    @property
275    def groups_to_remove(self) -> Dict[int, List[str]]:
276        """
277        Retrieve `groups_to_remove`.
278        
279        Returns:
280            **dict of int to list of str:** Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are 
281                lists of groups to remove for each array. Example: `{1: ['group1', 'group2'], 2: [], 3: ['group3']}`.
282        """
283        return self.__groups_to_remove

Retrieve groups_to_remove.

Returns:

dict of int to list of str: Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are lists of groups to remove for each array. Example: {1: ['group1', 'group2'], 2: [], 3: ['group3']}.

min_percent_snps: float View Source

292    @property
293    def min_percent_snps(self) -> float:
294        """
295        Retrieve `min_percent_snps`.
296        
297        Returns:
298            **float:** 
299                Minimum percentage of SNPs to be known in an individual for an individual to be included in the analysis. 
300                All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.
301        """
302        return self.__min_percent_snps

Retrieve min_percent_snps.

Returns:

float: Minimum percentage of SNPs to be known in an individual for an individual to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.

save_masks: bool View Source

311    @property
312    def save_masks(self) -> bool:
313        """
314        Retrieve `save_masks`.
315        
316        Returns:
317            **bool:** True if the masked matrices are to be saved in a `.npz` file, or False otherwise.
318        """
319        return self.__save_masks

Retrieve save_masks.

Returns:

bool: True if the masked matrices are to be saved in a .npz file, or False otherwise.

load_masks: bool View Source

328    @property
329    def load_masks(self) -> bool:
330        """
331        Retrieve `load_masks`.
332        
333        Returns:
334            **bool:** 
335                True if the masked matrices are to be loaded from a pre-existing `.npz` file specified 
336                by `masks_file`, or False otherwise.
337        """
338        return self.__load_masks

Retrieve load_masks.

Returns:

bool: True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.

masks_file: Union[str, pathlib._local.Path] View Source

347    @property
348    def masks_file(self) -> Union[str, pathlib.Path]:
349        """
350        Retrieve `masks_file`.
351        
352        Returns:
353            **str or pathlib.Path:** Path to the `.npz` file used for saving/loading masked matrices.
354        """
355        return self.__masks_file

Retrieve masks_file.

Returns:

str or pathlib.Path: Path to the .npz file used for saving/loading masked matrices.

distance_type: str View Source

364    @property
365    def distance_type(self) -> str:
366        """
367        Retrieve `distance_type`.
368        
369        Returns:
370            **str:** 
371                Distance metric to use. Options to choose from are: 'Manhattan', 'RMS' (Root Mean Square), 'AP' (Average Pairwise).
372                If `average_strands=True`, use 'distance_type=AP'.
373        """
374        return self.__distance_type

Retrieve distance_type.

Returns:

str: Distance metric to use. Options to choose from are: 'Manhattan', 'RMS' (Root Mean Square), 'AP' (Average Pairwise). If average_strands=True, use 'distance_type=AP'.

n_components: int View Source

383    @property
384    def n_components(self) -> int:
385        """
386        Retrieve `n_components`.
387        
388        Returns:
389            **int:** The number of principal components.
390        """
391        return self.__n_components

Retrieve n_components.

Returns:

int: The number of principal components.

rsid_or_chrompos: int View Source

400    @property
401    def rsid_or_chrompos(self) -> int:
402        """
403        Retrieve `rsid_or_chrompos`.
404        
405        Returns:
406            **int:** Format indicator for SNP IDs in the SNP data. Use 1 for `rsID` format or 2 for `chromosome_position`.
407        """
408        return self.__rsid_or_chrompos

Retrieve rsid_or_chrompos.

Returns:

int: Format indicator for SNP IDs in the SNP data. Use 1 for rsID format or 2 for chromosome_position.

X_new_: Optional[numpy.ndarray] View Source

417    @property
418    def X_new_(self) -> Optional[np.ndarray]:
419        """
420        Retrieve `X_new_`.
421
422        Returns:
423            **array of shape (n_haplotypes_, n_components):** 
424                The transformed SNP data projected onto the `n_components` principal components.
425                n_haplotypes_ is the number of haplotypes, potentially reduced if filtering is applied 
426                (`min_percent_snps > 0`). For diploid individuals without filtering, the shape is 
427                `(n_samples * 2, n_components)`.
428        """
429        return self.__X_new_

Retrieve X_new_.

Returns:

array of shape (n_haplotypes_, n_components): The transformed SNP data projected onto the n_components principal components. n_haplotypes_ is the number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0). For diploid individuals without filtering, the shape is (n_samples * 2, n_components).

haplotypes_: Optional[List[str]] View Source

438    @property
439    def haplotypes_(self) -> Optional[List[str]]:
440        """
441        Retrieve `haplotypes_`.
442
443        Returns:
444            list of str:
445                A list of unique haplotype identifiers.
446        """
447        if isinstance(self.__haplotypes_, np.ndarray):
448            return self.__haplotypes_.ravel().tolist()  # Flatten and convert NumPy array to a list
449        elif isinstance(self.__haplotypes_, list):
450            if len(self.__haplotypes_) == 1 and isinstance(self.__haplotypes_[0], np.ndarray):
451                return self.__haplotypes_[0].ravel().tolist()  # Handle list containing a single array
452            return self.__haplotypes_  # Already a flat list
453        elif self.__haplotypes_ is None:
454            return None  # If no haplotypes are set
455        else:
456            raise TypeError("`haplotypes_` must be a list or a NumPy array.")

Retrieve haplotypes_.

Returns:

list of str: A list of unique haplotype identifiers.

samples_: Optional[List[str]] View Source

473    @property
474    def samples_(self) -> Optional[List[str]]:
475        """
476        Retrieve `samples_`.
477
478        Returns:
479            list of str:
480                A list of sample identifiers based on `haplotypes_` and `average_strands`.
481        """
482        haplotypes = self.haplotypes_
483        if haplotypes is None:
484            return None
485        if self.__average_strands:
486            return haplotypes
487        else:
488            return [x[:-2] for x in haplotypes]

Retrieve samples_.

Returns:

list of str: A list of sample identifiers based on haplotypes_ and average_strands.

n_haplotypes: Optional[int] View Source

490    @property
491    def n_haplotypes(self) -> Optional[int]:
492        """
493        Retrieve `n_haplotypes`.
494
495        Returns:
496            **int:**
497                The total number of haplotypes, potentially reduced if filtering is applied 
498                (`min_percent_snps > 0`).
499        """
500        return len(self.__haplotypes_)

Retrieve n_haplotypes.

Returns:

int: The total number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0).

n_samples: Optional[int] View Source

502    @property
503    def n_samples(self) -> Optional[int]:
504        """
505        Retrieve `n_samples`.
506
507        Returns:
508            **int:**
509                The total number of samples, potentially reduced if filtering is applied 
510                (`min_percent_snps > 0`).
511        """
512        return len(np.unique(self.samples_))

Retrieve n_samples.

Returns:

int: The total number of samples, potentially reduced if filtering is applied (min_percent_snps > 0).

def fit_transform( self, snpobj: Optional[snputils.snp.genobj.SNPObject] = None, laiobj: Optional[snputils.ancestry.genobj.LocalAncestryObject] = None, labels_file: Optional[str] = None, ancestry: Optional[str] = None, average_strands: Optional[bool] = None) -> numpy.ndarray: View Source

524    def fit_transform(
525            self,
526            snpobj: Optional['SNPObject'] = None, 
527            laiobj: Optional['LocalAncestryObject'] = None,
528            labels_file: Optional[str] = None,
529            ancestry: Optional[str] = None,
530            average_strands: Optional[bool] = None
531        ) -> np.ndarray:
532        """
533        Fit the model to the SNP data stored in the provided `snpobj` and apply the dimensionality reduction on the same SNP data.
534
535        Args:
536            snpobj (SNPObject, optional): 
537                A SNPObject instance.
538            laiobj (LAIObject, optional): 
539                A LAIObject instance.
540            labels_file (str, optional): 
541                Path to the labels file in .tsv format. The first column, `indID`, contains the individual identifiers, and the second 
542                column, `label`, specifies the groups for all individuals. If `is_weighted=True`, a `weight` column with individual 
543                weights is required. Optionally, `combination` and `combination_weight` columns can specify sets of individuals to be 
544                combined into groups, with respective weights.
545            ancestry (str, optional): 
546                Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.
547            average_strands (bool, optional): 
548                True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
549                If None, defaults to `self.average_strands`.
550
551        Returns:
552            **array of shape (n_samples, n_components):** 
553                The transformed SNP data projected onto the `n_components` principal components, stored in `self.X_new_`.
554        """
555        if snpobj is None:
556            snpobj = self.snpobj
557        if laiobj is None:
558            laiobj = self.laiobj
559        if labels_file is None:
560            labels_file = self.labels_file
561        if ancestry is None:
562            ancestry = self.ancestry
563        if average_strands is None:
564            average_strands = self.average_strands
565        
566        if not self.is_masked:
567            self.ancestry = '1'
568        if self.load_masks:
569            masks, rs_ID_list, ind_ID_list, groups, weights = self._load_masks_file(self.masks_file)
570        else:
571            masks, rs_ID_list, ind_ID_list = array_process(
572                self.snpobj,
573                self.laiobj,
574                self.average_strands,
575                self.prob_thresh, 
576                self.is_masked, 
577                self.rsid_or_chrompos
578            )
579
580            masks, ind_ID_list, groups, weights = process_labels_weights(
581                self.labels_file, 
582                masks, 
583                rs_ID_list,
584                ind_ID_list, 
585                self.average_strands, 
586                self.ancestry, 
587                self.min_percent_snps, 
588                self.groups_to_remove,
589                self.is_weighted, 
590                self.save_masks, 
591                self.masks_file
592            )
593        
594        distance_list = [[distance_mat(first=masks[0][self.ancestry], dist_func=self.distance_type)]]
595        
596        self.X_new_ = mds_transform(distance_list, groups, weights, ind_ID_list, self.n_components)
597        self.haplotypes_ = ind_ID_list

Fit the model to the SNP data stored in the provided snpobj and apply the dimensionality reduction on the same SNP data.

Arguments:

snpobj (SNPObject, optional): A SNPObject instance.
laiobj (LAIObject, optional): A LAIObject instance.
labels_file (str, optional): Path to the labels file in .tsv format. The first column, indID, contains the individual identifiers, and the second column, label, specifies the groups for all individuals. If is_weighted=True, a weight column with individual weights is required. Optionally, combination and combination_weight columns can specify sets of individuals to be combined into groups, with respective weights.
ancestry (str, optional): Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.
average_strands (bool, optional): True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise. If None, defaults to self.average_strands.

Returns:

array of shape (n_samples, n_components): The transformed SNP data projected onto the n_components principal components, stored in self.X_new_.