Processing

Dimensionality-reduction and ancestry-aware processing classes.

class snputils.processing.PCA(snpobj=None, backend='sklearn', n_components=2, fitting='exact', device='cpu', average_strands=True, samples_subset=None, snps_subset=None)[source]

Bases: object

Principal Component Analysis (PCA) for SNP data.

This class wraps either sklearn.decomposition.PCA or the custom TorchPCA backend.

The fitting parameter selects exact vs approximate SVD on both backends (see __init__).

The class supports separate or averaged strand processing. If snpobj is provided at construction time, fit_transform is called automatically.

Parameters:
  • snpobj (SNPObject, optional) – A SNPObject instance.

  • backend (str, default='sklearn') – The backend to use (‘sklearn’ or ‘pytorch’). Default is ‘sklearn’.

  • n_components (int, default=2) – The number of principal components. Default is 2.

  • fitting (str, default='exact') – SVD mode for both backends. Use 'exact' for standard decomposition (PyTorch torch.linalg.svd or sklearn svd_solver='full'), or 'lowrank' for faster approximate decomposition (PyTorch torch.svd_lowrank or sklearn svd_solver='randomized'). Default is 'exact'.

  • device (str, default='cpu') – Device to use (‘cpu’, ‘gpu’, ‘cuda’, or ‘cuda:<index>’). Default is ‘cpu’.

  • average_strands (bool, default=True) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.

  • samples_subset (int or list of int, optional) – Subset of samples to include, as an integer for the first samples or a list of sample indices.

  • snps_subset (int or list of int, optional) – Subset of SNPs to include, as an integer for the first SNPs or a list of SNP indices.

property snpobj

Retrieve snpobj.

Returns:

SNPObject – A SNPObject instance.

property backend

Retrieve backend.

Returns:

str – The backend to use (‘sklearn’ or ‘pytorch’).

property n_components

Retrieve n_components.

Returns:

int – The number of principal components.

property fitting

Retrieve fitting.

Returns:

str'exact' or 'lowrank' (same meaning for sklearn and PyTorch backends).

property device

Retrieve device.

Returns:

torch.device – Device to use (‘cpu’, ‘gpu’, ‘cuda’, or ‘cuda:<index>’).

property average_strands

Retrieve average_strands.

Returns:

bool – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.

property samples_subset

Retrieve samples_subset.

Returns:

int or list of int – Subset of samples to include, as an integer for the first samples or a list of sample indices.

property snps_subset

Retrieve snps_subset.

Returns:

int or list of int – Subset of SNPs to include, as an integer for the first SNPs or a list of SNP indices.

property n_components_

Retrieve n_components_.

Returns:

int – The effective number of components retained after fitting, calculated as min(self.n_components, min(n_samples, n_snps)).

property components_

Retrieve components_.

Returns:

tensor or array of shape (n_components_, n_snps) – Matrix of principal components, where each row is a principal component vector.

property mean_

Retrieve mean_.

Returns:

tensor or array of shape (n_snps,) – Per-feature mean vector of the input data used for centering.

property X_

Retrieve X_.

Returns:

tensor or array of shape (n_samples, n_snps) – The SNP data matrix used to fit the model.

property X_new_

Retrieve X_new_.

Returns:

tensor or array of shape (n_samples, n_components_) – The transformed SNP data projected onto the n_components_ principal components.

copy()[source]

Create and return a copy of self.

Returns:

PCA – A new instance of the current object.

fit(snpobj=None, average_strands=None, samples_subset=None, snps_subset=None)[source]

Fit the model to the input SNP data stored in the provided snpobj.

Parameters:
  • snpobj (SNPObject, optional) – A SNPObject instance. If None, defaults to self.snpobj.

  • average_strands (bool, optional) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise. If None, defaults to self.average_strands.

  • samples_subset (int or list of int, optional) – Subset of samples to include, as an integer for the first samples or a list of sample indices. If None, defaults to self.samples_subset.

  • snps_subset (int or list of int, optional) – Subset of SNPs to include, as an integer for the first SNPs or a list of SNP indices. If None, defaults to self.snps_subset.

Returns:

PCA – The fitted instance of self.

transform(snpobj=None, average_strands=None, samples_subset=None, snps_subset=None)[source]

Apply dimensionality reduction to the input SNP data stored in the provided snpobj using the fitted model.

Parameters:
  • snpobj (SNPObject, optional) – A SNPObject instance. If None, defaults to self.snpobj.

  • average_strands (bool, optional) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise. If None, defaults to self.average_strands.

  • samples_subset (int or list of int, optional) – Subset of samples to include, as an integer for the first samples or a list of sample indices. If None, defaults to self.samples_subset.

  • snps_subset (int or list of int, optional) – Subset of SNPs to include, as an integer for the first SNPs or a list of SNP indices. If None, defaults to self.snps_subset.

Returns:

tensor or array of shape (n_samples, n_components) – The transformed SNP data projected onto the n_components_ principal components, stored in self.X_new_.

fit_transform(snpobj=None, average_strands=None, samples_subset=None, snps_subset=None)[source]

Fit the model to the SNP data stored in the provided snpobj and apply the dimensionality reduction on the same SNP data.

Parameters:
  • snpobj (SNPObject, optional) – A SNPObject instance. If None, defaults to self.snpobj.

  • average_strands (bool, optional) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise. If None, defaults to self.average_strands.

  • samples_subset (int or list of int, optional) – Subset of samples to include, as an integer for the first samples or a list of sample indices. If None, defaults to self.samples_subset.

  • snps_subset (int or list of int, optional) – Subset of SNPs to include, as an integer for the first SNPs or a list of SNP indices. If None, defaults to self.snps_subset.

Returns:

tensor or array of shape (n_samples, n_components) – The transformed SNP data projected onto the n_components_ principal components, stored in self.X_new_.

class snputils.processing.TorchPCA(n_components=2, fitting='exact')[source]

Bases: object

GPU-based Principal Component Analysis (PCA) using PyTorch tensors.

This implementation supports exact and approximate SVD fitting modes and is intended for accelerated execution on CUDA-capable hardware.

Parameters:
  • n_components (int, default=2) – The number of principal components. If None, defaults to the minimum of n_samples and n_snps.

  • fitting (str, default='exact') – SVD mode for PCA. Use 'exact' for economy SVD via torch.linalg.svd (full_matrices=False), or 'lowrank' for faster approximate SVD via torch.svd_lowrank.

property n_components

Retrieve n_components.

Returns:

int – The number of principal components.

property fitting

Retrieve fitting.

Returns:

str'exact' (economy SVD) or 'lowrank' (approximate).

property n_components_

Retrieve n_components_.

Returns:

int – The effective number of components retained after fitting, calculated as min(self.n_components, min(n_samples, n_snps)).

property components_

Retrieve components_.

Returns:

tensor of shape (n_components_, n_snps) – Matrix of principal components, where each row is a principal component vector.

property mean_

Retrieve mean_.

Returns:

tensor of shape (n_snps,) – Per-feature mean vector of the input data used for centering.

property X_new_

Retrieve X_new_.

Returns:

tensor of shape (n_samples, n_components_) – The transformed SNP data projected onto the n_components_ principal components.

copy()[source]

Create and return a copy of self.

Returns:

TorchPCA – A new instance of the current object.

fit(X)[source]

Fit the model to the input SNP data.

Parameters:

X (tensor of shape (n_samples, n_snps)) – The SNP data matrix to fit the model.

Returns:

TorchPCA – The fitted instance of self.

transform(X)[source]

Apply dimensionality reduction to the input SNP data using the fitted model.

Parameters:

X (tensor of shape (n_samples, n_snps)) – The SNP data matrix to be transformed.

Returns:

tensor of shape (n_samples, n_components_) – The transformed SNP data projected onto the n_components_ principal components, stored in self.X_new_.

fit_transform(X)[source]

Fit the model to the SNP data and apply dimensionality reduction on the same SNP data.

Parameters:

X (tensor of shape n_samples, n_snps) – The SNP data matrix used for both fitting and transformation.

Returns:

tensor of shape (n_samples, n_components_) – The transformed SNP data projected onto the n_components_ principal components, stored in self.X_new_.

class snputils.processing.mdPCA(method='weighted_cov_pca', snpobj=None, laiobj=None, labels_file=None, ancestry=None, is_masked=True, average_strands=False, force_nan_incomplete_strands=False, is_weighted=False, groups_to_remove=None, min_percent_snps=4, group_snp_frequencies_only=True, save_masks=False, load_masks=False, masks_file='masks.npz', output_file='output.tsv', covariance_matrix_file=None, n_components=2, rsid_or_chrompos=2, percent_vals_masked=0)[source]

Bases: object

A class for performing missing data principal component analysis (mdPCA) on SNP data.

The mdPCA class focuses on genotype segments from the ancestry of interest when the is_masked flag is set to True. It offers flexible processing options, allowing either separate handling of masked haplotype strands or combining (averaging) strands into a single composite representation for each individual. Moreover, the analysis can be performed on individual-level data, group-level SNP frequencies, or a combination of both.

If the snpobj, laiobj, labels_file, and ancestry parameters are all provided during instantiation, the fit_transform method will be automatically called, applying the specified mdPCA method to transform the data upon instantiation.

Parameters:
  • method (str, default='weighted_cov_pca') –

    The PCA method to use for dimensionality reduction. Options include:

    • ’weighted_cov_pca’:

      Simple covariance-based PCA, weighted by sample strengths.

    • ’regularized_optimization_ils’:

      Regularized optimization followed by iterative, weighted (via the strengths) least squares projection of missing samples using the original covariance matrix (considering only relevant elements not missing in the original covariance matrix for those samples).

    • ’cov_matrix_imputation’:

      Eigen-decomposition of the covariance matrix after first imputing the covariance matrix missing values using the Iterative SVD imputation method.

    • ’cov_matrix_imputation_ils’:

      The method of ‘cov_matrix_imputation’, but where afterwards missing samples are re-projected onto the space given by ‘cov_matrix_imputation’ using the same iterative method on the original covariance matrix just as done in ‘regularized_optimization_ils’.

    • ’nonmissing_pca_ils’:

      The method of ‘weighted_cov_pca’ on the non-missing samples, followed by the projection of missing samples onto the space given by ‘weighted_cov_pca’ using the same iterative method on the original covariance matrix just as done in ‘regularized_optimization_ils’.

  • snpobj (SNPObject, optional) – A SNPObject instance.

  • laiobj (LAIObject, optional) – A LAIObject instance.

  • labels_file (str, optional) – Path to a .tsv file with metadata on individuals, including population labels, optional weights, and groupings. The indID column must contain unique individual identifiers matching those in laiobj and snpobj for proper alignment. The label column assigns population groups. If is_weighted=True, a weight column must be provided, assigning a weight to each individual, where those with a weight of zero are removed. Optional columns include combination and combination_weight to aggregate individuals into combined groups, where SNP frequencies represent their sequences. The combination column assigns each individual to a specific group (0 for no combination, 1 for the first group, 2 for the second, etc.). All members of a group must share the same label and combination_weight. If combination_weight column is not provided, the combinations are assigned a default weight of 1. Individuals excluded via groups_to_remove or those falling below min_percent_snps are removed from the analysis.

  • ancestry (int or str, optional) –

    Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0. The ancestry input can be:

    • An integer (e.g., 0, 1, 2).

    • A string representation of an integer (e.g., ‘0’, ‘1’).

    • A string matching one of the ancestry map values (e.g., ‘Africa’).

  • is_masked (bool, default=True) – If True, applies ancestry-specific masking to the genotype matrix, retaining only genotype data corresponding to the specified ancestry. If False, uses the full, unmasked genotype matrix.

  • average_strands (bool, default=False) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.

  • force_nan_incomplete_strands (bool) – If True, sets the result to NaN if either haplotype in a pair is NaN. Otherwise, computes the mean while ignoring NaNs (e.g., 0|NaN -> 0, 1|NaN -> 1).

  • is_weighted (bool, default=False) – If True, assigns individual weights from the weight column in labels_file. Otherwise, all individuals have equal weight of 1.

  • groups_to_remove (list of str, optional) – List with groups to exclude from analysis. Example: [‘group1’, ‘group2’].

  • min_percent_snps (float, default=4) – Minimum percentage of SNPs that must be known for an individual and of the ancestry of interest to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.

  • group_snp_frequencies_only (bool, default=True) – If True, mdPCA is performed exclusively on group-level SNP frequencies, ignoring individual-level data. This applies when is_weighted is set to True and a combination column is provided in the labels_file, meaning individuals are aggregated into groups based on their assigned labels. If False, mdPCA is performed on individual-level SNP data alone or on both individual-level and group-level SNP frequencies when is_weighted is True and a combination column is provided.

  • save_masks (bool, default=False) – True if the masked matrices are to be saved in a .npz file, or False otherwise.

  • load_masks (bool, default=False) – True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.

  • masks_file (str or pathlib.Path, default='masks.npz') – Path to the .npz file used for saving/loading masked matrices.

  • output_file (str or pathlib.Path, default='output.tsv') – Path to the output .tsv file where mdPCA results are saved.

  • covariance_matrix_file (str, optional) – Path to save the covariance matrix file in .npy format. If None, the covariance matrix is not saved. Default is None.

  • n_components (int, default=2) – The number of principal components.

  • rsid_or_chrompos (int, default=2) – Format indicator for SNP IDs in self.__X_new_. Use 1 for rsID format or 2 for chromosome_position.

  • percent_vals_masked (float, default=0) – Percentage of values in the covariance matrix to be masked and then imputed. Only applicable if method is ‘cov_matrix_imputation’ or ‘cov_matrix_imputation_ils’.

property method

Retrieve method.

Returns:

str – The PCA method to use for dimensionality reduction.

property snpobj

Retrieve snpobj.

Returns:

SNPObject – A SNPObject instance.

property laiobj

Retrieve laiobj.

Returns:

LocalAncestryObject – A LocalAncestryObject instance.

property labels_file

Retrieve labels_file.

Returns:

str – Path to the labels file in .tsv format.

property ancestry

Retrieve ancestry.

Returns:

str – Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.

property is_masked

Retrieve is_masked.

Returns:

bool – True if an ancestry file is passed for ancestry-specific masking, or False otherwise.

property average_strands

Retrieve average_strands.

Returns:

bool – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.

property force_nan_incomplete_strands

Retrieve force_nan_incomplete_strands.

Returns:

bool

If True, sets the result to NaN if either haplotype in a pair is NaN.

Otherwise, computes the mean while ignoring NaNs (e.g., 0|NaN -> 0, 1|NaN -> 1).

property is_weighted

Retrieve is_weighted.

Returns:

bool – True if weights are provided in the labels file, or False otherwise.

property groups_to_remove

Retrieve groups_to_remove.

Returns:

dict of int to list of str

Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are

lists of groups to remove for each array. Example: {1: [‘group1’, ‘group2’], 2: [], 3: [‘group3’]}.

property min_percent_snps

Retrieve min_percent_snps.

Returns:

float – Minimum percentage of SNPs that must be known for an individual to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.

property group_snp_frequencies_only

Retrieve group_snp_frequencies_only.

Returns:

bool – If True, mdPCA is performed exclusively on group-level SNP frequencies, ignoring individual-level data. This applies when is_weighted is set to True and a combination column is provided in the labels_file, meaning individuals are aggregated into groups based on their assigned labels. If False, mdPCA is performed on individual-level SNP data alone or on both individual-level and group-level SNP frequencies when is_weighted is True and a combination column is provided.

property save_masks

Retrieve save_masks.

Returns:

bool – True if the masked matrices are to be saved in a .npz file, or False otherwise.

property load_masks

Retrieve load_masks.

Returns:

bool – True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.

property masks_file

Retrieve masks_file.

Returns:

str or pathlib.Path – Path to the .npz file used for saving/loading masked matrices.

property output_file

Retrieve output_file.

Returns:

str or pathlib.Path – Path to the output .tsv file where mdPCA results are saved.

property covariance_matrix_file

Retrieve covariance_matrix_file.

Returns:

str – Path to save the covariance matrix file in .npy format.

property n_components

Retrieve n_components.

Returns:

int – The number of principal components.

property rsid_or_chrompos

Retrieve rsid_or_chrompos.

Returns:

int – Format indicator for SNP IDs in the SNP data. Use 1 for rsID format or 2 for chromosome_position.

property percent_vals_masked

Retrieve percent_vals_masked.

Returns:

float – Percentage of values in the covariance matrix to be masked and then imputed. Only applicable if method is ‘cov_matrix_imputation’ or ‘cov_matrix_imputation_ils’.

property X_new_

Retrieve X_new_.

Returns:

array of shape (n_samples, n_components) – The transformed SNP data projected onto the n_components principal components. n_haplotypes_ is the number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0). For diploid individuals without filtering, the shape is (n_samples * 2, n_components).

property haplotypes_

Retrieve haplotypes_.

Returns:

list of str – A list of unique haplotype identifiers.

property samples_

Retrieve samples_.

Returns:

list of str – A list of sample identifiers based on haplotypes_ and average_strands.

property variants_id_

Retrieve variants_id_.

Returns:

array of shape (n_snp,) – An array containing unique identifiers (IDs) for each SNP, potentially reduced if there are SNPs not present in the laiobj. The format will depend on rsid_or_chrompos.

property n_haplotypes

Retrieve n_haplotypes.

Returns:

int – The total number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0).

property n_samples

Retrieve n_samples.

Returns:

int – The total number of samples, potentially reduced if filtering is applied (min_percent_snps > 0).

copy()[source]

Create and return a copy of self.

Returns:

mdPCA – A new instance of the current object.

fit_transform(snpobj=None, laiobj=None, labels_file=None, ancestry=None, average_strands=None)[source]

Fit the model to the SNP data stored in the provided snpobj and apply the dimensionality reduction on the same SNP data.

This method starts by loading or updating SNP and ancestry data. Then, it manages missing values by applying masks based on ancestry, either by loading a pre-existing mask or generating new ones. After processing these masks to produce an incomplete SNP data matrix, it applies the chosen mdPCA method to reduce dimensionality while handling missing data as specified.

Parameters:
  • snpobj (SNPObject, optional) – A SNPObject instance.

  • laiobj (LAIObject, optional) – A LAIObject instance.

  • labels_file (str, optional) – Path to the labels file in .tsv format. The first column, indID, contains the individual identifiers, and the second column, label, specifies the groups for all individuals. If is_weighted=True, a weight column with individual weights is required. Optionally, combination and combination_weight columns can specify sets of individuals to be combined into groups, with respective weights.

  • ancestry (str, optional) – Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.

  • average_strands (bool, optional) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise. If None, defaults to self.average_strands.

Returns:

array of shape (n_samples, n_components) – The transformed SNP data projected onto the n_components principal components, stored in self.X_new_.

class snputils.processing.maasMDS(snpobj=None, laiobj=None, labels_file=None, ancestry=None, is_masked=True, average_strands=False, force_nan_incomplete_strands=False, is_weighted=False, groups_to_remove={}, min_percent_snps=4, group_snp_frequencies_only=True, save_masks=False, load_masks=False, masks_file='masks.npz', distance_type='AP', n_components=2, rsid_or_chrompos=2)[source]

Bases: object

A class for performing multiple array ancestry-specific multidimensional scaling (maasMDS) on SNP data.

The maasMDS class focuses on genotype segments from the ancestry of interest when the is_masked flag is set to True. It offers flexible processing options, allowing either separate handling of masked haplotype strands or combining (averaging) strands into a single composite representation for each individual. Moreover, the analysis can be performed on individual-level data, group-level SNP frequencies, or a combination of both.

This class supports both separate and averaged strand processing for SNP data. If the snpobj, laiobj, labels_file, and ancestry parameters are all provided during instantiation, the fit_transform method will be automatically called, applying the specified maasMDS method to transform the data upon instantiation.

Parameters:
  • snpobj (SNPObject, optional) – A SNPObject instance.

  • laiobj (LAIObject, optional) – A LAIObject instance.

  • labels_file (str, optional) – Path to the labels file in .tsv format. The first column, indID, contains the individual identifiers, and the second column, label, specifies the groups for all individuals. If is_weighted=True, a weight column with individual weights is required. Optionally, combination and combination_weight columns can specify sets of individuals to be combined into groups, with respective weights.

  • ancestry (int or str, optional) – Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0. The ancestry input can be: - An integer (e.g., 0, 1, 2). - A string representation of an integer (e.g., ‘0’, ‘1’). - A string matching one of the ancestry map values (e.g., ‘Africa’).

  • is_masked (bool, default=True) – If True, applies ancestry-specific masking to the genotype matrix, retaining only genotype data corresponding to the specified ancestry. If False, uses the full, unmasked genotype matrix.

  • average_strands (bool, default=False) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.

  • force_nan_incomplete_strands (bool) – If True, sets the result to NaN if either haplotype in a pair is NaN. Otherwise, computes the mean while ignoring NaNs (e.g., 0|NaN -> 0, 1|NaN -> 1).

  • is_weighted (bool, default=False) – True if weights are provided in the labels file, or False otherwise.

  • groups_to_remove (dict of int to list of str, default={}) – Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are lists of groups to remove for each array. Example: {1: [‘group1’, ‘group2’], 2: [], 3: [‘group3’]}.

  • min_percent_snps (float, default=4.0) – Minimum percentage of SNPs to be known in an individual for an individual to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.

  • group_snp_frequencies_only (bool, default=True) – If True, maasMDS is performed exclusively on group-level SNP frequencies, ignoring individual-level data. This applies when is_weighted is set to True and a combination column is provided in the labels_file, meaning individuals are aggregated into groups based on their assigned labels. If False, maasMDS is performed on individual-level SNP data alone or on both individual-level and group-level SNP frequencies when is_weighted is True and a combination column is provided.

  • save_masks (bool, default=False) – True if the masked matrices are to be saved in a .npz file, or False otherwise.

  • load_masks (bool, default=False) – True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.

  • masks_file (str or pathlib.Path, default='masks.npz') – Path to the .npz file used for saving/loading masked matrices.

  • distance_type (str, default='AP') – Distance metric to use. Options to choose from are: ‘Manhattan’, ‘RMS’ (Root Mean Square), ‘AP’ (Average Pairwise). If average_strands=True, use ‘distance_type=AP’.

  • n_components (int, default=2) – The number of principal components.

  • rsid_or_chrompos (int, default=2) – Format indicator for SNP IDs in the SNP data. Use 1 for rsID format or 2 for chromosome_position.

copy()[source]

Create and return a copy of self.

Returns:

maasMDS – A new instance of the current object.

property snpobj

Retrieve snpobj.

Returns:

SNPObject – A SNPObject instance.

property laiobj

Retrieve laiobj.

Returns:

LocalAncestryObject – A LAIObject instance.

property labels_file

Retrieve labels_file.

Returns:

str – Path to the labels file in .tsv format.

property ancestry

Retrieve ancestry.

Returns:

int – Ancestry index for which dimensionality reduction is to be performed. Ancestry counter starts at 0.

property is_masked

Retrieve is_masked.

Returns:

bool – True if an ancestry file is passed for ancestry-specific masking, or False otherwise.

property average_strands

Retrieve average_strands.

Returns:

bool – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.

property force_nan_incomplete_strands

Retrieve force_nan_incomplete_strands.

Returns:

bool

If True, sets the result to NaN if either haplotype in a pair is NaN.

Otherwise, computes the mean while ignoring NaNs (e.g., 0|NaN -> 0, 1|NaN -> 1).

property is_weighted

Retrieve is_weighted.

Returns:

bool – True if weights are provided in the labels file, or False otherwise.

property groups_to_remove

Retrieve groups_to_remove.

Returns:

dict of int to list of str

Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are

lists of groups to remove for each array. Example: {1: [‘group1’, ‘group2’], 2: [], 3: [‘group3’]}.

property min_percent_snps

Retrieve min_percent_snps.

Returns:

float – Minimum percentage of SNPs to be known in an individual for an individual to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.

property group_snp_frequencies_only

Retrieve group_snp_frequencies_only.

Returns:

bool – If True, maasMDS is performed exclusively on group-level SNP frequencies, ignoring individual-level data. This applies when is_weighted is set to True and a combination column is provided in the labels_file, meaning individuals are aggregated into groups based on their assigned labels. If False, maasMDS is performed on individual-level SNP data alone or on both individual-level and group-level SNP frequencies when is_weighted is True and a combination column is provided.

property save_masks

Retrieve save_masks.

Returns:

bool – True if the masked matrices are to be saved in a .npz file, or False otherwise.

property load_masks

Retrieve load_masks.

Returns:

bool – True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.

property masks_file

Retrieve masks_file.

Returns:

str or pathlib.Path – Path to the .npz file used for saving/loading masked matrices.

property distance_type

Retrieve distance_type.

Returns:

str – Distance metric to use. Options to choose from are: ‘Manhattan’, ‘RMS’ (Root Mean Square), ‘AP’ (Average Pairwise). If average_strands=True, use ‘distance_type=AP’.

property n_components

Retrieve n_components.

Returns:

int – The number of principal components.

property rsid_or_chrompos

Retrieve rsid_or_chrompos.

Returns:

int – Format indicator for SNP IDs in the SNP data. Use 1 for rsID format or 2 for chromosome_position.

property X_new_

Retrieve X_new_.

Returns:

array of shape (n_haplotypes_, n_components) – The transformed SNP data projected onto the n_components principal components. n_haplotypes_ is the number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0). For diploid individuals without filtering, the shape is (n_samples * 2, n_components).

property haplotypes_

Retrieve haplotypes_.

Returns:

list of str – A list of unique haplotype identifiers.

property samples_

Retrieve samples_.

Returns:

list of str – A list of sample identifiers based on haplotypes_ and average_strands.

property variants_id_

Retrieve variants_id_.

Returns:

array of shape (n_snp,) – An array containing unique identifiers (IDs) for each SNP, potentially reduced if there are SNPs not present in the laiobj. The format will depend on rsid_or_chrompos.

property n_haplotypes

Retrieve n_haplotypes.

Returns:

int – The total number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0).

property n_samples

Retrieve n_samples.

Returns:

int – The total number of samples, potentially reduced if filtering is applied (min_percent_snps > 0).

fit_transform(snpobj=None, laiobj=None, labels_file=None, ancestry=None, average_strands=None)[source]

Fit the model to the SNP data stored in the provided snpobj and apply the dimensionality reduction on the same SNP data.

Parameters:
  • snpobj (SNPObject, optional) – A SNPObject instance.

  • laiobj (LAIObject, optional) – A LAIObject instance.

  • labels_file (str, optional) – Path to the labels file in .tsv format. The first column, indID, contains the individual identifiers, and the second column, label, specifies the groups for all individuals. If is_weighted=True, a weight column with individual weights is required. Optionally, combination and combination_weight columns can specify sets of individuals to be combined into groups, with respective weights.

  • ancestry (str, optional) – Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.

  • average_strands (bool, optional) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise. If None, defaults to self.average_strands.

Returns:

array of shape (n_samples, n_components) – The transformed SNP data projected onto the n_components principal components, stored in self.X_new_.