Processing¶
Classes for PCA, missing-data PCA with local-ancestry masking, and multi-array ancestry-specific MDS. All three accept SNPObject inputs directly and expose X_new_ after fitting. Embedding tables can be exported with the helper functions below.
- class snputils.processing.pca.PCA(snpobj=None, backend='sklearn', n_components=2, fitting='exact', device='cpu', average_strands=True, samples_subset=None, snps_subset=None, embedding_table_path=None)[source]¶
Bases:
objectPrincipal Component Analysis (PCA) for SNP genotype matrices.
PCAconverts aSNPObjectinto a sample-by-variant matrix, centers each variant, computes principal axes (eigenvectors/loadings), and projects samples onto the requested principal components.Two computational backends are available:
backend="sklearn"usessklearn.decomposition.PCAon CPU.backend="pytorch"usesTorchPCAand can run on CPU or CUDA.
The
fittingparameter selects exact SVD ("exact") or approximate low-rank SVD ("lowrank") for both backends. Diploid/two-strand genotype arrays can be averaged into one row per sample or expanded into one row per strand withaverage_strands.If
snpobjis passed to the constructor, PCA is performed immediately by callingfit_transform().- Parameters:
snpobj (SNPObject, optional) – Genotype data used to perform PCA immediately. If supplied,
fit_transform()is called during initialization.backend (str, default='sklearn') – Computational backend:
'sklearn'for scikit-learn on CPU or'pytorch'for the PyTorch implementation.n_components (int, default=2) – Number of principal components to compute and return.
fitting (str, default='exact') – SVD mode for both backends. Use
'exact'for standard decomposition (PyTorchtorch.linalg.svdor sklearnsvd_solver='full'), or'lowrank'for faster approximate decomposition (PyTorchtorch.svd_lowrankor sklearnsvd_solver='randomized'). Default is'exact'.device (str, default='cpu') – Device for the PyTorch backend. Accepted values are
'cpu','gpu','cuda', or'cuda:<index>'. Ignored by the scikit-learn backend.average_strands (bool, default=True) – If True, average the two genotype strands into one dosage row per sample. If False, treat each strand as a separate row.
samples_subset (int or list of int, optional) – Samples to include before PCA. An integer selects the first
nsamples; a list selects explicit sample indices.snps_subset (int or list of int, optional) – Variants to include before PCA. An integer selects the first
nvariants; a list selects explicit variant indices.embedding_table_path (path, optional) – Optional TSV/CSV path written by
fit_transform()with row identifiers and projected PC coordinates.
- property snpobj¶
Retrieve snpobj.
- Returns:
SNPObject – A SNPObject instance.
- property backend¶
Retrieve backend.
- Returns:
str – The backend to use (‘sklearn’ or ‘pytorch’).
- property n_components¶
Retrieve n_components.
- Returns:
int – The number of principal components.
- property fitting¶
Retrieve fitting.
- Returns:
str –
'exact'or'lowrank'(same meaning for sklearn and PyTorch backends).
- property device¶
Retrieve device.
- Returns:
torch.device – Device to use (‘cpu’, ‘gpu’, ‘cuda’, or ‘cuda:<index>’).
- property average_strands¶
Retrieve average_strands.
- Returns:
bool – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
- property samples_subset¶
Retrieve samples_subset.
- Returns:
int or list of int – Subset of samples to include, as an integer for the first samples or a list of sample indices.
- property snps_subset¶
Retrieve snps_subset.
- Returns:
int or list of int – Subset of SNPs to include, as an integer for the first SNPs or a list of SNP indices.
- property n_components_¶
Retrieve
n_components_.- Returns:
int – The effective number of components retained after fitting, calculated as
min(self.n_components, min(n_samples, n_snps)).
- property components_¶
Retrieve components_.
- Returns:
tensor or array – Matrix of principal components, where each row is a principal component vector.
- property mean_¶
Retrieve mean_.
- Returns:
tensor or array – Per-feature mean vector of the input data used for centering.
- property X_¶
Retrieve X_.
- Returns:
tensor or array – The SNP data matrix used to fit the model.
- property X_new_¶
Retrieve X_new_.
- Returns:
tensor or array – The transformed SNP data projected onto the
n_components_principal components.
- property embedding_table_path¶
Optional path written by
fit_transform()with the embedding table (TSV/CSV).
- property haplotypes_¶
Per-row identifiers aligned with
X_new_afterfit_transform().When
average_strandsis False and genotypes are diploid/two-strand 3D, values look likeindID|0andindID|1for the two expanded rows per sample.
- property samples_¶
Sample identifiers per projection row (same length as
X_new_when set).With expanded strands, entries repeat per sample (derived from
haplotypes_).
- copy()[source]¶
Create and return a copy of self.
- Returns:
PCA – A new instance of the current object.
- fit(snpobj=None, average_strands=None, samples_subset=None, snps_subset=None)[source]¶
Compute PCA eigenvectors from SNP genotype data.
This prepares a sample-by-variant matrix from
snpobj, centers each variant, and computes the principal axes. After fitting, the learned eigenvectors/loadings are available incomponents_and the per- variant centering values are available inmean_. No PC coordinates are returned by this method; calltransform()to project data onto the learned PCs, orfit_transform()to do both steps at once.- Parameters:
snpobj (SNPObject, optional) – Genotype data used to compute the PCA axes. If None, defaults to
self.snpobj.average_strands (bool, optional) – Whether to average two-strand genotypes into one dosage row per sample. If None, defaults to
self.average_strands.samples_subset (int or list of int, optional) – Samples to use when computing PCs. An integer selects the first
nsamples; a list selects explicit sample indices. If None, defaults toself.samples_subset.snps_subset (int or list of int, optional) – Variants to use when computing PCs. An integer selects the first
nvariants; a list selects explicit variant indices. If None, defaults toself.snps_subset.
- Returns:
PCA – The same
PCAinstance, withn_components_,components_, andmean_populated.
- transform(snpobj=None, average_strands=None, samples_subset=None, snps_subset=None)[source]¶
Project SNP genotype data onto previously computed principal components.
Call
fit()before calling this method. The input data are prepared with the same row/column conventions as infit(), centered using the fittedmean_, and multiplied by the fittedcomponents_. This is useful when PCA axes are computed on one dataset and another dataset needs to be projected into the same PC space.- Parameters:
snpobj (SNPObject, optional) – Genotype data to project. If None and a prepared matrix is already stored in
X_, that matrix is reused; otherwiseself.snpobjis used.average_strands (bool, optional) – Whether to average two-strand genotypes into one dosage row per sample. If None, defaults to
self.average_strands.samples_subset (int or list of int, optional) – Samples to project. An integer selects the first
nsamples; a list selects explicit sample indices. If None, defaults toself.samples_subset.snps_subset (int or list of int, optional) – Variants to use for projection. This must match the variant set used during fitting. If None, defaults to
self.snps_subset.
- Returns:
tensor or array – PC coordinates with one row per projected sample or strand and one column per component. The coordinates are also stored in
X_new_.
- fit_transform(snpobj=None, average_strands=None, samples_subset=None, snps_subset=None)[source]¶
Compute PCA eigenvectors and project the same data onto them.
This is the common one-step PCA workflow. It prepares the genotype matrix, computes the principal axes, and returns the projected PC coordinates for the same rows used to compute the axes. The fitted eigenvectors/loadings are stored in
components_and the projected coordinates are stored inX_new_.- Parameters:
snpobj (SNPObject, optional) – Genotype data used to compute PCs and projected coordinates. If None, defaults to
self.snpobj.average_strands (bool, optional) – Whether to average two-strand genotypes into one dosage row per sample. If None, defaults to
self.average_strands.samples_subset (int or list of int, optional) – Samples to include. An integer selects the first
nsamples; a list selects explicit sample indices. If None, defaults toself.samples_subset.snps_subset (int or list of int, optional) – Variants to include. An integer selects the first
nvariants; a list selects explicit variant indices. If None, defaults toself.snps_subset.
- Returns:
tensor or array – PC coordinates with one row per sample or strand and one column per component.
- class snputils.processing.pca.TorchPCA(n_components=2, fitting='exact')[source]¶
Bases:
objectGPU-based Principal Component Analysis (PCA) using PyTorch tensors.
This implementation supports exact and approximate SVD fitting modes and is intended for accelerated execution on CUDA-capable hardware.
- Parameters:
n_components (int, default=2) – The number of principal components. If None, defaults to the minimum of
n_samplesandn_snps.fitting (str, default='exact') – SVD mode for PCA. Use
'exact'for economy SVD viatorch.linalg.svd(full_matrices=False), or'lowrank'for faster approximate SVD viatorch.svd_lowrank.
- property n_components¶
Retrieve n_components.
- Returns:
int – The number of principal components.
- property fitting¶
Retrieve fitting.
- Returns:
str –
'exact'(economy SVD) or'lowrank'(approximate).
- property n_components_¶
Retrieve
n_components_.- Returns:
int – The effective number of components retained after fitting, calculated as
min(self.n_components, min(n_samples, n_snps)).
- property components_¶
Retrieve components_.
- Returns:
tensor – Matrix of principal components, where each row is a principal component vector.
- property mean_¶
Retrieve mean_.
- Returns:
tensor – Per-feature mean vector of the input data used for centering.
- property X_new_¶
Retrieve X_new_.
- Returns:
tensor – The transformed SNP data projected onto the
n_components_principal components.
- copy()[source]¶
Create and return a copy of self.
- Returns:
TorchPCA – A new instance of the current object.
- fit(X)[source]¶
Fit the model to the input SNP data.
- Parameters:
X (tensor) – The SNP data matrix to fit the model.
- Returns:
TorchPCA – The fitted instance of self.
- transform(X)[source]¶
Apply dimensionality reduction to the input SNP data using the fitted model.
- Parameters:
X (tensor) – The SNP data matrix to be transformed.
- Returns:
tensor – The transformed SNP data projected onto the
n_components_principal components, stored in self.X_new_.
- fit_transform(X)[source]¶
Fit the model to the SNP data and apply dimensionality reduction on the same SNP data.
- Parameters:
X (tensor) – The SNP data matrix used for both fitting and transformation.
- Returns:
tensor – The transformed SNP data projected onto the
n_components_principal components, stored in self.X_new_.
- class snputils.processing.mdpca.mdPCA(snpobj=None, laiobj=None, labels_file=None, *, labels=None, ancestry=None, method='weighted_cov_pca', is_masked=True, average_strands=False, force_nan_incomplete_strands=False, is_weighted=False, groups_to_remove=None, min_percent_snps=4, group_snp_frequencies_only=True, save_masks=False, load_masks=False, masks_file='masks.npz', embedding_table_path=None, output_file=None, covariance_matrix_file=None, n_components=2, rsid_or_chrompos=2, percent_vals_masked=0)[source]¶
Bases:
objectA class for performing missing data principal component analysis (mdPCA) on SNP data.
The mdPCA class focuses on genotype segments from the ancestry of interest when the is_masked flag is set to True. It offers flexible processing options, allowing either separate handling of masked haplotype strands or combining (averaging) strands into a single composite representation for each individual. Moreover, the analysis can be performed on individual-level data, group-level SNP frequencies, or a combination of both.
If
snpobj,laiobj,labels_file, andancestryare all provided during instantiation, thefit_transform()method will be automatically called, applying the specified mdPCA method to transform the data upon instantiation.- Parameters:
method (str, default='weighted_cov_pca') –
The PCA method to use for dimensionality reduction. Options include:
- ’weighted_cov_pca’:
Simple covariance-based PCA, weighted by sample strengths.
- ’regularized_optimization_ils’:
Regularized optimization followed by iterative, weighted (via the strengths) least squares projection of missing samples using the original covariance matrix (considering only relevant elements not missing in the original covariance matrix for those samples).
- ’cov_matrix_imputation’:
Eigen-decomposition of the covariance matrix after first imputing the covariance matrix missing values using the Iterative SVD imputation method.
- ’cov_matrix_imputation_ils’:
The method of ‘cov_matrix_imputation’, but where afterwards missing samples are re-projected onto the space given by ‘cov_matrix_imputation’ using the same iterative method on the original covariance matrix just as done in ‘regularized_optimization_ils’.
- ’nonmissing_pca_ils’:
The method of ‘weighted_cov_pca’ on the non-missing samples, followed by the projection of missing samples onto the space given by ‘weighted_cov_pca’ using the same iterative method on the original covariance matrix just as done in ‘regularized_optimization_ils’.
snpobj (SNPObject, optional) – A SNPObject instance.
laiobj (LAIObject, optional) – A LAIObject instance.
labels_file (str, optional) – Path to a .tsv file with metadata on individuals, including population labels, optional weights, and groupings. The indID column must contain unique individual identifiers matching those in laiobj and snpobj for proper alignment. The label column assigns population groups. If is_weighted=True, a weight column must be provided, assigning a weight to each individual, where those with a weight of zero are removed. Optional columns include combination and combination_weight to aggregate individuals into combined groups, where SNP frequencies represent their sequences. The combination column assigns each individual to a specific group (0 for no combination, 1 for the first group, 2 for the second, etc.). All members of a group must share the same label and combination_weight. If combination_weight column is not provided, the combinations are assigned a default weight of 1. Individuals excluded via groups_to_remove or those falling below min_percent_snps are removed from the analysis.
labels (pandas.DataFrame or str, optional) – In-memory labels table with the same columns as
labels_file, or a path. Pass only one oflabelsandlabels_file.ancestry (int or str, optional) –
Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0. The ancestry input can be:
An integer (e.g., 0, 1, 2).
A string representation of an integer (e.g., ‘0’, ‘1’).
A string matching one of the ancestry map values (e.g., ‘Africa’).
is_masked (bool, default=True) – If True, applies ancestry-specific masking to the genotype matrix, retaining only genotype data corresponding to the specified ancestry. If False, uses the full, unmasked genotype matrix.
average_strands (bool, default=False) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
force_nan_incomplete_strands (bool) – If True, sets the result to NaN if either haplotype in a pair is NaN. Otherwise, computes the mean while ignoring NaNs (e.g., 0|NaN -> 0, 1|NaN -> 1).
is_weighted (bool, default=False) – If True, assigns individual weights from the weight column in labels_file. Otherwise, all individuals have equal weight of 1.
groups_to_remove (list of str, optional) – List with groups to exclude from analysis. Example: [‘group1’, ‘group2’].
min_percent_snps (float, default=4) – Minimum percentage of SNPs that must be known for an individual and of the ancestry of interest to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.
group_snp_frequencies_only (bool, default=True) – If True, mdPCA is performed exclusively on group-level SNP frequencies, ignoring individual-level data. This applies when is_weighted is set to True and a combination column is provided in the labels_file, meaning individuals are aggregated into groups based on their assigned labels. If False, mdPCA is performed on individual-level SNP data alone or on both individual-level and group-level SNP frequencies when is_weighted is True and a combination column is provided.
save_masks (bool, default=False) – True if the masked matrices are to be saved in a .npz file, or False otherwise.
load_masks (bool, default=False) – True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.
masks_file (str or pathlib.Path, default='masks.npz') – Path to the .npz file used for saving/loading masked matrices.
embedding_table_path (str or pathlib.Path, optional) – If set,
fit_transform()writes the projection (X_new_) to this path as TSV/CSV with per-row IDs; seesnputils.processing.dimred_tabular. Default isNone(no table written).output_file (str or pathlib.Path, optional) – Deprecated synonym for
embedding_table_path. If both are given, aValueErroris raised.covariance_matrix_file (str, optional) – Path to save the covariance matrix file in .npy format. If None, the covariance matrix is not saved. Default is None.
n_components (int, default=2) – The number of principal components.
rsid_or_chrompos (int, default=2) – Format indicator for SNP IDs in self.__X_new_. Use 1 for rsID format or 2 for chromosome_position.
percent_vals_masked (float, default=0) – Percentage of values in the covariance matrix to be masked and then imputed. Only applicable if method is ‘cov_matrix_imputation’ or ‘cov_matrix_imputation_ils’.
- property method¶
Retrieve method.
- Returns:
str – The PCA method to use for dimensionality reduction.
- property snpobj¶
Retrieve snpobj.
- Returns:
SNPObject – A SNPObject instance.
- property laiobj¶
Retrieve laiobj.
- Returns:
LocalAncestryObject – A LocalAncestryObject instance.
- property labels_file¶
Retrieve labels_file.
- Returns:
str – Path to the labels file in .tsv format.
- property ancestry¶
Retrieve ancestry.
- Returns:
str – Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.
- property is_masked¶
Retrieve is_masked.
- Returns:
bool – True if an ancestry file is passed for ancestry-specific masking, or False otherwise.
- property average_strands¶
Retrieve average_strands.
- Returns:
bool – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
- property force_nan_incomplete_strands¶
Retrieve force_nan_incomplete_strands.
- Returns:
bool –
- If True, sets the result to NaN if either haplotype in a pair is NaN.
Otherwise, computes the mean while ignoring NaNs (e.g., 0|NaN -> 0, 1|NaN -> 1).
- property is_weighted¶
Retrieve is_weighted.
- Returns:
bool – True if weights are provided in the labels file, or False otherwise.
- property groups_to_remove¶
Retrieve groups_to_remove.
- Returns:
dict of int to list of str –
- Dictionary specifying groups to exclude from analysis. Keys are array numbers, and values are
lists of groups to remove for each array. Example: {1: [‘group1’, ‘group2’], 2: [], 3: [‘group3’]}.
- property min_percent_snps¶
Retrieve min_percent_snps.
- Returns:
float – Minimum percentage of SNPs that must be known for an individual to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.
- property group_snp_frequencies_only¶
Retrieve group_snp_frequencies_only.
- Returns:
bool – If True, mdPCA is performed exclusively on group-level SNP frequencies, ignoring individual-level data. This applies when is_weighted is set to True and a combination column is provided in the labels_file, meaning individuals are aggregated into groups based on their assigned labels. If False, mdPCA is performed on individual-level SNP data alone or on both individual-level and group-level SNP frequencies when is_weighted is True and a combination column is provided.
- property save_masks¶
Retrieve save_masks.
- Returns:
bool – True if the masked matrices are to be saved in a .npz file, or False otherwise.
- property load_masks¶
Retrieve load_masks.
- Returns:
bool – True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.
- property masks_file¶
Retrieve masks_file.
- Returns:
str or pathlib.Path – Path to the .npz file used for saving/loading masked matrices.
- property embedding_table_path¶
If set,
fit_transform()writes a TSV/CSV embedding table to this path (seesnputils.processing.dimred_tabular).
- property output_file¶
Deprecated alias for
embedding_table_path(no warning on access).
- property covariance_matrix_file¶
Retrieve covariance_matrix_file.
- Returns:
str – Path to save the covariance matrix file in .npy format.
- property n_components¶
Retrieve n_components.
- Returns:
int – The number of principal components.
- property rsid_or_chrompos¶
Retrieve rsid_or_chrompos.
- Returns:
int – Format indicator for SNP IDs in the SNP data. Use 1 for rsID format or 2 for chromosome_position.
- property percent_vals_masked¶
Retrieve percent_vals_masked.
- Returns:
float – Percentage of values in the covariance matrix to be masked and then imputed. Only applicable if method is ‘cov_matrix_imputation’ or ‘cov_matrix_imputation_ils’.
- property X_new_¶
Retrieve X_new_.
- Returns:
array – The transformed SNP data projected onto the n_components principal components.
n_haplotypes_is the number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0). For diploid individuals without filtering, the shape is (n_samples * 2, n_components).
- property haplotypes_¶
Retrieve haplotypes_.
- Returns:
list of str – A list of unique haplotype identifiers.
- property samples_¶
Retrieve samples_.
- Returns:
list of str – A list of sample identifiers based on haplotypes_ and average_strands.
- property variants_id_¶
Retrieve variants_id_.
- Returns:
array – An array containing unique identifiers (IDs) for each SNP, potentially reduced if there are SNPs not present in the laiobj. The format will depend on rsid_or_chrompos.
- property n_haplotypes¶
Retrieve n_haplotypes.
- Returns:
int – The total number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0).
- property n_samples¶
Retrieve
n_samples.- Returns:
int – The total number of samples, potentially reduced if filtering is applied (min_percent_snps > 0).
- copy()[source]¶
Create and return a copy of self.
- Returns:
mdPCA – A new instance of the current object.
- fit_transform(snpobj=None, laiobj=None, labels_file=None, ancestry=None, *, labels=None, average_strands=None)[source]¶
Fit the model to the SNP data stored in the provided snpobj and apply the dimensionality reduction on the same SNP data.
This method starts by loading or updating SNP and ancestry data. Then, it manages missing values by applying masks based on ancestry, either by loading a pre-existing mask or generating new ones. After processing these masks to produce an incomplete SNP data matrix, it applies the chosen mdPCA method to reduce dimensionality while handling missing data as specified.
- Parameters:
snpobj (SNPObject, optional) – A SNPObject instance.
laiobj (LAIObject, optional) – A LAIObject instance.
labels_file (str or pandas.DataFrame, optional) – Path to the labels file in .tsv format, or an in-memory labels DataFrame. The first column, indID, contains the individual identifiers, and the second column, label, specifies the groups for all individuals. If is_weighted=True, a weight column with individual weights is required. Optionally, combination and combination_weight columns can specify sets of individuals to be combined into groups, with respective weights.
labels (pandas.DataFrame or str, optional) – Alias for
labels_file. Pass only one oflabelsandlabels_file.ancestry (str, optional) – Ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0.
average_strands (bool, optional) – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise. If None, defaults to self.average_strands.
- Returns:
array – The transformed SNP data projected onto the n_components principal components, stored in self.X_new_.
- class snputils.processing.maasmds.maasMDS(snpobj=None, laiobj=None, labels_file=None, ancestry=None, is_masked=True, average_strands=False, force_nan_incomplete_strands=False, is_weighted=False, groups_to_remove=None, min_percent_snps=4, group_snp_frequencies_only=True, save_masks=False, load_masks=False, masks_file='masks.npz', distance_type='AP', n_components=2, rsid_or_chrompos=2, embedding_table_path=None, labels=None)[source]¶
Bases:
objectMulti-array, ancestry-specific multidimensional scaling (maasMDS) on SNP data.
When
is_maskedis True, genotype entries not attributed to the chosen ancestry are set to missing so distances reflect the ancestry segment of interest. You can keep haplotypes separate or average parental strands (seeaverage_strands). The workflow supports individual-level genotypes, group-level allele frequencies, or a mixture when weighting andcombinationcolumns are used in the labels file.Multiple arrays. Pass a sequence of
SNPObjectand a parallel sequence ofLocalAncestryObjectinstances (one LAI object per array). Genotypes are harmonized to a shared reference allele across arrays where possible; pairwise distances between arrays use overlapping SNPs and linear calibration so embeddings are comparable. Afterfit_transform(),array_labels_holds the array index for each row ofX_new_.If
snpobj,laiobj,labels_file, andancestryare all provided at construction time,fit_transform()runs immediately.- Parameters:
snpobj (SNPObject or sequence of SNPObject, optional) – One SNP object or a list/tuple of objects, one per genotyping array.
laiobj (LocalAncestryObject or sequence of LocalAncestryObject, optional) – Local ancestry object(s) parallel to
snpobjwhenis_maskedis True.labels_file (str, optional) – Path to a TSV with at least columns
indIDandlabel. Ifis_weightedis True, aweightcolumn is required. Optionalcombinationandcombination_weightcolumns define merged groups.labels (pandas.DataFrame or str, optional) – In-memory labels table with the same columns as
labels_file, or a path. Pass only one oflabelsandlabels_file.ancestry (int or str, optional) – Target ancestry index or name. Indices start at
0. Accepts anint, a numeric string (e.g."0"), or a string equal to a value in the LAI ancestry map.is_masked (bool, optional) – If True (default), keep only genotypes assigned to
ancestry; otherwise use the full matrix.average_strands (bool, optional) – If True, average the two haplotypes per individual.
force_nan_incomplete_strands (bool, optional) – If True, strand pairs with any missing value become NaN; if False, average while ignoring NaNs (e.g.
0with NaN yields0).is_weighted (bool, optional) – If True, read per-individual weights from the labels file.
groups_to_remove (optional) – Labels to drop before analysis:
None; a dict mapping 1-based array index to a list of labels; a single list of labels applied to every array; or a sequence of lengthnum_arrayswith one list of labels per array.min_percent_snps (float, optional) – Minimum fraction of non-missing SNPs per individual (default
4means 4%).group_snp_frequencies_only (bool, optional) – If True, use only group-level frequencies when combinations are defined; if False, keep individual-level (and optionally group-level) inputs.
save_masks (bool, optional) – If True, write masks and sidecar arrays to
masks_file.load_masks (bool, optional) – If True, read precomputed masks from
masks_fileinstead of genotypes.masks_file (str or pathlib.Path, optional) – Path for the compressed
.npzmask archive.distance_type (str, optional) –
"Manhattan","RMS", or"AP"(average pairwise). Withaverage_strands=True,"AP"is appropriate.n_components (int, optional) – Embedding dimension (default
2).rsid_or_chrompos (int, optional) –
1for rsID-style IDs,2for chromosome/position encoding (default2).embedding_table_path (path, optional) – If set,
fit_transform()writesX_new_with row metadata to this TSV/CSV path (seesnputils.processing.dimred_tabular).
- copy()[source]¶
Create and return a copy of self.
- Returns:
maasMDS – A new instance of the current object.
- property snpobj¶
Retrieve snpobj.
- Returns:
SNPObject – A SNPObject instance.
- property laiobj¶
Retrieve laiobj.
- Returns:
LocalAncestryObject or sequence thereof – Local ancestry object(s) for masking.
- property labels_file¶
Retrieve labels_file.
- Returns:
str – Path to the labels file in .tsv format.
- property ancestry¶
Retrieve ancestry.
- Returns:
int – Ancestry index for which dimensionality reduction is to be performed. Ancestry counter starts at 0.
- property is_masked¶
Retrieve is_masked.
- Returns:
bool – True if an ancestry file is passed for ancestry-specific masking, or False otherwise.
- property average_strands¶
Retrieve average_strands.
- Returns:
bool – True if the haplotypes from the two parents are to be combined (averaged) for each individual, or False otherwise.
- property force_nan_incomplete_strands¶
Retrieve force_nan_incomplete_strands.
- Returns:
bool –
- If True, sets the result to NaN if either haplotype in a pair is NaN.
Otherwise, computes the mean while ignoring NaNs (e.g., 0|NaN -> 0, 1|NaN -> 1).
- property is_weighted¶
Retrieve is_weighted.
- Returns:
bool – True if weights are provided in the labels file, or False otherwise.
- property groups_to_remove¶
Retrieve groups_to_remove.
- Returns:
A flat removal list or a per-array mapping of labels to remove.
- property min_percent_snps¶
Retrieve min_percent_snps.
- Returns:
float – Minimum percentage of SNPs to be known in an individual for an individual to be included in the analysis. All individuals with fewer percent of unmasked SNPs than this threshold will be excluded.
- property group_snp_frequencies_only¶
Retrieve group_snp_frequencies_only.
- Returns:
bool – If True, maasMDS is performed exclusively on group-level SNP frequencies, ignoring individual-level data. This applies when is_weighted is set to True and a combination column is provided in the labels_file, meaning individuals are aggregated into groups based on their assigned labels. If False, maasMDS is performed on individual-level SNP data alone or on both individual-level and group-level SNP frequencies when is_weighted is True and a combination column is provided.
- property save_masks¶
Retrieve save_masks.
- Returns:
bool – True if the masked matrices are to be saved in a .npz file, or False otherwise.
- property load_masks¶
Retrieve load_masks.
- Returns:
bool – True if the masked matrices are to be loaded from a pre-existing .npz file specified by masks_file, or False otherwise.
- property masks_file¶
Retrieve masks_file.
- Returns:
str or pathlib.Path – Path to the .npz file used for saving/loading masked matrices.
- property distance_type¶
Retrieve distance_type.
- Returns:
str – Distance metric to use. Options to choose from are: ‘Manhattan’, ‘RMS’ (Root Mean Square), ‘AP’ (Average Pairwise). If average_strands=True, use ‘distance_type=AP’.
- property n_components¶
Retrieve n_components.
- Returns:
int – The number of principal components.
- property rsid_or_chrompos¶
Retrieve rsid_or_chrompos.
- Returns:
int – Format indicator for SNP IDs in the SNP data. Use 1 for rsID format or 2 for chromosome_position.
- property embedding_table_path¶
Optional path for the tabular embedding written by
fit_transform().
- property X_new_¶
Retrieve X_new_.
- Returns:
array – The transformed SNP data projected onto the n_components principal components.
n_haplotypes_is the number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0). For diploid individuals without filtering, the shape is (n_samples * 2, n_components).
- property haplotypes_¶
Retrieve haplotypes_.
- Returns:
list of str – A list of unique haplotype identifiers.
- property samples_¶
Retrieve samples_.
- Returns:
list of str – A list of sample identifiers based on haplotypes_ and average_strands.
- property variants_id_¶
Retrieve variants_id_.
- Returns:
numpy.ndarray or list of numpy.ndarray – Per-SN identifiers after LAI alignment. For a single array this is a 1-D array; for multiple arrays, a list with one array of IDs per array (overlap sets can differ). Interpretation follows
rsid_or_chrompos.
- property n_haplotypes¶
Retrieve n_haplotypes.
- Returns:
int or None – The total number of haplotypes, potentially reduced if filtering is applied (min_percent_snps > 0).
Nonebeforefit_transform()has been called.
- property n_samples¶
Retrieve
n_samples.- Returns:
int or None – The total number of samples, potentially reduced if filtering is applied (min_percent_snps > 0).
Nonebeforefit_transform()has been called.
- fit_transform(snpobj=None, laiobj=None, labels_file=None, ancestry=None, average_strands=None, *, labels=None)[source]¶
Estimate the MDS embedding and store it on the instance.
Omitted arguments fall back to attributes set on the object or in
__init__.- Parameters:
snpobj (SNPObject or sequence of SNPObject, optional) – Input genotype container(s).
laiobj (LocalAncestryObject or sequence of LocalAncestryObject, optional) – Matching LAI object(s) when masking is enabled.
labels_file (str or pandas.DataFrame, optional) – TSV path or in-memory DataFrame with
indID/label(and optional weight / combination columns).labels (pandas.DataFrame or str, optional) – Alias for
labels_file. Pass only one oflabelsandlabels_file.ancestry (int or str, optional) – Same conventions as in
__init__.average_strands (bool, optional) – If omitted, uses
self.average_strands.
- Returns:
numpy.ndarray – Embedding of shape
(n_rows, n_components)withn_rowsequal to the number of haplotypes (or samples if strands are averaged) aftermin_percent_snpsfiltering. Also assigned toX_new_; row-wise array indices are inarray_labels_when multiple arrays are combined.
Embedding Utilities¶
- snputils.processing.build_embedding_dataframe(X_new, *, ind_ids, haplotype_ids=None, array_index=None, method, component_style='PC', component_names=None)[source]¶
Build a table of identifiers plus embedding columns.
- Parameters:
X_new – Matrix of shape
(n_rows, n_components)(numpyortorch).ind_ids – Per-row individual / sample identifiers (length
n_rows).haplotype_ids – Optional per-row haplotype or replicate IDs; omitted from the frame when it would duplicate
ind_idson every row.array_index – Optional per-row genotyping array index (multi-array
maasMDS).method – Short name stored in
methodcolumn (e.g."pca","mdpca","maasmds").component_style –
"PC"(PC1, …) or"MDS"(MDS1, …).component_names – If set, column names for coordinates; length must match
n_components.
- snputils.processing.embedding_dataframe_from_model(obj, *, metadata=None, metadata_id_col='sample', metadata_join_col='indID', require_metadata_match=False)[source]¶
Build a coordinate table from a fitted dimensionality-reduction model.
This is the in-memory counterpart to
save_embedding_table_from_model(). It readsobj.X_new_and row identifiers from fittedPCA,mdPCA, andmaasMDSobjects, creates component columns such asPC1/PC2, and can join sample-level metadata for plotting or downstream analysis.- Parameters:
obj – Fitted dimensionality-reduction model with
X_new_and row identifiers.metadata – Optional sample metadata table to join to the coordinates.
metadata_id_col – Column in
metadatacontaining sample IDs.metadata_join_col – Coordinate-table column to join against. Use
"indID"for sample-level metadata; use"haplotype_id"only for metadata keyed by expanded haplotype rows.require_metadata_match – If True, raise when any coordinate row lacks matching metadata.
- Returns:
DataFrame containing identifiers, method, component coordinates, and optional metadata columns.
- snputils.processing.save_embedding_table(path, X_new, *, ind_ids, haplotype_ids=None, array_index=None, method='dimred', component_style='PC', component_names=None, sep='\t', float_format='%.8g')[source]¶
Write embedding coordinates and identifiers to a CSV/TSV file on disk.
Compression is inferred from the file suffix (e.g.
.gz) viapandas.DataFrame.to_csv(). Tab separation is used by default; usesep=","for CSV.- Returns:
Resolved path that was written.
- snputils.processing.save_embedding_table_from_model(obj, path, *, sep='\t', float_format='%.8g')[source]¶
Write
obj.X_new_using identifiers from a fitted dimensionality-reduction object.Expects
X_new_and eithersamples_orhaplotypes_(as produced bymdPCA/maasMDS/PCAafterfit_transform()). Includesarray_indexwhenarray_labels_is present (maasMDS).- Parameters:
obj – A fitted
PCA,mdPCA, ormaasMDSinstance.path – Output path (
.tsv,.csv, or compressed variants).
- snputils.processing.try_save_embedding_table(obj, path, *, sep='\t', float_format='%.8g')[source]¶
Call
save_embedding_table_from_model()whenpathis notNone.Use this from
fit_transformimplementations so writing is a no-op unless a path was configured on the object.
- snputils.processing.embedding_column_names(n_components, style='PC')[source]¶
Return column names for embedding coordinates (1-based, e.g.
PC1).
- snputils.processing.pca_row_haplotype_ids(snpobj, average_strands, samples_subset=None)[source]¶
Row identifiers aligned with
snputils.processing.pca.PCA.fit_transform()output rows.Each string uniquely identifies one row of the projection (one haplotype row when
average_strandsis False). Format uses"indID|strand"with strand0or1when two strands are expanded; usepca_row_individual_ids()for sample IDs alone.- Raises:
ValueError – If
snpobj.samplesis missing while IDs are required for export.