SNP Data¶
Objects, readers, writers, and convenience functions for genotype data.
Objects¶
- class snputils.SNPObject(calldata_gt=None, samples=None, variants_ref=None, variants_alt=None, variants_chrom=None, variants_cm=None, variants_filter_pass=None, variants_id=None, variants_pos=None, variants_qual=None, variants_info=None, calldata_lai=None, ancestry_map=None, sample_fid=None, sample_sex=None)[source]¶
Bases:
objectA class for Single Nucleotide Polymorphism (SNP) data, with optional support for SNP-level Local Ancestry Information (LAI).
- Parameters:
calldata_gt (array, optional) – An array containing genotype data for each sample. This array can be either 2D with shape (n_snps, n_samples) if the paternal and maternal strands are summed, or 3D with shape (n_snps, n_samples, 2) if the strands are kept separate.
samples (array of shape (n_samples,), optional) – An array containing unique sample identifiers.
sample_fid (array of shape (n_samples,), optional) – PLINK-style family ID per sample (same order as
samples), for example population labels in column 1 of a.famfile. When present and not identical tosamples, f-statistics use these values as default group labels ifsample_labelsis omitted.sample_sex (array of shape (n_samples,), optional) – PLINK-style sex code per sample, aligned with
samples.variants_ref (array of shape (n_snps,), optional) – An array containing the reference allele for each SNP.
variants_alt (array of shape (n_snps,), optional) – An array containing the alternate allele for each SNP.
variants_chrom (array of shape (n_snps,), optional) – An array containing the chromosome for each SNP.
variants_cm (array of shape (n_snps,), optional) – An array containing the genetic-map position in centimorgans for each SNP.
variants_filter_pass (array of shape (n_snps,), optional) – An array indicating whether each SNP passed control checks.
variants_id (array of shape (n_snps,), optional) – An array containing unique identifiers (IDs) for each SNP.
variants_pos (array of shape (n_snps,), optional) – An array containing the chromosomal positions for each SNP.
variants_qual (array of shape (n_snps,), optional) – An array containing the Phred-scaled quality score for each SNP.
variants_info (array of shape (n_snps,), optional) – An array containing VCF/PVAR INFO column values for each SNP.
calldata_lai (array, optional) – An array containing the ancestry for each SNP. This array can be either 2D with shape (n_snps, n_samples*2), or 3D with shape (n_snps, n_samples, 2).
ancestry_map (dict of str to str, optional) – A dictionary mapping ancestry codes to region names.
- property calldata_gt¶
Retrieve calldata_gt.
- Returns:
array – An array containing genotype data for each sample. This array can be either 2D with shape (n_snps, n_samples) if the paternal and maternal strands are summed, or 3D with shape (n_snps, n_samples, 2) if the strands are kept separate.
- property samples¶
Retrieve samples.
- Returns:
array of shape (n_samples,) – An array containing unique sample identifiers.
- property sample_fid¶
PLINK Family ID (FID) per sample, aligned with
samples.
- property sample_sex¶
PLINK sex code per sample, aligned with
samples.
- property variants_ref¶
Retrieve variants_ref.
- Returns:
array of shape (n_snps,) – An array containing the reference allele for each SNP.
- property variants_alt¶
Retrieve variants_alt.
- Returns:
array of shape (n_snps,) – An array containing the alternate allele for each SNP.
- property variants_chrom¶
Retrieve variants_chrom.
- Returns:
array of shape (n_snps,) – An array containing the chromosome for each SNP.
- property variants_cm¶
Retrieve variants_cm.
- Returns:
array of shape (n_snps,) – An array containing the genetic-map position in centimorgans for each SNP.
- property variants_filter_pass¶
Retrieve variants_filter_pass.
- Returns:
array of shape (n_snps,) – An array indicating whether each SNP passed control checks.
- property variants_id¶
Retrieve variants_id.
- Returns:
array of shape (n_snps,) – An array containing unique identifiers (IDs) for each SNP.
- property variants_pos¶
Retrieve variants_pos.
- Returns:
array of shape (n_snps,) – An array containing the chromosomal positions for each SNP.
- property variants_qual¶
Retrieve variants_qual.
- Returns:
array of shape (n_snps,) – An array containing the Phred-scaled quality score for each SNP.
- property variants_info¶
Retrieve variants_info.
- Returns:
array of shape (n_snps,) – An array containing VCF/PVAR INFO column values for each SNP.
- property calldata_lai¶
Retrieve calldata_lai.
- Returns:
array – An array containing the ancestry for each SNP. This array can be either 2D with shape (n_snps, n_samples*2), or 3D with shape (n_snps, n_samples, 2).
- property ancestry_map¶
Retrieve ancestry_map.
- Returns:
dict of str to str – A dictionary mapping ancestry codes to region names.
- property n_samples¶
Retrieve n_samples.
- Returns:
int – The total number of samples.
- property n_snps¶
Retrieve n_snps.
- Returns:
int – The total number of SNPs.
- property n_chrom¶
Retrieve n_chrom.
- Returns:
int – The total number of unique chromosomes in variants_chrom.
- property n_ancestries¶
Retrieve n_ancestries.
- Returns:
int – The total number of unique ancestries.
- property shape¶
Retrieve the primary data shape.
- Returns:
tuple – The shape of calldata_gt when present, otherwise the shape of calldata_lai. If only metadata is available, returns (n_snps, n_samples) with unknown dimensions represented as None.
- property unique_chrom¶
Retrieve unique_chrom.
- Returns:
array – The unique chromosome names in variants_chrom, preserving their order of appearance.
- property are_strands_summed¶
Retrieve are_strands_summed.
- Returns:
bool – True if the maternal and paternal strands have been summed together, which is indicated by calldata_gt having shape (n_samples, n_snps). False if the strands are stored separately, indicated by calldata_gt having shape (n_samples, n_snps, 2).
- copy()[source]¶
Create and return a copy of self.
- Returns:
SNPObject – A new instance of the current object.
- keys()[source]¶
Retrieve a list of public attribute names for self.
- Returns:
list of str – A list of attribute names, with internal name-mangling removed, for easier reference to public attributes in the instance.
- allele_freq(sample_labels=None, ancestry=None, laiobj=None, pseudohaploid=False, return_counts=False, as_dataframe=False)[source]¶
Compute per-SNP alternate allele frequencies from calldata_gt.
- Parameters:
sample_labels (sequence, optional) – Population label per sample. If None, computes cohort-level frequencies.
ancestry (str or int, optional) – If provided, compute ancestry-masked frequencies using SNP-level LAI.
laiobj (LocalAncestryObject, optional) – Optional LAI object used when self.calldata_lai is not set.
pseudohaploid (bool or int, default=False) – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.
return_counts (bool, default=False) – If True, also return called-allele counts with the same shape as frequencies.
as_dataframe (bool, default=False) – If True, return pandas DataFrame output.
- Returns:
Frequencies as a NumPy array (or DataFrame if as_dataframe=True). If return_counts=True, returns (freq, counts).
- sum_strands(inplace=False)[source]¶
Sum paternal and maternal strands.
- Parameters:
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the variants filtered. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with summed strands if inplace=False. If inplace=True, modifies self in place and returns None.
- filter_variants(chrom=None, pos=None, indexes=None, mask=None, include=True, inplace=False)[source]¶
Filter variants based on chromosome names, variant positions, indexes, or a boolean mask.
This method updates the calldata_gt, variants_ref, variants_alt, variants_chrom, variants_filter_pass, variants_id, variants_pos, variants_qual, and lai attributes to include or exclude the specified variants. The filtering criteria can be based on chromosome names, variant positions, or indexes. If multiple criteria are provided, their union is used for filtering. The order of the variants is preserved.
Negative indexes are supported and follow [NumPy’s indexing conventions](https://numpy.org/doc/stable/user/basics.indexing.html).
- Parameters:
chrom (str or array_like of str, optional) – Chromosome(s) to filter variants by. Can be a single chromosome as a string or a sequence of chromosomes. If both chrom and pos are provided, they must either have matching lengths (pairing each chromosome with a position) or chrom should be a single value that applies to all positions in pos. Default is None.
pos (int or array_like of int, optional) – Position(s) to filter variants by. Can be a single position as an integer or a sequence of positions. If chrom is also provided, pos should either match chrom in length or chrom should be a single value. Default is None.
indexes (int or array_like of int, optional) – Index(es) of the variants to include or exclude. Can be a single index or a sequence of indexes. Negative indexes are supported. Default is None.
mask (array_like of bool, optional) – Boolean mask aligned to the SNP axis. If provided with other criteria, the union of all selected variants is used before applying
include.include (bool, default=True) – If True, includes only the specified variants. If False, excludes the specified variants. Default is True.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the variants filtered. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with the specified variants filtered if inplace=False. If inplace=True, modifies self in place and returns None.
- filter_biallelic_variants(snv_only=True, inplace=False)[source]¶
Keep variants with exactly one alternate allele.
- Parameters:
- Returns:
Optional[SNPObject] – A filtered SNPObject if
inplace=False; otherwise modifiesselfand returns None.
- filter_complete_genotypes(inplace=False)[source]¶
Keep variants with no missing genotype calls across all samples.
Missing calls are represented as negative values in
calldata_gt.
- filter_polymorphic_variants(inplace=False)[source]¶
Keep variants with at least two observed genotype dosages among called samples.
- filter_samples(samples=None, indexes=None, include=True, reorder=False, inplace=False)[source]¶
Filter samples based on specified names or indexes.
This method updates the samples and calldata_gt attributes to include or exclude the specified samples. The order of the samples is preserved. Set reorder=True to match the ordering of the provided samples and/or indexes lists when including.
If both samples and indexes are provided, any sample matching either a name in samples or an index in indexes will be included or excluded.
This method allows inclusion or exclusion of specific samples by their names or indexes. When both sample names and indexes are provided, the union of the specified samples is used. Negative indexes are supported and follow [NumPy’s indexing conventions](https://numpy.org/doc/stable/user/basics.indexing.html).
- Parameters:
samples (str or array_like of str, optional) – Name(s) of the samples to include or exclude. Can be a single sample name or a sequence of sample names. Default is None.
indexes (int or array_like of int, optional) – Index(es) of the samples to include or exclude. Can be a single index or a sequence of indexes. Negative indexes are supported. Default is None.
include (bool, default=True) – If True, includes only the specified samples. If False, excludes the specified samples. Default is True.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the samples filtered. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with the specified samples filtered if inplace=False. If inplace=True, modifies self in place and returns None.
- detect_chromosome_format()[source]¶
Detect the chromosome naming convention in variants_chrom based on the prefix of the first chromosome identifier in unique_chrom.
Recognized formats:
‘chr’: Format with ‘chr’ prefix, e.g., ‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’, ‘chrM’.
‘chm’: Format with ‘chm’ prefix, e.g., ‘chm1’, ‘chm2’, …, ‘chmX’, ‘chmY’, ‘chmM’.
‘chrom’: Format with ‘chrom’ prefix, e.g., ‘chrom1’, ‘chrom2’, …, ‘chromX’, ‘chromY’, ‘chromM’.
‘plain’: Plain format without a prefix, e.g., ‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’.
If the format does not match any recognized pattern, ‘Unknown format’ is returned.
- Returns:
str – A string indicating the detected chromosome format (‘chr’, ‘chm’, ‘chrom’, or ‘plain’). If no recognized format is matched, returns ‘Unknown format’.
- convert_chromosome_format(from_format, to_format, inplace=False)[source]¶
Convert the chromosome format from one naming convention to another in variants_chrom.
Supported formats:
‘chr’: Format with ‘chr’ prefix, e.g., ‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’, ‘chrM’.
‘chm’: Format with ‘chm’ prefix, e.g., ‘chm1’, ‘chm2’, …, ‘chmX’, ‘chmY’, ‘chmM’.
‘chrom’: Format with ‘chrom’ prefix, e.g., ‘chrom1’, ‘chrom2’, …, ‘chromX’, ‘chromY’, ‘chromM’.
‘plain’: Plain format without a prefix, e.g., ‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’.
- Parameters:
from_format (str) – The current chromosome format. Acceptable values are ‘chr’, ‘chm’, ‘chrom’, or ‘plain’.
to_format (str) – The target format for chromosome data conversion. Acceptable values match from_format options.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the converted format. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with the converted chromosome format if inplace=False. If inplace=True, modifies self in place and returns None.
- match_chromosome_format(snpobj, inplace=False)[source]¶
Convert the chromosome format in variants_chrom from self to match the format of a reference snpobj.
Recognized formats:
‘chr’: Format with ‘chr’ prefix, e.g., ‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’, ‘chrM’.
‘chm’: Format with ‘chm’ prefix, e.g., ‘chm1’, ‘chm2’, …, ‘chmX’, ‘chmY’, ‘chmM’.
‘chrom’: Format with ‘chrom’ prefix, e.g., ‘chrom1’, ‘chrom2’, …, ‘chromX’, ‘chromY’, ‘chromM’.
‘plain’: Plain format without a prefix, e.g., ‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’.
- Parameters:
- Returns:
Optional[SNPObject] – A new SNPObject with matched chromosome format if inplace=False. If inplace=True, modifies self in place and returns None.
- rename_chrom(to_replace={'^([0-9]+)$': 'chr\\1', '^chr([0-9]+)$': '\\1'}, value=None, regex=True, inplace=False)[source]¶
Replace chromosome values in variants_chrom using patterns or exact matches.
This method allows flexible chromosome replacements, using regex or exact matches, useful for non-standard chromosome formats. For standard conversions (e.g., ‘chr1’ to ‘1’), consider convert_chromosome_format.
- Parameters:
to_replace (dict, str, or list of str) – Pattern(s) or exact value(s) to be replaced in chromosome names. Default behavior transforms <chrom_num> to chr<chrom_num> or vice versa. Non-matching values remain unchanged. - If str or list of str: Matches will be replaced with value. - If regex (bool), then any regex matches will be replaced with value. - If dict: Keys defines values to replace, with corresponding replacements as values.
value (str or list of str, optional) – Replacement value(s) if to_replace is a string or list. Ignored if to_replace is a dictionary.
regex (bool, default=True) – If True, interprets to_replace keys as regex patterns.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the chromosomes renamed. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with the renamed chromosome format if inplace=False. If inplace=True, modifies self in place and returns None.
- rename_missings(before=-1, after='.', inplace=False)[source]¶
Replace missing values in the calldata_gt attribute.
This method identifies missing values in ‘calldata_gt’ and replaces them with a specified value. By default, it replaces occurrences of -1 (often used to signify missing data) with ‘.’.
- Parameters:
before (int, float, or str, default=-1) – The current representation of missing values in calldata_gt. Common values might be -1, ‘.’, or NaN. Default is -1.
after (int, float, or str, default='.') – The value that will replace before. Default is ‘.’.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the applied replacements. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with the renamed missing values if inplace=False. If inplace=True, modifies self in place and returns None.
- get_common_variants_intersection(snpobj, index_by='pos')[source]¶
Identify common variants between self and the snpobj instance based on the specified index_by criterion, which may match based on chromosome and position (variants_chrom, variants_pos), ID (variants_id), or both.
This method returns the identifiers of common variants and their corresponding indices in both objects.
- Parameters:
snpobj (SNPObject) – The reference SNPObject to compare against.
index_by (str, default='pos') – Criteria for matching variants. Options: - ‘pos’: Matches by chromosome and position (variants_chrom, variants_pos), e.g., ‘chr1-12345’. - ‘id’: Matches by variant ID alone (variants_id), e.g., ‘rs123’. - ‘pos+id’: Matches by chromosome, position, and ID (variants_chrom, variants_pos, variants_id), e.g., ‘chr1-12345-rs123’. Default is ‘pos’.
- Returns:
Tuple containing –
list of str: A list of common variant identifiers (as strings).
array: An array of indices in self where common variants are located.
array: An array of indices in snpobj where common variants are located.
- get_common_markers_intersection(snpobj)[source]¶
Identify common markers between between self and the snpobj instance. Common markers are identified based on matching chromosome (variants_chrom), position (variants_pos), reference (variants_ref), and alternate (variants_alt) alleles.
This method returns the identifiers of common markers and their corresponding indices in both objects.
- Parameters:
snpobj (SNPObject) – The reference SNPObject to compare against.
- Returns:
Tuple containing –
list of str: A list of common variant identifiers (as strings).
array: An array of indices in self where common variants are located.
array: An array of indices in snpobj where common variants are located.
- subset_to_common_variants(snpobj, index_by='pos', common_variants_intersection=None, inplace=False)[source]¶
Subset self to include only the common variants with a reference snpobj based on the specified index_by criterion, which may match based on chromosome and position (variants_chrom, variants_pos), ID (variants_id), or both.
- Parameters:
snpobj (SNPObject) – The reference SNPObject to compare against.
index_by (str, default='pos') – Criteria for matching variants. Options: - ‘pos’: Matches by chromosome and position (variants_chrom, variants_pos), e.g., ‘chr1-12345’. - ‘id’: Matches by variant ID alone (variants_id), e.g., ‘rs123’. - ‘pos+id’: Matches by chromosome, position, and ID (variants_chrom, variants_pos, variants_id), e.g., ‘chr1-12345-rs123’. Default is ‘pos’.
common_variants_intersection (Tuple[np.ndarray, np.ndarray], optional) – Precomputed indices of common variants between self and snpobj. If None, intersection is computed within the function.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the common variants subsetted. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with the common variants subsetted if inplace=False. If inplace=True, modifies self in place and returns None.
- subset_to_common_markers(snpobj, common_markers_intersection=None, inplace=False)[source]¶
Subset self to include only the common markers with a reference snpobj. Common markers are identified based on matching chromosome (variants_chrom), position (variants_pos), reference (variants_ref), and alternate (variants_alt) alleles.
- Parameters:
snpobj (SNPObject) – The reference SNPObject to compare against.
common_markers_intersection (tuple of arrays, optional) – Precomputed indices of common markers between self and snpobj. If None, intersection is computed within the function.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the common markers subsetted. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with the common markers subsetted if inplace=False. If inplace=True, modifies self in place and returns None.
- merge(snpobj, force_samples=False, prefix='2', inplace=False)[source]¶
Merge self with snpobj along the sample axis.
This method expects both SNPObjects to contain the same set of SNPs in the same order, then combines their genotype (calldata_gt) and LAI (calldata_lai) arrays by concatenating the sample dimension. Samples from snpobj are appended to those in self.
- Parameters:
snpobj (SNPObject) – The SNPObject to merge samples with.
force_samples (bool, default=False) – If True, duplicate sample names are resolved by prepending the prefix to duplicate sample names in snpobj. Otherwise, merging fails when duplicate sample names are found. Default is False.
prefix (str, default='2') – A string prepended to duplicate sample names in snpobj when force_samples=True. Duplicates are renamed from <sample_name> to <prefix>:<sample_name>. For instance, if prefix=’2’ and there is a conflict with a sample called “sample_1”, it becomes “2:sample_1”.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the merged samples. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject containing the merged sample data.
- concat(snpobj, inplace=False)[source]¶
Concatenate self with snpobj along the SNP axis.
This method expects both SNPObjects to contain the same set of samples in the same order, and that the chromosome(s) in snpobj follow (i.e. have higher numeric identifiers than) those in self.
- Parameters:
- Returns:
Optional[SNPObject] – A new SNPObject containing the concatenated SNP data.
- classmethod concat_variants(snpobjs)[source]¶
Concatenate multiple SNPObjects along the SNP axis.
All objects must have the same sample order and genotype strand representation.
- remove_strand_ambiguous_variants(inplace=False)[source]¶
A strand-ambiguous variant has reference (variants_ref) and alternate (variants_alt) alleles in the pairs A/T, T/A, C/G, or G/C, where both alleles are complementary and thus indistinguishable in terms of strand orientation.
- Parameters:
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with the strand-ambiguous variants removed. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with non-ambiguous variants only if inplace=False. If inplace=True, modifies self in place and returns None.
- correct_flipped_variants(snpobj, check_complement=True, index_by='pos', common_variants_intersection=None, log_stats=True, inplace=False)[source]¶
Correct flipped variants between between self and a reference snpobj, where reference (variants_ref) and alternate (variants_alt) alleles are swapped.
Flip Detection Based on check_complement:
- If check_complement=False, only direct allele swaps are considered:
Direct Swap: self.variants_ref == snpobj.variants_alt and self.variants_alt == snpobj.variants_ref.
- If check_complement=True, both direct and complementary swaps are considered, with four possible cases:
Direct Swap: self.variants_ref == snpobj.variants_alt and self.variants_alt == snpobj.variants_ref.
Complement Swap of Ref: complement(self.variants_ref) == snpobj.variants_alt and self.variants_alt == snpobj.variants_ref.
Complement Swap of Alt: self.variants_ref == snpobj.variants_alt and complement(self.variants_alt) == snpobj.variants_ref.
Complement Swap of both Ref and Alt: complement(self.variants_ref) == snpobj.variants_alt and complement(self.variants_alt) == snpobj.variants_ref.
Note: Variants where self.variants_ref == self.variants_alt are ignored as they are ambiguous.
Correction Process: - Swaps variants_ref and variants_alt alleles in self to align with snpobj. - Flips calldata_gt values (0 becomes 1, and 1 becomes 0) to match the updated allele configuration.
- Parameters:
snpobj (SNPObject) – The reference SNPObject to compare against.
check_complement (bool, default=True) – If True, also checks for complementary base pairs (A/T, T/A, C/G, and G/C) when identifying swapped variants. Default is True.
index_by (str, default='pos') – Criteria for matching variants. Options: - ‘pos’: Matches by chromosome and position (variants_chrom, variants_pos), e.g., ‘chr1-12345’. - ‘id’: Matches by variant ID alone (variants_id), e.g., ‘rs123’. - ‘pos+id’: Matches by chromosome, position, and ID (variants_chrom, variants_pos, variants_id), e.g., ‘chr1-12345-rs123’. Default is ‘pos’.
common_variants_intersection (tuple of arrays, optional) – Precomputed indices of common variants between self and snpobj. If None, intersection is computed within the function.
log_stats (bool, default=True) – If True, logs statistical information about matching and ambiguous alleles. Default is True.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with corrected flips. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with corrected flips if inplace=False. If inplace=True, modifies self in place and returns None.
- remove_mismatching_variants(snpobj, index_by='pos', common_variants_intersection=None, inplace=False)[source]¶
Remove variants from self, where reference (variants_ref) and/or alternate (variants_alt) alleles do not match with a reference snpobj.
- Parameters:
snpobj (SNPObject) – The reference SNPObject to compare against.
index_by (str, default='pos') – Criteria for matching variants. Options: - ‘pos’: Matches by chromosome and position (variants_chrom, variants_pos), e.g., ‘chr1-12345’. - ‘id’: Matches by variant ID alone (variants_id), e.g., ‘rs123’. - ‘pos+id’: Matches by chromosome, position, and ID (variants_chrom, variants_pos, variants_id), e.g., ‘chr1-12345-rs123’. Default is ‘pos’.
common_variants_intersection (tuple of arrays, optional) – Precomputed indices of common variants between self and the reference snpobj. If None, the intersection is computed within the function.
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject without mismatching variants. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject without mismatching variants if inplace=False. If inplace=True, modifies self in place and returns None.
- shuffle_variants(inplace=False)[source]¶
Randomly shuffle the positions of variants in the SNPObject, ensuring that all associated data (e.g., calldata_gt and variant-specific attributes) remain aligned.
- Parameters:
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with shuffled variants. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject without shuffled variant positions if inplace=False. If inplace=True, modifies self in place and returns None.
- set_empty_to_missing(inplace=False)[source]¶
Replace empty strings ‘’ with missing values ‘.’ in attributes of self.
- Parameters:
inplace (bool, default=False) – If True, modifies self in place. If False, returns a new SNPObject with empty strings ‘’ replaced by missing values ‘.’. Default is False.
- Returns:
Optional[SNPObject] – A new SNPObject with empty strings replaced if inplace=False. If inplace=True, modifies self in place and returns None.
- convert_to_window_level(window_size=None, physical_pos=None, chromosomes=None, window_sizes=None, laiobj=None)[source]¶
Aggregate the calldata_lai attribute into genomic windows within a snputils.ancestry.genobj.LocalAncestryObject.
Window definitions are resolved in this precedence order: fixed window size via
window_size; explicit boundaries viaphysical_pos(optionally withchromosomesandwindow_sizes); or reuse of existing window metadata fromlaiobj.- Parameters:
window_size (int, optional) – Number of SNPs in each window if defining fixed-size windows. If the total number of SNPs in a chromosome is not evenly divisible by the window size, the last window on that chromosome will include all remaining SNPs and therefore be larger than the specified size.
physical_pos (array of shape (n_windows, 2), optional) – A 2D array containing the start and end physical positions for each window.
chromosomes (array of shape (n_windows,), optional) – An array with chromosome numbers corresponding to each genomic window.
window_sizes (array of shape (n_windows,), optional) – An array specifying the number of SNPs in each genomic window.
laiobj (LocalAncestryObject, optional) – A reference LocalAncestryObject from which to copy existing window definitions.
- Returns:
LocalAncestryObject – A LocalAncestryObject containing window-level ancestry data.
- save(file)[source]¶
Save the data stored in self to a specified file.
The format of the saved file is determined by the file extension provided in the file argument.
Supported formats:
.bed: Binary PED (Plink) format.
.pgen: Plink2 binary genotype format.
.vcf: Variant Call Format.
.pkl: Pickle format for saving self in serialized form.
- Parameters:
file (str or pathlib.Path) – Path to the file where the data will be saved. The extension of the file determines the save format. Supported extensions: .bed, .pgen, .vcf, .pkl.
- save_bed(file, rename_missing_values=True, before=-1, after='.', sample_phenotype=None)[source]¶
Save the data stored in self to a .bed file.
- Parameters:
file (str or pathlib.Path) – Path to the file where the data will be saved. It should end with .bed. If the provided path does not have this extension, it will be appended.
rename_missing_values (bool, optional) – If True, renames potential missing values in calldata_gt before writing.
before (int, float, or str, default=-1) – The current representation of missing values in calldata_gt.
after (int, float, or str, default='.') – The value that will replace before.
sample_phenotype (optional) – PLINK phenotype value per sample, or a scalar used for all samples.
- save_pgen(file)[source]¶
Save the data stored in self to a .pgen file.
- Parameters:
file (str or pathlib.Path) – Path to the file where the data will be saved. It should end with .pgen. If the provided path does not have this extension, it will be appended.
- save_vcf(file)[source]¶
Save the data stored in self to a .vcf file.
- Parameters:
file (str or pathlib.Path) – Path to the file where the data will be saved. It should end with .vcf. If the provided path does not have this extension, it will be appended.
- save_pickle(file)[source]¶
Save self in serialized form to a .pkl file.
- Parameters:
file (str or pathlib.Path) – Path to the file where the data will be saved. It should end with .pkl. If the provided path does not have this extension, it will be appended.
- class snputils.GRGObject(calldata_gt=None, filename=None, mutable=None)[source]¶
Bases:
objectA class for Single Nucleotide Polymorphism (SNP) data.
- Parameters:
calldata_gt (GRG | MutableGRG, optional) – A Genotype Representation Graph containing genotype data for each sample.
filename (str, optional) – File storing the GRG.
- property calldata_gt¶
Retrieve calldata_gt.
- Returns:
GRG | MutableGRG – An GRG containing genotype data for all samples.
- property filename¶
Retrieve filename.
- Returns:
- str
A string containing the file name.
- property shape¶
Retrieve the graph genotype shape as (n_mutations, n_haplotypes).
- to_snpobject(sum_strands=False, chrom='.', sample_prefix='sample')[source]¶
Convert the GRG to a dense SNPObject.
Notes
This materializes the full genotype matrix, so memory usage scales with num_mutations * num_samples.
For diploid GRGs and sum_strands=False, output has shape (n_snps, n_samples, 2).
For sum_strands=True, output has shape (n_snps, n_samples) with per-individual allele counts.
Readers¶
- class snputils.SNPReader(filename, vcf_backend='default')[source]¶
Bases:
objectAutomatically detect the SNP file format from the file extension, and return its corresponding reader.
- Parameters:
filename – Filename of the file to read.
vcf_backend – Backend to use for reading the VCF file. Options are ‘default’ or ‘polars’. Default is ‘default’.
- Raises:
ValueError – If the filename does not have an extension or the extension is not supported.
- class snputils.BEDReader(filename)[source]¶
Bases:
SNPBaseReaderInitialize the SNPBaseReader.
- Parameters:
filename – The path to the file storing SNP data.
- read(fields=None, exclude_fields=None, sample_ids=None, sample_idxs=None, variant_ids=None, variant_idxs=None, sum_strands=False, separator=None)[source]¶
Read a bed fileset (bed, bim, fam) into a SNPObject.
- Parameters:
fields (str, None, or list of str, optional) – Fields to extract data for that should be included in the returned SNPObject. Available fields are ‘GT’, ‘IID’, ‘REF’, ‘ALT’, ‘#CHROM’, ‘CM’, ‘ID’, ‘POS’. To extract all fields, set fields to None. Defaults to None.
exclude_fields (str, None, or list of str, optional) – Fields to exclude from the returned SNPObject. Available fields are ‘GT’, ‘IID’, ‘REF’, ‘ALT’, ‘#CHROM’, ‘CM’, ‘ID’, ‘POS’. To exclude no fields, set exclude_fields to None. Defaults to None.
sample_ids – List of sample IDs to read. If None and sample_idxs is None, all samples are read.
sample_idxs – List of sample indices to read. If None and sample_ids is None, all samples are read.
variant_ids – List of variant IDs to read. If None and variant_idxs is None, all variants are read.
variant_idxs – List of variant indices to read. If None and variant_ids is None, all variants are read.
sum_strands – If True, maternal and paternal strands are combined into a single int8 array with values {0, 1, 2}. If False, strands are stored separately as an int8 array with values {0, 1} for each strand. Note: With the pgenlib backend, False uses a temporary int32 allele buffer.
separator – Separator used in the pvar file. If None, the separator is automatically detected. If the automatic detection fails, please specify the separator manually.
- Returns:
*SNPObject* – A SNPObject instance.
- class snputils.PGENReader(filename)[source]¶
Bases:
SNPBaseReaderInitialize the SNPBaseReader.
- Parameters:
filename – The path to the file storing SNP data.
- read(fields=None, exclude_fields=None, sample_ids=None, sample_idxs=None, variant_ids=None, variant_idxs=None, sum_strands=False, separator=None)[source]¶
Read a pgen fileset (pgen, psam, pvar) into a SNPObject.
- Parameters:
fields (str, None, or list of str, optional) – Fields to extract data for that should be included in the returned SNPObject. Available fields are ‘GT’, ‘IID’, ‘REF’, ‘ALT’, ‘#CHROM’, ‘CM’, ‘ID’, ‘POS’, ‘FILTER’, ‘QUAL’, ‘INFO’. To extract all fields, set fields to None. Defaults to None.
exclude_fields (str, None, or list of str, optional) – Fields to exclude from the returned SNPObject. Available fields are ‘GT’, ‘IID’, ‘REF’, ‘ALT’, ‘#CHROM’, ‘CM’, ‘ID’, ‘POS’, ‘FILTER’, ‘QUAL’, ‘INFO’. To exclude no fields, set exclude_fields to None. Defaults to None.
sample_ids – List of sample IDs to read. If None and sample_idxs is None, all samples are read.
sample_idxs – List of sample indices to read. If None and sample_ids is None, all samples are read.
variant_ids – List of variant IDs to read. If None and variant_idxs is None, all variants are read.
variant_idxs – List of variant indices to read. If None and variant_ids is None, all variants are read.
sum_strands – If True, maternal and paternal strands are combined into a single int8 array with values {0, 1, 2}. If False, strands are stored separately as an int8 array with values {0, 1} for each strand. Note: With the pgenlib backend, False uses a temporary int32 allele buffer.
separator – Separator used in the pvar file. If None, the separator is automatically detected. If the automatic detection fails, please specify the separator manually.
- Returns:
*SNPObject* – A SNPObject instance.
- class snputils.VCFReader(filename)[source]¶
Bases:
SNPBaseReaderReads VCF files into an SNPObject with a NumPy parser optimized for GT columns.
.vcfand.vcf.gzfiles with GT-only sample fields use a block parser that avoids materializing genotype strings in a DataFrame. Simple diploid FORMAT layouts such asGT:DPandDP:GTuse a streaming byte parser. Other supported VCF layouts fall back to a pandas chunked parser. By default it reads the core variant fieldsCHROM,POS,ID,REF,ALT,QUAL, andFILTER; passfields="*"or include"INFO"when the INFO column is required.Initialize the SNPBaseReader.
- Parameters:
filename – The path to the file storing SNP data.
- read(fields=None, exclude_fields=None, region=None, samples=None, sum_strands=False, separator=None)[source]¶
Read a VCF file into an
SNPObject.By default, the reader loads the core VCF variant columns
CHROM,POS,ID,REF,ALT,QUAL, andFILTER, plus all sample genotype columns. Genotypes are read from theGTFORMAT field and returned as anint8array. Withsum_strands=False,calldata_gthas shape(n_variants, n_samples, 2); withsum_strands=True, the two alleles are summed into shape(n_variants, n_samples).- Parameters:
fields – VCF fixed columns to include, such as
["CHROM", "POS", "ID"]. Use"*"to include all fixed VCF columns, includingINFOandFORMAT. IfNone, the default core fields are used.exclude_fields – Fixed VCF columns to exclude. This is mainly useful with
fields="*"; whenfieldsisNone, it excludes columns from the default core field set.region – Optional genomic region to read. Accepts chromosome-only values such as
"22"or inclusive 1-based intervals such as"22:100000-200000". Records are included when their POS is within the requested interval.samples – Optional sample subset. Provide sample IDs or zero-based sample indexes. If omitted, all samples are read; pass an empty sequence to read variant metadata without genotypes.
sum_strands – If
True, sum the two diploid alleles per sample and return dosages incalldata_gt. IfFalse, keep the two allele columns separate.separator – Optional column separator. If omitted, the separator is detected from the VCF header. Tab-delimited files use optimized byte parsers when possible; other separators use the pandas chunked parser.
- Returns:
SNPObject – Object containing selected genotype, sample, and variant fields.
- class snputils.snp.io.read.vcf.VCFReaderPolars(filename)[source]¶
Bases:
SNPBaseReaderReads a VCF file and processes it into a SNPObject.
Initialize the SNPBaseReader.
- Parameters:
filename – The path to the file storing SNP data.
- read(fields=None, exclude_fields=None, region=None, samples=None, sum_strands=False, separator=None)[source]¶
Read a vcf file into a SNPObject.
- Parameters:
fields – Fields to extract data for. This parameter specifies which data fields from the VCF file should be included in the result. Available options include ‘CHROM’/’#CHROM’, ‘POS’, ‘ID’, ‘REF’, ‘ALT’, ‘QUAL’, ‘FILTER’, ‘INFO’, and ‘FORMAT’. To extract all fields, provide just the string ‘*’ or the default None.
exclude_fields – Fields to exclude for use in combination with fields=’*’. Available options include ‘CHROM’/’#CHROM’, ‘POS’, ‘ID’, ‘REF’, ‘ALT’, ‘QUAL’, ‘FILTER’, ‘INFO’, and ‘FORMAT’.
region – Genomic region to extract variants for. If provided, it should be a tabix-style region string, specifying a chromosome name and optionally beginning and end coordinates (e.g., ‘2L:100000-200000’). TODO
samples – Selection of samples to extract calldata for. If provided, should be a list of strings giving sample identifiers. May also be a list of integers giving indices of selected samples. If an empty list is provided, no samples are extracted.
sum_strands – True if the maternal and paternal strands are to be summed together, False if the strands are to be stored separately.
separator – Separator used in the pvar file. If None, the separator is automatically detected. If the automatic detection fails, please specify the separator manually.
- Returns:
snpobj –
- SNPObject containing the data from the VCF file. The format and content
of this object depend on the specified parameters and the content of the VCF file.
Read Functions¶
- snputils.read_snp(filename, **kwargs)[source]¶
Automatically detect the file format and read it into a SNPObject.
- Parameters:
filename – Filename of the file to read.
**kwargs – Additional arguments passed to the reader method.
- Raises:
ValueError – If the filename does not have an extension or the extension is not supported.
- snputils.read_bed(filename, **kwargs)[source]¶
Read a BED fileset into a SNPObject.
- Parameters:
filename – Filename of the BED fileset to read.
**kwargs – Additional arguments passed to the reader method. See
snputils.snp.io.read.bed.BEDReaderfor possible parameters.
- snputils.read_pgen(filename, **kwargs)[source]¶
Read a PGEN fileset into a SNPObject.
- Parameters:
filename – Filename of the PGEN fileset to read.
**kwargs – Additional arguments passed to the reader method. See
snputils.snp.io.read.pgen.PGENReaderfor possible parameters.
- snputils.read_vcf(filename, backend='default', **kwargs)[source]¶
Read a VCF fileset into a SNPObject.
- Parameters:
filename – Filename of the VCF fileset to read.
backend – Backend to use for reading the VCF file. Options are ‘default’ or ‘polars’.
**kwargs – Additional arguments passed to the reader method. See
snputils.snp.io.read.vcf.VCFReaderfor possible parameters.
Writers¶
- class snputils.BEDWriter(snpobj, filename)[source]¶
Bases:
objectWrites an object in bed/bim/fam formats in the specified output path.
- Parameters:
snpobj – The SNPObject to be written.
file – The output file path.
- write(rename_missing_values=True, before=-1, after='.', sample_phenotype=None)[source]¶
Writes the SNPObject to bed/bim/fam formats.
- Parameters:
rename_missing_values (bool, optional) – If True, renames potential missing values in snpobj.calldata_gt before writing. Defaults to True.
before (int, float, or str, default=-1) – The current representation of missing values in calldata_gt. Common values might be -1, ‘.’, or NaN. Default is -1.
after (int, float, or str, default='.') – The value that will replace before. Default is ‘.’.
sample_phenotype (optional) – PLINK phenotype value per sample, or a scalar used for all samples. Defaults to
-9for all samples.
- class snputils.PGENWriter(snpobj, filename)[source]¶
Bases:
objectWrites a genotype object in PGEN format (.pgen, .psam, and .pvar files) in the specified output path.
Initializes the PGENWriter instance.
- Parameters:
- write(vzs=False, rename_missing_values=True, before=-1, after='.')[source]¶
Writes the SNPObject data to .pgen, .psam, and .pvar files.
- Parameters:
vzs (bool, optional) – If True, compresses the .pvar file using zstd and saves it as .pvar.zst. Defaults to False.
rename_missing_values (bool, optional) – If True, renames potential missing values in snpobj.calldata_gt before writing. Defaults to True.
before (int, float, or str, default=-1) – The current representation of missing values in calldata_gt. Common values might be -1, ‘.’, or NaN. Default is -1.
after (int, float, or str, default='.') – The value that will replace before. Default is ‘.’.
- class snputils.VCFWriter(snpobj, filename, n_jobs=-1, phased=False)[source]¶
Bases:
objectA writer class for exporting SNP data from a snputils.snp.genobj.SNPObject into an .vcf file.
- Parameters:
snpobj (SNPObject) – A SNPObject instance.
file (str or pathlib.Path) – Path to the file where the data will be saved. It should end with .vcf. If the provided path does not have this extension, the .vcf extension will be appended.
n_jobs – Number of jobs to run in parallel. - None: use 1 job unless within a joblib.parallel_backend context. - -1: use all available processors. - Any other integer: use the specified number of jobs.
phased – If True, genotype data is written in “maternal|paternal” format. If False, genotype data is written in “maternal/paternal” format.
- write(chrom_partition=False, rename_missing_values=True, before=-1, after='.', variants_info=None)[source]¶
Writes the SNP data to VCF file(s).
- Parameters:
chrom_partition (bool, optional) – If True, individual VCF files are generated for each chromosome. If False, a single VCF file containing data for all chromosomes is created. Defaults to False.
rename_missing_values (bool, optional) – If True, renames potential missing values in snpobj.calldata_gt before writing. Defaults to True.
before (int, float, or str, default=-1) – The current representation of missing values in calldata_gt. Common values might be -1, ‘.’, or NaN. Default is -1.
after (int, float, or str, default='.') – The value that will replace before. Default is ‘.’.
variants_info (sequence of str, optional) – Per-variant INFO column values (e.g.
["END=2000", "END=3000"]). Length must match variant count. When provided, a ##INFO header line for END is written if any value containsEND=.