Statistics¶
Allele-frequency and population-genetic statistics.
- snputils.stats.allele_freq_stream(data, *, chunk_size=10000, sample_labels=None, ancestry=None, laiobj=None, lai_chunk_size=1024, lai_iter_kwargs=None, pseudohaploid=False, return_counts=False, as_dataframe=False, **iter_kwargs)[source]¶
Compute allele frequencies in variant chunks.
- Parameters:
data –
One of:
in-memory SNPObject
a reader implementing iter_read(…)
an iterable yielding SNPObject chunks
chunk_size – Number of SNPs per chunk.
sample_labels – Population label per sample. If None, computes cohort-level frequencies.
ancestry – Optional ancestry code for ancestry-specific masking.
laiobj –
Optional LAI source used to derive SNP-level LAI when missing in chunks. Supported inputs:
in-memory LocalAncestryObject
LAI reader implementing iter_windows(…)
path to an LAI file understood by snputils.ancestry.io.local.read.LAIReader
lai_chunk_size – Number of LAI windows per chunk when laiobj is a streaming LAI reader/path.
lai_iter_kwargs – Optional kwargs forwarded to laiobj.iter_windows(…).
pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.
return_counts – If True, also return called haplotype counts.
as_dataframe – If True, return pandas DataFrames.
**iter_kwargs – Additional args forwarded to iter_read(…) when data is a reader.
- snputils.stats.f2(data, pop1=None, pop2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]¶
Compute f2-statistics with block-jackknife standard errors.
- Parameters:
data – Either a SNPObject or a tuple (afs, counts, pops), where afs and counts are arrays of shape (n_snps, n_pops). If a SNPObject is provided, sample_labels are used to aggregate samples to populations.
pop1 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.
pop2 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.
sample_labels – Population label per sample (aligned with SNPObject.samples) when data is a SNPObject. If omitted, labels are inferred from
SNPObject.sample_fidwhen it differs fromsamples(PLINK FID); otherwise eachsamplesvalue is used as its own group label.apply_correction – Apply small-sample correction p*(1-p)/(n-1) per population. When True, SNPs with n<=1 in either population are excluded at that SNP.
block_size – Number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
blocks – Optional explicit block id per SNP. If provided, overrides block_size.
ancestry – Optional ancestry code to mask genotypes to a specific ancestry before aggregation. Requires LAI.
laiobj – Optional LocalAncestryObject used to derive SNP-level LAI if it is not already present on the SNPObject.
pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.
- Returns:
Pandas DataFrame with columns – pop1, pop2, est, se, z, p, n_blocks, n_snps
- snputils.stats.f3(data, target=None, ref1=None, ref2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute f3-statistics f3(target; ref1, ref2) with block-jackknife SE.
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If target, ref1, and ref2 are all None, compute f3 for all combinations where each role can be any population.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
- If apply_correction is True, subtract the finite sample term p_t*(1-p_t)/(n_t-1) from the per-SNP product.
When True, SNPs with n_t<=1 are excluded.
If sample_labels is omitted for a SNPObject, defaults match
f2(PLINKsample_fidwhen FID differs from IID).pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.f4(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute f4-statistics f4(a, b; c, d) with block-jackknife SE.
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.d_stat(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute D-statistics D(a, b; c, d) as ratio of sums:
D = sum_l (A-B)(C-D) / sum_l (A+B-2AB)(C+D-2CD)
with delete-one block jackknife SE.
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.f4_ratio(data, num, den, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute f4-ratio statistics as ratio of two f4-statistics with block-jackknife SE.
- Parameters:
num – Sequence of quadruples (a, b, c, d) for numerator f4(a, b; c, d)
den – Sequence of quadruples (a, b, c, d) for denominator f4(a, b; c, d)
Notes
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.fst(data, pop1=None, pop2=None, *, method='hudson', sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]¶
Pairwise F_ST with delete-one block jackknife SE.
- `hudson`
Ratio-of-averages following Hudson 1992 / Bhatia 2013. Uses num = d_xy - 0.5*(pi_x + pi_y) and den = d_xy, where d_xy = p_x*(1-p_y) + p_y*(1-p_x) and pi_x = 2*p_x*(1-p_x)*n_x/(n_x-1).
- `weir_cockerham`
Weir and Cockerham’s theta for two populations. Computes per-SNP variance components a, b, and c, then uses a ratio-of-sums jackknife with num = a and den = a + b + c.
Notes
Inputs are the same as f2/f3/f4: either SNPObject or (afs, counts, pops).
For WC we use expected heterozygosity h_i = 2 p_i (1 - p_i) from allele freqs.
SNPs with n<=1 in either pop or with invalid denominators are ignored.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.