Statistics¶
Functions for allele-frequency aggregation and population-genetic tests. Inputs may be a SNPObject with sample labels or a precomputed allele-frequency tuple from allele_freq_stream(). Block jackknife standard errors are computed by default on f-statistics. Ancestry masking requires a LocalAncestryObject when the ancestry argument is set.
- snputils.stats.allele_freq_stream(data, *, chunk_size=10000, sample_labels=None, ancestry=None, laiobj=None, lai_chunk_size=1024, lai_iter_kwargs=None, pseudohaploid=False, return_counts=False, as_dataframe=False, **iter_kwargs)[source]¶
Compute allele frequencies in variant chunks.
- Parameters:
data –
One of:
in-memory SNPObject
a reader implementing iter_read(…)
an iterable yielding SNPObject chunks
chunk_size – Number of SNPs per chunk.
sample_labels – Population label per sample. If None, computes cohort-level frequencies.
ancestry – Optional ancestry code for ancestry-specific masking.
laiobj –
Optional LAI source used to derive SNP-level LAI when missing in chunks. Supported inputs:
in-memory LocalAncestryObject
LAI reader implementing iter_windows(…)
path to an LAI file understood by snputils.ancestry.io.local.read.LAIReader
lai_chunk_size – Number of LAI windows per chunk when laiobj is a streaming LAI reader/path.
lai_iter_kwargs – Optional kwargs forwarded to laiobj.iter_windows(…).
pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.
return_counts – If True, also return called haplotype counts.
as_dataframe – If True, return pandas DataFrames.
**iter_kwargs – Additional args forwarded to iter_read(…) when data is a reader.
- snputils.stats.f2(data, pop1=None, pop2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]¶
Compute f2-statistics with block-jackknife standard errors.
- Parameters:
data – Either a SNPObject or a tuple (afs, counts, pops), where afs and counts are arrays of shape (n_snps, n_pops). If a SNPObject is provided, sample_labels are used to aggregate samples to populations.
pop1 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.
pop2 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.
sample_labels – Population label per sample (aligned with SNPObject.samples) when data is a SNPObject. If omitted, labels are inferred from
SNPObject.sample_fidwhen it differs fromsamples(PLINK FID); otherwise eachsamplesvalue is used as its own group label.apply_correction – Apply small-sample correction p*(1-p)/(n-1) per population. When True, SNPs with n<=1 in either population are excluded at that SNP.
block_size – Number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
blocks – Optional explicit block id per SNP. If provided, overrides block_size.
ancestry – Optional ancestry code to mask genotypes to a specific ancestry before aggregation. Requires LAI.
laiobj – Optional LocalAncestryObject used to derive SNP-level LAI if it is not already present on the SNPObject.
pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.
- Returns:
Pandas DataFrame with columns – pop1, pop2, est, se, z, p, n_blocks, n_snps
- snputils.stats.f3(data, target=None, ref1=None, ref2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute f3-statistics f3(target; ref1, ref2) with block-jackknife SE.
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If target, ref1, and ref2 are all None, compute f3 for all combinations where each role can be any population.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
- If apply_correction is True, subtract the finite sample term p_t*(1-p_t)/(n_t-1) from the per-SNP product.
When True, SNPs with n_t<=1 are excluded.
If sample_labels is omitted for a SNPObject, defaults match
f2(PLINKsample_fidwhen FID differs from IID).pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.f4(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute f4-statistics f4(a, b; c, d) with block-jackknife SE.
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.d_stat(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute D-statistics D(a, b; c, d) as ratio of sums:
D = sum_l (A-B)(C-D) / sum_l (A+B-2AB)(C+D-2CD)
with delete-one block jackknife SE.
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.f4_ratio(data, num, den, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Compute f4-ratio statistics as ratio of two f4-statistics with block-jackknife SE.
- Parameters:
num – Sequence of quadruples (a, b, c, d) for numerator f4(a, b; c, d)
den – Sequence of quadruples (a, b, c, d) for denominator f4(a, b; c, d)
Notes
block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.
If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.fst(data, pop1=None, pop2=None, *, method='hudson', q=2.0, tsallis_weights='equal', sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]¶
Pairwise F_ST with delete-one block jackknife SE.
- - ``hudson``
Ratio-of-averages following Hudson 1992 / Bhatia 2013. Uses
num = d_xy - 0.5*(pi_x + pi_y)andden = d_xy, whered_xy = p_x*(1-p_y) + p_y*(1-p_x)andpi_x = 2*p_x*(1-p_x)*n_x/(n_x-1).
- - ``weir_cockerham``
Weir and Cockerham’s theta for two populations. Computes per-SNP variance components
a,b, andc, then uses a ratio-of-sums jackknife withnum = aandden = a + b + c.
- - ``tsallis``
Tsallis q-entropy F-statistic. For two populations, computes per-SNP total entropy S_q(Bern(p_bar)) and within entropy w1*S_q(Bern(p1)) + w2*S_q(Bern(p2)), then returns the genome-wide micro-average:
sum_l [S_total(l) - S_within(l)] / sum_l S_total(l)
With q=2, this equals the classical heterozygosity-based F_ST:
(H_T - H_S) / H_T
With q=1, this is the Shannon entropy / normalized mutual information analogue.
tsallis_weights="equal"uses w1=w2=0.5, for OVR equal-group weighting.tsallis_weights="sample_size"uses per-SNP haplotype count weights.
Notes
Inputs are the same as f2/f3/f4: either SNPObject or (afs, counts, pops).
For WC we use expected heterozygosity h_i = 2 p_i (1 - p_i) from allele freqs.
SNPs with n<=1 in either pop or with invalid denominators are ignored.
pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.
- snputils.stats.export_qp(data, outdir, *, sample_labels=None, populations=None, tools=('qpAdm', 'qpGraph', 'qpWave'), block_size=5000, blocks=None, block_size_cm=None, block_size_bp=None, chrom=None, pos=None, cm=None, apply_correction=True, overwrite=False, ancestry=None, laiobj=None, pseudohaploid=False)[source]¶
Export blocked pairwise statistics for qpAdm, qpGraph, and qpWave.
The output directory contains one file per population pair plus block-length files. qpGraph consumes the blocked
f2files. qpAdm and qpWave consume the allele-product blocks (ap) and form their f4 matrices from them.- Parameters:
data – Either a
SNPObjector a tuple(afs, counts, pops)whereafsandcountshave shape(n_snps, n_pops).outdir – Destination directory for the blocked-statistics files.
sample_labels – Population label per sample when
datais aSNPObject. If omitted, the same defaults asf2are used.populations – Optional subset/order of populations to export. Defaults to all populations in the prepared allele-frequency table.
tools – Any subset of
{"qpAdm", "qpGraph", "qpWave"}. The default writes everything needed by all three tools.block_size – Number of SNPs per jackknife block. Ignored if
blocks,block_size_cm, orblock_size_bpis provided.blocks – Optional explicit block label per SNP.
block_size_cm – Optional genetic-distance block size in centimorgans. Uses
cmif provided, otherwiseSNPObject.variants_cm.block_size_bp – Optional physical-distance block size in base pairs. Uses
posif provided, otherwiseSNPObject.variants_pos.chrom – Optional variant metadata for distance-based blocking.
chromis required when usingblock_size_cmorblock_size_bpunless it is present on aSNPObject.pos – Optional variant metadata for distance-based blocking.
chromis required when usingblock_size_cmorblock_size_bpunless it is present on aSNPObject.cm – Optional variant metadata for distance-based blocking.
chromis required when usingblock_size_cmorblock_size_bpunless it is present on aSNPObject.apply_correction – Apply the small-sample f2 correction. The correction denominator is clamped to at least 1, so count-1 observations are retained when allele frequencies are finite.
overwrite – Replace existing files if
True. Existing target files raiseFileExistsErrorby default.ancestry – Passed through to the allele-frequency aggregation used by the native f-statistics implementation.
laiobj – Passed through to the allele-frequency aggregation used by the native f-statistics implementation.
pseudohaploid – Passed through to the allele-frequency aggregation used by the native f-statistics implementation.
- Returns:
QPExportResultwith the exported populations, statistics, and files.