Statistics

Functions for allele-frequency aggregation and population-genetic tests. Inputs may be a SNPObject with sample labels or a precomputed allele-frequency tuple from allele_freq_stream(). Block jackknife standard errors are computed by default on f-statistics. Ancestry masking requires a LocalAncestryObject when the ancestry argument is set.

snputils.stats.allele_freq_stream(data, *, chunk_size=10000, sample_labels=None, ancestry=None, laiobj=None, lai_chunk_size=1024, lai_iter_kwargs=None, pseudohaploid=False, return_counts=False, as_dataframe=False, **iter_kwargs)[source]

Compute allele frequencies in variant chunks.

Parameters:
  • data

    One of:

    • in-memory SNPObject

    • a reader implementing iter_read(…)

    • an iterable yielding SNPObject chunks

  • chunk_size – Number of SNPs per chunk.

  • sample_labels – Population label per sample. If None, computes cohort-level frequencies.

  • ancestry – Optional ancestry code for ancestry-specific masking.

  • laiobj

    Optional LAI source used to derive SNP-level LAI when missing in chunks. Supported inputs:

    • in-memory LocalAncestryObject

    • LAI reader implementing iter_windows(…)

    • path to an LAI file understood by snputils.ancestry.io.local.read.LAIReader

  • lai_chunk_size – Number of LAI windows per chunk when laiobj is a streaming LAI reader/path.

  • lai_iter_kwargs – Optional kwargs forwarded to laiobj.iter_windows(…).

  • pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.

  • return_counts – If True, also return called haplotype counts.

  • as_dataframe – If True, return pandas DataFrames.

  • **iter_kwargs – Additional args forwarded to iter_read(…) when data is a reader.

snputils.stats.f2(data, pop1=None, pop2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]

Compute f2-statistics with block-jackknife standard errors.

Parameters:
  • data – Either a SNPObject or a tuple (afs, counts, pops), where afs and counts are arrays of shape (n_snps, n_pops). If a SNPObject is provided, sample_labels are used to aggregate samples to populations.

  • pop1 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.

  • pop2 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.

  • sample_labels – Population label per sample (aligned with SNPObject.samples) when data is a SNPObject. If omitted, labels are inferred from SNPObject.sample_fid when it differs from samples (PLINK FID); otherwise each samples value is used as its own group label.

  • apply_correction – Apply small-sample correction p*(1-p)/(n-1) per population. When True, SNPs with n<=1 in either population are excluded at that SNP.

  • block_size – Number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • blocks – Optional explicit block id per SNP. If provided, overrides block_size.

  • ancestry – Optional ancestry code to mask genotypes to a specific ancestry before aggregation. Requires LAI.

  • laiobj – Optional LocalAncestryObject used to derive SNP-level LAI if it is not already present on the SNPObject.

  • pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.

Returns:

Pandas DataFrame with columns – pop1, pop2, est, se, z, p, n_blocks, n_snps

snputils.stats.f3(data, target=None, ref1=None, ref2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute f3-statistics f3(target; ref1, ref2) with block-jackknife SE.

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If target, ref1, and ref2 are all None, compute f3 for all combinations where each role can be any population.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • If apply_correction is True, subtract the finite sample term p_t*(1-p_t)/(n_t-1) from the per-SNP product.

    When True, SNPs with n_t<=1 are excluded.

  • If sample_labels is omitted for a SNPObject, defaults match f2 (PLINK sample_fid when FID differs from IID).

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.f4(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute f4-statistics f4(a, b; c, d) with block-jackknife SE.

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.d_stat(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute D-statistics D(a, b; c, d) as ratio of sums:

D = sum_l (A-B)(C-D) / sum_l (A+B-2AB)(C+D-2CD)

with delete-one block jackknife SE.

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.f4_ratio(data, num, den, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute f4-ratio statistics as ratio of two f4-statistics with block-jackknife SE.

Parameters:
  • num – Sequence of quadruples (a, b, c, d) for numerator f4(a, b; c, d)

  • den – Sequence of quadruples (a, b, c, d) for denominator f4(a, b; c, d)

Notes

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.fst(data, pop1=None, pop2=None, *, method='hudson', q=2.0, tsallis_weights='equal', sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]

Pairwise F_ST with delete-one block jackknife SE.

- ``hudson``

Ratio-of-averages following Hudson 1992 / Bhatia 2013. Uses num = d_xy - 0.5*(pi_x + pi_y) and den = d_xy, where d_xy = p_x*(1-p_y) + p_y*(1-p_x) and pi_x = 2*p_x*(1-p_x)*n_x/(n_x-1).

- ``weir_cockerham``

Weir and Cockerham’s theta for two populations. Computes per-SNP variance components a, b, and c, then uses a ratio-of-sums jackknife with num = a and den = a + b + c.

- ``tsallis``

Tsallis q-entropy F-statistic. For two populations, computes per-SNP total entropy S_q(Bern(p_bar)) and within entropy w1*S_q(Bern(p1)) + w2*S_q(Bern(p2)), then returns the genome-wide micro-average:

sum_l [S_total(l) - S_within(l)] / sum_l S_total(l)

With q=2, this equals the classical heterozygosity-based F_ST:

(H_T - H_S) / H_T

With q=1, this is the Shannon entropy / normalized mutual information analogue.

tsallis_weights="equal" uses w1=w2=0.5, for OVR equal-group weighting. tsallis_weights="sample_size" uses per-SNP haplotype count weights.

Notes

  • Inputs are the same as f2/f3/f4: either SNPObject or (afs, counts, pops).

  • For WC we use expected heterozygosity h_i = 2 p_i (1 - p_i) from allele freqs.

  • SNPs with n<=1 in either pop or with invalid denominators are ignored.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.export_qp(data, outdir, *, sample_labels=None, populations=None, tools=('qpAdm', 'qpGraph', 'qpWave'), block_size=5000, blocks=None, block_size_cm=None, block_size_bp=None, chrom=None, pos=None, cm=None, apply_correction=True, overwrite=False, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Export blocked pairwise statistics for qpAdm, qpGraph, and qpWave.

The output directory contains one file per population pair plus block-length files. qpGraph consumes the blocked f2 files. qpAdm and qpWave consume the allele-product blocks (ap) and form their f4 matrices from them.

Parameters:
  • data – Either a SNPObject or a tuple (afs, counts, pops) where afs and counts have shape (n_snps, n_pops).

  • outdir – Destination directory for the blocked-statistics files.

  • sample_labels – Population label per sample when data is a SNPObject. If omitted, the same defaults as f2 are used.

  • populations – Optional subset/order of populations to export. Defaults to all populations in the prepared allele-frequency table.

  • tools – Any subset of {"qpAdm", "qpGraph", "qpWave"}. The default writes everything needed by all three tools.

  • block_size – Number of SNPs per jackknife block. Ignored if blocks, block_size_cm, or block_size_bp is provided.

  • blocks – Optional explicit block label per SNP.

  • block_size_cm – Optional genetic-distance block size in centimorgans. Uses cm if provided, otherwise SNPObject.variants_cm.

  • block_size_bp – Optional physical-distance block size in base pairs. Uses pos if provided, otherwise SNPObject.variants_pos.

  • chrom – Optional variant metadata for distance-based blocking. chrom is required when using block_size_cm or block_size_bp unless it is present on a SNPObject.

  • pos – Optional variant metadata for distance-based blocking. chrom is required when using block_size_cm or block_size_bp unless it is present on a SNPObject.

  • cm – Optional variant metadata for distance-based blocking. chrom is required when using block_size_cm or block_size_bp unless it is present on a SNPObject.

  • apply_correction – Apply the small-sample f2 correction. The correction denominator is clamped to at least 1, so count-1 observations are retained when allele frequencies are finite.

  • overwrite – Replace existing files if True. Existing target files raise FileExistsError by default.

  • ancestry – Passed through to the allele-frequency aggregation used by the native f-statistics implementation.

  • laiobj – Passed through to the allele-frequency aggregation used by the native f-statistics implementation.

  • pseudohaploid – Passed through to the allele-frequency aggregation used by the native f-statistics implementation.

Returns:

QPExportResult with the exported populations, statistics, and files.