Statistics

Allele-frequency and population-genetic statistics.

snputils.stats.allele_freq_stream(data, *, chunk_size=10000, sample_labels=None, ancestry=None, laiobj=None, lai_chunk_size=1024, lai_iter_kwargs=None, pseudohaploid=False, return_counts=False, as_dataframe=False, **iter_kwargs)[source]

Compute allele frequencies in variant chunks.

Parameters:
  • data

    One of:

    • in-memory SNPObject

    • a reader implementing iter_read(…)

    • an iterable yielding SNPObject chunks

  • chunk_size – Number of SNPs per chunk.

  • sample_labels – Population label per sample. If None, computes cohort-level frequencies.

  • ancestry – Optional ancestry code for ancestry-specific masking.

  • laiobj

    Optional LAI source used to derive SNP-level LAI when missing in chunks. Supported inputs:

    • in-memory LocalAncestryObject

    • LAI reader implementing iter_windows(…)

    • path to an LAI file understood by snputils.ancestry.io.local.read.LAIReader

  • lai_chunk_size – Number of LAI windows per chunk when laiobj is a streaming LAI reader/path.

  • lai_iter_kwargs – Optional kwargs forwarded to laiobj.iter_windows(…).

  • pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.

  • return_counts – If True, also return called haplotype counts.

  • as_dataframe – If True, return pandas DataFrames.

  • **iter_kwargs – Additional args forwarded to iter_read(…) when data is a reader.

snputils.stats.f2(data, pop1=None, pop2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]

Compute f2-statistics with block-jackknife standard errors.

Parameters:
  • data – Either a SNPObject or a tuple (afs, counts, pops), where afs and counts are arrays of shape (n_snps, n_pops). If a SNPObject is provided, sample_labels are used to aggregate samples to populations.

  • pop1 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.

  • pop2 – Populations to compute f2 for. If both are None, include_self=False computes only off-diagonal pairs i<j; include_self=True computes all pairs including diagonals i<=j.

  • sample_labels – Population label per sample (aligned with SNPObject.samples) when data is a SNPObject. If omitted, labels are inferred from SNPObject.sample_fid when it differs from samples (PLINK FID); otherwise each samples value is used as its own group label.

  • apply_correction – Apply small-sample correction p*(1-p)/(n-1) per population. When True, SNPs with n<=1 in either population are excluded at that SNP.

  • block_size – Number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • blocks – Optional explicit block id per SNP. If provided, overrides block_size.

  • ancestry – Optional ancestry code to mask genotypes to a specific ancestry before aggregation. Requires LAI.

  • laiobj – Optional LocalAncestryObject used to derive SNP-level LAI if it is not already present on the SNPObject.

  • pseudohaploid – If True, detects pseudo-haploid samples (samples with no heterozygotes in the first 1000 SNPs) and treats them as haploid. If an integer n is provided, checks the first n SNPs. If False, treats all samples as diploid.

Returns:

Pandas DataFrame with columns – pop1, pop2, est, se, z, p, n_blocks, n_snps

snputils.stats.f3(data, target=None, ref1=None, ref2=None, sample_labels=None, apply_correction=True, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute f3-statistics f3(target; ref1, ref2) with block-jackknife SE.

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If target, ref1, and ref2 are all None, compute f3 for all combinations where each role can be any population.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • If apply_correction is True, subtract the finite sample term p_t*(1-p_t)/(n_t-1) from the per-SNP product.

    When True, SNPs with n_t<=1 are excluded.

  • If sample_labels is omitted for a SNPObject, defaults match f2 (PLINK sample_fid when FID differs from IID).

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.f4(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute f4-statistics f4(a, b; c, d) with block-jackknife SE.

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.d_stat(data, a=None, b=None, c=None, d=None, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute D-statistics D(a, b; c, d) as ratio of sums:

D = sum_l (A-B)(C-D) / sum_l (A+B-2AB)(C+D-2CD)

with delete-one block jackknife SE.

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.f4_ratio(data, num, den, sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, pseudohaploid=False)[source]

Compute f4-ratio statistics as ratio of two f4-statistics with block-jackknife SE.

Parameters:
  • num – Sequence of quadruples (a, b, c, d) for numerator f4(a, b; c, d)

  • den – Sequence of quadruples (a, b, c, d) for denominator f4(a, b; c, d)

Notes

  • block_size is the number of SNPs per jackknife block (default 5000 SNPs). Ignored if blocks is provided.

  • If ancestry is provided, genotypes will be masked to the specified ancestry using LAI before aggregation.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.

snputils.stats.fst(data, pop1=None, pop2=None, *, method='hudson', sample_labels=None, block_size=5000, blocks=None, ancestry=None, laiobj=None, include_self=False, pseudohaploid=False)[source]

Pairwise F_ST with delete-one block jackknife SE.

`hudson`

Ratio-of-averages following Hudson 1992 / Bhatia 2013. Uses num = d_xy - 0.5*(pi_x + pi_y) and den = d_xy, where d_xy = p_x*(1-p_y) + p_y*(1-p_x) and pi_x = 2*p_x*(1-p_x)*n_x/(n_x-1).

`weir_cockerham`

Weir and Cockerham’s theta for two populations. Computes per-SNP variance components a, b, and c, then uses a ratio-of-sums jackknife with num = a and den = a + b + c.

Notes

  • Inputs are the same as f2/f3/f4: either SNPObject or (afs, counts, pops).

  • For WC we use expected heterozygosity h_i = 2 p_i (1 - p_i) from allele freqs.

  • SNPs with n<=1 in either pop or with invalid denominators are ignored.

  • pseudohaploid: If True, detects and treats pseudo-haploid samples as haploid. If int n, checks first n SNPs. If False, treats all as diploid.