F-statistics and population statistics

F-statistics summarize allele-frequency covariance among populations. They are useful for checking population separation, testing whether a group is consistent with a mixture of two references, and measuring whether two population pairs share excess genetic drift.

This tutorial shows how to:

  • Load a 1000 Genomes Project subset with population metadata

  • Compute f2, f3, f4, D-statistics, f4-ratios, and pairwise \(F_{ST}\)

  • Use SNP-count jackknife blocks

  • Reuse streamed allele frequencies as input to the same statistic functions

  • Export blocked pair statistics for downstream model-fitting workflows

  • Handle pseudo-haploid samples when aggregating allele frequencies

from pathlib import Path
import os
import tempfile

import numpy as np
import pandas as pd
from IPython.display import display

from snputils.datasets import load_dataset
from snputils.snp.genobj.snpobj import SNPObject
from snputils.stats import (
    allele_freq_stream,
    d_stat,
    export_qp,
    f2,
    f3,
    f4,
    f4_ratio,
    fst,
)

STAT_FORMAT = {
    "est": "{:.5f}",
    "se": "{:.5f}",
    "z": "{:.2f}",
    "p": "{:.2e}",
}


def display_stats(df):
    formatters = {column: fmt for column, fmt in STAT_FORMAT.items() if column in df.columns}
    display(df.style.format(formatters))

Load a 1000 Genomes subset

The subset below keeps population labels aligned to snpobj.samples in snpobj.sample_fid. The populations are chosen to make the checks below interpretable: close within-region pairs, a more distant European/East Asian pair, two African-European admixed groups, and American/South Asian groups for ratio examples.

Set SNPUTILS_1KGP_DIR if the 1000 Genomes files are already available outside data/1kgp.

DATA_DIR = Path(os.environ.get("SNPUTILS_1KGP_DIR", "data/1kgp"))

populations = [
    "YRI", "MSL",      # West African
    "CEU", "GBR",      # European
    "CHB", "JPT",      # East Asian
    "ASW", "ACB",      # African-European admixed groups
    "PEL", "MXL",      # American admixed groups
    "GIH",              # South Asian
]

snpobj = load_dataset(
    "1kgp",
    download_genotypes=True,
    resource="phase3",
    output_dir=DATA_DIR,
    populations=populations,
    samples_per_population=40,
    max_variants=100_000,
    require_biallelic=True,
    require_complete=True,
    require_polymorphic=True,
    snv_only=True,
    sum_strands=False,
    verbose=False,
)

print(f"Loaded {snpobj.n_samples:,} samples and {snpobj.n_snps:,} variants")
print(f"Genotype array shape: {snpobj.genotypes.shape}")
Loaded 440 samples and 100,000 variants
Genotype array shape: (100000, 440, 2)

Create a compact sample table and confirm that each selected population contributes the requested number of samples.

sample_metadata = pd.DataFrame({
    "sample": snpobj.samples,
    "population": snpobj.sample_fid,
    "sex": snpobj.sample_sex,
})

pop_counts = sample_metadata["population"].value_counts().reindex(populations)
display(pop_counts.to_frame("n_samples"))
n_samples
population
YRI 40
MSL 40
CEU 40
GBR 40
CHB 40
JPT 40
ASW 40
ACB 40
PEL 40
MXL 40
GIH 40

Jackknife blocks

The statistic functions use delete-one block jackknife standard errors. Passing block_size=5000 creates one block per 5,000 variants.

This tutorial uses 100,000 SNPs and 20 jackknife blocks to demonstrate expected behavior. The standard errors and p-values are useful for checking the workflow, but they should not be treated as strong genome-wide inference.

BLOCK_SIZE = 5_000

print(f"SNP-count blocks: {int(np.ceil(snpobj.n_snps / BLOCK_SIZE))}")
SNP-count blocks: 20

Pairwise f2 and \(F_{ST}\)

f2 measures squared allele-frequency distance. Pairwise \(F_{ST}\) rescales population differentiation; here we use the default Hudson estimator. Close population pairs should be much smaller than a continental-scale contrast such as CEU versus CHB.

Because these are sample-corrected unbiased estimates, very small values can be slightly negative. Interpret those as no detectable differentiation rather than biological negative distance.

pair_pop1 = ["YRI", "CEU", "CHB", "CEU"]
pair_pop2 = ["MSL", "GBR", "JPT", "CHB"]

f2_result = f2(
    snpobj,
    pop1=pair_pop1,
    pop2=pair_pop2,
    block_size=BLOCK_SIZE,
)

fst_result = fst(
    snpobj,
    pop1=pair_pop1,
    pop2=pair_pop2,
    method="hudson",
    block_size=BLOCK_SIZE,
)

display_stats(f2_result)
display_stats(fst_result)
  pop1 pop2 est se z p n_blocks n_snps
0 YRI MSL 0.00024 0.00011 2.16 3.09e-02 20 100000
1 CEU GBR -0.00007 0.00006 -1.16 2.44e-01 20 100000
2 CHB JPT 0.00021 0.00009 2.41 1.62e-02 20 100000
3 CEU CHB 0.00545 0.00077 7.06 1.65e-12 20 100000
  pop1 pop2 method est se z p n_blocks n_snps
0 YRI MSL hudson 0.00288 0.00105 2.75 6.04e-03 20 100000
1 CEU GBR hudson -0.00105 0.00103 -1.01 3.10e-01 20 100000
2 CHB JPT hudson 0.00380 0.00143 2.66 7.91e-03 20 100000
3 CEU CHB hudson 0.08365 0.00848 9.87 5.82e-23 20 100000

f3 statistics

f3(target; ref1, ref2) is negative when the target has allele-frequency covariance consistent with mixture between the two references. ASW and ACB are expected to show this pattern with West African and European references. A non-admixed reference such as CEU should be positive with distant references.

f3_result = f3(
    snpobj,
    target=["ASW", "ACB", "CEU", "PEL", "MXL", "GIH"],
    ref1=["YRI", "YRI", "YRI", "CEU", "CEU", "CEU"],
    ref2=["CEU", "CEU", "CHB", "CHB", "CHB", "CHB"],
    block_size=BLOCK_SIZE,
)

display_stats(f3_result)
  target ref1 ref2 est se z p n_blocks n_snps
0 ASW YRI CEU -0.00164 0.00018 -8.94 4.00e-19 20 100000
1 ACB YRI CEU -0.00103 0.00018 -5.71 1.13e-08 20 100000
2 CEU YRI CHB 0.00185 0.00052 3.55 3.92e-04 20 100000
3 PEL CEU CHB 0.00081 0.00038 2.12 3.37e-02 20 100000
4 MXL CEU CHB -0.00006 0.00029 -0.21 8.36e-01 20 100000
5 GIH CEU CHB -0.00022 0.00019 -1.19 2.35e-01 20 100000

f4 and D-statistics

f4(a, b; c, d) is an allele-frequency covariance. Swapping either side flips the sign, and values near zero mean the two population pairs are similarly related. The D-statistic uses the same numerator as f4 but normalizes it by heterozygosity terms, so its sign should agree with f4 for the same quartet.

quartets = {
    "a": ["YRI", "YRI", "YRI", "YRI", "CEU"],
    "b": ["CEU", "CEU", "CEU", "CEU", "GBR"],
    "c": ["CHB", "ASW", "ACB", "MXL", "CHB"],
    "d": ["JPT", "YRI", "YRI", "CHB", "JPT"],
}

f4_result = f4(snpobj, **quartets, block_size=BLOCK_SIZE)
d_result = d_stat(
    snpobj,
    a=quartets["a"][:4],
    b=quartets["b"][:4],
    c=quartets["c"][:4],
    d=quartets["d"][:4],
    block_size=BLOCK_SIZE,
)

display_stats(f4_result)
display_stats(d_result)
  a b c d est se z p n_blocks n_snps
0 YRI CEU CHB JPT 0.00001 0.00017 0.08 9.35e-01 20 100000
1 YRI CEU ASW YRI -0.00268 0.00025 -10.54 5.46e-26 20 100000
2 YRI CEU ACB YRI -0.00167 0.00023 -7.11 1.14e-12 20 100000
3 YRI CEU MXL CHB -0.00044 0.00036 -1.24 2.15e-01 20 100000
4 CEU GBR CHB JPT -0.00007 0.00004 -2.01 4.49e-02 20 100000
  a b c d est se z p n_blocks n_snps
0 YRI CEU CHB JPT 0.00070 0.00851 0.08 9.35e-01 20 100000
1 YRI CEU ASW YRI -0.10566 0.00925 -11.42 3.17e-30 20 100000
2 YRI CEU ACB YRI -0.06689 0.01005 -6.66 2.79e-11 20 100000
3 YRI CEU MXL CHB -0.02009 0.01617 -1.24 2.14e-01 20 100000

For algebra checks, f4(A, B; C, D) should equal -f4(B, A; C, D) up to numerical precision.

base = f4(
    snpobj,
    a=["YRI"], b=["CEU"], c=["ASW"], d=["YRI"],
    block_size=BLOCK_SIZE,
).est.iloc[0]
flipped = f4(
    snpobj,
    a=["CEU"], b=["YRI"], c=["ASW"], d=["YRI"],
    block_size=BLOCK_SIZE,
).est.iloc[0]

print(f"base:    {base:.6f}")
print(f"flipped: {flipped:.6f}")
print(f"sum:     {base + flipped:.3e}")
base:    -0.002680
flipped: 0.002680
sum:     0.000e+00

f4-ratios

An f4-ratio divides two f4 statistics and uses the same block jackknife machinery. The numerator and denominator should be chosen so that one reference endpoint makes the numerator vanish.

These ratios use imperfect population proxies and a small SNP subset, so they should be read as reference-coordinate sanity checks, not as estimates of true ancestry proportions.

ratio_result = f4_ratio(
    snpobj,
    num=[
        ("JPT", "CEU", "ASW", "YRI"),
        ("JPT", "CEU", "ACB", "YRI"),
        ("YRI", "JPT", "PEL", "CEU"),
        ("YRI", "JPT", "MXL", "CEU"),
    ],
    den=[
        ("JPT", "CEU", "CEU", "YRI"),
        ("JPT", "CEU", "CEU", "YRI"),
        ("YRI", "JPT", "CHB", "CEU"),
        ("YRI", "JPT", "CHB", "CEU"),
    ],
    block_size=BLOCK_SIZE,
)

display_stats(ratio_result)
  num den est se z p n_blocks n_snps
0 (JPT,CEU;ASW,YRI) (JPT,CEU;CEU,YRI) 0.22542 0.04592 4.91 9.16e-07 20 100000
1 (JPT,CEU;ACB,YRI) (JPT,CEU;CEU,YRI) 0.16228 0.03772 4.30 1.69e-05 20 100000
2 (YRI,JPT;PEL,CEU) (YRI,JPT;CHB,CEU) 0.32436 0.15325 2.12 3.43e-02 20 100000
3 (YRI,JPT;MXL,CEU) (YRI,JPT;CHB,CEU) 0.07644 0.11734 0.65 5.15e-01 20 100000

Reuse streamed allele frequencies

All f-statistic functions accept either a SNPObject or a tuple (allele_frequencies, counts, populations). This is useful when allele frequencies have already been computed in chunks.

af_df, count_df = allele_freq_stream(
    snpobj,
    chunk_size=25_000,
    sample_labels=snpobj.sample_fid,
    return_counts=True,
    as_dataframe=True,
)

stream_data = (af_df.to_numpy(), count_df.to_numpy(), af_df.columns.tolist())
stream_f2 = f2(
    stream_data,
    pop1=["YRI", "CEU"],
    pop2=["MSL", "CHB"],
    block_size=BLOCK_SIZE,
)

print(af_df.shape)
display_stats(stream_f2)
(100000, 11)
  pop1 pop2 est se z p n_blocks n_snps
0 YRI MSL 0.00024 0.00011 2.16 3.09e-02 20 100000
1 CEU CHB 0.00545 0.00077 7.06 1.65e-12 20 100000

Export blocked pair statistics for qpAdm, qpWave, and qpGraph

export_qp writes blocked pair-statistic files for a selected population set to be able to be used with qpAdm, qpWave, and qpGraph. The result object records the populations, statistics, block count, SNP count, and written paths.

with tempfile.TemporaryDirectory() as tmpdir:
    export_result = export_qp(
        snpobj,
        outdir=Path(tmpdir) / "blocked_stats",
        populations=["YRI", "CEU", "CHB"],
        tools=("qpGraph",),
        block_size=BLOCK_SIZE,
        overwrite=True,
    )

    print("populations:", export_result.populations)
    print("statistics:", export_result.statistics)
    print("blocks:", export_result.n_blocks)
    print("SNPs:", export_result.n_snps)
    print("files written:", len(export_result.files))
    print("first files:")
    for path in export_result.files[:5]:
        print("  ", path.relative_to(export_result.outdir))
populations: ('YRI', 'CEU', 'CHB')
statistics: ('f2',)
blocks: 20
SNPs: 100000
files written: 8
first files:
   block_lengths_f2.rds
   YRI/YRI_f2.rds
   CEU/YRI_f2.rds
   CHB/YRI_f2.rds
   CEU/CEU_f2.rds

Pseudo-haploid samples

When pseudohaploid=True, samples with no heterozygote calls in the first checked variants are treated as haploid during allele-frequency aggregation. This changes finite-sample corrections in f2 and \(F_{ST}\) because the called haplotype counts are different.

The toy example below is deliberately tiny and artificial. Negative values here only show how the finite-sample correction changes when pseudo-haploid calls are handled as haploid; they are not biological results.

pseudo_gt = np.array([
    # A1   A2   B1   B2
    [1.0, 0.0, 0.0, 2.0],
    [0.0, 1.0, 2.0, 0.0],
    [2.0, 1.0, 2.0, 2.0],
    [0.0, 0.0, 0.0, 2.0],
])

pseudo_obj = SNPObject(
    genotypes=pseudo_gt,
    samples=np.array(["A1", "A2", "B1", "B2"]),
)
pseudo_labels = np.array(["A", "A", "B", "B"])

pseudo_results = pd.concat(
    [
        f2(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=False,
        ).assign(stat="f2", pseudohaploid=False),
        f2(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=True,
        ).assign(stat="f2", pseudohaploid=True),
        fst(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=False,
        ).assign(stat="Hudson F_ST", pseudohaploid=False),
        fst(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=True,
        ).assign(stat="Hudson F_ST", pseudohaploid=True),
    ],
    ignore_index=True,
)

pseudo_summary = pseudo_results[
    ["stat", "pseudohaploid", "pop1", "pop2", "est", "se", "n_blocks", "n_snps"]
]
display_stats(pseudo_summary)
  stat pseudohaploid pop1 pop2 est se n_blocks n_snps
0 f2 False A B 0.00000 0.05893 4 4
1 f2 True A B -0.12500 0.07217 4 4
2 Hudson F_ST False A B 0.00000 0.11785 4 4
3 Hudson F_ST True A B -0.25000 0.14434 4 4