F-statistics and population statistics¶

F-statistics summarize allele-frequency covariance among populations. They are useful for checking population separation, testing whether a group is consistent with a mixture of two references, and measuring whether two population pairs share excess genetic drift.

This tutorial shows how to:

Load a 1000 Genomes Project subset with population metadata
Compute f2, f3, f4, D-statistics, f4-ratios, and pairwise \(F_{ST}\)
Use SNP-count jackknife blocks
Reuse streamed allele frequencies as input to the same statistic functions
Export blocked pair statistics for downstream model-fitting workflows
Handle pseudo-haploid samples when aggregating allele frequencies

from pathlib import Path
import os
import tempfile

import numpy as np
import pandas as pd
from IPython.display import display

from snputils.datasets import load_dataset
from snputils.snp.genobj.snpobj import SNPObject
from snputils.stats import (
    allele_freq_stream,
    d_stat,
    export_qp,
    f2,
    f3,
    f4,
    f4_ratio,
    fst,
)

STAT_FORMAT = {
    "est": "{:.5f}",
    "se": "{:.5f}",
    "z": "{:.2f}",
    "p": "{:.2e}",
}


def display_stats(df):
    formatters = {column: fmt for column, fmt in STAT_FORMAT.items() if column in df.columns}
    display(df.style.format(formatters))

Load a 1000 Genomes subset¶

The subset below keeps population labels aligned to snpobj.samples in snpobj.sample_fid. The populations are chosen to make the checks below interpretable: close within-region pairs, a more distant European/East Asian pair, two African-European admixed groups, and American/South Asian groups for ratio examples.

Set SNPUTILS_1KGP_DIR if the 1000 Genomes files are already available outside data/1kgp.

DATA_DIR = Path(os.environ.get("SNPUTILS_1KGP_DIR", "data/1kgp"))

populations = [
    "YRI", "MSL",      # West African
    "CEU", "GBR",      # European
    "CHB", "JPT",      # East Asian
    "ASW", "ACB",      # African-European admixed groups
    "PEL", "MXL",      # American admixed groups
    "GIH",              # South Asian
]

snpobj = load_dataset(
    "1kgp",
    download_genotypes=True,
    resource="phase3",
    output_dir=DATA_DIR,
    populations=populations,
    samples_per_population=40,
    max_variants=100_000,
    require_biallelic=True,
    require_complete=True,
    require_polymorphic=True,
    snv_only=True,
    sum_strands=False,
    verbose=False,
)

print(f"Loaded {snpobj.n_samples:,} samples and {snpobj.n_snps:,} variants")
print(f"Genotype array shape: {snpobj.genotypes.shape}")

Loaded 440 samples and 100,000 variants
Genotype array shape: (100000, 440, 2)

Create a compact sample table and confirm that each selected population contributes the requested number of samples.

sample_metadata = pd.DataFrame({
    "sample": snpobj.samples,
    "population": snpobj.sample_fid,
    "sex": snpobj.sample_sex,
})

pop_counts = sample_metadata["population"].value_counts().reindex(populations)
display(pop_counts.to_frame("n_samples"))

	n_samples
population
YRI	40
MSL	40
CEU	40
GBR	40
CHB	40
JPT	40
ASW	40
ACB	40
PEL	40
MXL	40
GIH	40

Jackknife blocks¶

The statistic functions use delete-one block jackknife standard errors. Passing block_size=5000 creates one block per 5,000 variants.

This tutorial uses 100,000 SNPs and 20 jackknife blocks to demonstrate expected behavior. The standard errors and p-values are useful for checking the workflow, but they should not be treated as strong genome-wide inference.

BLOCK_SIZE = 5_000

print(f"SNP-count blocks: {int(np.ceil(snpobj.n_snps / BLOCK_SIZE))}")

SNP-count blocks: 20

Pairwise f2 and \(F_{ST}\)¶

f2 measures squared allele-frequency distance. Pairwise \(F_{ST}\) rescales population differentiation; here we use the default Hudson estimator. Close population pairs should be much smaller than a continental-scale contrast such as CEU versus CHB.

Because these are sample-corrected unbiased estimates, very small values can be slightly negative. Interpret those as no detectable differentiation rather than biological negative distance.

pair_pop1 = ["YRI", "CEU", "CHB", "CEU"]
pair_pop2 = ["MSL", "GBR", "JPT", "CHB"]

f2_result = f2(
    snpobj,
    pop1=pair_pop1,
    pop2=pair_pop2,
    block_size=BLOCK_SIZE,
)

fst_result = fst(
    snpobj,
    pop1=pair_pop1,
    pop2=pair_pop2,
    method="hudson",
    block_size=BLOCK_SIZE,
)

display_stats(f2_result)
display_stats(fst_result)

	pop1	pop2	est	se	z	p	n_blocks	n_snps
0	YRI	MSL	0.00024	0.00011	2.16	3.09e-02	20	100000
1	CEU	GBR	-0.00007	0.00006	-1.16	2.44e-01	20	100000
2	CHB	JPT	0.00021	0.00009	2.41	1.62e-02	20	100000
3	CEU	CHB	0.00545	0.00077	7.06	1.65e-12	20	100000

	pop1	pop2	method	est	se	z	p	n_blocks	n_snps
0	YRI	MSL	hudson	0.00288	0.00105	2.75	6.04e-03	20	100000
1	CEU	GBR	hudson	-0.00105	0.00103	-1.01	3.10e-01	20	100000
2	CHB	JPT	hudson	0.00380	0.00143	2.66	7.91e-03	20	100000
3	CEU	CHB	hudson	0.08365	0.00848	9.87	5.82e-23	20	100000

f3 statistics¶

f3(target; ref1, ref2) is negative when the target has allele-frequency covariance consistent with mixture between the two references. ASW and ACB are expected to show this pattern with West African and European references. A non-admixed reference such as CEU should be positive with distant references.

f3_result = f3(
    snpobj,
    target=["ASW", "ACB", "CEU", "PEL", "MXL", "GIH"],
    ref1=["YRI", "YRI", "YRI", "CEU", "CEU", "CEU"],
    ref2=["CEU", "CEU", "CHB", "CHB", "CHB", "CHB"],
    block_size=BLOCK_SIZE,
)

display_stats(f3_result)

	target	ref1	ref2	est	se	z	p	n_blocks	n_snps
0	ASW	YRI	CEU	-0.00164	0.00018	-8.94	4.00e-19	20	100000
1	ACB	YRI	CEU	-0.00103	0.00018	-5.71	1.13e-08	20	100000
2	CEU	YRI	CHB	0.00185	0.00052	3.55	3.92e-04	20	100000
3	PEL	CEU	CHB	0.00081	0.00038	2.12	3.37e-02	20	100000
4	MXL	CEU	CHB	-0.00006	0.00029	-0.21	8.36e-01	20	100000
5	GIH	CEU	CHB	-0.00022	0.00019	-1.19	2.35e-01	20	100000

f4 and D-statistics¶

f4(a, b; c, d) is an allele-frequency covariance. Swapping either side flips the sign, and values near zero mean the two population pairs are similarly related. The D-statistic uses the same numerator as f4 but normalizes it by heterozygosity terms, so its sign should agree with f4 for the same quartet.

quartets = {
    "a": ["YRI", "YRI", "YRI", "YRI", "CEU"],
    "b": ["CEU", "CEU", "CEU", "CEU", "GBR"],
    "c": ["CHB", "ASW", "ACB", "MXL", "CHB"],
    "d": ["JPT", "YRI", "YRI", "CHB", "JPT"],
}

f4_result = f4(snpobj, **quartets, block_size=BLOCK_SIZE)
d_result = d_stat(
    snpobj,
    a=quartets["a"][:4],
    b=quartets["b"][:4],
    c=quartets["c"][:4],
    d=quartets["d"][:4],
    block_size=BLOCK_SIZE,
)

display_stats(f4_result)
display_stats(d_result)

	a	b	c	d	est	se	z	p	n_blocks	n_snps
0	YRI	CEU	CHB	JPT	0.00001	0.00017	0.08	9.35e-01	20	100000
1	YRI	CEU	ASW	YRI	-0.00268	0.00025	-10.54	5.46e-26	20	100000
2	YRI	CEU	ACB	YRI	-0.00167	0.00023	-7.11	1.14e-12	20	100000
3	YRI	CEU	MXL	CHB	-0.00044	0.00036	-1.24	2.15e-01	20	100000
4	CEU	GBR	CHB	JPT	-0.00007	0.00004	-2.01	4.49e-02	20	100000

	a	b	c	d	est	se	z	p	n_blocks	n_snps
0	YRI	CEU	CHB	JPT	0.00070	0.00851	0.08	9.35e-01	20	100000
1	YRI	CEU	ASW	YRI	-0.10566	0.00925	-11.42	3.17e-30	20	100000
2	YRI	CEU	ACB	YRI	-0.06689	0.01005	-6.66	2.79e-11	20	100000
3	YRI	CEU	MXL	CHB	-0.02009	0.01617	-1.24	2.14e-01	20	100000

For algebra checks, f4(A, B; C, D) should equal -f4(B, A; C, D) up to numerical precision.

base = f4(
    snpobj,
    a=["YRI"], b=["CEU"], c=["ASW"], d=["YRI"],
    block_size=BLOCK_SIZE,
).est.iloc[0]
flipped = f4(
    snpobj,
    a=["CEU"], b=["YRI"], c=["ASW"], d=["YRI"],
    block_size=BLOCK_SIZE,
).est.iloc[0]

print(f"base:    {base:.6f}")
print(f"flipped: {flipped:.6f}")
print(f"sum:     {base + flipped:.3e}")

base:    -0.002680
flipped: 0.002680
sum:     0.000e+00

f4-ratios¶

An f4-ratio divides two f4 statistics and uses the same block jackknife machinery. The numerator and denominator should be chosen so that one reference endpoint makes the numerator vanish.

These ratios use imperfect population proxies and a small SNP subset, so they should be read as reference-coordinate sanity checks, not as estimates of true ancestry proportions.

ratio_result = f4_ratio(
    snpobj,
    num=[
        ("JPT", "CEU", "ASW", "YRI"),
        ("JPT", "CEU", "ACB", "YRI"),
        ("YRI", "JPT", "PEL", "CEU"),
        ("YRI", "JPT", "MXL", "CEU"),
    ],
    den=[
        ("JPT", "CEU", "CEU", "YRI"),
        ("JPT", "CEU", "CEU", "YRI"),
        ("YRI", "JPT", "CHB", "CEU"),
        ("YRI", "JPT", "CHB", "CEU"),
    ],
    block_size=BLOCK_SIZE,
)

display_stats(ratio_result)

	num	den	est	se	z	p	n_blocks	n_snps
0	(JPT,CEU;ASW,YRI)	(JPT,CEU;CEU,YRI)	0.22542	0.04592	4.91	9.16e-07	20	100000
1	(JPT,CEU;ACB,YRI)	(JPT,CEU;CEU,YRI)	0.16228	0.03772	4.30	1.69e-05	20	100000
2	(YRI,JPT;PEL,CEU)	(YRI,JPT;CHB,CEU)	0.32436	0.15325	2.12	3.43e-02	20	100000
3	(YRI,JPT;MXL,CEU)	(YRI,JPT;CHB,CEU)	0.07644	0.11734	0.65	5.15e-01	20	100000

Reuse streamed allele frequencies¶

All f-statistic functions accept either a SNPObject or a tuple (allele_frequencies, counts, populations). This is useful when allele frequencies have already been computed in chunks.

af_df, count_df = allele_freq_stream(
    snpobj,
    chunk_size=25_000,
    sample_labels=snpobj.sample_fid,
    return_counts=True,
    as_dataframe=True,
)

stream_data = (af_df.to_numpy(), count_df.to_numpy(), af_df.columns.tolist())
stream_f2 = f2(
    stream_data,
    pop1=["YRI", "CEU"],
    pop2=["MSL", "CHB"],
    block_size=BLOCK_SIZE,
)

print(af_df.shape)
display_stats(stream_f2)

(100000, 11)

	pop1	pop2	est	se	z	p	n_blocks	n_snps
0	YRI	MSL	0.00024	0.00011	2.16	3.09e-02	20	100000
1	CEU	CHB	0.00545	0.00077	7.06	1.65e-12	20	100000

Export blocked pair statistics for qpAdm, qpWave, and qpGraph¶

export_qp writes blocked pair-statistic files for a selected population set to be able to be used with qpAdm, qpWave, and qpGraph. The result object records the populations, statistics, block count, SNP count, and written paths.

with tempfile.TemporaryDirectory() as tmpdir:
    export_result = export_qp(
        snpobj,
        outdir=Path(tmpdir) / "blocked_stats",
        populations=["YRI", "CEU", "CHB"],
        tools=("qpGraph",),
        block_size=BLOCK_SIZE,
        overwrite=True,
    )

    print("populations:", export_result.populations)
    print("statistics:", export_result.statistics)
    print("blocks:", export_result.n_blocks)
    print("SNPs:", export_result.n_snps)
    print("files written:", len(export_result.files))
    print("first files:")
    for path in export_result.files[:5]:
        print("  ", path.relative_to(export_result.outdir))

populations: ('YRI', 'CEU', 'CHB')
statistics: ('f2',)
blocks: 20
SNPs: 100000
files written: 8
first files:
   block_lengths_f2.rds
   YRI/YRI_f2.rds
   CEU/YRI_f2.rds
   CHB/YRI_f2.rds
   CEU/CEU_f2.rds

Pseudo-haploid samples¶

When pseudohaploid=True, samples with no heterozygote calls in the first checked variants are treated as haploid during allele-frequency aggregation. This changes finite-sample corrections in f2 and \(F_{ST}\) because the called haplotype counts are different.

The toy example below is deliberately tiny and artificial. Negative values here only show how the finite-sample correction changes when pseudo-haploid calls are handled as haploid; they are not biological results.

pseudo_gt = np.array([
    # A1   A2   B1   B2
    [1.0, 0.0, 0.0, 2.0],
    [0.0, 1.0, 2.0, 0.0],
    [2.0, 1.0, 2.0, 2.0],
    [0.0, 0.0, 0.0, 2.0],
])

pseudo_obj = SNPObject(
    genotypes=pseudo_gt,
    samples=np.array(["A1", "A2", "B1", "B2"]),
)
pseudo_labels = np.array(["A", "A", "B", "B"])

pseudo_results = pd.concat(
    [
        f2(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=False,
        ).assign(stat="f2", pseudohaploid=False),
        f2(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=True,
        ).assign(stat="f2", pseudohaploid=True),
        fst(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=False,
        ).assign(stat="Hudson F_ST", pseudohaploid=False),
        fst(
            pseudo_obj,
            pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
            block_size=1, pseudohaploid=True,
        ).assign(stat="Hudson F_ST", pseudohaploid=True),
    ],
    ignore_index=True,
)

pseudo_summary = pseudo_results[
    ["stat", "pseudohaploid", "pop1", "pop2", "est", "se", "n_blocks", "n_snps"]
]
display_stats(pseudo_summary)

	stat	pseudohaploid	pop1	pop2	est	se	n_blocks	n_snps
0	f2	False	A	B	0.00000	0.05893	4	4
1	f2	True	A	B	-0.12500	0.07217	4	4
2	Hudson F_ST	False	A	B	0.00000	0.11785	4	4
3	Hudson F_ST	True	A	B	-0.25000	0.14434	4	4