F-statistics and population statistics¶
F-statistics summarize allele-frequency covariance among populations. They are useful for checking population separation, testing whether a group is consistent with a mixture of two references, and measuring whether two population pairs share excess genetic drift.
This tutorial shows how to:
Load a 1000 Genomes Project subset with population metadata
Compute
f2,f3,f4, D-statistics, f4-ratios, and pairwise \(F_{ST}\)Use SNP-count jackknife blocks
Reuse streamed allele frequencies as input to the same statistic functions
Export blocked pair statistics for downstream model-fitting workflows
Handle pseudo-haploid samples when aggregating allele frequencies
from pathlib import Path
import os
import tempfile
import numpy as np
import pandas as pd
from IPython.display import display
from snputils.datasets import load_dataset
from snputils.snp.genobj.snpobj import SNPObject
from snputils.stats import (
allele_freq_stream,
d_stat,
export_qp,
f2,
f3,
f4,
f4_ratio,
fst,
)
STAT_FORMAT = {
"est": "{:.5f}",
"se": "{:.5f}",
"z": "{:.2f}",
"p": "{:.2e}",
}
def display_stats(df):
formatters = {column: fmt for column, fmt in STAT_FORMAT.items() if column in df.columns}
display(df.style.format(formatters))
Load a 1000 Genomes subset¶
The subset below keeps population labels aligned to snpobj.samples in snpobj.sample_fid. The populations are chosen to make the checks below interpretable: close within-region pairs, a more distant European/East Asian pair, two African-European admixed groups, and American/South Asian groups for ratio examples.
Set SNPUTILS_1KGP_DIR if the 1000 Genomes files are already available outside data/1kgp.
DATA_DIR = Path(os.environ.get("SNPUTILS_1KGP_DIR", "data/1kgp"))
populations = [
"YRI", "MSL", # West African
"CEU", "GBR", # European
"CHB", "JPT", # East Asian
"ASW", "ACB", # African-European admixed groups
"PEL", "MXL", # American admixed groups
"GIH", # South Asian
]
snpobj = load_dataset(
"1kgp",
download_genotypes=True,
resource="phase3",
output_dir=DATA_DIR,
populations=populations,
samples_per_population=40,
max_variants=100_000,
require_biallelic=True,
require_complete=True,
require_polymorphic=True,
snv_only=True,
sum_strands=False,
verbose=False,
)
print(f"Loaded {snpobj.n_samples:,} samples and {snpobj.n_snps:,} variants")
print(f"Genotype array shape: {snpobj.genotypes.shape}")
Loaded 440 samples and 100,000 variants
Genotype array shape: (100000, 440, 2)
Create a compact sample table and confirm that each selected population contributes the requested number of samples.
sample_metadata = pd.DataFrame({
"sample": snpobj.samples,
"population": snpobj.sample_fid,
"sex": snpobj.sample_sex,
})
pop_counts = sample_metadata["population"].value_counts().reindex(populations)
display(pop_counts.to_frame("n_samples"))
| n_samples | |
|---|---|
| population | |
| YRI | 40 |
| MSL | 40 |
| CEU | 40 |
| GBR | 40 |
| CHB | 40 |
| JPT | 40 |
| ASW | 40 |
| ACB | 40 |
| PEL | 40 |
| MXL | 40 |
| GIH | 40 |
Jackknife blocks¶
The statistic functions use delete-one block jackknife standard errors. Passing block_size=5000 creates one block per 5,000 variants.
This tutorial uses 100,000 SNPs and 20 jackknife blocks to demonstrate expected behavior. The standard errors and p-values are useful for checking the workflow, but they should not be treated as strong genome-wide inference.
BLOCK_SIZE = 5_000
print(f"SNP-count blocks: {int(np.ceil(snpobj.n_snps / BLOCK_SIZE))}")
SNP-count blocks: 20
Pairwise f2 and \(F_{ST}\)¶
f2 measures squared allele-frequency distance. Pairwise \(F_{ST}\) rescales population differentiation; here we use the default Hudson estimator. Close population pairs should be much smaller than a continental-scale contrast such as CEU versus CHB.
Because these are sample-corrected unbiased estimates, very small values can be slightly negative. Interpret those as no detectable differentiation rather than biological negative distance.
pair_pop1 = ["YRI", "CEU", "CHB", "CEU"]
pair_pop2 = ["MSL", "GBR", "JPT", "CHB"]
f2_result = f2(
snpobj,
pop1=pair_pop1,
pop2=pair_pop2,
block_size=BLOCK_SIZE,
)
fst_result = fst(
snpobj,
pop1=pair_pop1,
pop2=pair_pop2,
method="hudson",
block_size=BLOCK_SIZE,
)
display_stats(f2_result)
display_stats(fst_result)
| pop1 | pop2 | est | se | z | p | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|
| 0 | YRI | MSL | 0.00024 | 0.00011 | 2.16 | 3.09e-02 | 20 | 100000 |
| 1 | CEU | GBR | -0.00007 | 0.00006 | -1.16 | 2.44e-01 | 20 | 100000 |
| 2 | CHB | JPT | 0.00021 | 0.00009 | 2.41 | 1.62e-02 | 20 | 100000 |
| 3 | CEU | CHB | 0.00545 | 0.00077 | 7.06 | 1.65e-12 | 20 | 100000 |
| pop1 | pop2 | method | est | se | z | p | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | YRI | MSL | hudson | 0.00288 | 0.00105 | 2.75 | 6.04e-03 | 20 | 100000 |
| 1 | CEU | GBR | hudson | -0.00105 | 0.00103 | -1.01 | 3.10e-01 | 20 | 100000 |
| 2 | CHB | JPT | hudson | 0.00380 | 0.00143 | 2.66 | 7.91e-03 | 20 | 100000 |
| 3 | CEU | CHB | hudson | 0.08365 | 0.00848 | 9.87 | 5.82e-23 | 20 | 100000 |
f3 statistics¶
f3(target; ref1, ref2) is negative when the target has allele-frequency covariance consistent with mixture between the two references. ASW and ACB are expected to show this pattern with West African and European references. A non-admixed reference such as CEU should be positive with distant references.
f3_result = f3(
snpobj,
target=["ASW", "ACB", "CEU", "PEL", "MXL", "GIH"],
ref1=["YRI", "YRI", "YRI", "CEU", "CEU", "CEU"],
ref2=["CEU", "CEU", "CHB", "CHB", "CHB", "CHB"],
block_size=BLOCK_SIZE,
)
display_stats(f3_result)
| target | ref1 | ref2 | est | se | z | p | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | ASW | YRI | CEU | -0.00164 | 0.00018 | -8.94 | 4.00e-19 | 20 | 100000 |
| 1 | ACB | YRI | CEU | -0.00103 | 0.00018 | -5.71 | 1.13e-08 | 20 | 100000 |
| 2 | CEU | YRI | CHB | 0.00185 | 0.00052 | 3.55 | 3.92e-04 | 20 | 100000 |
| 3 | PEL | CEU | CHB | 0.00081 | 0.00038 | 2.12 | 3.37e-02 | 20 | 100000 |
| 4 | MXL | CEU | CHB | -0.00006 | 0.00029 | -0.21 | 8.36e-01 | 20 | 100000 |
| 5 | GIH | CEU | CHB | -0.00022 | 0.00019 | -1.19 | 2.35e-01 | 20 | 100000 |
f4 and D-statistics¶
f4(a, b; c, d) is an allele-frequency covariance. Swapping either side flips the sign, and values near zero mean the two population pairs are similarly related. The D-statistic uses the same numerator as f4 but normalizes it by heterozygosity terms, so its sign should agree with f4 for the same quartet.
quartets = {
"a": ["YRI", "YRI", "YRI", "YRI", "CEU"],
"b": ["CEU", "CEU", "CEU", "CEU", "GBR"],
"c": ["CHB", "ASW", "ACB", "MXL", "CHB"],
"d": ["JPT", "YRI", "YRI", "CHB", "JPT"],
}
f4_result = f4(snpobj, **quartets, block_size=BLOCK_SIZE)
d_result = d_stat(
snpobj,
a=quartets["a"][:4],
b=quartets["b"][:4],
c=quartets["c"][:4],
d=quartets["d"][:4],
block_size=BLOCK_SIZE,
)
display_stats(f4_result)
display_stats(d_result)
| a | b | c | d | est | se | z | p | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | YRI | CEU | CHB | JPT | 0.00001 | 0.00017 | 0.08 | 9.35e-01 | 20 | 100000 |
| 1 | YRI | CEU | ASW | YRI | -0.00268 | 0.00025 | -10.54 | 5.46e-26 | 20 | 100000 |
| 2 | YRI | CEU | ACB | YRI | -0.00167 | 0.00023 | -7.11 | 1.14e-12 | 20 | 100000 |
| 3 | YRI | CEU | MXL | CHB | -0.00044 | 0.00036 | -1.24 | 2.15e-01 | 20 | 100000 |
| 4 | CEU | GBR | CHB | JPT | -0.00007 | 0.00004 | -2.01 | 4.49e-02 | 20 | 100000 |
| a | b | c | d | est | se | z | p | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | YRI | CEU | CHB | JPT | 0.00070 | 0.00851 | 0.08 | 9.35e-01 | 20 | 100000 |
| 1 | YRI | CEU | ASW | YRI | -0.10566 | 0.00925 | -11.42 | 3.17e-30 | 20 | 100000 |
| 2 | YRI | CEU | ACB | YRI | -0.06689 | 0.01005 | -6.66 | 2.79e-11 | 20 | 100000 |
| 3 | YRI | CEU | MXL | CHB | -0.02009 | 0.01617 | -1.24 | 2.14e-01 | 20 | 100000 |
For algebra checks, f4(A, B; C, D) should equal -f4(B, A; C, D) up to numerical precision.
base = f4(
snpobj,
a=["YRI"], b=["CEU"], c=["ASW"], d=["YRI"],
block_size=BLOCK_SIZE,
).est.iloc[0]
flipped = f4(
snpobj,
a=["CEU"], b=["YRI"], c=["ASW"], d=["YRI"],
block_size=BLOCK_SIZE,
).est.iloc[0]
print(f"base: {base:.6f}")
print(f"flipped: {flipped:.6f}")
print(f"sum: {base + flipped:.3e}")
base: -0.002680
flipped: 0.002680
sum: 0.000e+00
f4-ratios¶
An f4-ratio divides two f4 statistics and uses the same block jackknife machinery. The numerator and denominator should be chosen so that one reference endpoint makes the numerator vanish.
These ratios use imperfect population proxies and a small SNP subset, so they should be read as reference-coordinate sanity checks, not as estimates of true ancestry proportions.
ratio_result = f4_ratio(
snpobj,
num=[
("JPT", "CEU", "ASW", "YRI"),
("JPT", "CEU", "ACB", "YRI"),
("YRI", "JPT", "PEL", "CEU"),
("YRI", "JPT", "MXL", "CEU"),
],
den=[
("JPT", "CEU", "CEU", "YRI"),
("JPT", "CEU", "CEU", "YRI"),
("YRI", "JPT", "CHB", "CEU"),
("YRI", "JPT", "CHB", "CEU"),
],
block_size=BLOCK_SIZE,
)
display_stats(ratio_result)
| num | den | est | se | z | p | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|
| 0 | (JPT,CEU;ASW,YRI) | (JPT,CEU;CEU,YRI) | 0.22542 | 0.04592 | 4.91 | 9.16e-07 | 20 | 100000 |
| 1 | (JPT,CEU;ACB,YRI) | (JPT,CEU;CEU,YRI) | 0.16228 | 0.03772 | 4.30 | 1.69e-05 | 20 | 100000 |
| 2 | (YRI,JPT;PEL,CEU) | (YRI,JPT;CHB,CEU) | 0.32436 | 0.15325 | 2.12 | 3.43e-02 | 20 | 100000 |
| 3 | (YRI,JPT;MXL,CEU) | (YRI,JPT;CHB,CEU) | 0.07644 | 0.11734 | 0.65 | 5.15e-01 | 20 | 100000 |
Reuse streamed allele frequencies¶
All f-statistic functions accept either a SNPObject or a tuple (allele_frequencies, counts, populations). This is useful when allele frequencies have already been computed in chunks.
af_df, count_df = allele_freq_stream(
snpobj,
chunk_size=25_000,
sample_labels=snpobj.sample_fid,
return_counts=True,
as_dataframe=True,
)
stream_data = (af_df.to_numpy(), count_df.to_numpy(), af_df.columns.tolist())
stream_f2 = f2(
stream_data,
pop1=["YRI", "CEU"],
pop2=["MSL", "CHB"],
block_size=BLOCK_SIZE,
)
print(af_df.shape)
display_stats(stream_f2)
(100000, 11)
| pop1 | pop2 | est | se | z | p | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|
| 0 | YRI | MSL | 0.00024 | 0.00011 | 2.16 | 3.09e-02 | 20 | 100000 |
| 1 | CEU | CHB | 0.00545 | 0.00077 | 7.06 | 1.65e-12 | 20 | 100000 |
Export blocked pair statistics for qpAdm, qpWave, and qpGraph¶
export_qp writes blocked pair-statistic files for a selected population set to be able to be used with qpAdm, qpWave, and qpGraph. The result object records the populations, statistics, block count, SNP count, and written paths.
with tempfile.TemporaryDirectory() as tmpdir:
export_result = export_qp(
snpobj,
outdir=Path(tmpdir) / "blocked_stats",
populations=["YRI", "CEU", "CHB"],
tools=("qpGraph",),
block_size=BLOCK_SIZE,
overwrite=True,
)
print("populations:", export_result.populations)
print("statistics:", export_result.statistics)
print("blocks:", export_result.n_blocks)
print("SNPs:", export_result.n_snps)
print("files written:", len(export_result.files))
print("first files:")
for path in export_result.files[:5]:
print(" ", path.relative_to(export_result.outdir))
populations: ('YRI', 'CEU', 'CHB')
statistics: ('f2',)
blocks: 20
SNPs: 100000
files written: 8
first files:
block_lengths_f2.rds
YRI/YRI_f2.rds
CEU/YRI_f2.rds
CHB/YRI_f2.rds
CEU/CEU_f2.rds
Pseudo-haploid samples¶
When pseudohaploid=True, samples with no heterozygote calls in the first checked variants are treated as haploid during allele-frequency aggregation. This changes finite-sample corrections in f2 and \(F_{ST}\) because the called haplotype counts are different.
The toy example below is deliberately tiny and artificial. Negative values here only show how the finite-sample correction changes when pseudo-haploid calls are handled as haploid; they are not biological results.
pseudo_gt = np.array([
# A1 A2 B1 B2
[1.0, 0.0, 0.0, 2.0],
[0.0, 1.0, 2.0, 0.0],
[2.0, 1.0, 2.0, 2.0],
[0.0, 0.0, 0.0, 2.0],
])
pseudo_obj = SNPObject(
genotypes=pseudo_gt,
samples=np.array(["A1", "A2", "B1", "B2"]),
)
pseudo_labels = np.array(["A", "A", "B", "B"])
pseudo_results = pd.concat(
[
f2(
pseudo_obj,
pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
block_size=1, pseudohaploid=False,
).assign(stat="f2", pseudohaploid=False),
f2(
pseudo_obj,
pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
block_size=1, pseudohaploid=True,
).assign(stat="f2", pseudohaploid=True),
fst(
pseudo_obj,
pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
block_size=1, pseudohaploid=False,
).assign(stat="Hudson F_ST", pseudohaploid=False),
fst(
pseudo_obj,
pop1=["A"], pop2=["B"], sample_labels=pseudo_labels,
block_size=1, pseudohaploid=True,
).assign(stat="Hudson F_ST", pseudohaploid=True),
],
ignore_index=True,
)
pseudo_summary = pseudo_results[
["stat", "pseudohaploid", "pop1", "pop2", "est", "se", "n_blocks", "n_snps"]
]
display_stats(pseudo_summary)
| stat | pseudohaploid | pop1 | pop2 | est | se | n_blocks | n_snps | |
|---|---|---|---|---|---|---|---|---|
| 0 | f2 | False | A | B | 0.00000 | 0.05893 | 4 | 4 |
| 1 | f2 | True | A | B | -0.12500 | 0.07217 | 4 | 4 |
| 2 | Hudson F_ST | False | A | B | 0.00000 | 0.11785 | 4 | 4 |
| 3 | Hudson F_ST | True | A | B | -0.25000 | 0.14434 | 4 | 4 |