SNP Allele Frequency¶
This tutorial walks through allele-frequency calculations from a SNPObject: whole-cohort frequencies, grouped frequencies, ancestry-masked frequencies, pseudohaploid handling, and a streaming version for larger data.
By the end, you will know how to:
compute alternate-allele frequency using called haplotypes as the denominator
aggregate frequencies by population or any other sample grouping
compare ancestry-masked frequencies to cohort-wide frequencies
use allele-frequency tables for quick QC or exploratory analysis
validate a streaming workflow against in-memory results
The examples use a synthetic dataset so the notebook is fully self-contained.
from pathlib import Path
import numpy as np
import pandas as pd
from IPython.display import display
from snputils import allele_freq_stream
from snputils.datasets import build_synthetic_snp_dataset
from snputils.snp.genobj.snpobj import SNPObject
RESULTS_DIR = Path("results/tutorials")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
SEED = 20240521
rng = np.random.default_rng(SEED)
snp = build_synthetic_snp_dataset(
n_samples=12,
n_snps=16,
seed=SEED,
n_populations=3,
missing_rate=0.03,
phased=True,
chromosome="7",
variant_prefix="rs_af",
)
labels = snp.sample_fid.astype(str)
samples = snp.samples.astype(str)
pop_a = labels == "POP_A"
pop_b = labels == "POP_B"
pop_c = labels == "POP_C"
# Add a few hand-crafted variants so the examples show clear patterns.
snp.genotypes[1, :, :] = 0
snp.genotypes[1, pop_a, :] = 1
snp.genotypes[4, :, :] = 0
snp.genotypes[4, pop_b, :] = 1
snp.genotypes[4, pop_c, :] = -1
snp.genotypes[7, :, :] = 0
snp.genotypes[7, 0, 0] = 1
snp.genotypes[10, :, :] = 0
snp.genotypes[10, [2, 5, 8], :] = -1
snp.genotypes[10, [1, 6], :] = 1
snp.genotypes[13, :, :] = 0
snp.genotypes[13, pop_c, :] = 1
lai = rng.integers(0, 3, size=snp.genotypes.shape, dtype=np.int8)
lai[1, :, :] = 0
lai[1, pop_a, :] = 1
lai[4, :, :] = 0
lai[4, pop_b, :] = 2
lai[13, :, :] = 0
lai[13, pop_c, :] = 1
snp.calldata_lai = lai
snp.ancestry_map = {"0": "ANC0", "1": "ANC1", "2": "ANC2"}
sample_metadata = pd.DataFrame(
{
"sample": samples,
"population": labels,
"sex_code": snp.sample_sex,
}
)
variant_notes = pd.DataFrame(
{
"variant": snp.variants_id[[1, 4, 7, 10, 13]],
"why_it_is_interesting": [
"POP_A-private high-frequency alternate allele",
"POP_B-enriched allele with all POP_C calls missing",
"single alternate haplotype (rare variant)",
"segregating variant with reduced call count",
"POP_C-private high-frequency alternate allele",
],
}
)
display(sample_metadata)
display(variant_notes)
| sample | population | sex_code | |
|---|---|---|---|
| 0 | S01 | POP_A | 1 |
| 1 | S02 | POP_B | 2 |
| 2 | S03 | POP_C | 1 |
| 3 | S04 | POP_A | 2 |
| 4 | S05 | POP_B | 1 |
| 5 | S06 | POP_C | 2 |
| 6 | S07 | POP_A | 1 |
| 7 | S08 | POP_B | 2 |
| 8 | S09 | POP_C | 1 |
| 9 | S10 | POP_A | 2 |
| 10 | S11 | POP_B | 1 |
| 11 | S12 | POP_C | 2 |
| variant | why_it_is_interesting | |
|---|---|---|
| 0 | rs_af_00002 | POP_A-private high-frequency alternate allele |
| 1 | rs_af_00005 | POP_B-enriched allele with all POP_C calls mis... |
| 2 | rs_af_00008 | single alternate haplotype (rare variant) |
| 3 | rs_af_00011 | segregating variant with reduced call count |
| 4 | rs_af_00014 | POP_C-private high-frequency alternate allele |
Cohort allele frequency¶
At the cohort level, allele_freq() counts alternate alleles across all called haplotypes with respect to variants_alt. Missing calls are excluded from both the numerator and the denominator. This is not necessarily the minor-allele frequency, we derive MAF separately below with min(AF, 1 - AF).
For phased diploid genotypes, the denominator is usually 2 * n_samples, but it becomes smaller when alleles are missing. Looking at the called-allele count beside the frequency is therefore good practice.
cohort_af, cohort_called = snp.allele_freq(return_counts=True)
cohort_summary = pd.DataFrame(
{
"variant": snp.variants_id,
"chrom": snp.variants_chrom,
"pos": snp.variants_pos,
"ref": snp.variants_ref,
"alt": snp.variants_alt,
"cohort_af": cohort_af,
"called_alleles": cohort_called,
}
)
cohort_summary["call_rate"] = cohort_summary["called_alleles"] / (2 * snp.n_samples)
cohort_summary["maf"] = np.minimum(cohort_summary["cohort_af"], 1.0 - cohort_summary["cohort_af"])
cohort_summary["qc_flag"] = np.select(
[cohort_summary["call_rate"] < 0.85, cohort_summary["maf"] < 0.10],
["low_call_rate", "rare_variant"],
default="ok",
)
cohort_summary.round({"cohort_af": 3, "call_rate": 3, "maf": 3})
| variant | chrom | pos | ref | alt | cohort_af | called_alleles | call_rate | maf | qc_flag | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | rs_af_00001 | 7 | 1000 | A | G | 0.875 | 24 | 1.000 | 0.125 | ok |
| 1 | rs_af_00002 | 7 | 2000 | C | T | 0.333 | 24 | 1.000 | 0.333 | ok |
| 2 | rs_af_00003 | 7 | 3000 | G | A | 0.174 | 23 | 0.958 | 0.174 | ok |
| 3 | rs_af_00004 | 7 | 4000 | T | C | 0.739 | 23 | 0.958 | 0.261 | ok |
| 4 | rs_af_00005 | 7 | 5000 | A | G | 0.500 | 16 | 0.667 | 0.500 | low_call_rate |
| 5 | rs_af_00006 | 7 | 6000 | C | T | 0.478 | 23 | 0.958 | 0.478 | ok |
| 6 | rs_af_00007 | 7 | 7000 | G | A | 0.182 | 22 | 0.917 | 0.182 | ok |
| 7 | rs_af_00008 | 7 | 8000 | T | C | 0.042 | 24 | 1.000 | 0.042 | rare_variant |
| 8 | rs_af_00009 | 7 | 9000 | A | G | 0.708 | 24 | 1.000 | 0.292 | ok |
| 9 | rs_af_00010 | 7 | 10000 | C | T | 0.348 | 23 | 0.958 | 0.348 | ok |
| 10 | rs_af_00011 | 7 | 11000 | G | A | 0.222 | 18 | 0.750 | 0.222 | low_call_rate |
| 11 | rs_af_00012 | 7 | 12000 | T | C | 0.818 | 22 | 0.917 | 0.182 | ok |
| 12 | rs_af_00013 | 7 | 13000 | A | G | 0.208 | 24 | 1.000 | 0.208 | ok |
| 13 | rs_af_00014 | 7 | 14000 | C | T | 0.333 | 24 | 1.000 | 0.333 | ok |
| 14 | rs_af_00015 | 7 | 15000 | G | A | 0.250 | 24 | 1.000 | 0.250 | ok |
| 15 | rs_af_00016 | 7 | 16000 | T | C | 0.167 | 24 | 1.000 | 0.167 | ok |
The maf and call_rate columns are simple but useful downstream summaries you can derive immediately from the frequency and count outputs.
This is often the first pass for QC: variants with low call rate or extremely small minor-allele frequency deserve a closer look before you use them.
cohort_summary.loc[cohort_summary["qc_flag"] != "ok", [
"variant",
"cohort_af",
"called_alleles",
"call_rate",
"maf",
"qc_flag",
]].round({"cohort_af": 3, "call_rate": 3, "maf": 3})
| variant | cohort_af | called_alleles | call_rate | maf | qc_flag | |
|---|---|---|---|---|---|---|
| 4 | rs_af_00005 | 0.500 | 16 | 0.667 | 0.500 | low_call_rate |
| 7 | rs_af_00008 | 0.042 | 24 | 1.000 | 0.042 | rare_variant |
| 10 | rs_af_00011 | 0.222 | 18 | 0.750 | 0.222 | low_call_rate |
variant_idx = 4
gt = snp.genotypes[variant_idx].astype(float)
gt[gt < 0] = np.nan
manual_alt_count = np.nansum(gt)
manual_called_alleles = np.sum(~np.isnan(gt))
manual_af = manual_alt_count / manual_called_alleles
pd.DataFrame(
{
"variant": [snp.variants_id[variant_idx]],
"manual_alt_count": [manual_alt_count],
"manual_called_alleles": [manual_called_alleles],
"manual_af": [manual_af],
"allele_freq_output": [cohort_af[variant_idx]],
"called_output": [cohort_called[variant_idx]],
}
).round(3)
| variant | manual_alt_count | manual_called_alleles | manual_af | allele_freq_output | called_output | |
|---|---|---|---|---|---|---|
| 0 | rs_af_00005 | 8.0 | 16 | 0.5 | 0.5 | 16 |
This check is useful because allele frequency is not just a mean over samples. For phased genotypes, the denominator is the number of called haplotypes.
Grouped allele frequency¶
Pass sample_labels to estimate one frequency column per group. The group labels can be populations, sequencing centers, phenotype strata, batches, or any other sample-level annotation aligned with snp.samples.
A useful trick is to rank variants by between-group spread. That quickly surfaces alleles that are nearly private to one group or that look suspiciously batch-specific.
group_af, group_called = snp.allele_freq(
sample_labels=labels,
return_counts=True,
as_dataframe=True,
)
group_summary = pd.concat(
[
cohort_summary[["variant", "cohort_af", "called_alleles"]].rename(
columns={"called_alleles": "cohort_called_alleles"}
),
group_af.add_prefix("af_"),
group_called.add_prefix("called_"),
],
axis=1,
)
freq_cols = [col for col in group_summary.columns if col.startswith("af_")]
group_summary["spread"] = group_summary[freq_cols].max(axis=1) - group_summary[freq_cols].min(axis=1)
group_summary["most_enriched_group"] = group_summary[freq_cols].idxmax(axis=1).str.removeprefix("af_")
group_summary.sort_values(["spread", "variant"], ascending=[False, True]).head(8).round(3)
| variant | cohort_af | cohort_called_alleles | af_POP_A | af_POP_B | af_POP_C | called_POP_A | called_POP_B | called_POP_C | spread | most_enriched_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | rs_af_00002 | 0.333 | 24 | 1.000 | 0.00 | 0.000 | 8 | 8 | 8 | 1.000 | POP_A |
| 4 | rs_af_00005 | 0.500 | 16 | 0.000 | 1.00 | NaN | 8 | 8 | 0 | 1.000 | POP_B |
| 13 | rs_af_00014 | 0.333 | 24 | 0.000 | 0.00 | 1.000 | 8 | 8 | 8 | 1.000 | POP_C |
| 3 | rs_af_00004 | 0.739 | 23 | 0.286 | 1.00 | 0.875 | 7 | 8 | 8 | 0.714 | POP_B |
| 8 | rs_af_00009 | 0.708 | 24 | 0.500 | 1.00 | 0.625 | 8 | 8 | 8 | 0.500 | POP_B |
| 0 | rs_af_00001 | 0.875 | 24 | 1.000 | 1.00 | 0.625 | 8 | 8 | 8 | 0.375 | POP_A |
| 12 | rs_af_00013 | 0.208 | 24 | 0.375 | 0.25 | 0.000 | 8 | 8 | 8 | 0.375 | POP_A |
| 14 | rs_af_00015 | 0.250 | 24 | 0.375 | 0.00 | 0.375 | 8 | 8 | 8 | 0.375 | POP_A |
Two details matter when interpreting grouped results:
A large frequency difference is only convincing if the corresponding
called_*counts are also healthy.Missingness can be group-specific, so comparing only frequencies without counts can be misleading.
interesting_grouped = group_summary.loc[group_summary["spread"] >= 0.50, [
"variant",
"cohort_af",
"af_POP_A",
"af_POP_B",
"af_POP_C",
"called_POP_A",
"called_POP_B",
"called_POP_C",
"spread",
"most_enriched_group",
],
].sort_values(["spread", "variant"], ascending=[False, True])
interesting_grouped.round(3)
| variant | cohort_af | af_POP_A | af_POP_B | af_POP_C | called_POP_A | called_POP_B | called_POP_C | spread | most_enriched_group | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | rs_af_00002 | 0.333 | 1.000 | 0.0 | 0.000 | 8 | 8 | 8 | 1.000 | POP_A |
| 4 | rs_af_00005 | 0.500 | 0.000 | 1.0 | NaN | 8 | 8 | 0 | 1.000 | POP_B |
| 13 | rs_af_00014 | 0.333 | 0.000 | 0.0 | 1.000 | 8 | 8 | 8 | 1.000 | POP_C |
| 3 | rs_af_00004 | 0.739 | 0.286 | 1.0 | 0.875 | 7 | 8 | 8 | 0.714 | POP_B |
| 8 | rs_af_00009 | 0.708 | 0.500 | 1.0 | 0.625 | 8 | 8 | 8 | 0.500 | POP_B |
Ancestry-masked allele frequency¶
If SNP-level local ancestry is available, passing ancestry= restricts the calculation to haplotypes assigned to that ancestry.
This is different from grouping samples by population. Population labels split individuals into bins; ancestry masking splits haplotypes within individuals. In admixed cohorts, the two summaries answer different biological questions.
ancestry_tables = []
for anc_code, anc_name in snp.ancestry_map.items():
anc_af, anc_called = snp.allele_freq(ancestry=int(anc_code), return_counts=True)
ancestry_tables.append(
pd.DataFrame(
{
"variant": snp.variants_id,
f"{anc_name}_af": anc_af,
f"{anc_name}_called": anc_called,
}
)
)
ancestry_summary = cohort_summary[["variant", "cohort_af", "called_alleles"]].copy()
for table in ancestry_tables:
ancestry_summary = ancestry_summary.merge(table, on="variant")
anc_freq_cols = [col for col in ancestry_summary.columns if col.endswith("_af") and col != "cohort_af"]
ancestry_summary["max_ancestry_shift"] = ancestry_summary[anc_freq_cols].sub(
ancestry_summary["cohort_af"], axis=0
).abs().max(axis=1)
ancestry_summary.sort_values("max_ancestry_shift", ascending=False).head(8).round(3)
| variant | cohort_af | called_alleles | ANC0_af | ANC0_called | ANC1_af | ANC1_called | ANC2_af | ANC2_called | max_ancestry_shift | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | rs_af_00002 | 0.333 | 24 | 0.000 | 16 | 1.000 | 8 | NaN | 0 | 0.667 |
| 13 | rs_af_00014 | 0.333 | 24 | 0.000 | 16 | 1.000 | 8 | NaN | 0 | 0.667 |
| 4 | rs_af_00005 | 0.500 | 16 | 0.000 | 8 | NaN | 0 | 1.00 | 8 | 0.500 |
| 8 | rs_af_00009 | 0.708 | 24 | 0.667 | 12 | 0.571 | 7 | 1.00 | 5 | 0.292 |
| 14 | rs_af_00015 | 0.250 | 24 | 0.000 | 5 | 0.444 | 9 | 0.20 | 10 | 0.250 |
| 2 | rs_af_00003 | 0.174 | 23 | 0.400 | 5 | 0.000 | 10 | 0.25 | 8 | 0.226 |
| 10 | rs_af_00011 | 0.222 | 18 | 0.000 | 2 | 0.250 | 8 | 0.25 | 8 | 0.222 |
| 6 | rs_af_00007 | 0.182 | 22 | 0.167 | 12 | 0.400 | 5 | 0.00 | 5 | 0.218 |
The ancestry-specific called-allele counts are often much smaller than the cohort-wide counts because each ancestry mask keeps only a subset of haplotypes. That count reduction is expected, not an error.
When a variant looks extreme in one ancestry but is based on very few called haplotypes, treat it as a clue rather than a conclusion.
ancestry_summary.loc[ancestry_summary["max_ancestry_shift"] >= 0.35, [
"variant",
"cohort_af",
"ANC0_af",
"ANC1_af",
"ANC2_af",
"ANC0_called",
"ANC1_called",
"ANC2_called",
"max_ancestry_shift",
]].round(3)
| variant | cohort_af | ANC0_af | ANC1_af | ANC2_af | ANC0_called | ANC1_called | ANC2_called | max_ancestry_shift | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | rs_af_00002 | 0.333 | 0.0 | 1.0 | NaN | 16 | 8 | 0 | 0.667 |
| 4 | rs_af_00005 | 0.500 | 0.0 | NaN | 1.0 | 8 | 0 | 8 | 0.500 |
| 13 | rs_af_00014 | 0.333 | 0.0 | 1.0 | NaN | 16 | 8 | 0 | 0.667 |
Pseudo-haploid example¶
Ancient-DNA and low-coverage workflows often use pseudohaploid calls, where each sample contributes one observed allele at a site even if the genotype is encoded as 0 or 2. So the denominator should increase by 1 rather than 2 for those samples.
allele_freq() can detect samples without heterozygotes and down-weight their denominator with pseudohaploid=True.
pseudo_gt = np.array(
[
[1.0, 0.0],
[0.0, 2.0],
[1.0, 0.0],
[2.0, 2.0],
]
)
pseudo = SNPObject(
genotypes=pseudo_gt,
samples=np.array(["diploid_sample", "pseudo_sample"], dtype=object),
variants_id=np.array([f"pseudo_rs_{i}" for i in range(1, 5)], dtype=object),
)
freq_diploid, counts_diploid = pseudo.allele_freq(pseudohaploid=False, return_counts=True)
freq_pseudo, counts_pseudo = pseudo.allele_freq(pseudohaploid=True, return_counts=True)
pd.DataFrame(
{
"variant": pseudo.variants_id,
"freq_default": freq_diploid,
"called_default": counts_diploid,
"freq_pseudohaploid": freq_pseudo,
"called_pseudohaploid": counts_pseudo,
}
).round(3)
| variant | freq_default | called_default | freq_pseudohaploid | called_pseudohaploid | |
|---|---|---|---|---|---|
| 0 | pseudo_rs_1 | 0.25 | 4 | 0.333 | 3 |
| 1 | pseudo_rs_2 | 0.50 | 4 | 0.333 | 3 |
| 2 | pseudo_rs_3 | 0.25 | 4 | 0.333 | 3 |
| 3 | pseudo_rs_4 | 1.00 | 4 | 1.000 | 3 |
Streaming allele frequency¶
allele_freq_stream() processes the data in variant chunks and returns the same result as eager in-memory computation. The streaming API is useful when your source data comes from a reader or when the full genotype matrix is too large to materialize comfortably.
Here we validate the streaming path against the in-memory path on the same grouped, ancestry-masked query.
stream_af, stream_called = allele_freq_stream(
snp,
chunk_size=5,
sample_labels=labels,
ancestry=1,
return_counts=True,
as_dataframe=True,
)
eager_af, eager_called = snp.allele_freq(
sample_labels=labels,
ancestry=1,
return_counts=True,
as_dataframe=True,
)
comparison = pd.DataFrame(
{
"freq_match": [np.allclose(stream_af.to_numpy(), eager_af.to_numpy(), equal_nan=True)],
"count_match": [stream_called.equals(eager_called)],
"n_variants": [snp.n_snps],
"chunk_size": [5],
}
)
display(comparison)
stream_af.head().round(3)
| freq_match | count_match | n_variants | chunk_size | |
|---|---|---|---|---|
| 0 | True | True | 16 | 5 |
| POP_A | POP_B | POP_C | |
|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 |
| 1 | 1.0 | NaN | NaN |
| 2 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 1.0 | 1.0 |
| 4 | NaN | NaN | NaN |
Takeaways¶
Always inspect frequency estimates together with called-allele counts.
Grouped allele frequencies are useful for detecting differentiated or suspicious variants.
Ancestry-masked allele frequencies answer a different question than population-grouped frequencies.
Use
pseudohaploid=Truewhen pseudohaploid 0/2 calls are mixed with ordinary diploid genotypes.allele_freq_stream()is the scalable path, and it should agree with eager results on the same input.