SNP Allele Frequency

This tutorial walks through allele-frequency calculations from a SNPObject: whole-cohort frequencies, grouped frequencies, ancestry-masked frequencies, pseudohaploid handling, and a streaming version for larger data.

By the end, you will know how to:

  • compute alternate-allele frequency using called haplotypes as the denominator

  • aggregate frequencies by population or any other sample grouping

  • compare ancestry-masked frequencies to cohort-wide frequencies

  • use allele-frequency tables for quick QC or exploratory analysis

  • validate a streaming workflow against in-memory results

The examples use a synthetic dataset so the notebook is fully self-contained.

from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display

from snputils import allele_freq_stream
from snputils.datasets import build_synthetic_snp_dataset
from snputils.snp.genobj.snpobj import SNPObject

RESULTS_DIR = Path("results/tutorials")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
SEED = 20240521
rng = np.random.default_rng(SEED)

snp = build_synthetic_snp_dataset(
    n_samples=12,
    n_snps=16,
    seed=SEED,
    n_populations=3,
    missing_rate=0.03,
    phased=True,
    chromosome="7",
    variant_prefix="rs_af",
)

labels = snp.sample_fid.astype(str)
samples = snp.samples.astype(str)
pop_a = labels == "POP_A"
pop_b = labels == "POP_B"
pop_c = labels == "POP_C"

# Add a few hand-crafted variants so the examples show clear patterns.
snp.genotypes[1, :, :] = 0
snp.genotypes[1, pop_a, :] = 1

snp.genotypes[4, :, :] = 0
snp.genotypes[4, pop_b, :] = 1
snp.genotypes[4, pop_c, :] = -1

snp.genotypes[7, :, :] = 0
snp.genotypes[7, 0, 0] = 1

snp.genotypes[10, :, :] = 0
snp.genotypes[10, [2, 5, 8], :] = -1
snp.genotypes[10, [1, 6], :] = 1

snp.genotypes[13, :, :] = 0
snp.genotypes[13, pop_c, :] = 1

lai = rng.integers(0, 3, size=snp.genotypes.shape, dtype=np.int8)
lai[1, :, :] = 0
lai[1, pop_a, :] = 1
lai[4, :, :] = 0
lai[4, pop_b, :] = 2
lai[13, :, :] = 0
lai[13, pop_c, :] = 1
snp.calldata_lai = lai
snp.ancestry_map = {"0": "ANC0", "1": "ANC1", "2": "ANC2"}

sample_metadata = pd.DataFrame(
    {
        "sample": samples,
        "population": labels,
        "sex_code": snp.sample_sex,
    }
)
variant_notes = pd.DataFrame(
    {
        "variant": snp.variants_id[[1, 4, 7, 10, 13]],
        "why_it_is_interesting": [
            "POP_A-private high-frequency alternate allele",
            "POP_B-enriched allele with all POP_C calls missing",
            "single alternate haplotype (rare variant)",
            "segregating variant with reduced call count",
            "POP_C-private high-frequency alternate allele",
        ],
    }
)

display(sample_metadata)
display(variant_notes)
sample population sex_code
0 S01 POP_A 1
1 S02 POP_B 2
2 S03 POP_C 1
3 S04 POP_A 2
4 S05 POP_B 1
5 S06 POP_C 2
6 S07 POP_A 1
7 S08 POP_B 2
8 S09 POP_C 1
9 S10 POP_A 2
10 S11 POP_B 1
11 S12 POP_C 2
variant why_it_is_interesting
0 rs_af_00002 POP_A-private high-frequency alternate allele
1 rs_af_00005 POP_B-enriched allele with all POP_C calls mis...
2 rs_af_00008 single alternate haplotype (rare variant)
3 rs_af_00011 segregating variant with reduced call count
4 rs_af_00014 POP_C-private high-frequency alternate allele

Cohort allele frequency

At the cohort level, allele_freq() counts alternate alleles across all called haplotypes with respect to variants_alt. Missing calls are excluded from both the numerator and the denominator. This is not necessarily the minor-allele frequency, we derive MAF separately below with min(AF, 1 - AF).

For phased diploid genotypes, the denominator is usually 2 * n_samples, but it becomes smaller when alleles are missing. Looking at the called-allele count beside the frequency is therefore good practice.

cohort_af, cohort_called = snp.allele_freq(return_counts=True)

cohort_summary = pd.DataFrame(
    {
        "variant": snp.variants_id,
        "chrom": snp.variants_chrom,
        "pos": snp.variants_pos,
        "ref": snp.variants_ref,
        "alt": snp.variants_alt,
        "cohort_af": cohort_af,
        "called_alleles": cohort_called,
    }
)
cohort_summary["call_rate"] = cohort_summary["called_alleles"] / (2 * snp.n_samples)
cohort_summary["maf"] = np.minimum(cohort_summary["cohort_af"], 1.0 - cohort_summary["cohort_af"])
cohort_summary["qc_flag"] = np.select(
    [cohort_summary["call_rate"] < 0.85, cohort_summary["maf"] < 0.10],
    ["low_call_rate", "rare_variant"],
    default="ok",
)

cohort_summary.round({"cohort_af": 3, "call_rate": 3, "maf": 3})
variant chrom pos ref alt cohort_af called_alleles call_rate maf qc_flag
0 rs_af_00001 7 1000 A G 0.875 24 1.000 0.125 ok
1 rs_af_00002 7 2000 C T 0.333 24 1.000 0.333 ok
2 rs_af_00003 7 3000 G A 0.174 23 0.958 0.174 ok
3 rs_af_00004 7 4000 T C 0.739 23 0.958 0.261 ok
4 rs_af_00005 7 5000 A G 0.500 16 0.667 0.500 low_call_rate
5 rs_af_00006 7 6000 C T 0.478 23 0.958 0.478 ok
6 rs_af_00007 7 7000 G A 0.182 22 0.917 0.182 ok
7 rs_af_00008 7 8000 T C 0.042 24 1.000 0.042 rare_variant
8 rs_af_00009 7 9000 A G 0.708 24 1.000 0.292 ok
9 rs_af_00010 7 10000 C T 0.348 23 0.958 0.348 ok
10 rs_af_00011 7 11000 G A 0.222 18 0.750 0.222 low_call_rate
11 rs_af_00012 7 12000 T C 0.818 22 0.917 0.182 ok
12 rs_af_00013 7 13000 A G 0.208 24 1.000 0.208 ok
13 rs_af_00014 7 14000 C T 0.333 24 1.000 0.333 ok
14 rs_af_00015 7 15000 G A 0.250 24 1.000 0.250 ok
15 rs_af_00016 7 16000 T C 0.167 24 1.000 0.167 ok

The maf and call_rate columns are simple but useful downstream summaries you can derive immediately from the frequency and count outputs.

This is often the first pass for QC: variants with low call rate or extremely small minor-allele frequency deserve a closer look before you use them.

cohort_summary.loc[cohort_summary["qc_flag"] != "ok", [
    "variant",
    "cohort_af",
    "called_alleles",
    "call_rate",
    "maf",
    "qc_flag",
]].round({"cohort_af": 3, "call_rate": 3, "maf": 3})
variant cohort_af called_alleles call_rate maf qc_flag
4 rs_af_00005 0.500 16 0.667 0.500 low_call_rate
7 rs_af_00008 0.042 24 1.000 0.042 rare_variant
10 rs_af_00011 0.222 18 0.750 0.222 low_call_rate
variant_idx = 4

gt = snp.genotypes[variant_idx].astype(float)
gt[gt < 0] = np.nan

manual_alt_count = np.nansum(gt)
manual_called_alleles = np.sum(~np.isnan(gt))
manual_af = manual_alt_count / manual_called_alleles

pd.DataFrame(
    {
        "variant": [snp.variants_id[variant_idx]],
        "manual_alt_count": [manual_alt_count],
        "manual_called_alleles": [manual_called_alleles],
        "manual_af": [manual_af],
        "allele_freq_output": [cohort_af[variant_idx]],
        "called_output": [cohort_called[variant_idx]],
    }
).round(3)
variant manual_alt_count manual_called_alleles manual_af allele_freq_output called_output
0 rs_af_00005 8.0 16 0.5 0.5 16

This check is useful because allele frequency is not just a mean over samples. For phased genotypes, the denominator is the number of called haplotypes.

Grouped allele frequency

Pass sample_labels to estimate one frequency column per group. The group labels can be populations, sequencing centers, phenotype strata, batches, or any other sample-level annotation aligned with snp.samples.

A useful trick is to rank variants by between-group spread. That quickly surfaces alleles that are nearly private to one group or that look suspiciously batch-specific.

group_af, group_called = snp.allele_freq(
    sample_labels=labels,
    return_counts=True,
    as_dataframe=True,
)

group_summary = pd.concat(
    [
        cohort_summary[["variant", "cohort_af", "called_alleles"]].rename(
            columns={"called_alleles": "cohort_called_alleles"}
        ),
        group_af.add_prefix("af_"),
        group_called.add_prefix("called_"),
    ],
    axis=1,
)

freq_cols = [col for col in group_summary.columns if col.startswith("af_")]
group_summary["spread"] = group_summary[freq_cols].max(axis=1) - group_summary[freq_cols].min(axis=1)
group_summary["most_enriched_group"] = group_summary[freq_cols].idxmax(axis=1).str.removeprefix("af_")

group_summary.sort_values(["spread", "variant"], ascending=[False, True]).head(8).round(3)
variant cohort_af cohort_called_alleles af_POP_A af_POP_B af_POP_C called_POP_A called_POP_B called_POP_C spread most_enriched_group
1 rs_af_00002 0.333 24 1.000 0.00 0.000 8 8 8 1.000 POP_A
4 rs_af_00005 0.500 16 0.000 1.00 NaN 8 8 0 1.000 POP_B
13 rs_af_00014 0.333 24 0.000 0.00 1.000 8 8 8 1.000 POP_C
3 rs_af_00004 0.739 23 0.286 1.00 0.875 7 8 8 0.714 POP_B
8 rs_af_00009 0.708 24 0.500 1.00 0.625 8 8 8 0.500 POP_B
0 rs_af_00001 0.875 24 1.000 1.00 0.625 8 8 8 0.375 POP_A
12 rs_af_00013 0.208 24 0.375 0.25 0.000 8 8 8 0.375 POP_A
14 rs_af_00015 0.250 24 0.375 0.00 0.375 8 8 8 0.375 POP_A

Two details matter when interpreting grouped results:

  1. A large frequency difference is only convincing if the corresponding called_* counts are also healthy.

  2. Missingness can be group-specific, so comparing only frequencies without counts can be misleading.

interesting_grouped = group_summary.loc[group_summary["spread"] >= 0.50, [
    "variant",
    "cohort_af",
    "af_POP_A",
    "af_POP_B",
    "af_POP_C",
    "called_POP_A",
    "called_POP_B",
    "called_POP_C",
    "spread",
    "most_enriched_group",
],
].sort_values(["spread", "variant"], ascending=[False, True])
interesting_grouped.round(3)
variant cohort_af af_POP_A af_POP_B af_POP_C called_POP_A called_POP_B called_POP_C spread most_enriched_group
1 rs_af_00002 0.333 1.000 0.0 0.000 8 8 8 1.000 POP_A
4 rs_af_00005 0.500 0.000 1.0 NaN 8 8 0 1.000 POP_B
13 rs_af_00014 0.333 0.000 0.0 1.000 8 8 8 1.000 POP_C
3 rs_af_00004 0.739 0.286 1.0 0.875 7 8 8 0.714 POP_B
8 rs_af_00009 0.708 0.500 1.0 0.625 8 8 8 0.500 POP_B

Ancestry-masked allele frequency

If SNP-level local ancestry is available, passing ancestry= restricts the calculation to haplotypes assigned to that ancestry.

This is different from grouping samples by population. Population labels split individuals into bins; ancestry masking splits haplotypes within individuals. In admixed cohorts, the two summaries answer different biological questions.

ancestry_tables = []
for anc_code, anc_name in snp.ancestry_map.items():
    anc_af, anc_called = snp.allele_freq(ancestry=int(anc_code), return_counts=True)
    ancestry_tables.append(
        pd.DataFrame(
            {
                "variant": snp.variants_id,
                f"{anc_name}_af": anc_af,
                f"{anc_name}_called": anc_called,
            }
        )
    )

ancestry_summary = cohort_summary[["variant", "cohort_af", "called_alleles"]].copy()
for table in ancestry_tables:
    ancestry_summary = ancestry_summary.merge(table, on="variant")

anc_freq_cols = [col for col in ancestry_summary.columns if col.endswith("_af") and col != "cohort_af"]
ancestry_summary["max_ancestry_shift"] = ancestry_summary[anc_freq_cols].sub(
    ancestry_summary["cohort_af"], axis=0
).abs().max(axis=1)

ancestry_summary.sort_values("max_ancestry_shift", ascending=False).head(8).round(3)
variant cohort_af called_alleles ANC0_af ANC0_called ANC1_af ANC1_called ANC2_af ANC2_called max_ancestry_shift
1 rs_af_00002 0.333 24 0.000 16 1.000 8 NaN 0 0.667
13 rs_af_00014 0.333 24 0.000 16 1.000 8 NaN 0 0.667
4 rs_af_00005 0.500 16 0.000 8 NaN 0 1.00 8 0.500
8 rs_af_00009 0.708 24 0.667 12 0.571 7 1.00 5 0.292
14 rs_af_00015 0.250 24 0.000 5 0.444 9 0.20 10 0.250
2 rs_af_00003 0.174 23 0.400 5 0.000 10 0.25 8 0.226
10 rs_af_00011 0.222 18 0.000 2 0.250 8 0.25 8 0.222
6 rs_af_00007 0.182 22 0.167 12 0.400 5 0.00 5 0.218

The ancestry-specific called-allele counts are often much smaller than the cohort-wide counts because each ancestry mask keeps only a subset of haplotypes. That count reduction is expected, not an error.

When a variant looks extreme in one ancestry but is based on very few called haplotypes, treat it as a clue rather than a conclusion.

ancestry_summary.loc[ancestry_summary["max_ancestry_shift"] >= 0.35, [
    "variant",
    "cohort_af",
    "ANC0_af",
    "ANC1_af",
    "ANC2_af",
    "ANC0_called",
    "ANC1_called",
    "ANC2_called",
    "max_ancestry_shift",
]].round(3)
variant cohort_af ANC0_af ANC1_af ANC2_af ANC0_called ANC1_called ANC2_called max_ancestry_shift
1 rs_af_00002 0.333 0.0 1.0 NaN 16 8 0 0.667
4 rs_af_00005 0.500 0.0 NaN 1.0 8 0 8 0.500
13 rs_af_00014 0.333 0.0 1.0 NaN 16 8 0 0.667

Pseudo-haploid example

Ancient-DNA and low-coverage workflows often use pseudohaploid calls, where each sample contributes one observed allele at a site even if the genotype is encoded as 0 or 2. So the denominator should increase by 1 rather than 2 for those samples.

allele_freq() can detect samples without heterozygotes and down-weight their denominator with pseudohaploid=True.

pseudo_gt = np.array(
    [
        [1.0, 0.0],
        [0.0, 2.0],
        [1.0, 0.0],
        [2.0, 2.0],
    ]
)
pseudo = SNPObject(
    genotypes=pseudo_gt,
    samples=np.array(["diploid_sample", "pseudo_sample"], dtype=object),
    variants_id=np.array([f"pseudo_rs_{i}" for i in range(1, 5)], dtype=object),
)

freq_diploid, counts_diploid = pseudo.allele_freq(pseudohaploid=False, return_counts=True)
freq_pseudo, counts_pseudo = pseudo.allele_freq(pseudohaploid=True, return_counts=True)

pd.DataFrame(
    {
        "variant": pseudo.variants_id,
        "freq_default": freq_diploid,
        "called_default": counts_diploid,
        "freq_pseudohaploid": freq_pseudo,
        "called_pseudohaploid": counts_pseudo,
    }
).round(3)
variant freq_default called_default freq_pseudohaploid called_pseudohaploid
0 pseudo_rs_1 0.25 4 0.333 3
1 pseudo_rs_2 0.50 4 0.333 3
2 pseudo_rs_3 0.25 4 0.333 3
3 pseudo_rs_4 1.00 4 1.000 3

Streaming allele frequency

allele_freq_stream() processes the data in variant chunks and returns the same result as eager in-memory computation. The streaming API is useful when your source data comes from a reader or when the full genotype matrix is too large to materialize comfortably.

Here we validate the streaming path against the in-memory path on the same grouped, ancestry-masked query.

stream_af, stream_called = allele_freq_stream(
    snp,
    chunk_size=5,
    sample_labels=labels,
    ancestry=1,
    return_counts=True,
    as_dataframe=True,
)

eager_af, eager_called = snp.allele_freq(
    sample_labels=labels,
    ancestry=1,
    return_counts=True,
    as_dataframe=True,
)

comparison = pd.DataFrame(
    {
        "freq_match": [np.allclose(stream_af.to_numpy(), eager_af.to_numpy(), equal_nan=True)],
        "count_match": [stream_called.equals(eager_called)],
        "n_variants": [snp.n_snps],
        "chunk_size": [5],
    }
)

display(comparison)
stream_af.head().round(3)
freq_match count_match n_variants chunk_size
0 True True 16 5
POP_A POP_B POP_C
0 1.0 1.0 1.0
1 1.0 NaN NaN
2 0.0 0.0 0.0
3 0.0 1.0 1.0
4 NaN NaN NaN

Takeaways

  • Always inspect frequency estimates together with called-allele counts.

  • Grouped allele frequencies are useful for detecting differentiated or suspicious variants.

  • Ancestry-masked allele frequencies answer a different question than population-grouped frequencies.

  • Use pseudohaploid=True when pseudohaploid 0/2 calls are mixed with ordinary diploid genotypes.

  • allele_freq_stream() is the scalable path, and it should agree with eager results on the same input.