SNPObject

SNPObject is the main object you will use to hold genotype data in snputils. It stores genotype calls together with sample metadata, variant metadata, and optional SNP-level local ancestry.

This notebook walks through a small SNPObject from inspection to filtering, sample alignment, allele frequencies, and local ancestry. We will cover how to:

  • inspect the structure of a phased SNPObject

  • make dosage and QC tables from phased genotypes

  • subset variants and samples for downstream analysis

  • realign samples to an external table

  • normalize chromosome naming and prepare human-readable genotype table views

  • attach local ancestry and compute ancestry-aware summaries

The examples use a synthetic dataset so the notebook runs without external files.

import numpy as np
import pandas as pd
from IPython.display import display

from snputils.datasets import build_synthetic_snp_dataset

SEED = 20240521
rng = np.random.default_rng(SEED)

snpobj = build_synthetic_snp_dataset(
    n_samples=8,
    n_snps=18,
    seed=SEED,
    n_populations=3,
    missing_rate=0.0,
    phased=True,
    chromosome="chr7",
    variant_prefix="rs_obj",
)

# Add a few hand-crafted variants so later examples are interpretable.
snpobj.genotypes[2, :, :] = 0
snpobj.genotypes[5, [1, 6], :] = -1
snpobj.genotypes[7, [2, 7], :] = -1
snpobj.genotypes[11, :, :] = 0
snpobj.genotypes[11, [0, 3, 4], :] = 1
snpobj.genotypes[14, [2, 4], :] = -1

sample_table = pd.DataFrame(
    {
        "sample": snpobj.samples.astype(str),
        "population": snpobj.sample_fid.astype(str),
        "sex_code": snpobj.sample_sex,
        "study_arm": np.where(np.arange(snpobj.n_samples) % 2 == 0, "arm_A", "arm_B"),
    }
)

variant_table = pd.DataFrame(
    {
        "variant": snpobj.variants_id.astype(str),
        "chrom": snpobj.variants_chrom.astype(str),
        "pos": snpobj.variants_pos,
        "ref": snpobj.variants_ref.astype(str),
        "alt": snpobj.variants_alt.astype(str),
    }
)

display(sample_table)
display(variant_table)
sample population sex_code study_arm
0 S01 POP_A 1 arm_A
1 S02 POP_B 2 arm_B
2 S03 POP_C 1 arm_A
3 S04 POP_A 2 arm_B
4 S05 POP_B 1 arm_A
5 S06 POP_C 2 arm_B
6 S07 POP_A 1 arm_A
7 S08 POP_B 2 arm_B
variant chrom pos ref alt
0 rs_obj_00001 chr7 1000 A G
1 rs_obj_00002 chr7 2000 C T
2 rs_obj_00003 chr7 3000 G A
3 rs_obj_00004 chr7 4000 T C
4 rs_obj_00005 chr7 5000 A G
5 rs_obj_00006 chr7 6000 C T
6 rs_obj_00007 chr7 7000 G A
7 rs_obj_00008 chr7 8000 T C
8 rs_obj_00009 chr7 9000 A G
9 rs_obj_00010 chr7 10000 C T
10 rs_obj_00011 chr7 11000 G A
11 rs_obj_00012 chr7 12000 T C
12 rs_obj_00013 chr7 13000 A G
13 rs_obj_00014 chr7 14000 C T
14 rs_obj_00015 chr7 15000 G A
15 rs_obj_00016 chr7 16000 T C
16 rs_obj_00017 chr7 17000 A G
17 rs_obj_00018 chr7 18000 C T

Inspect the container

A phased SNPObject stores genotypes with shape (n_snps, n_samples, 2), where the last axis is the two haplotypes. The object also keeps sample-level and variant-level arrays aligned to those axes.

print(snpobj)
print("shape:", snpobj.shape)
print("n_samples:", snpobj.n_samples)
print("n_snps:", snpobj.n_snps)
print("unique chromosomes:", snpobj.unique_chrom.tolist())
print("are strands summed?:", snpobj.are_strands_summed)
print("available keys:", snpobj.keys())
SNPObject(shape=(18, 8, 2), n_snps=18, n_samples=8, genotypes_shape=(18, 8, 2), calldata_lai_shape=None, calldata_gp_shape=None, has_variant_metadata=True, has_ancestry_map=False)
shape: (18, 8, 2)
n_samples: 8
n_snps: 18
unique chromosomes: ['chr7']
are strands summed?: False
available keys: ['genotypes', 'samples', 'variants_ref', 'variants_alt', 'variants_chrom', 'variants_cm', 'variants_filter_pass', 'variants_id', 'variants_pos', 'variants_qual', 'variants_info', 'calldata_lai', 'calldata_gp', 'ancestry_map', 'sample_fid', 'sample_sex']

The two most important alignment rules are:

  1. arrays like samples, sample_fid, and sample_sex must stay aligned to the sample axis;

  2. arrays like variants_id, variants_pos, and variants_chrom must stay aligned to the SNP axis.

Dictionary-style access is convenient for programmatic workflows, while property access is the usual style in most analysis code. When you edit metadata, it is usually safer to work on a copy.

print("original sample IDs:", snpobj.samples[:4])
print("first variant IDs via dictionary-style access:", snpobj["variants_id"][:4])

renamed = snpobj.copy()
renamed.samples = np.array([f"analysis_{i:02d}" for i in range(renamed.n_samples)], dtype=object)
renamed["sample_fid"] = np.array([f"cluster_{i % 2}" for i in range(renamed.n_samples)], dtype=object)

print("copied sample IDs:", renamed.samples[:4])
print("original sample IDs still intact:", snpobj.samples[:4])
print("copied sample labels:", renamed.sample_fid[:4])
original sample IDs: ['S01' 'S02' 'S03' 'S04']
first variant IDs via dictionary-style access: ['rs_obj_00001' 'rs_obj_00002' 'rs_obj_00003' 'rs_obj_00004']
copied sample IDs: ['analysis_00' 'analysis_01' 'analysis_02' 'analysis_03']
original sample IDs still intact: ['S01' 'S02' 'S03' 'S04']
copied sample labels: ['cluster_0' 'cluster_1' 'cluster_0' 'cluster_1']

Dosage-style summaries from phased genotypes

Many downstream methods want alternate-allele dosage rather than two separate haplotypes. dosage("ALT") returns a 2D dosage matrix, and sum_strands() converts the whole object to summed genotypes. sum_strands() just sums the two haplotypes. If there are missing calls, clean those up before using the summed matrix for QC or modeling.

For phased genotype calls, we mask any sample with a missing haplotype before computing QC summaries. Before modeling, you can make a small variant-level QC table from the dosage matrix.

raw_dosage = snpobj.dosage("ALT").astype(float)
summed = snpobj.sum_strands()

if snpobj.genotypes.ndim == 3:
    missing_call = np.any(snpobj.genotypes < 0, axis=2)
else:
    missing_call = snpobj.genotypes < 0

alt_dosage = raw_dosage.copy()
alt_dosage[missing_call] = np.nan
called_samples = np.sum(~np.isnan(alt_dosage), axis=1)

variant_qc = variant_table.copy()
variant_qc["called_samples"] = called_samples
variant_qc["call_rate"] = called_samples / snpobj.n_samples
variant_qc["mean_alt_dosage"] = np.divide(
    np.nansum(alt_dosage, axis=1),
    called_samples,
    out=np.full(snpobj.n_snps, np.nan),
    where=called_samples > 0,
)
variant_qc["is_monomorphic_among_called"] = np.array(
    [
        called.size > 0 and np.unique(called).size <= 1
        for called in (row[~np.isnan(row)] for row in alt_dosage)
    ]
)

print("ALT dosage shape:", alt_dosage.shape)
print("Summed genotype shape:", summed.genotypes.shape)
display(pd.DataFrame(alt_dosage, index=snpobj.variants_id, columns=snpobj.samples))
variant_qc.round({"call_rate": 3, "mean_alt_dosage": 3})
ALT dosage shape: (18, 8)
Summed genotype shape: (18, 8)
S01 S02 S03 S04 S05 S06 S07 S08
rs_obj_00001 2.0 2.0 1.0 2.0 1.0 0.0 2.0 2.0
rs_obj_00002 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0
rs_obj_00003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
rs_obj_00004 2.0 1.0 2.0 2.0 1.0 2.0 0.0 1.0
rs_obj_00005 1.0 2.0 1.0 2.0 0.0 1.0 0.0 1.0
rs_obj_00006 2.0 NaN 1.0 1.0 2.0 1.0 NaN 2.0
rs_obj_00007 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
rs_obj_00008 0.0 2.0 NaN 2.0 1.0 2.0 1.0 NaN
rs_obj_00009 2.0 1.0 1.0 2.0 0.0 1.0 2.0 1.0
rs_obj_00010 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
rs_obj_00011 1.0 0.0 1.0 1.0 1.0 2.0 2.0 2.0
rs_obj_00012 2.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0
rs_obj_00013 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0
rs_obj_00014 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0
rs_obj_00015 0.0 1.0 NaN 1.0 NaN 0.0 0.0 2.0
rs_obj_00016 0.0 0.0 0.0 1.0 0.0 0.0 1.0 2.0
rs_obj_00017 1.0 0.0 1.0 0.0 1.0 1.0 2.0 2.0
rs_obj_00018 2.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0
variant chrom pos ref alt called_samples call_rate mean_alt_dosage is_monomorphic_among_called
0 rs_obj_00001 chr7 1000 A G 8 1.00 1.500 False
1 rs_obj_00002 chr7 2000 C T 8 1.00 1.875 False
2 rs_obj_00003 chr7 3000 G A 8 1.00 0.000 True
3 rs_obj_00004 chr7 4000 T C 8 1.00 1.375 False
4 rs_obj_00005 chr7 5000 A G 8 1.00 1.000 False
5 rs_obj_00006 chr7 6000 C T 6 0.75 1.500 False
6 rs_obj_00007 chr7 7000 G A 8 1.00 0.000 True
7 rs_obj_00008 chr7 8000 T C 6 0.75 1.333 False
8 rs_obj_00009 chr7 9000 A G 8 1.00 1.250 False
9 rs_obj_00010 chr7 10000 C T 8 1.00 0.250 False
10 rs_obj_00011 chr7 11000 G A 8 1.00 1.250 False
11 rs_obj_00012 chr7 12000 T C 8 1.00 0.750 False
12 rs_obj_00013 chr7 13000 A G 8 1.00 0.500 False
13 rs_obj_00014 chr7 14000 C T 8 1.00 0.625 False
14 rs_obj_00015 chr7 15000 G A 6 0.75 0.667 False
15 rs_obj_00016 chr7 16000 T C 8 1.00 0.500 False
16 rs_obj_00017 chr7 17000 A G 8 1.00 1.000 False
17 rs_obj_00018 chr7 18000 C T 8 1.00 0.625 False

Variant filtering as a workflow

In practice, it is easier to filter in small, explicit steps. Here we:

  • take a positional slice of the chromosome;

  • remove variants with any missing genotypes;

  • remove monomorphic variants that carry no variation signal.

region_subset = snpobj.filter_variants(
    chrom="chr7",
    pos=snpobj.variants_pos[2:12],
)
complete_subset = region_subset.filter_complete_genotypes()
analysis_subset = complete_subset.filter_polymorphic_variants()

summary = pd.DataFrame(
    {
        "step": ["original", "regional subset", "complete calls", "polymorphic + complete"],
        "n_snps": [snpobj.n_snps, region_subset.n_snps, complete_subset.n_snps, analysis_subset.n_snps],
        "n_samples": [snpobj.n_samples, region_subset.n_samples, complete_subset.n_samples, analysis_subset.n_samples],
    }
)

display(summary)
analysis_subset.variants_id
step n_snps n_samples
0 original 18 8
1 regional subset 10 8
2 complete calls 8 8
3 polymorphic + complete 6 8
array(['rs_obj_00004', 'rs_obj_00005', 'rs_obj_00009', 'rs_obj_00010',
       'rs_obj_00011', 'rs_obj_00012'], dtype=object)

Reorder samples to match an external table

A common preprocessing task is to realign genotype samples to a phenotype or covariate table. reorder=True makes the SNPObject follow the same sample order as the table.

phenotype_table = pd.DataFrame(
    {
        "sample": ["S05", "S02", "S08", "S01"],
        "trait": [2.3, -0.7, 1.1, 0.4],
    }
)

aligned = snpobj.filter_samples(samples=phenotype_table["sample"], reorder=True)

alignment_check = pd.DataFrame(
    {
        "phenotype_order": phenotype_table["sample"],
        "aligned_samples": aligned.samples,
        "aligned_population": aligned.sample_fid,
    }
)

alignment_check
phenotype_order aligned_samples aligned_population
0 S05 S05 POP_B
1 S02 S02 POP_B
2 S08 S08 POP_B
3 S01 S01 POP_A

Normalize chromosome naming and prepare a human-readable genotype table

Mixed chromosome naming conventions such as chr7 versus 7 are easy to overlook and can break joins or merges. For standard chromosome labels such as chr7 and 7, convert_chromosome_format() can normalize naming. For non-standard contigs or non-human assemblies, use pattern-based renaming with rename_chrom().

Another practical transformation is to make a display-friendly genotype table with missing-aware summed genotypes and "." as the missing-value token.

plain_chrom = snpobj.convert_chromosome_format("chr", "plain")

summed_gt = snpobj.genotypes.sum(axis=2)
missing_call = np.any(snpobj.genotypes < 0, axis=2)
summed_gt[missing_call] = -1

table_gt = summed_gt.astype(object)
table_gt[missing_call] = "."

genotype_table = pd.DataFrame(
    table_gt,
    index=snpobj.variants_id,
    columns=snpobj.samples,
)

print("chromosome format before:", snpobj.detect_chromosome_format(), snpobj.variants_chrom[:3])
print("chromosome format after:", plain_chrom.detect_chromosome_format(), plain_chrom.variants_chrom[:3])
display(genotype_table)
chromosome format before: chr ['chr7' 'chr7' 'chr7']
chromosome format after: plain ['7' '7' '7']
S01 S02 S03 S04 S05 S06 S07 S08
rs_obj_00001 2 2 1 2 1 0 2 2
rs_obj_00002 2 2 1 2 2 2 2 2
rs_obj_00003 0 0 0 0 0 0 0 0
rs_obj_00004 2 1 2 2 1 2 0 1
rs_obj_00005 1 2 1 2 0 1 0 1
rs_obj_00006 2 . 1 1 2 1 . 2
rs_obj_00007 0 0 0 0 0 0 0 0
rs_obj_00008 0 2 . 2 1 2 1 .
rs_obj_00009 2 1 1 2 0 1 2 1
rs_obj_00010 1 0 0 0 1 0 0 0
rs_obj_00011 1 0 1 1 1 2 2 2
rs_obj_00012 2 0 0 2 2 0 0 0
rs_obj_00013 0 0 1 1 0 0 1 1
rs_obj_00014 0 1 1 0 0 1 1 1
rs_obj_00015 0 1 . 1 . 0 0 2
rs_obj_00016 0 0 0 1 0 0 1 2
rs_obj_00017 1 0 1 0 1 1 2 2
rs_obj_00018 2 0 0 1 0 0 2 0

Grouped allele frequencies from object metadata

Because SNPObject already carries sample-level metadata, you can compute grouped allele frequencies directly from those labels. Here we compare two grouping schemes:

  • sample_fid, which acts like a population label in this synthetic dataset; in real PLINK-style data, FID may instead represent family ID;

  • study_arm, a toy grouping variable that could be replaced by batch, site, or case-control status in real data.

pop_af, pop_called = snpobj.allele_freq(
    sample_labels=snpobj.sample_fid,
    return_counts=True,
    as_dataframe=True,
)
arm_labels = sample_table["study_arm"].to_numpy()
arm_af, arm_called = snpobj.allele_freq(
    sample_labels=arm_labels,
    return_counts=True,
    as_dataframe=True,
)

pop_summary = pd.concat(
    [variant_table[["variant", "pos"]], pop_af.add_prefix("af_"), pop_called.add_prefix("called_")],
    axis=1,
)
pop_summary["population_spread"] = pop_af.max(axis=1) - pop_af.min(axis=1)

arm_summary = pd.concat(
    [variant_table[["variant", "pos"]], arm_af.add_prefix("af_"), arm_called.add_prefix("called_")],
    axis=1,
)
arm_summary["arm_difference"] = (arm_af["arm_A"] - arm_af["arm_B"]).abs()

display(pop_summary.sort_values("population_spread", ascending=False).round(3))
arm_summary.sort_values("arm_difference", ascending=False).round(3)
variant pos af_POP_A af_POP_B af_POP_C called_POP_A called_POP_B called_POP_C population_spread
17 rs_obj_00018 18000 0.833 0.000 0.00 6 6 4 0.833
0 rs_obj_00001 1000 1.000 0.833 0.25 6 6 4 0.750
14 rs_obj_00015 15000 0.167 0.750 0.00 6 4 2 0.750
8 rs_obj_00009 9000 1.000 0.333 0.50 6 6 4 0.667
11 rs_obj_00012 12000 0.667 0.333 0.00 6 6 4 0.667
7 rs_obj_00008 8000 0.500 0.750 1.00 6 4 2 0.500
5 rs_obj_00006 6000 0.750 1.000 0.50 4 4 4 0.500
3 rs_obj_00004 4000 0.667 0.500 1.00 6 6 4 0.500
13 rs_obj_00014 14000 0.167 0.333 0.50 6 6 4 0.333
15 rs_obj_00016 16000 0.333 0.333 0.00 6 6 4 0.333
1 rs_obj_00002 2000 1.000 1.000 0.75 6 6 4 0.250
10 rs_obj_00011 11000 0.667 0.500 0.75 6 6 4 0.250
12 rs_obj_00013 13000 0.333 0.167 0.25 6 6 4 0.167
9 rs_obj_00010 10000 0.167 0.167 0.00 6 6 4 0.167
6 rs_obj_00007 7000 0.000 0.000 0.00 6 6 4 0.000
4 rs_obj_00005 5000 0.500 0.500 0.50 6 6 4 0.000
2 rs_obj_00003 3000 0.000 0.000 0.00 6 6 4 0.000
16 rs_obj_00017 17000 0.500 0.500 0.50 6 6 4 0.000
variant pos af_arm_A af_arm_B called_arm_A called_arm_B arm_difference
7 rs_obj_00008 8000 0.333 1.000 6 6 0.667
4 rs_obj_00005 5000 0.250 0.750 8 8 0.500
14 rs_obj_00015 15000 0.000 0.500 4 8 0.500
17 rs_obj_00018 18000 0.500 0.125 8 8 0.375
15 rs_obj_00016 16000 0.125 0.375 8 8 0.250
11 rs_obj_00012 12000 0.500 0.250 8 8 0.250
9 rs_obj_00010 10000 0.250 0.000 8 8 0.250
16 rs_obj_00017 17000 0.625 0.375 8 8 0.250
5 rs_obj_00006 6000 0.833 0.667 6 6 0.167
1 rs_obj_00002 2000 0.875 1.000 8 8 0.125
3 rs_obj_00004 4000 0.625 0.750 8 8 0.125
13 rs_obj_00014 14000 0.250 0.375 8 8 0.125
0 rs_obj_00001 1000 0.750 0.750 8 8 0.000
2 rs_obj_00003 3000 0.000 0.000 8 8 0.000
8 rs_obj_00009 9000 0.625 0.625 8 8 0.000
6 rs_obj_00007 7000 0.000 0.000 8 8 0.000
12 rs_obj_00013 13000 0.250 0.250 8 8 0.000
10 rs_obj_00011 11000 0.625 0.625 8 8 0.000

Attach local ancestry directly to the object

SNPObject can also carry SNP-level local ancestry in calldata_lai. In this example, we attach SNP-level local ancestry with the same (n_snps, n_samples, 2) layout as the phased genotypes. SNPObject can also represent SNP-level LAI in a flattened haplotype layout.

Ancestry masking can be done without leaving the object. In the ancestry-specific example below, ancestry=1 selects numeric ancestry code 1, which is labeled "ANC1" by ancestry_map.

snp_with_lai = snpobj.copy()
lai = rng.integers(0, 3, size=snp_with_lai.genotypes.shape, dtype=np.int8)
lai[11, :, :] = 0
lai[11, [0, 3, 4], :] = 1
snp_with_lai.calldata_lai = lai
snp_with_lai.ancestry_map = {"0": "ANC0", "1": "ANC1", "2": "ANC2"}

cohort_af, cohort_called = snp_with_lai.allele_freq(return_counts=True)
anc1_af, anc1_called = snp_with_lai.allele_freq(ancestry=1, return_counts=True)

lai_summary = pd.DataFrame(
    {
        "variant": snp_with_lai.variants_id,
        "cohort_af": cohort_af,
        "cohort_called": cohort_called,
        "ANC1_af": anc1_af,
        "ANC1_called": anc1_called,
        "abs_shift": np.abs(anc1_af - cohort_af),
    }
)

lai_summary.sort_values("abs_shift", ascending=False).head(8).round(3)
variant cohort_af cohort_called ANC1_af ANC1_called abs_shift
11 rs_obj_00012 0.375 16 1.000 6 0.625
7 rs_obj_00008 0.667 12 1.000 4 0.333
13 rs_obj_00014 0.312 16 0.000 3 0.312
16 rs_obj_00017 0.500 16 0.800 5 0.300
17 rs_obj_00018 0.312 16 0.500 6 0.188
3 rs_obj_00004 0.688 16 0.857 7 0.170
5 rs_obj_00006 0.750 12 0.600 5 0.150
9 rs_obj_00010 0.125 16 0.000 3 0.125

Takeaways

  • SNPObject is most useful when treated as an analysis container, not just a genotype array.

  • Working on copies makes metadata edits and export transforms safer.

  • dosage() and missing-aware summed genotypes are the bridge from phased haplotypes to matrix-style downstream workflows.

  • Filtering and sample reordering are easier to check when each step creates a clear new object.

  • Once metadata and LAI are attached, the same object can answer whole-cohort, group-level, and ancestry-masked questions.