SNPObject¶

SNPObject is the main object you will use to hold genotype data in snputils. It stores genotype calls together with sample metadata, variant metadata, and optional SNP-level local ancestry.

This notebook walks through a small SNPObject from inspection to filtering, sample alignment, allele frequencies, and local ancestry. We will cover how to:

inspect the structure of a phased SNPObject
make dosage and QC tables from phased genotypes
subset variants and samples for downstream analysis
realign samples to an external table
normalize chromosome naming and prepare human-readable genotype table views
attach local ancestry and compute ancestry-aware summaries

The examples use a synthetic dataset so the notebook runs without external files.

import numpy as np
import pandas as pd
from IPython.display import display

from snputils.datasets import build_synthetic_snp_dataset

SEED = 20240521
rng = np.random.default_rng(SEED)

snpobj = build_synthetic_snp_dataset(
    n_samples=8,
    n_snps=18,
    seed=SEED,
    n_populations=3,
    missing_rate=0.0,
    phased=True,
    chromosome="chr7",
    variant_prefix="rs_obj",
)

# Add a few hand-crafted variants so later examples are interpretable.
snpobj.genotypes[2, :, :] = 0
snpobj.genotypes[5, [1, 6], :] = -1
snpobj.genotypes[7, [2, 7], :] = -1
snpobj.genotypes[11, :, :] = 0
snpobj.genotypes[11, [0, 3, 4], :] = 1
snpobj.genotypes[14, [2, 4], :] = -1

sample_table = pd.DataFrame(
    {
        "sample": snpobj.samples.astype(str),
        "population": snpobj.sample_fid.astype(str),
        "sex_code": snpobj.sample_sex,
        "study_arm": np.where(np.arange(snpobj.n_samples) % 2 == 0, "arm_A", "arm_B"),
    }
)

variant_table = pd.DataFrame(
    {
        "variant": snpobj.variants_id.astype(str),
        "chrom": snpobj.variants_chrom.astype(str),
        "pos": snpobj.variants_pos,
        "ref": snpobj.variants_ref.astype(str),
        "alt": snpobj.variants_alt.astype(str),
    }
)

display(sample_table)
display(variant_table)

	sample	population	sex_code	study_arm
0	S01	POP_A	1	arm_A
1	S02	POP_B	2	arm_B
2	S03	POP_C	1	arm_A
3	S04	POP_A	2	arm_B
4	S05	POP_B	1	arm_A
5	S06	POP_C	2	arm_B
6	S07	POP_A	1	arm_A
7	S08	POP_B	2	arm_B

	variant	chrom	pos	ref	alt
0	rs_obj_00001	chr7	1000	A	G
1	rs_obj_00002	chr7	2000	C	T
2	rs_obj_00003	chr7	3000	G	A
3	rs_obj_00004	chr7	4000	T	C
4	rs_obj_00005	chr7	5000	A	G
5	rs_obj_00006	chr7	6000	C	T
6	rs_obj_00007	chr7	7000	G	A
7	rs_obj_00008	chr7	8000	T	C
8	rs_obj_00009	chr7	9000	A	G
9	rs_obj_00010	chr7	10000	C	T
10	rs_obj_00011	chr7	11000	G	A
11	rs_obj_00012	chr7	12000	T	C
12	rs_obj_00013	chr7	13000	A	G
13	rs_obj_00014	chr7	14000	C	T
14	rs_obj_00015	chr7	15000	G	A
15	rs_obj_00016	chr7	16000	T	C
16	rs_obj_00017	chr7	17000	A	G
17	rs_obj_00018	chr7	18000	C	T

Inspect the container¶

A phased SNPObject stores genotypes with shape (n_snps, n_samples, 2), where the last axis is the two haplotypes. The object also keeps sample-level and variant-level arrays aligned to those axes.

print(snpobj)
print("shape:", snpobj.shape)
print("n_samples:", snpobj.n_samples)
print("n_snps:", snpobj.n_snps)
print("unique chromosomes:", snpobj.unique_chrom.tolist())
print("are strands summed?:", snpobj.are_strands_summed)
print("available keys:", snpobj.keys())

SNPObject(shape=(18, 8, 2), n_snps=18, n_samples=8, genotypes_shape=(18, 8, 2), calldata_lai_shape=None, calldata_gp_shape=None, has_variant_metadata=True, has_ancestry_map=False)
shape: (18, 8, 2)
n_samples: 8
n_snps: 18
unique chromosomes: ['chr7']
are strands summed?: False
available keys: ['genotypes', 'samples', 'variants_ref', 'variants_alt', 'variants_chrom', 'variants_cm', 'variants_filter_pass', 'variants_id', 'variants_pos', 'variants_qual', 'variants_info', 'calldata_lai', 'calldata_gp', 'ancestry_map', 'sample_fid', 'sample_sex']

The two most important alignment rules are:

arrays like samples, sample_fid, and sample_sex must stay aligned to the sample axis;
arrays like variants_id, variants_pos, and variants_chrom must stay aligned to the SNP axis.

Dictionary-style access is convenient for programmatic workflows, while property access is the usual style in most analysis code. When you edit metadata, it is usually safer to work on a copy.

print("original sample IDs:", snpobj.samples[:4])
print("first variant IDs via dictionary-style access:", snpobj["variants_id"][:4])

renamed = snpobj.copy()
renamed.samples = np.array([f"analysis_{i:02d}" for i in range(renamed.n_samples)], dtype=object)
renamed["sample_fid"] = np.array([f"cluster_{i % 2}" for i in range(renamed.n_samples)], dtype=object)

print("copied sample IDs:", renamed.samples[:4])
print("original sample IDs still intact:", snpobj.samples[:4])
print("copied sample labels:", renamed.sample_fid[:4])

original sample IDs: ['S01' 'S02' 'S03' 'S04']
first variant IDs via dictionary-style access: ['rs_obj_00001' 'rs_obj_00002' 'rs_obj_00003' 'rs_obj_00004']
copied sample IDs: ['analysis_00' 'analysis_01' 'analysis_02' 'analysis_03']
original sample IDs still intact: ['S01' 'S02' 'S03' 'S04']
copied sample labels: ['cluster_0' 'cluster_1' 'cluster_0' 'cluster_1']

Dosage-style summaries from phased genotypes¶

Many downstream methods want alternate-allele dosage rather than two separate haplotypes. dosage("ALT") returns a 2D dosage matrix, and sum_strands() converts the whole object to summed genotypes. sum_strands() just sums the two haplotypes. If there are missing calls, clean those up before using the summed matrix for QC or modeling.

For phased genotype calls, we mask any sample with a missing haplotype before computing QC summaries. Before modeling, you can make a small variant-level QC table from the dosage matrix.

raw_dosage = snpobj.dosage("ALT").astype(float)
summed = snpobj.sum_strands()

if snpobj.genotypes.ndim == 3:
    missing_call = np.any(snpobj.genotypes < 0, axis=2)
else:
    missing_call = snpobj.genotypes < 0

alt_dosage = raw_dosage.copy()
alt_dosage[missing_call] = np.nan
called_samples = np.sum(~np.isnan(alt_dosage), axis=1)

variant_qc = variant_table.copy()
variant_qc["called_samples"] = called_samples
variant_qc["call_rate"] = called_samples / snpobj.n_samples
variant_qc["mean_alt_dosage"] = np.divide(
    np.nansum(alt_dosage, axis=1),
    called_samples,
    out=np.full(snpobj.n_snps, np.nan),
    where=called_samples > 0,
)
variant_qc["is_monomorphic_among_called"] = np.array(
    [
        called.size > 0 and np.unique(called).size <= 1
        for called in (row[~np.isnan(row)] for row in alt_dosage)
    ]
)

print("ALT dosage shape:", alt_dosage.shape)
print("Summed genotype shape:", summed.genotypes.shape)
display(pd.DataFrame(alt_dosage, index=snpobj.variants_id, columns=snpobj.samples))
variant_qc.round({"call_rate": 3, "mean_alt_dosage": 3})

ALT dosage shape: (18, 8)
Summed genotype shape: (18, 8)

	S01	S02	S03	S04	S05	S06	S07	S08
rs_obj_00001	2.0	2.0	1.0	2.0	1.0	0.0	2.0	2.0
rs_obj_00002	2.0	2.0	1.0	2.0	2.0	2.0	2.0	2.0
rs_obj_00003	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
rs_obj_00004	2.0	1.0	2.0	2.0	1.0	2.0	0.0	1.0
rs_obj_00005	1.0	2.0	1.0	2.0	0.0	1.0	0.0	1.0
rs_obj_00006	2.0	NaN	1.0	1.0	2.0	1.0	NaN	2.0
rs_obj_00007	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
rs_obj_00008	0.0	2.0	NaN	2.0	1.0	2.0	1.0	NaN
rs_obj_00009	2.0	1.0	1.0	2.0	0.0	1.0	2.0	1.0
rs_obj_00010	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
rs_obj_00011	1.0	0.0	1.0	1.0	1.0	2.0	2.0	2.0
rs_obj_00012	2.0	0.0	0.0	2.0	2.0	0.0	0.0	0.0
rs_obj_00013	0.0	0.0	1.0	1.0	0.0	0.0	1.0	1.0
rs_obj_00014	0.0	1.0	1.0	0.0	0.0	1.0	1.0	1.0
rs_obj_00015	0.0	1.0	NaN	1.0	NaN	0.0	0.0	2.0
rs_obj_00016	0.0	0.0	0.0	1.0	0.0	0.0	1.0	2.0
rs_obj_00017	1.0	0.0	1.0	0.0	1.0	1.0	2.0	2.0
rs_obj_00018	2.0	0.0	0.0	1.0	0.0	0.0	2.0	0.0

	variant	chrom	pos	ref	alt	called_samples	call_rate	mean_alt_dosage	is_monomorphic_among_called
0	rs_obj_00001	chr7	1000	A	G	8	1.00	1.500	False
1	rs_obj_00002	chr7	2000	C	T	8	1.00	1.875	False
2	rs_obj_00003	chr7	3000	G	A	8	1.00	0.000	True
3	rs_obj_00004	chr7	4000	T	C	8	1.00	1.375	False
4	rs_obj_00005	chr7	5000	A	G	8	1.00	1.000	False
5	rs_obj_00006	chr7	6000	C	T	6	0.75	1.500	False
6	rs_obj_00007	chr7	7000	G	A	8	1.00	0.000	True
7	rs_obj_00008	chr7	8000	T	C	6	0.75	1.333	False
8	rs_obj_00009	chr7	9000	A	G	8	1.00	1.250	False
9	rs_obj_00010	chr7	10000	C	T	8	1.00	0.250	False
10	rs_obj_00011	chr7	11000	G	A	8	1.00	1.250	False
11	rs_obj_00012	chr7	12000	T	C	8	1.00	0.750	False
12	rs_obj_00013	chr7	13000	A	G	8	1.00	0.500	False
13	rs_obj_00014	chr7	14000	C	T	8	1.00	0.625	False
14	rs_obj_00015	chr7	15000	G	A	6	0.75	0.667	False
15	rs_obj_00016	chr7	16000	T	C	8	1.00	0.500	False
16	rs_obj_00017	chr7	17000	A	G	8	1.00	1.000	False
17	rs_obj_00018	chr7	18000	C	T	8	1.00	0.625	False

Variant filtering as a workflow¶

In practice, it is easier to filter in small, explicit steps. Here we:

take a positional slice of the chromosome;
remove variants with any missing genotypes;
remove monomorphic variants that carry no variation signal.

region_subset = snpobj.filter_variants(
    chrom="chr7",
    pos=snpobj.variants_pos[2:12],
)
complete_subset = region_subset.filter_complete_genotypes()
analysis_subset = complete_subset.filter_polymorphic_variants()

summary = pd.DataFrame(
    {
        "step": ["original", "regional subset", "complete calls", "polymorphic + complete"],
        "n_snps": [snpobj.n_snps, region_subset.n_snps, complete_subset.n_snps, analysis_subset.n_snps],
        "n_samples": [snpobj.n_samples, region_subset.n_samples, complete_subset.n_samples, analysis_subset.n_samples],
    }
)

display(summary)
analysis_subset.variants_id

	step	n_snps	n_samples
0	original	18	8
1	regional subset	10	8
2	complete calls	8	8
3	polymorphic + complete	6	8

array(['rs_obj_00004', 'rs_obj_00005', 'rs_obj_00009', 'rs_obj_00010',
       'rs_obj_00011', 'rs_obj_00012'], dtype=object)

Reorder samples to match an external table¶

A common preprocessing task is to realign genotype samples to a phenotype or covariate table. reorder=True makes the SNPObject follow the same sample order as the table.

phenotype_table = pd.DataFrame(
    {
        "sample": ["S05", "S02", "S08", "S01"],
        "trait": [2.3, -0.7, 1.1, 0.4],
    }
)

aligned = snpobj.filter_samples(samples=phenotype_table["sample"], reorder=True)

alignment_check = pd.DataFrame(
    {
        "phenotype_order": phenotype_table["sample"],
        "aligned_samples": aligned.samples,
        "aligned_population": aligned.sample_fid,
    }
)

alignment_check

	phenotype_order	aligned_samples	aligned_population
0	S05	S05	POP_B
1	S02	S02	POP_B
2	S08	S08	POP_B
3	S01	S01	POP_A

Normalize chromosome naming and prepare a human-readable genotype table¶

Mixed chromosome naming conventions such as chr7 versus 7 are easy to overlook and can break joins or merges. For standard chromosome labels such as chr7 and 7, convert_chromosome_format() can normalize naming. For non-standard contigs or non-human assemblies, use pattern-based renaming with rename_chrom().

Another practical transformation is to make a display-friendly genotype table with missing-aware summed genotypes and "." as the missing-value token.

plain_chrom = snpobj.convert_chromosome_format("chr", "plain")

summed_gt = snpobj.genotypes.sum(axis=2)
missing_call = np.any(snpobj.genotypes < 0, axis=2)
summed_gt[missing_call] = -1

table_gt = summed_gt.astype(object)
table_gt[missing_call] = "."

genotype_table = pd.DataFrame(
    table_gt,
    index=snpobj.variants_id,
    columns=snpobj.samples,
)

print("chromosome format before:", snpobj.detect_chromosome_format(), snpobj.variants_chrom[:3])
print("chromosome format after:", plain_chrom.detect_chromosome_format(), plain_chrom.variants_chrom[:3])
display(genotype_table)

chromosome format before: chr ['chr7' 'chr7' 'chr7']
chromosome format after: plain ['7' '7' '7']

	S01	S02	S03	S04	S05	S06	S07	S08
rs_obj_00001	2	2	1	2	1	0	2	2
rs_obj_00002	2	2	1	2	2	2	2	2
rs_obj_00003	0	0	0	0	0	0	0	0
rs_obj_00004	2	1	2	2	1	2	0	1
rs_obj_00005	1	2	1	2	0	1	0	1
rs_obj_00006	2	.	1	1	2	1	.	2
rs_obj_00007	0	0	0	0	0	0	0	0
rs_obj_00008	0	2	.	2	1	2	1	.
rs_obj_00009	2	1	1	2	0	1	2	1
rs_obj_00010	1	0	0	0	1	0	0	0
rs_obj_00011	1	0	1	1	1	2	2	2
rs_obj_00012	2	0	0	2	2	0	0	0
rs_obj_00013	0	0	1	1	0	0	1	1
rs_obj_00014	0	1	1	0	0	1	1	1
rs_obj_00015	0	1	.	1	.	0	0	2
rs_obj_00016	0	0	0	1	0	0	1	2
rs_obj_00017	1	0	1	0	1	1	2	2
rs_obj_00018	2	0	0	1	0	0	2	0

Grouped allele frequencies from object metadata¶

Because SNPObject already carries sample-level metadata, you can compute grouped allele frequencies directly from those labels. Here we compare two grouping schemes:

sample_fid, which acts like a population label in this synthetic dataset; in real PLINK-style data, FID may instead represent family ID;
study_arm, a toy grouping variable that could be replaced by batch, site, or case-control status in real data.

pop_af, pop_called = snpobj.allele_freq(
    sample_labels=snpobj.sample_fid,
    return_counts=True,
    as_dataframe=True,
)
arm_labels = sample_table["study_arm"].to_numpy()
arm_af, arm_called = snpobj.allele_freq(
    sample_labels=arm_labels,
    return_counts=True,
    as_dataframe=True,
)

pop_summary = pd.concat(
    [variant_table[["variant", "pos"]], pop_af.add_prefix("af_"), pop_called.add_prefix("called_")],
    axis=1,
)
pop_summary["population_spread"] = pop_af.max(axis=1) - pop_af.min(axis=1)

arm_summary = pd.concat(
    [variant_table[["variant", "pos"]], arm_af.add_prefix("af_"), arm_called.add_prefix("called_")],
    axis=1,
)
arm_summary["arm_difference"] = (arm_af["arm_A"] - arm_af["arm_B"]).abs()

display(pop_summary.sort_values("population_spread", ascending=False).round(3))
arm_summary.sort_values("arm_difference", ascending=False).round(3)

	variant	pos	af_POP_A	af_POP_B	af_POP_C	called_POP_A	called_POP_B	called_POP_C	population_spread
17	rs_obj_00018	18000	0.833	0.000	0.00	6	6	4	0.833
0	rs_obj_00001	1000	1.000	0.833	0.25	6	6	4	0.750
14	rs_obj_00015	15000	0.167	0.750	0.00	6	4	2	0.750
8	rs_obj_00009	9000	1.000	0.333	0.50	6	6	4	0.667
11	rs_obj_00012	12000	0.667	0.333	0.00	6	6	4	0.667
7	rs_obj_00008	8000	0.500	0.750	1.00	6	4	2	0.500
5	rs_obj_00006	6000	0.750	1.000	0.50	4	4	4	0.500
3	rs_obj_00004	4000	0.667	0.500	1.00	6	6	4	0.500
13	rs_obj_00014	14000	0.167	0.333	0.50	6	6	4	0.333
15	rs_obj_00016	16000	0.333	0.333	0.00	6	6	4	0.333
1	rs_obj_00002	2000	1.000	1.000	0.75	6	6	4	0.250
10	rs_obj_00011	11000	0.667	0.500	0.75	6	6	4	0.250
12	rs_obj_00013	13000	0.333	0.167	0.25	6	6	4	0.167
9	rs_obj_00010	10000	0.167	0.167	0.00	6	6	4	0.167
6	rs_obj_00007	7000	0.000	0.000	0.00	6	6	4	0.000
4	rs_obj_00005	5000	0.500	0.500	0.50	6	6	4	0.000
2	rs_obj_00003	3000	0.000	0.000	0.00	6	6	4	0.000
16	rs_obj_00017	17000	0.500	0.500	0.50	6	6	4	0.000

	variant	pos	af_arm_A	af_arm_B	called_arm_A	called_arm_B	arm_difference
7	rs_obj_00008	8000	0.333	1.000	6	6	0.667
4	rs_obj_00005	5000	0.250	0.750	8	8	0.500
14	rs_obj_00015	15000	0.000	0.500	4	8	0.500
17	rs_obj_00018	18000	0.500	0.125	8	8	0.375
15	rs_obj_00016	16000	0.125	0.375	8	8	0.250
11	rs_obj_00012	12000	0.500	0.250	8	8	0.250
9	rs_obj_00010	10000	0.250	0.000	8	8	0.250
16	rs_obj_00017	17000	0.625	0.375	8	8	0.250
5	rs_obj_00006	6000	0.833	0.667	6	6	0.167
1	rs_obj_00002	2000	0.875	1.000	8	8	0.125
3	rs_obj_00004	4000	0.625	0.750	8	8	0.125
13	rs_obj_00014	14000	0.250	0.375	8	8	0.125
0	rs_obj_00001	1000	0.750	0.750	8	8	0.000
2	rs_obj_00003	3000	0.000	0.000	8	8	0.000
8	rs_obj_00009	9000	0.625	0.625	8	8	0.000
6	rs_obj_00007	7000	0.000	0.000	8	8	0.000
12	rs_obj_00013	13000	0.250	0.250	8	8	0.000
10	rs_obj_00011	11000	0.625	0.625	8	8	0.000

Attach local ancestry directly to the object¶

SNPObject can also carry SNP-level local ancestry in calldata_lai. In this example, we attach SNP-level local ancestry with the same (n_snps, n_samples, 2) layout as the phased genotypes. SNPObject can also represent SNP-level LAI in a flattened haplotype layout.

Ancestry masking can be done without leaving the object. In the ancestry-specific example below, ancestry=1 selects numeric ancestry code 1, which is labeled "ANC1" by ancestry_map.

snp_with_lai = snpobj.copy()
lai = rng.integers(0, 3, size=snp_with_lai.genotypes.shape, dtype=np.int8)
lai[11, :, :] = 0
lai[11, [0, 3, 4], :] = 1
snp_with_lai.calldata_lai = lai
snp_with_lai.ancestry_map = {"0": "ANC0", "1": "ANC1", "2": "ANC2"}

cohort_af, cohort_called = snp_with_lai.allele_freq(return_counts=True)
anc1_af, anc1_called = snp_with_lai.allele_freq(ancestry=1, return_counts=True)

lai_summary = pd.DataFrame(
    {
        "variant": snp_with_lai.variants_id,
        "cohort_af": cohort_af,
        "cohort_called": cohort_called,
        "ANC1_af": anc1_af,
        "ANC1_called": anc1_called,
        "abs_shift": np.abs(anc1_af - cohort_af),
    }
)

lai_summary.sort_values("abs_shift", ascending=False).head(8).round(3)

	variant	cohort_af	cohort_called	ANC1_af	ANC1_called	abs_shift
11	rs_obj_00012	0.375	16	1.000	6	0.625
7	rs_obj_00008	0.667	12	1.000	4	0.333
13	rs_obj_00014	0.312	16	0.000	3	0.312
16	rs_obj_00017	0.500	16	0.800	5	0.300
17	rs_obj_00018	0.312	16	0.500	6	0.188
3	rs_obj_00004	0.688	16	0.857	7	0.170
5	rs_obj_00006	0.750	12	0.600	5	0.150
9	rs_obj_00010	0.125	16	0.000	3	0.125

Takeaways¶

SNPObject is most useful when treated as an analysis container, not just a genotype array.
Working on copies makes metadata edits and export transforms safer.
dosage() and missing-aware summed genotypes are the bridge from phased haplotypes to matrix-style downstream workflows.
Filtering and sample reordering are easier to check when each step creates a clear new object.
Once metadata and LAI are attached, the same object can answer whole-cohort, group-level, and ancestry-masked questions.