SNPObject¶
SNPObject is the main object you will use to hold genotype data in snputils. It stores genotype calls together with sample metadata, variant metadata, and optional SNP-level local ancestry.
This notebook walks through a small SNPObject from inspection to filtering, sample alignment, allele frequencies, and local ancestry. We will cover how to:
inspect the structure of a phased
SNPObjectmake dosage and QC tables from phased genotypes
subset variants and samples for downstream analysis
realign samples to an external table
normalize chromosome naming and prepare human-readable genotype table views
attach local ancestry and compute ancestry-aware summaries
The examples use a synthetic dataset so the notebook runs without external files.
import numpy as np
import pandas as pd
from IPython.display import display
from snputils.datasets import build_synthetic_snp_dataset
SEED = 20240521
rng = np.random.default_rng(SEED)
snpobj = build_synthetic_snp_dataset(
n_samples=8,
n_snps=18,
seed=SEED,
n_populations=3,
missing_rate=0.0,
phased=True,
chromosome="chr7",
variant_prefix="rs_obj",
)
# Add a few hand-crafted variants so later examples are interpretable.
snpobj.genotypes[2, :, :] = 0
snpobj.genotypes[5, [1, 6], :] = -1
snpobj.genotypes[7, [2, 7], :] = -1
snpobj.genotypes[11, :, :] = 0
snpobj.genotypes[11, [0, 3, 4], :] = 1
snpobj.genotypes[14, [2, 4], :] = -1
sample_table = pd.DataFrame(
{
"sample": snpobj.samples.astype(str),
"population": snpobj.sample_fid.astype(str),
"sex_code": snpobj.sample_sex,
"study_arm": np.where(np.arange(snpobj.n_samples) % 2 == 0, "arm_A", "arm_B"),
}
)
variant_table = pd.DataFrame(
{
"variant": snpobj.variants_id.astype(str),
"chrom": snpobj.variants_chrom.astype(str),
"pos": snpobj.variants_pos,
"ref": snpobj.variants_ref.astype(str),
"alt": snpobj.variants_alt.astype(str),
}
)
display(sample_table)
display(variant_table)
| sample | population | sex_code | study_arm | |
|---|---|---|---|---|
| 0 | S01 | POP_A | 1 | arm_A |
| 1 | S02 | POP_B | 2 | arm_B |
| 2 | S03 | POP_C | 1 | arm_A |
| 3 | S04 | POP_A | 2 | arm_B |
| 4 | S05 | POP_B | 1 | arm_A |
| 5 | S06 | POP_C | 2 | arm_B |
| 6 | S07 | POP_A | 1 | arm_A |
| 7 | S08 | POP_B | 2 | arm_B |
| variant | chrom | pos | ref | alt | |
|---|---|---|---|---|---|
| 0 | rs_obj_00001 | chr7 | 1000 | A | G |
| 1 | rs_obj_00002 | chr7 | 2000 | C | T |
| 2 | rs_obj_00003 | chr7 | 3000 | G | A |
| 3 | rs_obj_00004 | chr7 | 4000 | T | C |
| 4 | rs_obj_00005 | chr7 | 5000 | A | G |
| 5 | rs_obj_00006 | chr7 | 6000 | C | T |
| 6 | rs_obj_00007 | chr7 | 7000 | G | A |
| 7 | rs_obj_00008 | chr7 | 8000 | T | C |
| 8 | rs_obj_00009 | chr7 | 9000 | A | G |
| 9 | rs_obj_00010 | chr7 | 10000 | C | T |
| 10 | rs_obj_00011 | chr7 | 11000 | G | A |
| 11 | rs_obj_00012 | chr7 | 12000 | T | C |
| 12 | rs_obj_00013 | chr7 | 13000 | A | G |
| 13 | rs_obj_00014 | chr7 | 14000 | C | T |
| 14 | rs_obj_00015 | chr7 | 15000 | G | A |
| 15 | rs_obj_00016 | chr7 | 16000 | T | C |
| 16 | rs_obj_00017 | chr7 | 17000 | A | G |
| 17 | rs_obj_00018 | chr7 | 18000 | C | T |
Inspect the container¶
A phased SNPObject stores genotypes with shape (n_snps, n_samples, 2), where the last axis is the two haplotypes. The object also keeps sample-level and variant-level arrays aligned to those axes.
print(snpobj)
print("shape:", snpobj.shape)
print("n_samples:", snpobj.n_samples)
print("n_snps:", snpobj.n_snps)
print("unique chromosomes:", snpobj.unique_chrom.tolist())
print("are strands summed?:", snpobj.are_strands_summed)
print("available keys:", snpobj.keys())
SNPObject(shape=(18, 8, 2), n_snps=18, n_samples=8, genotypes_shape=(18, 8, 2), calldata_lai_shape=None, calldata_gp_shape=None, has_variant_metadata=True, has_ancestry_map=False)
shape: (18, 8, 2)
n_samples: 8
n_snps: 18
unique chromosomes: ['chr7']
are strands summed?: False
available keys: ['genotypes', 'samples', 'variants_ref', 'variants_alt', 'variants_chrom', 'variants_cm', 'variants_filter_pass', 'variants_id', 'variants_pos', 'variants_qual', 'variants_info', 'calldata_lai', 'calldata_gp', 'ancestry_map', 'sample_fid', 'sample_sex']
The two most important alignment rules are:
arrays like
samples,sample_fid, andsample_sexmust stay aligned to the sample axis;arrays like
variants_id,variants_pos, andvariants_chrommust stay aligned to the SNP axis.
Dictionary-style access is convenient for programmatic workflows, while property access is the usual style in most analysis code. When you edit metadata, it is usually safer to work on a copy.
print("original sample IDs:", snpobj.samples[:4])
print("first variant IDs via dictionary-style access:", snpobj["variants_id"][:4])
renamed = snpobj.copy()
renamed.samples = np.array([f"analysis_{i:02d}" for i in range(renamed.n_samples)], dtype=object)
renamed["sample_fid"] = np.array([f"cluster_{i % 2}" for i in range(renamed.n_samples)], dtype=object)
print("copied sample IDs:", renamed.samples[:4])
print("original sample IDs still intact:", snpobj.samples[:4])
print("copied sample labels:", renamed.sample_fid[:4])
original sample IDs: ['S01' 'S02' 'S03' 'S04']
first variant IDs via dictionary-style access: ['rs_obj_00001' 'rs_obj_00002' 'rs_obj_00003' 'rs_obj_00004']
copied sample IDs: ['analysis_00' 'analysis_01' 'analysis_02' 'analysis_03']
original sample IDs still intact: ['S01' 'S02' 'S03' 'S04']
copied sample labels: ['cluster_0' 'cluster_1' 'cluster_0' 'cluster_1']
Dosage-style summaries from phased genotypes¶
Many downstream methods want alternate-allele dosage rather than two separate haplotypes. dosage("ALT") returns a 2D dosage matrix, and sum_strands() converts the whole object to summed genotypes. sum_strands() just sums the two haplotypes. If there are missing calls, clean those up before using the summed matrix for QC or modeling.
For phased genotype calls, we mask any sample with a missing haplotype before computing QC summaries. Before modeling, you can make a small variant-level QC table from the dosage matrix.
raw_dosage = snpobj.dosage("ALT").astype(float)
summed = snpobj.sum_strands()
if snpobj.genotypes.ndim == 3:
missing_call = np.any(snpobj.genotypes < 0, axis=2)
else:
missing_call = snpobj.genotypes < 0
alt_dosage = raw_dosage.copy()
alt_dosage[missing_call] = np.nan
called_samples = np.sum(~np.isnan(alt_dosage), axis=1)
variant_qc = variant_table.copy()
variant_qc["called_samples"] = called_samples
variant_qc["call_rate"] = called_samples / snpobj.n_samples
variant_qc["mean_alt_dosage"] = np.divide(
np.nansum(alt_dosage, axis=1),
called_samples,
out=np.full(snpobj.n_snps, np.nan),
where=called_samples > 0,
)
variant_qc["is_monomorphic_among_called"] = np.array(
[
called.size > 0 and np.unique(called).size <= 1
for called in (row[~np.isnan(row)] for row in alt_dosage)
]
)
print("ALT dosage shape:", alt_dosage.shape)
print("Summed genotype shape:", summed.genotypes.shape)
display(pd.DataFrame(alt_dosage, index=snpobj.variants_id, columns=snpobj.samples))
variant_qc.round({"call_rate": 3, "mean_alt_dosage": 3})
ALT dosage shape: (18, 8)
Summed genotype shape: (18, 8)
| S01 | S02 | S03 | S04 | S05 | S06 | S07 | S08 | |
|---|---|---|---|---|---|---|---|---|
| rs_obj_00001 | 2.0 | 2.0 | 1.0 | 2.0 | 1.0 | 0.0 | 2.0 | 2.0 |
| rs_obj_00002 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 |
| rs_obj_00003 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| rs_obj_00004 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 0.0 | 1.0 |
| rs_obj_00005 | 1.0 | 2.0 | 1.0 | 2.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| rs_obj_00006 | 2.0 | NaN | 1.0 | 1.0 | 2.0 | 1.0 | NaN | 2.0 |
| rs_obj_00007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| rs_obj_00008 | 0.0 | 2.0 | NaN | 2.0 | 1.0 | 2.0 | 1.0 | NaN |
| rs_obj_00009 | 2.0 | 1.0 | 1.0 | 2.0 | 0.0 | 1.0 | 2.0 | 1.0 |
| rs_obj_00010 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| rs_obj_00011 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 |
| rs_obj_00012 | 2.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 |
| rs_obj_00013 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| rs_obj_00014 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| rs_obj_00015 | 0.0 | 1.0 | NaN | 1.0 | NaN | 0.0 | 0.0 | 2.0 |
| rs_obj_00016 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| rs_obj_00017 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 2.0 | 2.0 |
| rs_obj_00018 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| variant | chrom | pos | ref | alt | called_samples | call_rate | mean_alt_dosage | is_monomorphic_among_called | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | rs_obj_00001 | chr7 | 1000 | A | G | 8 | 1.00 | 1.500 | False |
| 1 | rs_obj_00002 | chr7 | 2000 | C | T | 8 | 1.00 | 1.875 | False |
| 2 | rs_obj_00003 | chr7 | 3000 | G | A | 8 | 1.00 | 0.000 | True |
| 3 | rs_obj_00004 | chr7 | 4000 | T | C | 8 | 1.00 | 1.375 | False |
| 4 | rs_obj_00005 | chr7 | 5000 | A | G | 8 | 1.00 | 1.000 | False |
| 5 | rs_obj_00006 | chr7 | 6000 | C | T | 6 | 0.75 | 1.500 | False |
| 6 | rs_obj_00007 | chr7 | 7000 | G | A | 8 | 1.00 | 0.000 | True |
| 7 | rs_obj_00008 | chr7 | 8000 | T | C | 6 | 0.75 | 1.333 | False |
| 8 | rs_obj_00009 | chr7 | 9000 | A | G | 8 | 1.00 | 1.250 | False |
| 9 | rs_obj_00010 | chr7 | 10000 | C | T | 8 | 1.00 | 0.250 | False |
| 10 | rs_obj_00011 | chr7 | 11000 | G | A | 8 | 1.00 | 1.250 | False |
| 11 | rs_obj_00012 | chr7 | 12000 | T | C | 8 | 1.00 | 0.750 | False |
| 12 | rs_obj_00013 | chr7 | 13000 | A | G | 8 | 1.00 | 0.500 | False |
| 13 | rs_obj_00014 | chr7 | 14000 | C | T | 8 | 1.00 | 0.625 | False |
| 14 | rs_obj_00015 | chr7 | 15000 | G | A | 6 | 0.75 | 0.667 | False |
| 15 | rs_obj_00016 | chr7 | 16000 | T | C | 8 | 1.00 | 0.500 | False |
| 16 | rs_obj_00017 | chr7 | 17000 | A | G | 8 | 1.00 | 1.000 | False |
| 17 | rs_obj_00018 | chr7 | 18000 | C | T | 8 | 1.00 | 0.625 | False |
Variant filtering as a workflow¶
In practice, it is easier to filter in small, explicit steps. Here we:
take a positional slice of the chromosome;
remove variants with any missing genotypes;
remove monomorphic variants that carry no variation signal.
region_subset = snpobj.filter_variants(
chrom="chr7",
pos=snpobj.variants_pos[2:12],
)
complete_subset = region_subset.filter_complete_genotypes()
analysis_subset = complete_subset.filter_polymorphic_variants()
summary = pd.DataFrame(
{
"step": ["original", "regional subset", "complete calls", "polymorphic + complete"],
"n_snps": [snpobj.n_snps, region_subset.n_snps, complete_subset.n_snps, analysis_subset.n_snps],
"n_samples": [snpobj.n_samples, region_subset.n_samples, complete_subset.n_samples, analysis_subset.n_samples],
}
)
display(summary)
analysis_subset.variants_id
| step | n_snps | n_samples | |
|---|---|---|---|
| 0 | original | 18 | 8 |
| 1 | regional subset | 10 | 8 |
| 2 | complete calls | 8 | 8 |
| 3 | polymorphic + complete | 6 | 8 |
array(['rs_obj_00004', 'rs_obj_00005', 'rs_obj_00009', 'rs_obj_00010',
'rs_obj_00011', 'rs_obj_00012'], dtype=object)
Reorder samples to match an external table¶
A common preprocessing task is to realign genotype samples to a phenotype or covariate table. reorder=True makes the SNPObject follow the same sample order as the table.
phenotype_table = pd.DataFrame(
{
"sample": ["S05", "S02", "S08", "S01"],
"trait": [2.3, -0.7, 1.1, 0.4],
}
)
aligned = snpobj.filter_samples(samples=phenotype_table["sample"], reorder=True)
alignment_check = pd.DataFrame(
{
"phenotype_order": phenotype_table["sample"],
"aligned_samples": aligned.samples,
"aligned_population": aligned.sample_fid,
}
)
alignment_check
| phenotype_order | aligned_samples | aligned_population | |
|---|---|---|---|
| 0 | S05 | S05 | POP_B |
| 1 | S02 | S02 | POP_B |
| 2 | S08 | S08 | POP_B |
| 3 | S01 | S01 | POP_A |
Normalize chromosome naming and prepare a human-readable genotype table¶
Mixed chromosome naming conventions such as chr7 versus 7 are easy to overlook and can break joins or merges. For standard chromosome labels such as chr7 and 7, convert_chromosome_format() can normalize naming. For non-standard contigs or non-human assemblies, use pattern-based renaming with rename_chrom().
Another practical transformation is to make a display-friendly genotype table with missing-aware summed genotypes and "." as the missing-value token.
plain_chrom = snpobj.convert_chromosome_format("chr", "plain")
summed_gt = snpobj.genotypes.sum(axis=2)
missing_call = np.any(snpobj.genotypes < 0, axis=2)
summed_gt[missing_call] = -1
table_gt = summed_gt.astype(object)
table_gt[missing_call] = "."
genotype_table = pd.DataFrame(
table_gt,
index=snpobj.variants_id,
columns=snpobj.samples,
)
print("chromosome format before:", snpobj.detect_chromosome_format(), snpobj.variants_chrom[:3])
print("chromosome format after:", plain_chrom.detect_chromosome_format(), plain_chrom.variants_chrom[:3])
display(genotype_table)
chromosome format before: chr ['chr7' 'chr7' 'chr7']
chromosome format after: plain ['7' '7' '7']
| S01 | S02 | S03 | S04 | S05 | S06 | S07 | S08 | |
|---|---|---|---|---|---|---|---|---|
| rs_obj_00001 | 2 | 2 | 1 | 2 | 1 | 0 | 2 | 2 |
| rs_obj_00002 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 |
| rs_obj_00003 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| rs_obj_00004 | 2 | 1 | 2 | 2 | 1 | 2 | 0 | 1 |
| rs_obj_00005 | 1 | 2 | 1 | 2 | 0 | 1 | 0 | 1 |
| rs_obj_00006 | 2 | . | 1 | 1 | 2 | 1 | . | 2 |
| rs_obj_00007 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| rs_obj_00008 | 0 | 2 | . | 2 | 1 | 2 | 1 | . |
| rs_obj_00009 | 2 | 1 | 1 | 2 | 0 | 1 | 2 | 1 |
| rs_obj_00010 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| rs_obj_00011 | 1 | 0 | 1 | 1 | 1 | 2 | 2 | 2 |
| rs_obj_00012 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 0 |
| rs_obj_00013 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| rs_obj_00014 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
| rs_obj_00015 | 0 | 1 | . | 1 | . | 0 | 0 | 2 |
| rs_obj_00016 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 2 |
| rs_obj_00017 | 1 | 0 | 1 | 0 | 1 | 1 | 2 | 2 |
| rs_obj_00018 | 2 | 0 | 0 | 1 | 0 | 0 | 2 | 0 |
Grouped allele frequencies from object metadata¶
Because SNPObject already carries sample-level metadata, you can compute grouped allele frequencies directly from those labels. Here we compare two grouping schemes:
sample_fid, which acts like a population label in this synthetic dataset; in real PLINK-style data, FID may instead represent family ID;study_arm, a toy grouping variable that could be replaced by batch, site, or case-control status in real data.
pop_af, pop_called = snpobj.allele_freq(
sample_labels=snpobj.sample_fid,
return_counts=True,
as_dataframe=True,
)
arm_labels = sample_table["study_arm"].to_numpy()
arm_af, arm_called = snpobj.allele_freq(
sample_labels=arm_labels,
return_counts=True,
as_dataframe=True,
)
pop_summary = pd.concat(
[variant_table[["variant", "pos"]], pop_af.add_prefix("af_"), pop_called.add_prefix("called_")],
axis=1,
)
pop_summary["population_spread"] = pop_af.max(axis=1) - pop_af.min(axis=1)
arm_summary = pd.concat(
[variant_table[["variant", "pos"]], arm_af.add_prefix("af_"), arm_called.add_prefix("called_")],
axis=1,
)
arm_summary["arm_difference"] = (arm_af["arm_A"] - arm_af["arm_B"]).abs()
display(pop_summary.sort_values("population_spread", ascending=False).round(3))
arm_summary.sort_values("arm_difference", ascending=False).round(3)
| variant | pos | af_POP_A | af_POP_B | af_POP_C | called_POP_A | called_POP_B | called_POP_C | population_spread | |
|---|---|---|---|---|---|---|---|---|---|
| 17 | rs_obj_00018 | 18000 | 0.833 | 0.000 | 0.00 | 6 | 6 | 4 | 0.833 |
| 0 | rs_obj_00001 | 1000 | 1.000 | 0.833 | 0.25 | 6 | 6 | 4 | 0.750 |
| 14 | rs_obj_00015 | 15000 | 0.167 | 0.750 | 0.00 | 6 | 4 | 2 | 0.750 |
| 8 | rs_obj_00009 | 9000 | 1.000 | 0.333 | 0.50 | 6 | 6 | 4 | 0.667 |
| 11 | rs_obj_00012 | 12000 | 0.667 | 0.333 | 0.00 | 6 | 6 | 4 | 0.667 |
| 7 | rs_obj_00008 | 8000 | 0.500 | 0.750 | 1.00 | 6 | 4 | 2 | 0.500 |
| 5 | rs_obj_00006 | 6000 | 0.750 | 1.000 | 0.50 | 4 | 4 | 4 | 0.500 |
| 3 | rs_obj_00004 | 4000 | 0.667 | 0.500 | 1.00 | 6 | 6 | 4 | 0.500 |
| 13 | rs_obj_00014 | 14000 | 0.167 | 0.333 | 0.50 | 6 | 6 | 4 | 0.333 |
| 15 | rs_obj_00016 | 16000 | 0.333 | 0.333 | 0.00 | 6 | 6 | 4 | 0.333 |
| 1 | rs_obj_00002 | 2000 | 1.000 | 1.000 | 0.75 | 6 | 6 | 4 | 0.250 |
| 10 | rs_obj_00011 | 11000 | 0.667 | 0.500 | 0.75 | 6 | 6 | 4 | 0.250 |
| 12 | rs_obj_00013 | 13000 | 0.333 | 0.167 | 0.25 | 6 | 6 | 4 | 0.167 |
| 9 | rs_obj_00010 | 10000 | 0.167 | 0.167 | 0.00 | 6 | 6 | 4 | 0.167 |
| 6 | rs_obj_00007 | 7000 | 0.000 | 0.000 | 0.00 | 6 | 6 | 4 | 0.000 |
| 4 | rs_obj_00005 | 5000 | 0.500 | 0.500 | 0.50 | 6 | 6 | 4 | 0.000 |
| 2 | rs_obj_00003 | 3000 | 0.000 | 0.000 | 0.00 | 6 | 6 | 4 | 0.000 |
| 16 | rs_obj_00017 | 17000 | 0.500 | 0.500 | 0.50 | 6 | 6 | 4 | 0.000 |
| variant | pos | af_arm_A | af_arm_B | called_arm_A | called_arm_B | arm_difference | |
|---|---|---|---|---|---|---|---|
| 7 | rs_obj_00008 | 8000 | 0.333 | 1.000 | 6 | 6 | 0.667 |
| 4 | rs_obj_00005 | 5000 | 0.250 | 0.750 | 8 | 8 | 0.500 |
| 14 | rs_obj_00015 | 15000 | 0.000 | 0.500 | 4 | 8 | 0.500 |
| 17 | rs_obj_00018 | 18000 | 0.500 | 0.125 | 8 | 8 | 0.375 |
| 15 | rs_obj_00016 | 16000 | 0.125 | 0.375 | 8 | 8 | 0.250 |
| 11 | rs_obj_00012 | 12000 | 0.500 | 0.250 | 8 | 8 | 0.250 |
| 9 | rs_obj_00010 | 10000 | 0.250 | 0.000 | 8 | 8 | 0.250 |
| 16 | rs_obj_00017 | 17000 | 0.625 | 0.375 | 8 | 8 | 0.250 |
| 5 | rs_obj_00006 | 6000 | 0.833 | 0.667 | 6 | 6 | 0.167 |
| 1 | rs_obj_00002 | 2000 | 0.875 | 1.000 | 8 | 8 | 0.125 |
| 3 | rs_obj_00004 | 4000 | 0.625 | 0.750 | 8 | 8 | 0.125 |
| 13 | rs_obj_00014 | 14000 | 0.250 | 0.375 | 8 | 8 | 0.125 |
| 0 | rs_obj_00001 | 1000 | 0.750 | 0.750 | 8 | 8 | 0.000 |
| 2 | rs_obj_00003 | 3000 | 0.000 | 0.000 | 8 | 8 | 0.000 |
| 8 | rs_obj_00009 | 9000 | 0.625 | 0.625 | 8 | 8 | 0.000 |
| 6 | rs_obj_00007 | 7000 | 0.000 | 0.000 | 8 | 8 | 0.000 |
| 12 | rs_obj_00013 | 13000 | 0.250 | 0.250 | 8 | 8 | 0.000 |
| 10 | rs_obj_00011 | 11000 | 0.625 | 0.625 | 8 | 8 | 0.000 |
Attach local ancestry directly to the object¶
SNPObject can also carry SNP-level local ancestry in calldata_lai. In this example, we attach SNP-level local ancestry with the same (n_snps, n_samples, 2) layout as the phased genotypes. SNPObject can also represent SNP-level LAI in a flattened haplotype layout.
Ancestry masking can be done without leaving the object. In the ancestry-specific example below, ancestry=1 selects numeric ancestry code 1, which is labeled "ANC1" by ancestry_map.
snp_with_lai = snpobj.copy()
lai = rng.integers(0, 3, size=snp_with_lai.genotypes.shape, dtype=np.int8)
lai[11, :, :] = 0
lai[11, [0, 3, 4], :] = 1
snp_with_lai.calldata_lai = lai
snp_with_lai.ancestry_map = {"0": "ANC0", "1": "ANC1", "2": "ANC2"}
cohort_af, cohort_called = snp_with_lai.allele_freq(return_counts=True)
anc1_af, anc1_called = snp_with_lai.allele_freq(ancestry=1, return_counts=True)
lai_summary = pd.DataFrame(
{
"variant": snp_with_lai.variants_id,
"cohort_af": cohort_af,
"cohort_called": cohort_called,
"ANC1_af": anc1_af,
"ANC1_called": anc1_called,
"abs_shift": np.abs(anc1_af - cohort_af),
}
)
lai_summary.sort_values("abs_shift", ascending=False).head(8).round(3)
| variant | cohort_af | cohort_called | ANC1_af | ANC1_called | abs_shift | |
|---|---|---|---|---|---|---|
| 11 | rs_obj_00012 | 0.375 | 16 | 1.000 | 6 | 0.625 |
| 7 | rs_obj_00008 | 0.667 | 12 | 1.000 | 4 | 0.333 |
| 13 | rs_obj_00014 | 0.312 | 16 | 0.000 | 3 | 0.312 |
| 16 | rs_obj_00017 | 0.500 | 16 | 0.800 | 5 | 0.300 |
| 17 | rs_obj_00018 | 0.312 | 16 | 0.500 | 6 | 0.188 |
| 3 | rs_obj_00004 | 0.688 | 16 | 0.857 | 7 | 0.170 |
| 5 | rs_obj_00006 | 0.750 | 12 | 0.600 | 5 | 0.150 |
| 9 | rs_obj_00010 | 0.125 | 16 | 0.000 | 3 | 0.125 |
Takeaways¶
SNPObjectis most useful when treated as an analysis container, not just a genotype array.Working on copies makes metadata edits and export transforms safer.
dosage()and missing-aware summed genotypes are the bridge from phased haplotypes to matrix-style downstream workflows.Filtering and sample reordering are easier to check when each step creates a clear new object.
Once metadata and LAI are attached, the same object can answer whole-cohort, group-level, and ancestry-masked questions.