Admixture Mapping¶

Admixture mapping tests whether local ancestry at a genomic region is associated with a phenotype in recently admixed individuals. Instead of asking whether a specific SNP allele is associated with a trait, we ask whether carrying ancestry from a source population at a local genomic window changes trait values after accounting for background ancestry and other covariates.

In this tutorial, we build a synthetic local ancestry dataset with a quantitative trait, run snputils.tools.run_admixture_mapping, and make a Manhattan plot from the resulting association table.

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from snputils.datasets import build_synthetic_admixture_dataset
from snputils.phenotype.genobj import CovariateObject, PhenotypeObject
from snputils.tools import run_admixture_mapping
from snputils.visualization import manhattan_plot, qq_plot

What the input represents¶

A local ancestry inference method usually produces one ancestry label per haplotype per genomic window. For a diploid sample, each window therefore has two ancestry labels: one for each haplotype. snputils stores that information in a LocalAncestryObject.

For each ancestry and each window, admixture mapping converts those two haplotype labels into an ancestry dosage: 0, 1, or 2 copies of the requested ancestry. The association model then tests whether that dosage is related to the phenotype.

Build a synthetic dataset¶

The synthetic trait below has a few sparse local ancestry effects plus global ancestry-like covariates, so the strongest signals should appear near the simulated causal windows while most tests remain null.

build_synthetic_admixture_dataset lives in snputils.datasets, and returns:

laiobj: local ancestry calls for each haplotype and window.
phenotype: a continuous trait generated from a few local ancestry effects, covariates, and noise.
covar_matrix: covariates that include genome-wide ancestry-like summaries.
effect_windows, effect_ancestries, and effect_sizes: the simulated ground truth used only for interpretation.

RESULTS_DIR = Path("results/admixture_mapping")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

PHENO_NAME = "TRAIT"

synthetic = build_synthetic_admixture_dataset(
    n_samples=2000,
    n_windows=22000,
    n_covariates=3,
    seed=42,
)

laiobj = synthetic["laiobj"]
sample_ids = synthetic["sample_ids"]
phenotype_values = synthetic["phenotype"]
covar_names = synthetic["covar_names"]
covar_matrix = synthetic["covar_matrix"]

print(laiobj)
print(f"Phenotype shape: {phenotype_values.shape}")
print(f"Covariate matrix shape: {covar_matrix.shape}")

LocalAncestryObject(shape=(22000, 4000), n_windows=22000, n_samples=2000, n_haplotypes=4000, n_ancestries=3, has_window_metadata=True, has_ancestry_map=True)
Phenotype shape: (2000,)
Covariate matrix shape: (2000, 3)

The simulated causal windows are shown below. In real data these columns do not exist; they are included here so we can sanity-check whether the analysis recovers the planted signal.

truth = pd.DataFrame({
    "window_index_0_based": synthetic["effect_windows"],
    "result_window_index_1_based": synthetic["effect_windows"] + 1,
    "chromosome": laiobj.chromosomes[synthetic["effect_windows"]],
    "start": laiobj.physical_pos[synthetic["effect_windows"], 0],
    "end": laiobj.physical_pos[synthetic["effect_windows"], 1],
    "ancestry_code": synthetic["effect_ancestries"],
    "ancestry": [laiobj.ancestry_map[str(code)] for code in synthetic["effect_ancestries"]],
    "effect_size": synthetic["effect_sizes"],
})
truth["expected_result_id"] = [
    f"w{window_idx}_{ancestry}"
    for window_idx, ancestry in zip(truth["result_window_index_1_based"], truth["ancestry"])
]
truth

	window_index_0_based	result_window_index_1_based	chromosome	start	end	ancestry_code	ancestry	effect_size	expected_result_id
0	2200	2201	3	5000001	5004999	0	AFR	0.30	w2201_AFR
1	12099	12100	13	2475001	2479999	1	EUR	-0.26	w12100_EUR
2	21999	22000	22	24975001	24979999	2	AMR	0.22	w22000_AMR

Create phenotype and covariate inputs¶

run_admixture_mapping accepts file paths or in-memory snputils objects. Here we use an in-memory PhenotypeObject, LocalAncestryObject, and CovariateObject.

Covariates are important in real admixture mapping. Local ancestry is correlated across the genome and with global ancestry; without adjustment, a local window can look associated because it tags broader ancestry structure rather than a region-specific effect.

phenotype = PhenotypeObject(
    samples=sample_ids,
    values=phenotype_values,
    phenotype_name=PHENO_NAME,
    quantitative=True,
)

covariates = CovariateObject(
    samples=sample_ids,
    values=covar_matrix,
    covariate_names=covar_names,
)

pd.DataFrame(covariates.values, index=covariates.samples, columns=covariates.covariate_names)

	COV1	COV2	COV3
A00000	-0.330546	0.634283	0.236212
A00001	-0.651705	1.062672	0.336365
A00002	-0.275401	-0.115268	-0.084457
A00003	-0.753692	1.544846	0.866777
A00004	0.401226	0.367596	-0.915346
...	...	...	...
A01995	-0.921008	0.516885	0.657766
A01996	-1.158979	2.452996	2.187145
A01997	1.127358	-1.241836	0.632129
A01998	-0.219785	-0.030625	0.777080
A01999	0.640294	0.621354	0.687112

2000 rows × 3 columns

Run admixture mapping¶

The API uses lai_source for either an MSP path or a LocalAncestryObject.

For a quantitative trait, snputils fits a linear model per window and ancestry. For a binary trait, it fits a logistic model. The returned table includes p-values, effect estimates, standard errors, test type, ancestry labels, chromosome, position, and multiple-testing adjusted p-values when adjust=True.

results_path = RESULTS_DIR / "synthetic_admixture_mapping.tsv.gz"

results = run_admixture_mapping(
    phe_path=phenotype,
    lai_source=laiobj,
    results_path=results_path,
    batch_size=256,
    quantitative=True,
    covar=covariates,
    covar_variance_standardize=True,
    ci=0.95,
    adjust=True,
    keep_hla=True,
    return_results=True,
    verbose=True,
)

results

Reading LAI source (LocalAncestryObject)...
  Chunk 0: 256 windows (total so far: 256)
  Done. Processed 22,000 windows.

	#CHROM	POS	END	ID	REF	ALT	A1	ANCESTRY	TEST	OBS_CT	BETA	SE	T_STAT	P	L95	U95	BONF	FDR_BH	ERRCODE
0	1	1	4999	w1_AFR	N	AFR	AFR	AFR	LINEAR	2000	-0.171130	0.087126	-1.964177	4.964817e-02	-0.341997	-0.000263	1.000000	0.949899	.
1	1	25001	29999	w2_AFR	N	AFR	AFR	AFR	LINEAR	2000	-0.214909	0.086942	-2.471884	1.352322e-02	-0.385415	-0.044404	1.000000	0.869918	.
2	1	50001	54999	w3_AFR	N	AFR	AFR	AFR	LINEAR	2000	-0.223219	0.086726	-2.573841	1.012925e-02	-0.393302	-0.053136	1.000000	0.852834	.
3	1	75001	79999	w4_AFR	N	AFR	AFR	AFR	LINEAR	2000	-0.237153	0.086625	-2.737707	6.241845e-03	-0.407038	-0.067269	1.000000	0.761482	.
4	1	100001	104999	w5_AFR	N	AFR	AFR	AFR	LINEAR	2000	-0.233500	0.086671	-2.694094	7.117167e-03	-0.403476	-0.063525	1.000000	0.792420	.
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
65995	22	24875001	24879999	w21996_AMR	N	AMR	AMR	AMR	LINEAR	2000	0.442430	0.098918	4.472673	8.159219e-06	0.248436	0.636424	0.538508	0.024478	.
65996	22	24900001	24904999	w21997_AMR	N	AMR	AMR	AMR	LINEAR	2000	0.490311	0.100017	4.902289	1.023791e-06	0.294163	0.686459	0.067570	0.006757	.
65997	22	24925001	24929999	w21998_AMR	N	AMR	AMR	AMR	LINEAR	2000	0.483034	0.099555	4.851932	1.317446e-06	0.287791	0.678277	0.086951	0.007723	.
65998	22	24950001	24954999	w21999_AMR	N	AMR	AMR	AMR	LINEAR	2000	0.509389	0.100400	5.073602	4.265379e-07	0.312489	0.706288	0.028152	0.004685	.
65999	22	24975001	24979999	w22000_AMR	N	AMR	AMR	AMR	LINEAR	2000	0.534291	0.099897	5.348402	9.891250e-08	0.338377	0.730205	0.006528	0.003264	.

66000 rows × 19 columns

Inspect the result table¶

Each row is one ancestry-specific test at one local ancestry window. P is the nominal association p-value. BETA is the estimated change in the quantitative trait per additional copy of that local ancestry, after covariate adjustment.

result_cols = [
    "#CHROM", "POS", "ID", "ANCESTRY", "TEST", "OBS_CT", "BETA", "SE", "P", "ERRCODE"
]
optional_cols = ["L95", "U95", "BONF", "FDR_BH"]
show_cols = [col for col in result_cols + optional_cols if col in results.columns]

results[show_cols].sort_values("P").head(10)

	#CHROM	POS	ID	ANCESTRY	TEST	OBS_CT	BETA	SE	P	ERRCODE	L95	U95	BONF	FDR_BH
6296	3	5000001	w2201_AFR	AFR	LINEAR	2000	0.464202	0.082196	1.861255e-08	.	0.303003	0.625400	0.001228	0.001228
65999	22	24975001	w22000_AMR	AMR	LINEAR	2000	0.534291	0.099897	9.891250e-08	.	0.338377	0.730205	0.006528	0.003264
6299	3	5075001	w2204_AFR	AFR	LINEAR	2000	0.423219	0.082736	3.432297e-07	.	0.260962	0.585477	0.022653	0.004685
6297	3	5025001	w2202_AFR	AFR	LINEAR	2000	0.420844	0.082709	3.952853e-07	.	0.258639	0.583049	0.026089	0.004685
65998	22	24950001	w21999_AMR	AMR	LINEAR	2000	0.509389	0.100400	4.265379e-07	.	0.312489	0.706288	0.028152	0.004685
6305	3	5225001	w2210_AFR	AFR	LINEAR	2000	0.422560	0.083664	4.803442e-07	.	0.258482	0.586638	0.031703	0.004685
6295	3	4975001	w2200_AFR	AFR	LINEAR	2000	0.414107	0.082419	5.499524e-07	.	0.252471	0.575743	0.036297	0.004685
6302	3	5150001	w2207_AFR	AFR	LINEAR	2000	0.416082	0.082915	5.678927e-07	.	0.253473	0.578690	0.037481	0.004685
6306	3	5250001	w2211_AFR	AFR	LINEAR	2000	0.414821	0.083673	7.738177e-07	.	0.250726	0.578917	0.051072	0.005675
65996	22	24900001	w21997_AMR	AMR	LINEAR	2000	0.490311	0.100017	1.023791e-06	.	0.294163	0.686459	0.067570	0.006757

Manhattan plot¶

A Manhattan plot places association strength along genomic position. Each point is one local ancestry window for a single ancestry-specific test. Taller points have smaller p-values. The dashed line is a Bonferroni threshold, a strict correction that divides the chosen significance level by the number of tests in that ancestry-specific plot.

plot_df = results.loc[results["P"].notna() & (results["P"] > 0), ["#CHROM", "POS", "P", "ANCESTRY"]].copy()

for ancestry in sorted(plot_df["ANCESTRY"].unique()):
    ancestry_df = plot_df.loc[plot_df["ANCESTRY"] == ancestry, ["#CHROM", "POS", "P"]].copy()
    manhattan_plot(ancestry_df, title=f"{ancestry} local ancestry")

../_images/c1e7e8d4d1c69872244dbeab83f3efa81b902634ca74a7e1acafcaaff52e2237.png

../_images/358d687e91370b95a6d1b3bf5578aff5b1031dac7374c6848b7f568d6e133ba2.png

../_images/d36eb1a7bed7158eec56682ab3c4740ce2580abbf4138453c51bd1f2379a1ffb.png

Q-Q plot¶

Q-Q plots help check whether the p-value distribution looks well calibrated under the null and highlight departures consistent with true signal or residual confounding.

for ancestry in sorted(plot_df["ANCESTRY"].unique()):
    ancestry_df = plot_df.loc[plot_df["ANCESTRY"] == ancestry, ["P"]].copy()
    qq_plot(ancestry_df, title=f"{ancestry} local ancestry")

../_images/32e6cadc047ed363237600168e01d844c820a89984595382db206962c4f60a8e.png

../_images/e5850b1ffc11da299b4d2286c7a7267fcb31e759df3fc6f728dc47493d99fe0c.png

../_images/e7a9c006d83a14bfdd9d7e27fa380f7ce4a1ff10367f502dc2de2618630ac267.png