Analysis¶

Dimensionality Reduction¶

PCA¶

Standard PCA backed by scikit-learn or PyTorch. Accepts a SNPObject directly.

from snputils.processing import PCA, TorchPCA

pca = PCA(n_components=10)
pca.fit_transform(snpobj)

pca.X_new_    # (n_samples, n_components) embedding
pca.samples_  # sample IDs aligned with embedding rows

# GPU-accelerated via PyTorch (requires torch)
torch_pca = TorchPCA(n_components=10, device="cuda")
torch_pca.fit_transform(snpobj)

mdPCA¶

PCA with missing-data support; ancestry-specific masking replaces non-target LAI segments with NaN.

from snputils.processing import mdPCA

mdpca = mdPCA(
    snpobj=snpobj,
    laiobj=laiobj,
    labels_file="labels.tsv",  # columns: indID, label
    ancestry="AFR",
    is_masked=True,
    average_strands=False,
    n_components=2,
    embedding_table_path="mdpca_coords.tsv",  # optional TSV export
)
# fit_transform runs automatically when all four arguments are passed at construction

mdpca.fit_transform(snpobj, laiobj, "labels.tsv", ancestry="AFR")
mdpca.X_new_   # embedding

maasMDS¶

Multi-array, ancestry-specific multidimensional scaling. Distances are computed on ancestry-masked genotypes; multiple SNP arrays are harmonized to a shared reference and linearly calibrated so embeddings are comparable.

from snputils.processing import maasMDS

mds = maasMDS(
    snpobj=[snpobj_array1, snpobj_array2],   # single object or list
    laiobj=[laiobj_array1, laiobj_array2],
    labels_file="labels.tsv",
    ancestry=0,
    is_masked=True,
    n_components=2,
)

mds.X_new_         # (n_samples, n_components)
mds.array_labels_  # array index per row

Saving Embeddings¶

from snputils.processing import save_embedding_table, embedding_dataframe_from_model

df = embedding_dataframe_from_model(pca)         # returns a DataFrame
save_embedding_table(pca, "coords.tsv")          # writes TSV

Allele Frequencies¶

import numpy as np
from snputils.stats import allele_freq_stream

# Streaming computation — does not load all genotypes at once
for chunk in allele_freq_stream("cohort.pgen", chunk_size=50_000):
    freqs = chunk   # (chunk_size, n_pops)

# Ancestry-specific frequencies
all_freqs = []
for chunk in allele_freq_stream(
    snpobj,
    sample_labels=labels,
    ancestry="EUR",
    laiobj=laiobj,
    chunk_size=50_000,
):
    all_freqs.append(chunk)
freqs = np.concatenate(all_freqs, axis=0)

F-Statistics¶

All functions accept either a SNPObject or a pre-computed (afs, counts, pops) tuple. Block-jackknife standard errors are computed by default.

from snputils.stats import f2, f3, f4, f4_ratio, d_stat, fst

# f2: pairwise branch length between populations
df = f2(snpobj, sample_labels=labels)

# f3: test whether C is admixed between A and B  (f3(A, B; C))
df = f3(snpobj, a=["EUR"], b=["AFR"], c=["AMR"], sample_labels=labels)

# f4: test for gene flow  (f4(A, B; C, D))
df = f4(snpobj, a=["EUR"], b=["AFR"], c=["EAS"], d=["AMR"], sample_labels=labels)

# f4-ratio: admixture proportion estimate
df = f4_ratio(snpobj, sample_labels=labels)

# D-statistic
df = d_stat(snpobj, a=["EUR"], b=["AFR"], c=["EAS"], d=["AMR"], sample_labels=labels)

# FST (Hudson, Weir-Cockerham, or Tsallis)
df = fst(snpobj, method="hudson", sample_labels=labels)
df = fst(snpobj, method="weir_cockerham", sample_labels=labels)

All statistics support ancestry-specific computation:

df = f2(snpobj, sample_labels=labels, ancestry="AFR", laiobj=laiobj)

GWAS and Admixture Mapping¶

import snputils as su

# GWAS: phenotype + genotype paths or in-memory objects
results = su.run_gwas(phen, snpobj)

# Admixture mapping: phenotype + local ancestry path or object
results = su.run_admixture_mapping(phen, laiobj)

Optional covariates go on the covar argument as a file path or CovariateObject. Covariate files are whitespace-separated tables with an IID column and numeric columns after it (for example age, sex). A column named SEX accepts M/F or MALE/FEMALE and is coded as 1/2.

For typical GWAS and admixture-mapping models, build covariates from PCs, global ancestry proportions, and clinical variables, then merge:

pc_covar = su.CovariateObject.from_embedding(pca, n_components=10)
anc_covar = su.CovariateObject.from_global_ancestry(admobj)
clinical = su.CovariateObject.from_file("covariates.txt")
covar = su.CovariateObject.merge(pc_covar, clinical)

results = su.run_gwas(phen, snpobj, covar=covar)
results = su.run_admixture_mapping(phen, laiobj, covar=covar)

from_global_ancestry drops the last ancestry column by default. from_embedding requires sample-level coordinates (average_strands=True on phased PCA). The same blocks can be composed with build_association_covariates().

File-only or manual construction still works:

results = su.run_gwas(phen, snpobj, covar="covariates.txt")

Both return a DataFrame with per-variant statistics (p-values, effect sizes) suitable for Manhattan/QQ plotting. See Association and Utilities for full signatures and Command-Line Interface for file-backed CLI equivalents.

Simulation¶

OnlineSimulator generates admixed haplotypes from real reference panels using crossover models.

from snputils.simulation.simulator import OnlineSimulator
import pandas as pd

meta = pd.read_csv("metadata.tsv", sep="\t")   # columns: Sample, Population, Latitude, Longitude
sim = OnlineSimulator(
    snp_data=snpobj,
    meta=meta,
    genetic_map=gmap_df,   # columns: chm, pos, cM  (optional)
    window_size=1000,
    make_haploid=True,
)

snps, labels_d, labels_c, changepoints = sim.simulate(
    batch_size=256,
    num_generation_max=10,
    device="cuda",         # or 'cpu'
)
# snps:       (batch_size, n_snps)  tensor
# labels_d:   (batch_size, n_windows)  discrete population labels
# labels_c:   (batch_size, n_windows, 3)  continuous lat/lon n-vectors
# changepoints: (batch_size, n_windows)  ancestry change-point mask

Via CLI:

snputils simulate \
    --snp cohort.pgen \
    --metadata metadata.tsv \
    --genetic-map gmap.tsv \
    --output-dir sim_batches/ \
    --batch-size 256 \
    --n-batches 10 \
    --num-generations 10

Command-Line Interface¶

The snputils CLI mirrors these analyses for file-backed workflows. Subcommands include pca, mdpca, maasmds, gwas, admixture-map, simulate, plot-manhattan, and plot-qq.

snputils gwas \
    --phe-id trait \
    --phe-path phenotypes.tsv \
    --snp-path cohort.pgen \
    --results-path gwas.tsv.gz

snputils mdpca \
    --snp-path cohort.pgen \
    --lai-path local_ancestry.msp \
    --labels-file labels.tsv \
    --ancestry AFR \
    --coords mdpca_coords.tsv \
    --plot mdpca.pdf

Full subcommand reference and additional examples: Command-Line Interface.