Missing-data PCA (mdPCA)

Missing-data PCA is useful when genotype matrices are too incomplete for ordinary PCA assumptions. Ancient DNA is a common example: every individual can have sparse coverage, and there may be no fully observed reference individuals. This notebook masks 70% of genotypes at random for every individual, then compares ordinary PCA fallbacks with mdPCA.

Build a structured genotype dataset

The synthetic dataset has 200 individuals and 1,000 SNPs. The population labels are geography-like source groups; all samples, including the reference-like groups, will receive the same random missingness process.

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA as SklearnPCA
from sklearn.metrics import silhouette_score

from snputils import mdPCA
from snputils.datasets import build_synthetic_mdpca_dataset
from snputils.visualization import plot_embedding, scatter

RESULTS_DIR = Path("results/tutorials/mdpca")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
SEED = 20240520

dataset = build_synthetic_mdpca_dataset(
    n_samples=200,
    n_snps=1_000,
    seed=SEED,
)
snpobj = dataset["snpobj"]
laiobj = dataset["laiobj"]
labels = dataset["labels"]

print(snpobj)
labels["label"].value_counts().rename_axis("label").reset_index(name="n_samples")
SNPObject(shape=(1000, 200, 2), n_snps=1000, n_samples=200, genotypes_shape=(1000, 200, 2), calldata_lai_shape=(1000, 200, 2), calldata_gp_shape=None, has_variant_metadata=True, has_ancestry_map=True)
label n_samples
0 AFR_ref 40
1 EUR_ref 40
2 EAS_ref 40
3 ADMIXED_AFR_EUR 40
4 ADMIXED_EUR_EAS 40

Mask 70% of every individual

This is missing completely at random across SNPs, samples, and haplotypes.

rng = np.random.default_rng(SEED + 1)
missing_rate = 0.70

incomplete = snpobj.copy()
genotypes = snpobj.genotypes.copy()
missing = rng.random(genotypes.shape) < missing_rate
genotypes[missing] = -1
incomplete.genotypes = genotypes

observed_fraction = (genotypes >= 0).mean(axis=(0, 2))
complete_variant_mask = np.all(genotypes >= 0, axis=(1, 2))

print(f"Mean observed genotype fraction per individual: {observed_fraction.mean():.3f}")
print(f"Range across individuals: {observed_fraction.min():.3f} - {observed_fraction.max():.3f}")
print(f"Complete-case SNPs left for ordinary PCA: {complete_variant_mask.sum()} of {snpobj.n_snps}")
Mean observed genotype fraction per individual: 0.298
Range across individuals: 0.271 - 0.327
Complete-case SNPs left for ordinary PCA: 0 of 1000

Ordinary PCA fallback: zero-filled missing calls

Complete-case PCA has no SNPs left. A common but poor fallback is to fill missing genotypes with a fixed value and run ordinary PCA. Here missing diploid dosages are filled with 0, which strongly weakens the population structure.

def dosage_matrix(snp):
    gt = snp.genotypes
    dosage = gt.sum(axis=2).astype(float)
    dosage[np.any(gt < 0, axis=2)] = np.nan
    return dosage.T  # samples x SNPs

X_missing = dosage_matrix(incomplete)
X_zero_filled = np.nan_to_num(X_missing, nan=0.0)

ordinary_coords = SklearnPCA(n_components=2, random_state=SEED).fit_transform(X_zero_filled)
ordinary_df = labels.copy()
ordinary_df["PC1"] = ordinary_coords[:, 0]
ordinary_df["PC2"] = ordinary_coords[:, 1]

ordinary_silhouette = silhouette_score(ordinary_coords, ordinary_df["label"])
print(f"Population-label silhouette after zero-filled PCA: {ordinary_silhouette:.3f}")

plot_embedding(ordinary_df, hue="label", title="Ordinary PCA after zero-filling 70% missing genotypes")
plt.show()
Population-label silhouette after zero-filled PCA: 0.003
../_images/a1e9ed5f36a50de270ad3866905e8f278f0b8ba4549319c4949619342daa6c55.png

mdPCA on the same incomplete matrix

mdPCA computes the covariance structure while respecting missing values. Here is_masked=False means the example is about missing genotypes themselves, not ancestry-specific masking. Every individual still has about 70% missing calls.

mdpca = mdPCA(
    snpobj=incomplete,
    laiobj=laiobj,
    labels=labels,
    ancestry=1,
    is_masked=False,
    average_strands=True,
    min_percent_snps=1,
    group_snp_frequencies_only=False,
    n_components=2,
)

mdpca_silhouette = silhouette_score(mdpca.X_new_, labels["label"])
print(f"Population-label silhouette after mdPCA: {mdpca_silhouette:.3f}")
print("mdPCA embedding shape:", mdpca.X_new_.shape)
Population-label silhouette after mdPCA: 0.675
mdPCA embedding shape: (200, 2)
scatter(mdpca, labels)
plt.show()
../_images/81745048657f23e138bc367e325b967c2312b5c67957d3f173615e193a24be41.png