Principal Component Analysis (PCA)¶

Principal Component Analysis (PCA) summarizes genome-wide genotype variation into a small number of axes called principal components (PCs). In population genetics, the first few PCs often capture broad ancestry structure, sample outliers, or batch effects.

This tutorial shows how to:

Load a representative subset of 1000 Genomes Project samples with population metadata
Perform PCA with the scikit-learn CPU backend using low-rank SVD
Perform PCA with the PyTorch backend using exact SVD on a GPU when available
Calculate PC coordinates in one step or compute eigenvectors first and then project samples
Visualize PC coordinates and color samples by population metadata

from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd

from snputils.datasets import load_dataset
from snputils.processing import PCA, embedding_dataframe_from_model
from snputils.visualization import plot_embedding

Load a 1000 Genomes subset¶

The code below loads 10 populations from the 1000 Genomes Project Phase 3 panel, taking up to 50 samples per population and 100,000 SNPs.

The selected population labels are stored on the returned SNPObject as sample_fid, aligned to snpobj.samples. The genotype array is kept as two strands per individual (sum_strands=False); the PCA examples below average those strands into one genotype dosage per sample.

DATA_DIR = Path("data/1kgp")

populations = [
    "YRI", "MSL", "GWD", "CEU", "GBR", 
    "IBS", "CHB", "JPT", "GIH", "PEL",
]

# Once the data is downloaded in DATA_DIR, loading it should only take less than 2 minutes.
snpobj = load_dataset(
    "1kgp",
    download_genotypes=True,
    resource="phase3",
    output_dir=DATA_DIR,
    populations=populations,
    samples_per_population=50,
    max_variants=100_000,
    require_biallelic=True,
    require_complete=True,
    require_polymorphic=True,
    snv_only=True,
    sum_strands=False,
)

print(f"Loaded {snpobj.n_samples:,} samples and {snpobj.n_snps:,} variants")
print(f"Genotype array shape: {snpobj.genotypes.shape}")

Loaded 500 samples and 100,000 variants
Genotype array shape: (100000, 500, 2)

Create a metadata table that will be reused when plotting PC coordinates. Each row represents one individual.

sample_metadata = pd.DataFrame({
    "sample": snpobj.samples,
    "population": snpobj.sample_fid,
    "sex": snpobj.sample_sex,
})

sample_metadata

	sample	population	sex
0	HG00096	GBR	M
1	HG00097	GBR	F
2	HG00099	GBR	F
3	HG00100	GBR	F
4	HG00101	GBR	M
...	...	...	...
495	NA20902	GIH	F
496	NA20903	GIH	M
497	NA20904	GIH	M
498	NA20905	GIH	M
499	NA20906	GIH	F

500 rows × 3 columns

Build tables and plots with snputils utilities¶

PCA.fit_transform() and PCA.transform() store projected coordinates. snputils provides embedding_dataframe_from_model() to convert those coordinates into a table with PC1, PC2, etc., and sample identifiers, and plot_embedding() to make metadata-colored scatter plots from that table.

PCA on CPU with low-rank SVD¶

The sklearn backend always runs on CPU. Setting fitting="lowrank" uses a randomized low-rank SVD, which is often a practical choice when the goal is to calculate the first few PCs from a large genotype matrix.

Here fit_transform() computes the eigenvectors and immediately projects the same samples onto the requested PCs.

%%time
pca_cpu = PCA(
    backend="sklearn",
    n_components=10,
    fitting="lowrank",
    average_strands=True,
)

coords_cpu = pca_cpu.fit_transform(snpobj)
coords_cpu_df = embedding_dataframe_from_model(pca_cpu, metadata=sample_metadata)
coords_cpu_df

CPU times: user 3min 35s, sys: 252 ms, total: 3min 35s
Wall time: 16.3 s

	indID	method	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	population	sex
0	HG00096	pca	-5.445668	-13.814376	-2.301814	-1.244111	-6.680630	-3.717565	-0.317790	-1.471880	5.344683	-3.964763	GBR	M
1	HG00097	pca	-9.220740	-10.856742	-1.274107	2.944276	8.447960	-4.819750	-8.208855	-5.760562	-2.643495	2.121280	GBR	F
2	HG00099	pca	-10.290048	-12.828224	5.822768	-7.014384	-6.646201	-3.529025	-5.580506	-0.436918	10.338888	2.512376	GBR	F
3	HG00100	pca	-8.672443	-11.465423	-1.752690	-3.504632	0.428147	6.366216	4.551703	2.242306	-2.266150	2.665156	GBR	F
4	HG00101	pca	-10.551410	-7.101886	1.762160	-4.699048	-2.456534	-9.979330	3.919668	-4.201518	-0.775524	-2.900219	GBR	M
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
495	NA20902	pca	-9.311613	2.114745	-4.715104	-2.613610	14.008947	-5.364798	6.982087	2.001102	-2.000129	12.701165	GIH	F
496	NA20903	pca	-6.888755	-8.312722	-4.942655	0.221704	2.450688	4.242626	-5.060560	4.646278	6.622268	-2.037378	GIH	M
497	NA20904	pca	-9.226455	-4.702215	-1.251491	-4.237715	-0.598488	4.642883	-1.589584	-0.740125	4.006232	1.004840	GIH	M
498	NA20905	pca	-8.176906	-1.238093	-3.162771	-5.514153	5.755089	3.472307	5.553517	-4.196957	7.590596	10.594611	GIH	M
499	NA20906	pca	-7.501036	0.409011	-3.043834	-8.882437	5.112177	9.716062	-3.419470	-12.553329	-5.402270	8.496523	GIH	F

500 rows × 14 columns

plot_embedding(coords_cpu_df, hue="population")
plt.show()

../_images/2b40d65732bb63eae0d8b5423c6dd7122fa04f511b576ef9e9df5127b20f4963.png

Compute eigenvectors, then project samples¶

Sometimes it is useful to separate the two steps:

fit() computes the PCA eigenvectors from a set of samples and variants.
transform() projects genotype data onto those PCs, and you can also project new data which was not used to compute the eigenvectors.

The example below computes eigenvectors from the loaded dataset and then projects the samples.

pca_two_step = PCA(
    backend="sklearn",
    n_components=4,
    fitting="lowrank",
    average_strands=True,
)

pca_two_step.fit(snpobj)
coords_projected = pca_two_step.transform(snpobj)
coords_projected_df = embedding_dataframe_from_model(pca_two_step, metadata=sample_metadata)

coords_projected_df

	indID	method	PC1	PC2	PC3	PC4	population	sex
0	HG00096	pca	-5.445668	-13.814338	-2.253621	-1.215476	GBR	M
1	HG00097	pca	-9.220740	-10.856771	-1.279904	2.900660	GBR	F
2	HG00099	pca	-10.290048	-12.828216	5.841178	-7.012531	GBR	F
3	HG00100	pca	-8.672443	-11.465432	-1.751666	-3.533308	GBR	F
4	HG00101	pca	-10.551410	-7.101853	1.827255	-4.676792	GBR	M
...	...	...	...	...	...	...	...	...
495	NA20902	pca	-9.311613	2.114709	-4.768888	-2.651646	GIH	F
496	NA20903	pca	-6.888755	-8.312735	-4.990218	0.221016	GIH	M
497	NA20904	pca	-9.226455	-4.702213	-1.256812	-4.232951	GIH	M
498	NA20905	pca	-8.176906	-1.238123	-3.207110	-5.533290	GIH	M
499	NA20906	pca	-7.501036	0.409055	-3.051526	-8.825791	GIH	F

500 rows × 8 columns

After fit(), the principal axes are available in components_. Each row is one eigenvector, and each column corresponds to one variant used in the PCA calculation.

print(f"Number of PCs computed: {pca_two_step.n_components_}")
print(f"Eigenvector matrix shape: {pca_two_step.components_.shape}")

Number of PCs computed: 4
Eigenvector matrix shape: (4, 100000)

PCA on GPU with exact SVD¶

The pytorch backend can run PCA on a CUDA GPU. Setting fitting="exact" uses an exact SVD. If CUDA is unavailable, device="cuda:0" falls back to CPU with a warning, so the code remains runnable on a laptop while still showing the GPU configuration.

Exact SVD can be useful when you want deterministic full-SVD behavior for a selected number of PCs and have enough GPU memory for the selected genotype matrix.

%%time
pca_gpu = PCA(
    backend="pytorch",
    n_components=10,
    fitting="exact",
    device="cuda:0",
    average_strands=True,
)

coords_gpu = pca_gpu.fit_transform(snpobj)
coords_gpu_df = embedding_dataframe_from_model(pca_gpu, metadata=sample_metadata)

print(f"PyTorch PCA device: {pca_gpu.device}")
coords_gpu_df

Converting data to PyTorch tensor on device cuda:0
PyTorch PCA device: cuda:0
CPU times: user 3.74 s, sys: 681 ms, total: 4.42 s
Wall time: 4.23 s

	indID	method	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	population	sex
0	HG00096	pca	-5.445668	-13.814376	-2.302600	1.240161	-6.667645	-3.794875	-0.166289	1.317343	5.105976	-3.670369	GBR	M
1	HG00097	pca	-9.220740	-10.856746	-1.269752	-2.940157	8.468375	-4.897522	-7.814991	5.632787	-2.985812	2.463613	GBR	F
2	HG00099	pca	-10.290048	-12.828221	5.820676	7.014180	-6.644298	-3.812626	-5.505681	0.458062	10.347920	2.704303	GBR	F
3	HG00100	pca	-8.672443	-11.465423	-1.750988	3.513406	0.381993	6.500455	4.294257	-2.217475	-2.083600	2.720650	GBR	F
4	HG00101	pca	-10.551410	-7.101874	1.757486	4.687729	-2.473735	-10.157563	4.045995	4.234613	-0.680950	-2.863122	GBR	M
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
495	NA20902	pca	-9.311613	2.114750	-4.714011	2.615123	14.015300	-5.178370	7.097056	-1.831070	-1.849865	12.583123	GIH	F
496	NA20903	pca	-6.888755	-8.312712	-4.947268	-0.228888	2.459371	3.590458	-5.735509	-4.652354	6.701974	-2.289744	GIH	M
497	NA20904	pca	-9.226455	-4.702216	-1.251564	4.228626	-0.535143	4.553773	-1.794397	0.759041	4.000025	0.816303	GIH	M
498	NA20905	pca	-8.176906	-1.238097	-3.161562	5.514553	5.749235	3.756734	5.473292	4.269050	7.260847	10.549923	GIH	M
499	NA20906	pca	-7.501036	0.409024	-3.048236	8.867490	5.108280	9.334359	-3.876448	12.744956	-5.340466	8.151290	GIH	F

500 rows × 14 columns

plot_embedding(coords_gpu_df, hue="population")
plt.show()

Color PCA Plots by population groups¶

The population metadata can also be collapsed into broader groups for a less crowded plot. The grouping below is only for visualization; the PCA coordinates are the same.

population_group = {
    "YRI": "Africa", "LWK": "Africa", "MSL": "Africa", "GWD": "Africa",
    "CEU": "Europe", "FIN": "Europe", "GBR": "Europe", "IBS": "Europe",
    "CHB": "East Asia", "JPT": "East Asia",
    "GIH": "South Asia",
    "PEL": "Americas",
}

coords_cpu_df["population_group"] = coords_cpu_df["population"].map(population_group)
plot_embedding(coords_cpu_df, hue="population_group")
plt.show()

../_images/f12b31f77e26fcf2b8633bd52f4a1855ab791e7cf03559824ebfeeae89608d4b.png

Using sample and variant subsets¶

PCA can select rows and columns before computing PCs. samples_subset chooses samples by row index, and snps_subset chooses variants by column index. Passing an integer uses the first n rows or columns; passing a list uses the specified indices.

This is helpful for quick exploratory PCA calculations before scaling up to the full selected dataset.

%%time
pca_subset = PCA(
    backend="sklearn",
    n_components=4,
    fitting="lowrank",
    average_strands=True,
    samples_subset=200,
    snps_subset=20_000,
)

coords_subset = pca_subset.fit_transform(snpobj)
coords_subset_df = embedding_dataframe_from_model(pca_subset, metadata=sample_metadata)

print(f"Subset PC coordinate shape: {coords_subset.shape}")
plot_embedding(coords_subset_df, hue="population")
plt.show()

Subset PC coordinate shape: (200, 4)
CPU times: user 36.1 s, sys: 144 ms, total: 36.2 s
Wall time: 3.36 s

../_images/a9392c062357636c2b39aaed15eab5668559934be855b1201b9850d3120416e6.png

Save PC coordinates¶

If embedding_table_path is set, fit_transform() writes the projected PC coordinates to a TSV or CSV file. This is useful when PC coordinates will be used by another analysis or plotting script.

output_path = "pca_coordinates.tsv"

pca_export = PCA(
    backend="sklearn",
    n_components=4,
    fitting="lowrank",
    average_strands=True,
    embedding_table_path=output_path,
)
_ = pca_export.fit_transform(snpobj)

print(f"Wrote PCA coordinates to {output_path}")

Wrote PCA coordinates to pca_coordinates.tsv