Phenotype Objects¶
snputils uses three phenotype-side containers:
MultiPhenotypeObjectfor a table keyed byIIDplus several trait columns.PhenotypeObjectfor one trait aligned to sample IDs.CovariateObjectfor numeric covariates aligned to the same samples.
This notebook uses the synthetic dataset helpers so every example is reproducible and can run without external files.
from pathlib import Path
import pandas as pd
from snputils.datasets import build_synthetic_phenotype_dataset
from snputils.phenotype import MultiPhenReader, PhenotypeObject, read_pheno
from snputils.tools import run_gwas
RESULTS_DIR = Path("results/tutorials")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
SEED = 20240520
Create a small aligned cohort¶
build_synthetic_phenotype_dataset() returns an SNPObject, a phenotype table, single-trait phenotype objects, and covariates that all use the same sample IDs. The table already uses the same IID convention as the phenotype readers, and the synthetic SNP data use complete diploid dosages so the examples can also be passed directly to run_gwas().
dataset = build_synthetic_phenotype_dataset(n_samples=48, n_snps=120, seed=SEED)
snpobj = dataset["snpobj"]
phen_df = dataset["phen_df"]
multi = dataset["multi_phenotype"]
quantitative = dataset["quantitative"]
binary = dataset["binary"]
covariates = dataset["covariates"]
shuffled_phen_df = dataset["shuffled_phen_df"]
print(multi)
print(binary)
print(quantitative)
print("effect variants:", dataset["effect_variant_ids"])
MultiPhenotypeObject(shape=(48, 7), n_samples=48, n_phenotypes=6, sample_column='IID')
PhenotypeObject(phenotype_name='TRAIT_BIN', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
PhenotypeObject(phenotype_name='TRAIT_Q', shape=(48,), n_samples=48, trait_type='quantitative')
effect variants: ['rs_syn_00001', 'rs_syn_00060', 'rs_syn_00120']
phen_df.head()
| IID | trait_quantitative | trait_binary_01 | trait_binary_12 | age | batch | sex | |
|---|---|---|---|---|---|---|---|
| 0 | S01 | 0.4818 | 0 | 1 | 58 | 1 | 1 |
| 1 | S02 | -1.0665 | 1 | 2 | 49 | 2 | 2 |
| 2 | S03 | -0.4797 | 0 | 1 | 29 | 1 | 1 |
| 3 | S04 | -0.8982 | 0 | 1 | 30 | 2 | 2 |
| 4 | S05 | 0.2875 | 0 | 1 | 63 | 1 | 1 |
Multi-trait tables¶
MultiPhenotypeObject is a validated sample-aligned table. It requires a unique sample-ID column plus at least one phenotype column, and exposes the parsed sample IDs through samples.
This object is useful when you want to keep several columns together while selecting or reordering samples. Downstream association tools still expect a single PhenotypeObject, so you typically extract one trait column when you are ready to run an analysis.
print("sample column:", multi.sample_column)
print("phenotype columns:", multi.phenotype_names)
selected = multi.filter_samples(samples=["S12", "S03", "S01"], reorder=True)
excluded = multi.filter_samples(indexes=[0, -1], include=False)
display(selected.phen_df)
display(excluded.phen_df.head())
sample column: IID
phenotype columns: ['trait_quantitative', 'trait_binary_01', 'trait_binary_12', 'age', 'batch', 'sex']
| IID | trait_quantitative | trait_binary_01 | trait_binary_12 | age | batch | sex | |
|---|---|---|---|---|---|---|---|
| 0 | S12 | 0.5951 | 1 | 2 | 62 | 2 | 2 |
| 1 | S03 | -0.4797 | 0 | 1 | 29 | 1 | 1 |
| 2 | S01 | 0.4818 | 0 | 1 | 58 | 1 | 1 |
| IID | trait_quantitative | trait_binary_01 | trait_binary_12 | age | batch | sex | |
|---|---|---|---|---|---|---|---|
| 0 | S02 | -1.0665 | 1 | 2 | 49 | 2 | 2 |
| 1 | S03 | -0.4797 | 0 | 1 | 29 | 1 | 1 |
| 2 | S04 | -0.8982 | 0 | 1 | 30 | 2 | 2 |
| 3 | S05 | 0.2875 | 0 | 1 | 63 | 1 | 1 |
| 4 | S06 | 0.7950 | 0 | 1 | 50 | 2 | 2 |
Single-trait phenotype objects¶
PhenotypeObject stores one numeric vector plus sample IDs. The constructor enforces unique sample IDs, finite numeric values, and one of two trait modes:
Quantitative traits keep their numeric values and must have non-zero variance.
Binary traits must use exactly two levels encoded as
{0, 1}or{1, 2}. Internally they are normalized to0/1.
If quantitative=None, snputils infers the trait type from the values.
binary_from_12 = PhenotypeObject(
samples=phen_df["IID"],
values=phen_df["trait_binary_12"],
phenotype_name="TRAIT_BIN_FROM_12",
quantitative=None,
)
quantitative_from_table = PhenotypeObject(
samples=phen_df["IID"],
values=phen_df["trait_quantitative"],
phenotype_name="TRAIT_Q_FROM_TABLE",
quantitative=None,
)
print(binary_from_12)
print("unique stored values:", sorted(set(binary_from_12.values.tolist())))
print("first five cases:", binary_from_12.cases[:5])
print(quantitative_from_table)
PhenotypeObject(phenotype_name='TRAIT_BIN_FROM_12', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
unique stored values: [0, 1]
first five cases: ['S02', 'S07', 'S08', 'S09', 'S12']
PhenotypeObject(phenotype_name='TRAIT_Q_FROM_TABLE', shape=(48,), n_samples=48, trait_type='quantitative')
Covariates¶
CovariateObject stores a numeric sample-by-covariate matrix. It is stricter than MultiPhenotypeObject: covariates must be numeric, finite, and have explicit names. This is the object expected by run_gwas() and run_admixture_mapping() when you want to pass in-memory covariates.
print(covariates)
pd.DataFrame(
covariates.values,
index=covariates.samples,
columns=covariates.covariate_names,
).head()
CovariateObject(shape=(48, 3), n_samples=48, n_covariates=3)
| age | batch | sex | |
|---|---|---|---|
| S01 | 58.0 | 1.0 | 1.0 |
| S02 | 49.0 | 2.0 | 2.0 |
| S03 | 29.0 | 1.0 | 1.0 |
| S04 | 30.0 | 2.0 | 2.0 |
| S05 | 63.0 | 1.0 | 1.0 |
Read the same data from files¶
The reader APIs use the same IID convention.
MultiPhenReaderloads a whole table intoMultiPhenotypeObject.read_pheno()reads one phenotype column intoPhenotypeObject.
For single-trait phenotype files, the header must include IID, and if multiple phenotype columns appear after IID you must select one explicitly with col=. FID may be present before IID, but it is ignored by the phenotype objects.
phen_path = RESULTS_DIR / "synthetic_phenotypes.tsv"
phen_file_df = phen_df.copy()
phen_file_df.insert(0, "FID", phen_file_df["IID"])
phen_file_df.to_csv(phen_path, sep="\t", index=False)
multi_from_file = MultiPhenReader(phen_path).read()
binary_from_file = read_pheno(phen_path, col="trait_binary_12")
quantitative_from_file = read_pheno(phen_path, col="trait_quantitative", quantitative=True)
display(multi_from_file.phen_df.head())
print(binary_from_file)
print(quantitative_from_file)
| IID | trait_quantitative | trait_binary_01 | trait_binary_12 | age | batch | sex | |
|---|---|---|---|---|---|---|---|
| 0 | S01 | 0.4818 | 0 | 1 | 58 | 1 | 1 |
| 1 | S02 | -1.0665 | 1 | 2 | 49 | 2 | 2 |
| 2 | S03 | -0.4797 | 0 | 1 | 29 | 1 | 1 |
| 3 | S04 | -0.8982 | 0 | 1 | 30 | 2 | 2 |
| 4 | S05 | 0.2875 | 0 | 1 | 63 | 1 | 1 |
PhenotypeObject(phenotype_name='trait_binary_12', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
PhenotypeObject(phenotype_name='trait_quantitative', shape=(48,), n_samples=48, trait_type='quantitative')