Phenotype Objects¶

snputils uses three phenotype-side containers:

MultiPhenotypeObject for a table keyed by IID plus several trait columns.
PhenotypeObject for one trait aligned to sample IDs.
CovariateObject for numeric covariates aligned to the same samples.

This notebook uses the synthetic dataset helpers so every example is reproducible and can run without external files.

from pathlib import Path

import pandas as pd

from snputils.datasets import build_synthetic_phenotype_dataset
from snputils.phenotype import MultiPhenReader, PhenotypeObject, read_pheno
from snputils.tools import run_gwas

RESULTS_DIR = Path("results/tutorials")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
SEED = 20240520

Create a small aligned cohort¶

build_synthetic_phenotype_dataset() returns an SNPObject, a phenotype table, single-trait phenotype objects, and covariates that all use the same sample IDs. The table already uses the same IID convention as the phenotype readers, and the synthetic SNP data use complete diploid dosages so the examples can also be passed directly to run_gwas().

dataset = build_synthetic_phenotype_dataset(n_samples=48, n_snps=120, seed=SEED)

snpobj = dataset["snpobj"]
phen_df = dataset["phen_df"]
multi = dataset["multi_phenotype"]
quantitative = dataset["quantitative"]
binary = dataset["binary"]
covariates = dataset["covariates"]
shuffled_phen_df = dataset["shuffled_phen_df"]

print(multi)
print(binary)
print(quantitative)
print("effect variants:", dataset["effect_variant_ids"])

MultiPhenotypeObject(shape=(48, 7), n_samples=48, n_phenotypes=6, sample_column='IID')
PhenotypeObject(phenotype_name='TRAIT_BIN', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
PhenotypeObject(phenotype_name='TRAIT_Q', shape=(48,), n_samples=48, trait_type='quantitative')
effect variants: ['rs_syn_00001', 'rs_syn_00060', 'rs_syn_00120']

phen_df.head()

	IID	trait_quantitative	trait_binary_01	trait_binary_12	age	batch	sex
0	S01	0.4818	0	1	58	1	1
1	S02	-1.0665	1	2	49	2	2
2	S03	-0.4797	0	1	29	1	1
3	S04	-0.8982	0	1	30	2	2
4	S05	0.2875	0	1	63	1	1

Multi-trait tables¶

MultiPhenotypeObject is a validated sample-aligned table. It requires a unique sample-ID column plus at least one phenotype column, and exposes the parsed sample IDs through samples.

This object is useful when you want to keep several columns together while selecting or reordering samples. Downstream association tools still expect a single PhenotypeObject, so you typically extract one trait column when you are ready to run an analysis.

print("sample column:", multi.sample_column)
print("phenotype columns:", multi.phenotype_names)

selected = multi.filter_samples(samples=["S12", "S03", "S01"], reorder=True)
excluded = multi.filter_samples(indexes=[0, -1], include=False)

display(selected.phen_df)
display(excluded.phen_df.head())

sample column: IID
phenotype columns: ['trait_quantitative', 'trait_binary_01', 'trait_binary_12', 'age', 'batch', 'sex']

	IID	trait_quantitative	trait_binary_01	trait_binary_12	age	batch	sex
0	S12	0.5951	1	2	62	2	2
1	S03	-0.4797	0	1	29	1	1
2	S01	0.4818	0	1	58	1	1

	IID	trait_quantitative	trait_binary_01	trait_binary_12	age	batch	sex
0	S02	-1.0665	1	2	49	2	2
1	S03	-0.4797	0	1	29	1	1
2	S04	-0.8982	0	1	30	2	2
3	S05	0.2875	0	1	63	1	1
4	S06	0.7950	0	1	50	2	2

Single-trait phenotype objects¶

PhenotypeObject stores one numeric vector plus sample IDs. The constructor enforces unique sample IDs, finite numeric values, and one of two trait modes:

Quantitative traits keep their numeric values and must have non-zero variance.
Binary traits must use exactly two levels encoded as {0, 1} or {1, 2}. Internally they are normalized to 0/1.

If quantitative=None, snputils infers the trait type from the values.

binary_from_12 = PhenotypeObject(
    samples=phen_df["IID"],
    values=phen_df["trait_binary_12"],
    phenotype_name="TRAIT_BIN_FROM_12",
    quantitative=None,
)

quantitative_from_table = PhenotypeObject(
    samples=phen_df["IID"],
    values=phen_df["trait_quantitative"],
    phenotype_name="TRAIT_Q_FROM_TABLE",
    quantitative=None,
)

print(binary_from_12)
print("unique stored values:", sorted(set(binary_from_12.values.tolist())))
print("first five cases:", binary_from_12.cases[:5])
print(quantitative_from_table)

PhenotypeObject(phenotype_name='TRAIT_BIN_FROM_12', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
unique stored values: [0, 1]
first five cases: ['S02', 'S07', 'S08', 'S09', 'S12']
PhenotypeObject(phenotype_name='TRAIT_Q_FROM_TABLE', shape=(48,), n_samples=48, trait_type='quantitative')

Covariates¶

CovariateObject stores a numeric sample-by-covariate matrix. It is stricter than MultiPhenotypeObject: covariates must be numeric, finite, and have explicit names. This is the object expected by run_gwas() and run_admixture_mapping() when you want to pass in-memory covariates.

print(covariates)
pd.DataFrame(
    covariates.values,
    index=covariates.samples,
    columns=covariates.covariate_names,
).head()

CovariateObject(shape=(48, 3), n_samples=48, n_covariates=3)

	age	batch	sex
S01	58.0	1.0	1.0
S02	49.0	2.0	2.0
S03	29.0	1.0	1.0
S04	30.0	2.0	2.0
S05	63.0	1.0	1.0

Read the same data from files¶

The reader APIs use the same IID convention.

MultiPhenReader loads a whole table into MultiPhenotypeObject.
read_pheno() reads one phenotype column into PhenotypeObject.

For single-trait phenotype files, the header must include IID, and if multiple phenotype columns appear after IID you must select one explicitly with col=. FID may be present before IID, but it is ignored by the phenotype objects.

phen_path = RESULTS_DIR / "synthetic_phenotypes.tsv"

phen_file_df = phen_df.copy()
phen_file_df.insert(0, "FID", phen_file_df["IID"])
phen_file_df.to_csv(phen_path, sep="\t", index=False)

multi_from_file = MultiPhenReader(phen_path).read()
binary_from_file = read_pheno(phen_path, col="trait_binary_12")
quantitative_from_file = read_pheno(phen_path, col="trait_quantitative", quantitative=True)

display(multi_from_file.phen_df.head())
print(binary_from_file)
print(quantitative_from_file)

	IID	trait_quantitative	trait_binary_01	trait_binary_12	age	batch	sex
0	S01	0.4818	0	1	58	1	1
1	S02	-1.0665	1	2	49	2	2
2	S03	-0.4797	0	1	29	1	1
3	S04	-0.8982	0	1	30	2	2
4	S05	0.2875	0	1	63	1	1

PhenotypeObject(phenotype_name='trait_binary_12', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
PhenotypeObject(phenotype_name='trait_quantitative', shape=(48,), n_samples=48, trait_type='quantitative')