Phenotype Objects

snputils uses three phenotype-side containers:

  • MultiPhenotypeObject for a table keyed by IID plus several trait columns.

  • PhenotypeObject for one trait aligned to sample IDs.

  • CovariateObject for numeric covariates aligned to the same samples.

This notebook uses the synthetic dataset helpers so every example is reproducible and can run without external files.

from pathlib import Path

import pandas as pd

from snputils.datasets import build_synthetic_phenotype_dataset
from snputils.phenotype import MultiPhenReader, PhenotypeObject, read_pheno
from snputils.tools import run_gwas

RESULTS_DIR = Path("results/tutorials")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
SEED = 20240520

Create a small aligned cohort

build_synthetic_phenotype_dataset() returns an SNPObject, a phenotype table, single-trait phenotype objects, and covariates that all use the same sample IDs. The table already uses the same IID convention as the phenotype readers, and the synthetic SNP data use complete diploid dosages so the examples can also be passed directly to run_gwas().

dataset = build_synthetic_phenotype_dataset(n_samples=48, n_snps=120, seed=SEED)

snpobj = dataset["snpobj"]
phen_df = dataset["phen_df"]
multi = dataset["multi_phenotype"]
quantitative = dataset["quantitative"]
binary = dataset["binary"]
covariates = dataset["covariates"]
shuffled_phen_df = dataset["shuffled_phen_df"]

print(multi)
print(binary)
print(quantitative)
print("effect variants:", dataset["effect_variant_ids"])
MultiPhenotypeObject(shape=(48, 7), n_samples=48, n_phenotypes=6, sample_column='IID')
PhenotypeObject(phenotype_name='TRAIT_BIN', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
PhenotypeObject(phenotype_name='TRAIT_Q', shape=(48,), n_samples=48, trait_type='quantitative')
effect variants: ['rs_syn_00001', 'rs_syn_00060', 'rs_syn_00120']
phen_df.head()
IID trait_quantitative trait_binary_01 trait_binary_12 age batch sex
0 S01 0.4818 0 1 58 1 1
1 S02 -1.0665 1 2 49 2 2
2 S03 -0.4797 0 1 29 1 1
3 S04 -0.8982 0 1 30 2 2
4 S05 0.2875 0 1 63 1 1

Multi-trait tables

MultiPhenotypeObject is a validated sample-aligned table. It requires a unique sample-ID column plus at least one phenotype column, and exposes the parsed sample IDs through samples.

This object is useful when you want to keep several columns together while selecting or reordering samples. Downstream association tools still expect a single PhenotypeObject, so you typically extract one trait column when you are ready to run an analysis.

print("sample column:", multi.sample_column)
print("phenotype columns:", multi.phenotype_names)

selected = multi.filter_samples(samples=["S12", "S03", "S01"], reorder=True)
excluded = multi.filter_samples(indexes=[0, -1], include=False)

display(selected.phen_df)
display(excluded.phen_df.head())
sample column: IID
phenotype columns: ['trait_quantitative', 'trait_binary_01', 'trait_binary_12', 'age', 'batch', 'sex']
IID trait_quantitative trait_binary_01 trait_binary_12 age batch sex
0 S12 0.5951 1 2 62 2 2
1 S03 -0.4797 0 1 29 1 1
2 S01 0.4818 0 1 58 1 1
IID trait_quantitative trait_binary_01 trait_binary_12 age batch sex
0 S02 -1.0665 1 2 49 2 2
1 S03 -0.4797 0 1 29 1 1
2 S04 -0.8982 0 1 30 2 2
3 S05 0.2875 0 1 63 1 1
4 S06 0.7950 0 1 50 2 2

Single-trait phenotype objects

PhenotypeObject stores one numeric vector plus sample IDs. The constructor enforces unique sample IDs, finite numeric values, and one of two trait modes:

  • Quantitative traits keep their numeric values and must have non-zero variance.

  • Binary traits must use exactly two levels encoded as {0, 1} or {1, 2}. Internally they are normalized to 0/1.

If quantitative=None, snputils infers the trait type from the values.

binary_from_12 = PhenotypeObject(
    samples=phen_df["IID"],
    values=phen_df["trait_binary_12"],
    phenotype_name="TRAIT_BIN_FROM_12",
    quantitative=None,
)

quantitative_from_table = PhenotypeObject(
    samples=phen_df["IID"],
    values=phen_df["trait_quantitative"],
    phenotype_name="TRAIT_Q_FROM_TABLE",
    quantitative=None,
)

print(binary_from_12)
print("unique stored values:", sorted(set(binary_from_12.values.tolist())))
print("first five cases:", binary_from_12.cases[:5])
print(quantitative_from_table)
PhenotypeObject(phenotype_name='TRAIT_BIN_FROM_12', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
unique stored values: [0, 1]
first five cases: ['S02', 'S07', 'S08', 'S09', 'S12']
PhenotypeObject(phenotype_name='TRAIT_Q_FROM_TABLE', shape=(48,), n_samples=48, trait_type='quantitative')

Covariates

CovariateObject stores a numeric sample-by-covariate matrix. It is stricter than MultiPhenotypeObject: covariates must be numeric, finite, and have explicit names. This is the object expected by run_gwas() and run_admixture_mapping() when you want to pass in-memory covariates.

print(covariates)
pd.DataFrame(
    covariates.values,
    index=covariates.samples,
    columns=covariates.covariate_names,
).head()
CovariateObject(shape=(48, 3), n_samples=48, n_covariates=3)
age batch sex
S01 58.0 1.0 1.0
S02 49.0 2.0 2.0
S03 29.0 1.0 1.0
S04 30.0 2.0 2.0
S05 63.0 1.0 1.0

Read the same data from files

The reader APIs use the same IID convention.

  • MultiPhenReader loads a whole table into MultiPhenotypeObject.

  • read_pheno() reads one phenotype column into PhenotypeObject.

For single-trait phenotype files, the header must include IID, and if multiple phenotype columns appear after IID you must select one explicitly with col=. FID may be present before IID, but it is ignored by the phenotype objects.

phen_path = RESULTS_DIR / "synthetic_phenotypes.tsv"

phen_file_df = phen_df.copy()
phen_file_df.insert(0, "FID", phen_file_df["IID"])
phen_file_df.to_csv(phen_path, sep="\t", index=False)

multi_from_file = MultiPhenReader(phen_path).read()
binary_from_file = read_pheno(phen_path, col="trait_binary_12")
quantitative_from_file = read_pheno(phen_path, col="trait_quantitative", quantitative=True)

display(multi_from_file.phen_df.head())
print(binary_from_file)
print(quantitative_from_file)
IID trait_quantitative trait_binary_01 trait_binary_12 age batch sex
0 S01 0.4818 0 1 58 1 1
1 S02 -1.0665 1 2 49 2 2
2 S03 -0.4797 0 1 29 1 1
3 S04 -0.8982 0 1 30 2 2
4 S05 0.2875 0 1 63 1 1
PhenotypeObject(phenotype_name='trait_binary_12', shape=(48,), n_samples=48, trait_type='binary', n_cases=24, n_controls=24)
PhenotypeObject(phenotype_name='trait_quantitative', shape=(48,), n_samples=48, trait_type='quantitative')