File I/O¶
High-level readers dispatch to the right implementation based on file extension.
import snputils as su
snpobj = su.read_snp("cohort.vcf.gz") # auto-detects format
SNP Formats¶
Format |
Read |
Write |
|---|---|---|
PLINK BED ( |
|
|
PLINK2 PGEN ( |
|
|
VCF / VCF.gz |
|
|
BCF ( |
|
|
BGEN |
|
|
GRG |
|
|
from snputils import read_bcf, read_bed, read_bgen, read_pgen, read_vcf
bcf = read_bcf("cohort.bcf")
bed = read_bed("cohort.bed")
bgen = read_bgen("cohort.bgen")
pgen = read_pgen("cohort.pgen")
vcf = read_vcf("cohort.vcf.gz")
Reader options:
from snputils import BCFReader, PGENReader, BGENReader, VCFReader
# PGEN: read only specific samples or chromosomes
pgen = PGENReader("cohort.pgen").read(
samples=["sample1", "sample2"],
variants_chrom=["1", "2"],
sum_strands=False,
)
# VCF: read only phased GT, specific region
vcf = VCFReader("cohort.vcf.gz").read(
region="chr1:1000000-2000000",
sum_strands=False,
)
# BCF: read only specific samples and variants
bcf = BCFReader("cohort.bcf").read(
sample_ids=["sample1", "sample2"],
variant_ids=["chr1:1000000", "rs123"],
)
BCF notes: BCF reads use a native snputils parser over BGZF-compressed BCF2.2 records. Genotypes are stored on SNPObject.genotypes just like VCF input. region=... works without an index by scanning and filtering matching records.
BGEN notes: Genotype probabilities are stored on SNPObject.calldata_gp with shape (n_snps, n_samples, n_probs). Hard calls are not inferred during reads. Mixed-width BGEN records are padded with NaN columns.
Sampleless/Annotation-Only VCF notes: VCFReader supports reading sampleless VCFs (VCFs containing variant metadata but no sample columns). In this case, the returned SNPObject will have an empty genotypes array with a variant axis (shape (n_snps, 0) or (n_snps, 0, 2)). Correspondingly, VCFWriter can write sampleless VCFs from such SNPObject instances, provided their genotypes array has a variant axis of size matching the variant count.
Writer options:
from snputils import BEDWriter, PGENWriter, VCFWriter, BCFWriter, BGENWriter
BEDWriter(snpobj, "out.bed").write()
PGENWriter(snpobj, "out.pgen").write(vzs=True) # compressed .pvar.zst
VCFWriter(snpobj, "out.vcf", phased=True).write(
chrom_partition=True # one file per chromosome
)
BCFWriter(snpobj, "out.bcf", phased=True).write(
chrom_partition=True # one file per chromosome
)
BGENWriter(snpobj, "out.bgen").write(
compression="zstd", layout=2, bit_depth=16, phased=True
)
Local Ancestry Formats¶
Format |
Read |
Write |
|---|---|---|
MSP ( |
|
|
FLARE ( |
|
|
msp = su.read_lai("local_ancestry.msp")
flare = su.read_lai("flare.out.anc.vcf.gz")
su.MSPWriter(laiobj, "out.msp").write()
# FLARE VCF output requires matching SNP data (GT + variant metadata)
su.FLAREWriter(laiobj, "out.anc.vcf", snpobj=snpobj).write()
Global Ancestry / ADMIXTURE¶
admobj = su.read_admixture("run_prefix") # reads run_prefix.Q and run_prefix.P
su.AdmixtureWriter(admobj, "out_prefix").write()
Admixture Mapping Output¶
# Write a VCF for downstream admixture-mapping tools
su.AdmixtureMappingVCFWriter(laiobj, snpobj, "admix_map.vcf").write()
IBD Formats¶
hapibd = su.read_ibd("segments.hapibd") # hap-IBD
ancibd = su.AncIBDReader("ch_all.tsv").read(
include_segment_types=["IBD1", "IBD2"] # filter by segment type
)
Phenotype Formats¶
phen = su.read_pheno("pheno.tsv")
mphen = su.MultiPhenReader("multi.tsv").read()
Covariate Formats¶
Covariate files feed run_gwas() and run_admixture_mapping() through the covar argument. Use whitespace-separated text with IID in the header and numeric columns after it. An optional #FID column before IID is allowed.
#FID IID age sex
sample1 sample1 58 1
sample2 sample2 49 2
In Python you can pass the same data as a CovariateObject instead of a path.
Datasets¶
Built-in datasets and synthetic data builders for quick experimentation:
su.available_datasets_list() # list bundled datasets
ds = su.load_dataset("1kgp") # returns a SNPObject
snpobj = su.build_synthetic_snp_dataset()
laiobj = su.build_synthetic_chromosome_painting_dataset()
admobj = su.build_synthetic_admixture_dataset()
phen = su.build_synthetic_phenotype_dataset()
grg = su.build_synthetic_grg()
mdpca_ds = su.build_synthetic_mdpca_dataset()
maas_ds = su.build_synthetic_maasmds_dataset()