Datasets

Helpers for discovering bundled example datasets and generating synthetic data for tests, tutorials, and quick experiments. Synthetic builders return fully populated objects matching the shapes described in Data Model.

Example Datasets

The 1000 Genomes Project dataset is registered as 1kgp. Its resource options include grch38_biallelic, phase3, and high_coverage_2022. The high_coverage_2022 resource corresponds to the phased 30x high-coverage SNV/INDEL/SV panel from Byrska-Bishop et al., Cell 2022, “High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios”. Because these VCFs are large, load_dataset("1kgp", resource="high_coverage_2022") defaults to chromosome 1; pass chromosomes= explicitly to load additional chromosomes.

snputils.available_datasets_list()[source]

Get the list of available dataset registry names.

snputils.load_dataset(name, chromosomes=None, variants_ids=None, sample_ids=None, resource=None, data_home=None, output_dir=None, genotype_sources=None, download_genotypes=True, populations=None, samples_per_population=None, max_variants=None, require_biallelic=False, require_complete=False, require_polymorphic=False, snv_only=False, metadata_path=None, metadata_url=None, panel_path=None, panel_url=None, verbose=True, **read_kwargs)[source]

Load a genome dataset.

Parameters:
  • name (str) – Name of the dataset to load. Call available_datasets_list() to get the list of available datasets.

  • chromosomes (List[str] | List[int] | str | int) – Chromosomes to load.

  • variants_ids (List[str]) – List of variant IDs to load.

  • sample_ids (List[str]) – List of sample IDs to load.

  • resource (str) – Optional dataset genotype resource name. If omitted, the dataset default is used. For name="1kgp", available resources include "grch38_biallelic", "phase3", and "high_coverage_2022".

  • data_home (Path | str) – Optional dataset cache directory root.

  • output_dir (Path | str) – Optional directory for downloaded source files and intermediate files.

  • genotype_sources – Optional local paths or URLs to use instead of a registry chromosome resource.

  • download_genotypes – Whether remote genotype sources should be downloaded.

  • populations – Optional population labels to select from dataset metadata.

  • samples_per_population – Optional number of samples to take from each selected population.

  • max_variants – Optional maximum number of variants to read by streaming source files directly.

  • require_biallelic – When max_variants is set and source files are streamed, keep only variants with exactly one REF allele and one ALT allele.

  • require_complete – When max_variants is set and source files are streamed, keep only variants with no missing genotype calls across the selected samples.

  • require_polymorphic – When max_variants is set and source files are streamed, keep only variants that are polymorphic among the selected samples after any sample filtering.

  • snv_only – When max_variants is set and source files are streamed, keep only biallelic single-nucleotide variants. This implies the same biallelic filter as require_biallelic=True and additionally removes multi-base substitutions, indels, and symbolic alleles.

  • metadata_path – Optional local population metadata path.

  • metadata_url – Optional population metadata URL.

  • panel_path – Backward-compatible alias for metadata_path.

  • panel_url – Backward-compatible alias for metadata_url.

  • verbose (bool) – Whether to show progress.

  • **read_kwargs – Keyword arguments to pass to PGENReader.read().

Returns:

SNPObject – SNPObject containing the loaded dataset. If population metadata is used, population labels are stored in sample_fid, sex labels are stored in sample_sex, and the aligned metadata table is attached as sample_metadata.

Synthetic Data Builders

snputils.build_synthetic_snp_dataset(n_samples=12, n_snps=100, seed=0, *, n_populations=3, missing_rate=0.0, phased=True, chromosome='1', sample_prefix='S', variant_prefix='rs_syn')[source]

Build a small SNPObject with realistic metadata and population structure.

Population labels are stored in sample_fid.

snputils.build_synthetic_admixture_dataset(n_samples, n_windows, n_covariates, seed, *, ancestry_map=None)[source]

Create a local-ancestry dataset with sparse quantitative trait effects.

The generated data are intentionally small and self-contained, making them suitable for tutorials, tests, and benchmarks that need realistic-looking admixture mapping inputs without downloading external files.

snputils.build_synthetic_chromosome_painting_dataset(n_samples=3, windows_per_chromosome=60, seed=42, *, build='hg38', chromosomes=None, ancestry_map=None, male_samples=None)[source]

Build synthetic LAI data for chromosome painting examples.

By default this simulates diploid local ancestry across chromosomes 1-22 and X. The default sample metadata marks every sample as female so the reported sex is consistent with the diploid-X simulation used for painting.

snputils.build_synthetic_grg()[source]

Build a tiny deterministic GRGObject.

snputils.build_synthetic_maasmds_dataset(n_samples_per_array=200, n_snps_per_array=1000, n_arrays=3, seed=0, *, ancestry_map=None, triple_shared_fraction=0.25, pair_shared_fraction=0.2)[source]

Build multi-array SNP, LAI, and labels inputs for maasMDS examples.

With the default three arrays and 1,000 SNPs per array, each array has 250 SNPs shared across all arrays, 200 SNPs shared with each other array, and 350 array-specific SNPs. Thus overlap decays from within-array to pairwise intersections to the three-way intersection.

Genotypes are sampled haplotype-by-haplotype from local ancestry states. Within each continental ancestry we simulate multiple latent sources so ancestry-masked analyses still contain population-specific structure.

snputils.build_synthetic_mdpca_dataset(n_samples=200, n_snps=1000, seed=0, *, ancestry_map=None)[source]

Build one-array SNP, LAI, and labels inputs for mdPCA examples.

Notes

  • This generator is intended for missing-data mdPCA tutorials.

  • Genotypes are generated from sample-level label structure, while LAI is generated separately. In admixed labels, haplotype ancestry states are not used to drive ancestry-specific allele frequencies.

  • For ancestry-specific masking demos, prefer build_synthetic_maasmds_dataset(), which couples haplotype genotypes to local ancestry states.

snputils.build_synthetic_phenotype_dataset(n_samples=24, n_snps=200, seed=0, *, snpobj=None, missing_rate=0.0)[source]

Build a sample-aligned SNP/phenotype/covariate cohort for tutorials.

The returned objects are intentionally small and self-contained so docs and tests can demonstrate phenotype handling, file readers, and association workflows without downloading external data.