Datasets¶

Helpers for discovering and loading example datasets.

snputils.available_datasets_list()[source]¶: Get the list of available dataset registry names.

snputils.load_dataset(name, chromosomes=None, variants_ids=None, sample_ids=None, resource=None, data_home=None, output_dir=None, genotype_sources=None, download_genotypes=True, populations=None, samples_per_population=None, max_variants=None, require_biallelic=False, require_complete=False, require_polymorphic=False, snv_only=False, metadata_path=None, metadata_url=None, panel_path=None, panel_url=None, verbose=True, **read_kwargs)[source]¶

Load a genome dataset.

Parameters:

name (str) – Name of the dataset to load. Call available_datasets_list() to get the list of available datasets.
chromosomes (List[str] | List[int] | str | int) – Chromosomes to load.
variants_ids (List[str]) – List of variant IDs to load.
sample_ids (List[str]) – List of sample IDs to load.
resource (str) – Optional dataset genotype resource name. If omitted, the dataset default is used.
data_home (Path | str) – Optional dataset cache directory root.
output_dir (Path | str) – Optional directory for downloaded source files and intermediate files.
genotype_sources – Optional local paths or URLs to use instead of a registry chromosome resource.
download_genotypes – Whether remote genotype sources should be downloaded.
populations – Optional population labels to select from dataset metadata.
samples_per_population – Optional number of samples to take from each selected population.
max_variants – Optional maximum number of variants to read by streaming source files directly.
require_biallelic – If streaming source files, keep only biallelic variants.
require_complete – If streaming source files, keep only variants with complete genotypes.
require_polymorphic – If streaming source files, keep only polymorphic variants.
snv_only – If streaming source files, keep only biallelic SNVs.
metadata_path – Optional local population metadata path.
metadata_url – Optional population metadata URL.
panel_path – Backward-compatible alias for metadata_path.
panel_url – Backward-compatible alias for metadata_url.
verbose (bool) – Whether to show progress.
**read_kwargs – Keyword arguments to pass to PGENReader.read().

Returns:

SNPObject – SNPObject containing the loaded dataset. If population metadata is used, population labels are stored in sample_fid and sex labels are stored in sample_sex.