Datasets

Helpers for discovering and loading example datasets.

snputils.available_datasets_list()[source]

Get the list of available dataset registry names.

snputils.load_dataset(name, chromosomes=None, variants_ids=None, sample_ids=None, resource=None, data_home=None, output_dir=None, genotype_sources=None, download_genotypes=True, populations=None, samples_per_population=None, max_variants=None, require_biallelic=False, require_complete=False, require_polymorphic=False, snv_only=False, metadata_path=None, metadata_url=None, panel_path=None, panel_url=None, verbose=True, **read_kwargs)[source]

Load a genome dataset.

Parameters:
  • name (str) – Name of the dataset to load. Call available_datasets_list() to get the list of available datasets.

  • chromosomes (List[str] | List[int] | str | int) – Chromosomes to load.

  • variants_ids (List[str]) – List of variant IDs to load.

  • sample_ids (List[str]) – List of sample IDs to load.

  • resource (str) – Optional dataset genotype resource name. If omitted, the dataset default is used.

  • data_home (Path | str) – Optional dataset cache directory root.

  • output_dir (Path | str) – Optional directory for downloaded source files and intermediate files.

  • genotype_sources – Optional local paths or URLs to use instead of a registry chromosome resource.

  • download_genotypes – Whether remote genotype sources should be downloaded.

  • populations – Optional population labels to select from dataset metadata.

  • samples_per_population – Optional number of samples to take from each selected population.

  • max_variants – Optional maximum number of variants to read by streaming source files directly.

  • require_biallelic – If streaming source files, keep only biallelic variants.

  • require_complete – If streaming source files, keep only variants with complete genotypes.

  • require_polymorphic – If streaming source files, keep only polymorphic variants.

  • snv_only – If streaming source files, keep only biallelic SNVs.

  • metadata_path – Optional local population metadata path.

  • metadata_url – Optional population metadata URL.

  • panel_path – Backward-compatible alias for metadata_path.

  • panel_url – Backward-compatible alias for metadata_url.

  • verbose (bool) – Whether to show progress.

  • **read_kwargs – Keyword arguments to pass to PGENReader.read().

Returns:

SNPObject – SNPObject containing the loaded dataset. If population metadata is used, population labels are stored in sample_fid and sex labels are stored in sample_sex.