Skip to content

【Zenodo】Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)【Is this useful?】 #208

@kojix2

Description

@kojix2

Hello.
Yesterday, I noticed the following project had been uploaded to Zenodo.
I would appreciate hearing the opinions of the FACETS community regarding whether this dataset is actually useful.
I have absolutely no connection to the author of this project and am merely copying and pasting the README.

https://zenodo.org/records/17226698


Title: Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)

1. Summary

This deposit contains two reference VCF files for the human genome (hg38/GRCh38). They list common Single Nucleotide Polymorphisms (SNPs) and are specifically designed for use with allele-specific copy number analysis tools, such as FACETS.

The main feature of these files is a uniform SNP density of approximately 1 SNP per kilobase (kb), which significantly improves the performance and robustness of the analysis.

2. Rationale

Official SNP files (e.g., from dbSNP) present two challenges for FACETS-like analyses:

  • Excessive file size: They contain millions of rare variants, which considerably slows down the pre-processing step (snp-pileup).
  • Non-uniform density: SNP 'hotspots' with very high density can introduce bias into segmentation algorithms.

These optimized reference files solve both problems by providing a lightweight, clean, and evenly distributed set of markers.

3. Contents of the Deposit

  • facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz
    • The SNP reference for use with BAM files where chromosomes are named 'chr1', 'chr2', etc.
  • facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz.tbi
    • The Tabix index for the above file.
  • facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz
    • The SNP reference for use with BAM files where chromosomes are named '1', '2', etc.
  • facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz.tbi
    • The Tabix index for the above file.
  • uniformize_vcf_density.py
    • The Python script used to generate these reference files.
  • README.md
    • This information file.

4. Generation Workflow (Transparency)

The generation process for these files is fully reproducible:

  1. Primary Source: Data was derived from the official dbSNP b157 VCF for the GRCh38/hg38 assembly (GCF_000001405.40.gz), downloaded from the NCBI FTP server.
  2. Primary Chromosome Filtering: The VCF was first filtered to retain only the primary assembly chromosomes (1-22, X, Y, M), excluding alternate haplotypes and unplaced contigs.
  3. Initial SNP Filtering: The source VCF was subsequently filtered using bcftools to retain only common (INFO/COMMON=1), bi-allelic SNPs.
  4. Chromosome Renaming: NCBI-style chromosome names (NC_...) were converted to the two standard nomenclatures (chr and no-chr) using the official GRCh38.p14 assembly report.
  5. Density Uniformization: The uniformize_vcf_density.py script was run on the filtered files to select the most informative SNP (allele frequency closest to 0.5) within each 1 kb window.

5. Recommended Usage

  1. Download the VCF file (and its .tbi index) that matches the chromosome naming style of your BAM files.
  2. Provide this VCF file as the SNP reference to the snp-pileup tool or a Galaxy wrapper for FACETS.

Example command:

snp-pileup -q15 -Q20 facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz normal.bam tumor.bam | gzip > pileup.csv.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions