Hello.
Yesterday, I noticed the following project had been uploaded to Zenodo.
I would appreciate hearing the opinions of the FACETS community regarding whether this dataset is actually useful.
I have absolutely no connection to the author of this project and am merely copying and pasting the README.
https://zenodo.org/records/17226698
Title: Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)
1. Summary
This deposit contains two reference VCF files for the human genome (hg38/GRCh38). They list common Single Nucleotide Polymorphisms (SNPs) and are specifically designed for use with allele-specific copy number analysis tools, such as FACETS.
The main feature of these files is a uniform SNP density of approximately 1 SNP per kilobase (kb), which significantly improves the performance and robustness of the analysis.
2. Rationale
Official SNP files (e.g., from dbSNP) present two challenges for FACETS-like analyses:
- Excessive file size: They contain millions of rare variants, which considerably slows down the pre-processing step (
snp-pileup).
- Non-uniform density: SNP 'hotspots' with very high density can introduce bias into segmentation algorithms.
These optimized reference files solve both problems by providing a lightweight, clean, and evenly distributed set of markers.
3. Contents of the Deposit
facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz
- The SNP reference for use with BAM files where chromosomes are named 'chr1', 'chr2', etc.
facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz.tbi
- The Tabix index for the above file.
facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz
- The SNP reference for use with BAM files where chromosomes are named '1', '2', etc.
facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz.tbi
- The Tabix index for the above file.
uniformize_vcf_density.py
- The Python script used to generate these reference files.
README.md
4. Generation Workflow (Transparency)
The generation process for these files is fully reproducible:
- Primary Source: Data was derived from the official dbSNP b157 VCF for the GRCh38/hg38 assembly (
GCF_000001405.40.gz), downloaded from the NCBI FTP server.
- Primary Chromosome Filtering: The VCF was first filtered to retain only the primary assembly chromosomes (1-22, X, Y, M), excluding alternate haplotypes and unplaced contigs.
- Initial SNP Filtering: The source VCF was subsequently filtered using
bcftools to retain only common (INFO/COMMON=1), bi-allelic SNPs.
- Chromosome Renaming: NCBI-style chromosome names (
NC_...) were converted to the two standard nomenclatures (chr and no-chr) using the official GRCh38.p14 assembly report.
- Density Uniformization: The
uniformize_vcf_density.py script was run on the filtered files to select the most informative SNP (allele frequency closest to 0.5) within each 1 kb window.
5. Recommended Usage
- Download the VCF file (and its
.tbi index) that matches the chromosome naming style of your BAM files.
- Provide this VCF file as the SNP reference to the
snp-pileup tool or a Galaxy wrapper for FACETS.
Example command:
snp-pileup -q15 -Q20 facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz normal.bam tumor.bam | gzip > pileup.csv.gz
Hello.
Yesterday, I noticed the following project had been uploaded to Zenodo.
I would appreciate hearing the opinions of the FACETS community regarding whether this dataset is actually useful.
I have absolutely no connection to the author of this project and am merely copying and pasting the README.
https://zenodo.org/records/17226698
Title: Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)
1. Summary
This deposit contains two reference VCF files for the human genome (hg38/GRCh38). They list common Single Nucleotide Polymorphisms (SNPs) and are specifically designed for use with allele-specific copy number analysis tools, such as FACETS.
The main feature of these files is a uniform SNP density of approximately 1 SNP per kilobase (kb), which significantly improves the performance and robustness of the analysis.
2. Rationale
Official SNP files (e.g., from dbSNP) present two challenges for FACETS-like analyses:
snp-pileup).These optimized reference files solve both problems by providing a lightweight, clean, and evenly distributed set of markers.
3. Contents of the Deposit
facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gzfacets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz.tbifacets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gzfacets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz.tbiuniformize_vcf_density.pyREADME.md4. Generation Workflow (Transparency)
The generation process for these files is fully reproducible:
GCF_000001405.40.gz), downloaded from the NCBI FTP server.bcftoolsto retain only common (INFO/COMMON=1), bi-allelic SNPs.NC_...) were converted to the two standard nomenclatures (chrandno-chr) using the official GRCh38.p14 assembly report.uniformize_vcf_density.pyscript was run on the filtered files to select the most informative SNP (allele frequency closest to 0.5) within each 1 kb window.5. Recommended Usage
.tbiindex) that matches the chromosome naming style of your BAM files.snp-pileuptool or a Galaxy wrapper for FACETS.Example command: