【Zenodo】Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)【Is this useful?】

Hello.
Yesterday, I noticed the following project had been uploaded to Zenodo.
I would appreciate hearing the opinions of the FACETS community regarding whether this dataset is actually useful.
**I have absolutely no connection to the author of this project and am merely copying and pasting the README.**

https://zenodo.org/records/17226698


---

#### Title: Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)

#### 1. Summary

This deposit contains two reference VCF files for the human genome (hg38/GRCh38). They list common Single Nucleotide Polymorphisms (SNPs) and are specifically designed for use with allele-specific copy number analysis tools, such as **FACETS**.

The main feature of these files is a **uniform SNP density of approximately 1 SNP per kilobase (kb)**, which significantly improves the performance and robustness of the analysis.

#### 2. Rationale

Official SNP files (e.g., from dbSNP) present two challenges for FACETS-like analyses:
* **Excessive file size:** They contain millions of rare variants, which considerably slows down the pre-processing step (`snp-pileup`).
* **Non-uniform density:** SNP 'hotspots' with very high density can introduce bias into segmentation algorithms.

These optimized reference files solve both problems by providing a lightweight, clean, and evenly distributed set of markers.

#### 3. Contents of the Deposit

* `facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz`
    * The SNP reference for use with BAM files where chromosomes are named **'chr1', 'chr2', etc.**
* `facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz.tbi`
    * The Tabix index for the above file.
* `facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz`
    * The SNP reference for use with BAM files where chromosomes are named **'1', '2', etc.**
* `facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz.tbi`
    * The Tabix index for the above file.
* `uniformize_vcf_density.py`
    * The Python script used to generate these reference files.
* `README.md`
    * This information file.

#### 4. Generation Workflow (Transparency)

The generation process for these files is fully reproducible:
1.  **Primary Source:** Data was derived from the official **dbSNP b157** VCF for the **GRCh38/hg38** assembly (`GCF_000001405.40.gz`), downloaded from the NCBI FTP server.
2.  **Primary Chromosome Filtering:** The VCF was first filtered to retain only the primary assembly chromosomes (1-22, X, Y, M), excluding alternate haplotypes and unplaced contigs.
3.  **Initial SNP Filtering:** The source VCF was subsequently filtered using `bcftools` to retain only **common (INFO/COMMON=1), bi-allelic SNPs**.
4.  **Chromosome Renaming:** NCBI-style chromosome names (`NC_...`) were converted to the two standard nomenclatures (`chr` and `no-chr`) using the official GRCh38.p14 assembly report.
5.  **Density Uniformization:** The `uniformize_vcf_density.py` script was run on the filtered files to select the most informative SNP (allele frequency closest to 0.5) within each **1 kb** window.

#### 5. Recommended Usage

1.  Download the VCF file (and its `.tbi` index) that matches the chromosome naming style of your BAM files.
2.  Provide this VCF file as the SNP reference to the `snp-pileup` tool or a Galaxy wrapper for FACETS.

*Example command:*
```bash
snp-pileup -q15 -Q20 facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz normal.bam tumor.bam | gzip > pileup.csv.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Zenodo】Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)【Is this useful?】 #208

Title: Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)

1. Summary

2. Rationale

3. Contents of the Deposit

4. Generation Workflow (Transparency)

5. Recommended Usage

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

【Zenodo】Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)【Is this useful?】 #208

Description

Title: Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)

1. Summary

2. Rationale

3. Contents of the Deposit

4. Generation Workflow (Transparency)

5. Recommended Usage

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions