Skip to content

fix: quote-aware VCF header parser preserves '=' in Description values (#80, #89)#93

Open
jmg421 wants to merge 7 commits into
Bioconductor:develfrom
jmg421:devel
Open

fix: quote-aware VCF header parser preserves '=' in Description values (#80, #89)#93
jmg421 wants to merge 7 commits into
Bioconductor:develfrom
jmg421:devel

Conversation

@jmg421

@jmg421 jmg421 commented Jun 15, 2026

Copy link
Copy Markdown

htslib/scanBcfHeader splits structured header fields on = without respecting double-quoted values, silently truncating Description strings that contain = (e.g. VRS version=2.0.1).

Adds .parseVcfHeaderBody() and .parseRawVcfHeader() to re-parse raw header text and patch each DataFrame back to the correct values. Includes regression test.

Fixes #80, fixes #89.

jmg421 added 7 commits June 12, 2026 10:19
Bioconductor#86)

When query has no overlap with the CDS, .localCoordinates() returns a
zero-length GRanges. Previously an early return on length(txlocal)==0
caused REFAA and VARAA to be absent from mcols(), returning NULL instead
of empty AAStringSet objects. This breaks downstream operations like
reverse() and subseq() on the result columns.

Fix:
- Remove early return so the full mcols-building code runs even when
  txlocal is empty, naturally producing zero-length AAStringSet columns
- Fix GENEID=NA_character_ -> rep(NA_character_, length(txlocal)) so
  DataFrame() construction works correctly at zero length

Test: extend test_predictCoding_empty to assert REFAA and VARAA are
AAStringSet with length 0.
…lassification

Multi-nucleotide variants (MNVs/DBS) can produce VARAA strings like 'P*'
or '*W' where %in% '*' fails to match. Switch to grepl('\*', ..., fixed=TRUE)
so any VARAA containing a stop codon is correctly classified as 'nonsense'
rather than 'nonsynonymous'.

Fixes Bioconductor#86. Adds unit test test_predictCoding_nonsense_DBS covering
a DBS that introduces a stop at a codon boundary.
mcols(rdexp) <- NULL unconditionally erased all user-added metadata
columns from rowRanges during CollapsedVCF expansion. Fix: compute
the set of non-VCF-fixed columns (anything not in REF/ALT/QUAL/
FILTER/paramRangeID) and retain them in the expanded object; the
fixed columns are dropped as before since they are rebuilt from fexp.

Fixes Bioconductor#85.
… all-NA seqinfo (Bioconductor#78)

- .contigsFromSeqinfo() now returns character(0) when all seqlengths and
  genome are NA, avoiding noisy '##contig=<ID=x>' placeholder lines
- .formatHeader() single-value branch no longer overwrites an existing
  fileDate with today's date; the original value is preserved
- META branch likewise only adds fileDate when absent
- Add regression test test_predictCoding_exon_intron_boundary (Bioconductor#83)
Bioconductor#80, Bioconductor#89)

htslib/scanBcfHeader splits structured header fields on '=' without
respecting double-quoted values, silently truncating Description strings
that contain '=' (e.g. VRS version=2.0.1). Add .parseVcfHeaderBody()
and .parseRawVcfHeader() to re-parse raw header text and patch each
DataFrame back to the correct values.

Fixes Bioconductor#80, fixes Bioconductor#89
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

possible problem parsing INFO description field readVcf doesn't handle quoted = in hapmap_exome_chr22.vcf.gz correctly

1 participant