Skip to content

Bug in windows version of Rsamtools Fasta sequence retrieval #14

@dnieuw

Description

@dnieuw

I discovered a bug in the windows version of Rsamtools, which does not occur in the same version on Ubuntu 18.04 or CentOS 7. It may be related to #5 and #12, but the examples I have might help with solving the issue.

I'm trying to extract sequences from a large reference database of virus references found here:
ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrl*.seq.gz (about 4.2GB in total) (my current database may differ a bit from the most recent database though, let me know if you would like to have the exact same database).

I'm using these same commands on the 3 systems, not all cases go wrong, but I have isolated a case in idx nr 2621935 which gives different results on windows as compared to Ubuntu and CentOS.

> gbvrl_fa <- open(FaFile("gbvrl.nt.fasta"))
> idx <- scanFaIndex(gbvrl_fa)
> sequence <- scanFa(gbvrl_fa, idx[2621935])

On windows I get this result:

  A DNAStringSet instance of length 1
    width seq                                                                                   names               
[1]  4979 CATGGGTATGGCATCAGTCCTAGATAAAGGGACTGGCAAGT...TTTAACTGTCGTATCTCTTGTCTGTGGTATACTTAGTCTAG EU851411.1

On the other two platforms I get this result (which is the correct sequence, the sequence on windows is incorrect):

A DNAStringSet instance of length 1
    width seq                                                                                   names
[1]  4979 TTGAACAAGTAACCAGTCGTAAG...CTTACGACTGGTTACTTGTTCAA EU851411.1

The accession numbers match, but the sequences don't...

A second hint is the last sequence in the index, which the 2 non-windows platforms retrieve just fine, but the windows version gives an error:

Windows:

> scanFa(gbvrl_fa, idx[length(idx)])
Error in value[[3L]](cond) : 
   record 1 (FN398484.1:1-903) contains invalid DNA letters
  file: gbvrl.nt.fasta

Other two:
image

The last hint is an error claiming that the reference is not present in the current index, which happens with some sequences:

Windows:

> scanFa(gbvrl_fa, idx[1400001])
Error in value[[3L]](cond) :  record 1 (KY058615.1:1-1166) failed
  file: gbvrl.nt.fasta

Other two:
image

Sequences in the index before around 1,300,000 seem to succeed at retrieving the correct sequence and sequences after around 1,400,000 seem to fail, I don't know if that makes sense.

I've checked the Rsamtools and dependencies's versions and they match. Also the md5sum of all the fasta and fasta.fai files match.

I hope that these examples shine some light on the issues.

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shinydashboard_0.7.1  rbokeh_0.5.0          shinyWidgets_0.4.8    shinycssloaders_0.2.0 stringr_1.4.0        
 [6] ggplot2_3.2.1         data.table_1.12.2     shiny_1.3.2           Rsamtools_2.0.3       Biostrings_2.52.0    
[11] XVector_0.24.0        GenomicRanges_1.36.1  GenomeInfoDb_1.20.0   IRanges_2.18.3        S4Vectors_0.22.0     
[16] BiocGenerics_0.30.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2             lattice_0.20-38        assertthat_0.2.1       digest_0.6.20          mime_0.7              
 [6] R6_2.4.0               evaluate_0.14          httr_1.4.1             pillar_1.4.2           zlibbioc_1.30.0       
[11] rlang_0.4.0            lazyeval_0.2.2         rstudioapi_0.10        hexbin_1.27.3          DT_0.9                
[16] rmarkdown_1.16         BiocParallel_1.18.1    htmlwidgets_1.5.1      RCurl_1.95-4.12        munsell_0.5.0         
[21] compiler_3.6.1         httpuv_1.5.1           xfun_0.10              gistr_0.4.2            pkgconfig_2.0.3       
[26] htmltools_0.3.6        tidyselect_0.2.5       tibble_2.1.3           GenomeInfoDbData_1.2.1 codetools_0.2-16      
[31] crayon_1.3.4           dplyr_0.8.3            withr_2.1.2            later_0.8.0            bitops_1.0-6          
[36] grid_3.6.1             jsonlite_1.6           xtable_1.8-4           gtable_0.3.0           magrittr_1.5          
[41] scales_1.0.0           stringi_1.4.3          pryr_0.1.4             promises_1.0.1         tools_3.6.1           
[46] glue_1.3.1             purrr_0.3.2            crosstalk_1.0.0        maps_3.3.0             yaml_2.2.0            
[51] colorspace_1.4-1       BiocManager_1.30.8     knitr_1.25

CentOS 7
image

Ubuntu 18.04
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions