Skip to content

Latest commit

 

History

History
61 lines (47 loc) · 2.74 KB

File metadata and controls

61 lines (47 loc) · 2.74 KB

dnaglider

dnaglider Go Report Card codecov

Command line utility to compute sliding window genome statistics from a fasta file.

Installation:

If go is installed on the machine, the program can be built from source using:

go get -u github.com/cmdoret/dnaglider/dnaglider

Otherwise, binaries can be downloaded from the github repository's releases page.

Usage:

dnaglider only requires a genome. You can also select a window size and what metrics to compute. For example to compute GC content and GC skew on 8 threads:

dnaglider -window 1000 -threads 8 -fields "GC,GCSKEW" -fasta ./mygenome.fasta -out gc_stats.tsv

Instead of working with input / output files, the program reads from stdin and write to stdout by default:

some command genome.fa | dnaglider -fields "GC,GCSKEW" | grep "chr10" > gc_stats_chr10.tsv

Note: Streaming genomes through stdin doesn't work when using the KMER field, as computing k-mer divergence requires a 2-pass scan of the genome. When working with k-mers, specify the genome file using -fasta instead.

Usage of dnaglider:
  -fasta string
        Input genome. '-' reads from stdin. (default "-")
  -fields string
        Statistics to report in output fields. Multiple comma-separated values can be provided.
        Valid fields are: 
                GC: GC content (0 to 1)
                GCSKEW: G/C skew (-1 to 1)
                ATSKEW: A/T skew (-1 to 1)
                ENTRO: Information entropy of the sequence (0 to 1)
                KMER: K-mer divergence from the reference (euclidean distance)
         (default "GC")
  -kmers string
        Report k-mer divergence from the genome for the following k-mer lengths. Multiple comma separated values can be provided. This only has an effect if KMER is specified in -fields. (default "4")
  -out string
        Path to output file. '-' writes to stdout. (default "-")
  -stride int
        Step between windows. (default 100)
  -threads int
        Number of CPU threads (default 1)
  -version
        Version
  -window int
        Size of the sliding window. (default 100)

Output:

The output files are tab-separated text files with one row per window. The first 3 columns indicate 1-based genomic coordinates and the following column contain statistics computed on the genome.