-
Notifications
You must be signed in to change notification settings - Fork 1
Key Parameters
The node penalty threshold (--penalty-th) controls the sensitivity and specificity of the reported signatures.
Higher threshold values generally allow Seqwin to report longer and/or more signatures, but may reduce sensitivity, specificity, or both.
If --penalty-th is not specified, Seqwin estimates it automatically from k-mer sketches. By default, it uses MinHash sketches computed with Mash. If --no-mash is provided, Seqwin uses minimizer sketches instead. This is faster, but the estimate may be more biased.
You can tune the automatically estimated threshold with --stringency (-s), which ranges from 0 to 10 (default: 5). Higher stringency results in a lower estimated threshold.
By default, Seqwin evaluates the reported signatures with BLAST:
- against target genomes to assess sensitivity (
conservation) - against non-target genomes to assess specificity (
divergenceandf_neg_hits)
These metrics are written to signatures.csv, and signatures are sorted in descending order by conservation + divergence.
Evaluation can be disabled with --no-blast to reduce runtime. In that case, the reported signatures are still expected to be informative, but no BLAST-based validation is performed, and BLAST-derived columns in signatures.csv will be empty.
--kmerlen, -k (default: 21)
Length of k-mers used in the sketch.
Smaller k-mers may be helpful for genomes with higher sequence variation. For example, a shorter k-mer length such as 17 may work better for some viral genomes.
--windowsize, -w (default: 200)
Window size used for minimizer selection.
Smaller window sizes generate more minimizers and increase resolution, which may help recover shorter signatures. The tradeoff is increased runtime and memory usage.
Use --threads (-p) to enable parallel processing across multiple CPU cores (default: 4).
For large genome sets, the fastest configuration is usually to combine --no-mash and --no-blast.
-
Mash scales quadratically with the number of genomes. With
--no-mash, Seqwin uses minimizer sketches instead to estimate the node penalty threshold. - BLAST-based evaluation requires building a BLAST database from the input genomes. With
--no-blast, this evaluation step is skipped, and BLAST-derived metrics such asconservationanddivergencewill not be computed insignatures.csv.