Skip to content

Key Parameters

mxwang66 edited this page Mar 14, 2026 · 1 revision

Node penalty threshold

The node penalty threshold (--penalty-th) controls the sensitivity and specificity of the reported signatures.

Higher threshold values generally allow Seqwin to report longer and/or more signatures, but may reduce sensitivity, specificity, or both.

If --penalty-th is not specified, Seqwin estimates it automatically from k-mer sketches. By default, it uses MinHash sketches computed with Mash. If --no-mash is provided, Seqwin uses minimizer sketches instead. This is faster, but the estimate may be more biased.

You can tune the automatically estimated threshold with --stringency (-s), which ranges from 0 to 10 (default: 5). Higher stringency results in a lower estimated threshold.

Signature evaluation

By default, Seqwin evaluates the reported signatures with BLAST:

  • against target genomes to assess sensitivity (conservation)
  • against non-target genomes to assess specificity (divergence and f_neg_hits)

These metrics are written to signatures.csv, and signatures are sorted in descending order by conservation + divergence.

Evaluation can be disabled with --no-blast to reduce runtime. In that case, the reported signatures are still expected to be informative, but no BLAST-based validation is performed, and BLAST-derived columns in signatures.csv will be empty.

Minimizer sketch parameters

--kmerlen, -k (default: 21)

Length of k-mers used in the sketch.

Smaller k-mers may be helpful for genomes with higher sequence variation. For example, a shorter k-mer length such as 17 may work better for some viral genomes.

--windowsize, -w (default: 200)

Window size used for minimizer selection.

Smaller window sizes generate more minimizers and increase resolution, which may help recover shorter signatures. The tradeoff is increased runtime and memory usage.

Performance tuning

Use --threads (-p) to enable parallel processing across multiple CPU cores (default: 4).

For large genome sets, the fastest configuration is usually to combine --no-mash and --no-blast.

  • Mash scales quadratically with the number of genomes. With --no-mash, Seqwin uses minimizer sketches instead to estimate the node penalty threshold.
  • BLAST-based evaluation requires building a BLAST database from the input genomes. With --no-blast, this evaluation step is skipped, and BLAST-derived metrics such as conservation and divergence will not be computed in signatures.csv.

Clone this wiki locally