Skip to content

Description of Outputs

mxwang66 edited this page Apr 20, 2026 · 2 revisions

Output files

Seqwin creates the following files/directories inside the directory specified by --title or -o (default: seqwin-out/):

Name Description
signatures.fasta Signature sequences
signatures.csv Tabulated metrics for each signature
assemblies.csv Mapping of internal genome IDs to file paths
blastdb/ BLAST database built from all input genomes
assemblies/ Genomes downloaded from NCBI
results.seqwin Serialized run snapshot (Python pickle)
config.json Full run configuration
seqwin.log Execution log

File contents

Example output files can be found here.

signatures.fasta

Each FASTA record represents one signature sequence.

The FASTA header has the following format:

>{genome_ID}-{sequence_ID}-{start}:{stop}

where:

genome_ID: A unique integer assigned by Seqwin to each input genome. This ID can be matched to the corresponding genome in assemblies.csv.

sequence_ID: The FASTA record ID from the source genome. It is expected to be unique within that genome.

start and stop: Start and end coordinates of the signature in the source sequence, using Python-style indexing: zero-based, with start inclusive and stop exclusive.


signatures.csv

Each row represents one signature. The columns are described below.

Some columns will be empty if BLAST-based evaluation is disabled with --no-blast.

If BLAST-based evaluation is enabled, signatures are sorted in descending order by conservation + divergence.

fasta_header: The FASTA header of the signature in signatures.fasta.

length: Length of the signature sequence.

conservation: Average fraction of identical bases between the signature and the target genomes (0-1).

f_tar_hits: Fraction of target genomes with a BLAST hit to the signature (0-1).

divergence: Average fraction of mismatches and gaps between the signature and the non-target genomes (0-1).

Mismatches and gaps are only counted for genomes with a BLAST hit. Genomes with no BLAST hit do NOT contribute to divergence.

See "How to evaluate signature specificity" for more details.

f_neg_hits: Fraction of non-target genomes with a BLAST hit to the signature (0-1).

avg_repeats_tar: Average number of repeats of the signature in target genomes.

Repeats may be partial, meaning only part of the signature is repeated.

avg_pident_tar: Average percent identity across all repeats in target genomes (0-1).

Because repeats may be partial, this value can be substantially lower than 1.

avg_repeats_neg: Average number of repeats of the signature in non-target genomes.

Repeats may be partial, meaning only part of the signature is repeated.

avg_pident_neg: Average percent identity across all repeats in non-target genomes (0-1).

Because repeats may be partial, this value can be substantially lower than 1.

rep_ratio: Fraction of target genomes that have the same k-mer ordering as the representative sequence.

n_nodes: Number of nodes (k-mers) in the representative sequence.


assemblies.csv

Each row represents one input genome. The first column contains the internal genome IDs used in signatures.fasta. The remaining columns are described below.

path: Absolute path to the genome file.

is_target: True if the genome is a target genome, and False if it is a non-target genome.

Clone this wiki locally