-
Notifications
You must be signed in to change notification settings - Fork 1
Description of Outputs
Seqwin creates the following files/directories inside the directory specified by --title or -o (default: seqwin-out/):
| Name | Description |
|---|---|
signatures.fasta |
Signature sequences |
signatures.csv |
Tabulated metrics for each signature |
assemblies.csv |
Mapping of internal genome IDs to file paths |
blastdb/ |
BLAST database built from all input genomes |
assemblies/ |
Genomes downloaded from NCBI |
results.seqwin |
Serialized run snapshot (Python pickle) |
config.json |
Full run configuration |
seqwin.log |
Execution log |
Example output files can be found here.
Each FASTA record represents one signature sequence.
The FASTA header has the following format:
>{genome_ID}-{sequence_ID}-{start}:{stop}
where:
genome_ID:
A unique integer assigned by Seqwin to each input genome. This ID can be matched to the corresponding genome in assemblies.csv.
sequence_ID:
The FASTA record ID from the source genome. It is expected to be unique within that genome.
start and stop:
Start and end coordinates of the signature in the source sequence, using Python-style indexing: zero-based, with start inclusive and stop exclusive.
Each row represents one signature. The columns are described below.
Some columns will be empty if BLAST-based evaluation is disabled with --no-blast.
If BLAST-based evaluation is enabled, signatures are sorted in descending order by conservation + divergence.
fasta_header:
The FASTA header of the signature in signatures.fasta.
length:
Length of the signature sequence.
conservation:
Average fraction of identical bases between the signature and the target genomes (0-1).
f_tar_hits:
Fraction of target genomes with a BLAST hit to the signature (0-1).
divergence:
Average fraction of mismatches and gaps between the signature and the non-target genomes (0-1).
Mismatches and gaps are only counted for genomes with a BLAST hit. Genomes with no BLAST hit do NOT contribute to divergence.
See "How to evaluate signature specificity" for more details.
f_neg_hits:
Fraction of non-target genomes with a BLAST hit to the signature (0-1).
avg_repeats_tar:
Average number of repeats of the signature in target genomes.
Repeats may be partial, meaning only part of the signature is repeated.
avg_pident_tar:
Average percent identity across all repeats in target genomes (0-1).
Because repeats may be partial, this value can be substantially lower than 1.
avg_repeats_neg:
Average number of repeats of the signature in non-target genomes.
Repeats may be partial, meaning only part of the signature is repeated.
avg_pident_neg:
Average percent identity across all repeats in non-target genomes (0-1).
Because repeats may be partial, this value can be substantially lower than 1.
rep_ratio:
Fraction of target genomes that have the same k-mer ordering as the representative sequence.
n_nodes:
Number of nodes (k-mers) in the representative sequence.
Each row represents one input genome. The first column contains the internal genome IDs used in signatures.fasta. The remaining columns are described below.
path:
Absolute path to the genome file.
is_target:
True if the genome is a target genome, and False if it is a non-target genome.