Description of Outputs

Output files

Seqwin creates the following files/directories inside the directory specified by --title or -o (default: seqwin-out/):

Name	Description
`signatures.fasta`	Signature sequences
`signatures.csv`	Tabulated metrics for each signature
`assemblies.csv`	Mapping of internal genome IDs to file paths
`blastdb/`	BLAST database built from all input genomes
`assemblies/`	Genomes downloaded from NCBI
`results.seqwin`	Serialized run snapshot (Python pickle)
`config.json`	Full run configuration
`seqwin.log`	Execution log

File contents

Example output files can be found here.

`signatures.fasta`

Each FASTA record represents one signature sequence.

The FASTA header has the following format:

>{genome_ID}-{sequence_ID}-{start}:{stop}

where:

genome_ID: A unique integer assigned by Seqwin to each input genome. This ID can be matched to the corresponding genome in assemblies.csv.

sequence_ID: The FASTA record ID from the source genome. It is expected to be unique within that genome.

start and stop: Start and end coordinates of the signature in the source sequence, using Python-style indexing: zero-based, with start inclusive and stop exclusive.

`signatures.csv`

Each row represents one signature. The columns are described below.

Some columns will be empty if BLAST-based evaluation is disabled with --no-blast.

If BLAST-based evaluation is enabled, signatures are sorted in descending order by conservation + divergence.

fasta_header: The FASTA header of the signature in signatures.fasta.

length: Length of the signature sequence.

conservation: Average fraction of identical bases between the signature and the target genomes (0-1).

f_tar_hits: Fraction of target genomes with a BLAST hit to the signature (0-1).

divergence: Average fraction of mismatches and gaps between the signature and the non-target genomes (0-1).

Mismatches and gaps are only counted for genomes with a BLAST hit. Genomes with no BLAST hit do NOT contribute to divergence.

See "How to evaluate signature specificity" for more details.

f_neg_hits: Fraction of non-target genomes with a BLAST hit to the signature (0-1).

avg_repeats_tar: Average number of repeats of the signature in target genomes.

Repeats may be partial, meaning only part of the signature is repeated.

avg_pident_tar: Average percent identity across all repeats in target genomes (0-1).

Because repeats may be partial, this value can be substantially lower than 1.

avg_repeats_neg: Average number of repeats of the signature in non-target genomes.

Repeats may be partial, meaning only part of the signature is repeated.

avg_pident_neg: Average percent identity across all repeats in non-target genomes (0-1).

Because repeats may be partial, this value can be substantially lower than 1.

rep_ratio: Fraction of target genomes that have the same k-mer ordering as the representative sequence.

n_nodes: Number of nodes (k-mers) in the representative sequence.

`assemblies.csv`

Each row represents one input genome. The first column contains the internal genome IDs used in signatures.fasta. The remaining columns are described below.

path: Absolute path to the genome file.

is_target: True if the genome is a target genome, and False if it is a non-target genome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description of Outputs

Output files

File contents

`signatures.fasta`

`signatures.csv`

`assemblies.csv`

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally