Cyanobacterial Genome Assembly

Scripts and workflows for the assembly, polishing, quality assessment, and visualization of complete cyanobacterial genomes.

Overview

This repository contains step-by-step protocols and scripts for generating high-quality cyanobacterial genome assemblies from sequencing data. The workflow covers hybrid assembly, contig circularization, quality evaluation, and multiple visualization approaches commonly used in comparative and structural genomics.

Workflow Structure

Each chapter corresponds to one logical step in the genome assembly and evaluation pipeline. Chapters are designed to be followed sequentially, but individual steps can also be used independently.

Chapters

01. Genome Assembly

Hybrid genome assembly

02. Contig Processing

Contig circularization

03. Assembly Quality Assessment

What the .qv file format means

The Merqury QV output file has five columns:

<label>   <err_kmers>   <total_kmers>   <QV>   <error_rate>

Column descriptions

Column	Meaning
label	Assembly name (or combined assemblies)
err_kmers	Number of assembly k-mers not found in the reads
total_kmers	Total number of assembly k-mers
QV	Phred-scaled quality value
error_rate	Estimated base error rate

RESULTS FOR BACTERIAL GENOME

<label>                         <err_kmers>   <total_kmers>   <QV>     <error_rate>
BacterialGenome_Bactopia        0             4536393        +inf    0
PlasmidGenome_Bactopia          0             4018           +inf    0
both                            0             4540411        +inf    0

Notes:

A +inf QV indicates zero observed error k-mers relative to the reads.
An error_rate of 0 reflects no estimated base errors under this metric.

TO DO: Do both quality checks (QUAST, Mercury) also for plasmid genome

04. k-mer Spectrum Analysis

Visualization of k-mer spectra via Jellyfish

TO DO: Extend the x-axis in this figure.

Visualization of k-mer spectra via Merqury

TO DO: Create the same figure based on the Nanopore data; maybe create a facet plot for both.

05. Whole-Genome Alignment

Visualization of MUMmer4 results as dotplots

Inference of inversions

Limnothrix_sp_BL_A_16_CP166615	82994	113322	30329
Limnothrix_sp_BL_A_16_CP166615	592628	736045	143418
Limnothrix_sp_BL_A_16_CP166615	1756173	1817785	61613
Limnothrix_sp_BL_A_16_CP166615	2674925	2733309	58385
Limnothrix_sp_BL_A_16_CP166615	3085082	3131342	46261
Limnothrix_sp_BL_A_16_CP166615	4038055	4191007	152953

06. Coverage Visualization

Visualization of sequencing coverage via Circleator

TO DO: Do coverage visualization also for plasmid genome

07. Synteny and Structural Analysis

Show inversions within the assembly using Circos
Show synteny and collinearity between Limnothrix B-16 and the assembly using Circos

08. Evaluation of genome annotations

Evaluate if the reading frames of the genes of the genome are intact

# Bacterial genome
python Step1__PYSCRIPT_Evaluate_reading_frames_of_genes.py -i Limnothrix_sp_HT2024_Bactopia.gb -o Limnothrix_sp_HT2024_Bactopia_ANNOTATION-INFO
# Bacterial plasmid
python Step1__PYSCRIPT_Evaluate_reading_frames_of_genes.py -i Limnothrix_sp_HT2024_plasmid.gb -o Limnothrix_sp_HT2024_plasmid_ANNOTATION-INFO

Compare gene set of two input genomes by gene name and start-position proximity

# Bacterial genome
python Step2__PYSCRIPT_Compare_genes_by_name_and_position.py Limnothrix_sp_HT2024_Bactopia.gb Limnothrix_sp_HT2024_bacass.gb --max-start-diff 500

Standardize the annotations of a bacterial genome This script ensures that every CDS and every gene annotation contain at least a gene-tag as well as a product-tag. The gene-tag contains the four-letter gene abbreviation. The full behaviour of the script is as follows:

# Bacterial genome
python Step3__PYSCRIPT_Standardize_annotations_of_bacterial_genome.py input.gb output.gb
grep -v '/gene="unknown_gene"' output.gb | grep -v "/locus_tag=" > output_final.gb
# Bacterial plasmid
python Step3__PYSCRIPT_Standardize_annotations_of_bacterial_genome.py Limnothrix_sp_HT2024_plasmid.gb Limnothrix_sp_HT2024_plasmid_TMP.gb
grep -v '/gene="unknown_gene"' Limnothrix_sp_HT2024_plasmid_TMP.gb | grep -v "/locus_tag=" > Limnothrix_sp_HT2024_plasmid_FINAL.gb

Behavior Table

Situation	Action	Log style
CDS already has valid `gene` and `product`	Copy missing values to the paired `gene` feature; CDS remains authoritative	White if nothing changes, yellow if synchronization changes something
Only the `gene` feature has valid `gene` and `product`	Copy missing values to the paired `CDS` feature	Yellow if this changes the CDS, otherwise white
`gene` and `CDS` disagree	Prefer CDS values and record the conflict in the report	Yellow if resolved successfully, red if still unresolved
Valid `product` exists but `gene` is missing	Try local mapping first; otherwise query UniProt cyanobacteria by product to infer the most common gene abbreviation	Yellow on successful resolution, red on failure
Valid `gene` exists but `product` is missing	Try local mapping first; otherwise query UniProt cyanobacteria by gene to infer the most common product description	Yellow on successful resolution, red on failure
`standard_name` is present	Use it as supporting information during local resolution and as the displayed name in logs	Shown as `standard_name` instead of `locus_tag`
`standard_name` is `hypothetical protein CDS` or `hypothetical protein gene`	Do not query UniProt and do not log the annotation; still standardize `/product` to `hypothetical protein`	No log output
Product is already `hypothetical protein` for one of those hypothetical-standard-name annotations	Leave it as `hypothetical protein` or rewrite it to the same standard form	No log output
Nothing reliable can be inferred	Fall back to unresolved values such as `unknown_gene` or remaining missing information, and record it in the report	Red
Annotation does not change	Keep existing values as they are	White

Priority Order

Priority	Rule
1	Prefer existing CDS qualifiers over gene qualifiers
2	Copy missing values across paired `gene` and `CDS` features
3	Apply the local mapping table
4	Query UniProt cyanobacteria by product if `gene` is missing
5	Query UniProt cyanobacteria by gene if `product` is missing
6	Skip logging and UniProt lookup for annotations whose `standard_name` is `hypothetical protein CDS` or `hypothetical protein gene`, but still standardize their product to `hypothetical protein`
7	Record conflicts and unresolved cases in the report

Run PAGP Foo bar baz

Foo bar baz

Metric	HorseThief_lake_water_2	HorseThief_in_cultivation
Raw reads	54,376	91,377
ASV method	DADA2	DADA2
Number of ASVs	201	272
Number of genera	53	67
Number of families	40	46
Shannon diversity	6.570	6.341
Pielou evenness	0.859	0.784
Simpson diversity	0.983	0.972

Notes

Image files are stored in the corresponding data_STEPXX__* directories.
All workflows assume Linux environments and standard bioinformatics toolchains.
Individual steps can be adapted to other bacterial genomes with minimal modification.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
data_STEP02__Circularization_of_contig		data_STEP02__Circularization_of_contig
data_STEP03b__Merqury_QV		data_STEP03b__Merqury_QV
data_STEP04a__Kmer_spectrum_Jellyfish		data_STEP04a__Kmer_spectrum_Jellyfish
data_STEP04b__Kmer_spectrum_Merqury		data_STEP04b__Kmer_spectrum_Merqury
data_STEP05__MUMmer4_dotplots		data_STEP05__MUMmer4_dotplots
data_STEP06__Coverage_Viz_Circleator		data_STEP06__Coverage_Viz_Circleator
data_STEP07b__Circos__synteny_collinearity_across_genomes		data_STEP07b__Circos__synteny_collinearity_across_genomes
data_STEP08__Annotation_evaluation		data_STEP08__Annotation_evaluation
data_STEP09a__Visualization_via_GenoVi		data_STEP09a__Visualization_via_GenoVi
data_STEP09b__Gene_tally_of_GenoVi		data_STEP09b__Gene_tally_of_GenoVi
data_STEP10__QIIME2_analysis		data_STEP10__QIIME2_analysis
publications		publications
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
STEP01__Read_filtering_and_genome_assembly.md		STEP01__Read_filtering_and_genome_assembly.md
STEP02__Circularization_of_contig.md		STEP02__Circularization_of_contig.md
STEP03a__Quality_eval_via_QUAST.md		STEP03a__Quality_eval_via_QUAST.md
STEP03b__Merqury_QV.md		STEP03b__Merqury_QV.md
STEP04a__Kmer_spectrum_Jellyfish.md		STEP04a__Kmer_spectrum_Jellyfish.md
STEP04b__Kmer_spectrum_Merqury.md		STEP04b__Kmer_spectrum_Merqury.md
STEP05a__MUMmer4_dotplots.md		STEP05a__MUMmer4_dotplots.md
STEP05b__Inversions.md		STEP05b__Inversions.md
STEP06__Coverage_Viz_Circleator.md		STEP06__Coverage_Viz_Circleator.md
STEP07b__Circos__synteny_collinearity_across_genomes.md		STEP07b__Circos__synteny_collinearity_across_genomes.md
STEP09a__Visualization_via_GenoVi.md		STEP09a__Visualization_via_GenoVi.md
STEP09b__Gene_tally_of_GenoVi.md		STEP09b__Gene_tally_of_GenoVi.md
STEP10__QIIME2_analysis.md		STEP10__QIIME2_analysis.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cyanobacterial Genome Assembly

Overview

Workflow Structure

Chapters

01. Genome Assembly

02. Contig Processing

03. Assembly Quality Assessment

What the .qv file format means

RESULTS FOR BACTERIAL GENOME

TO DO: Do both quality checks (QUAST, Mercury) also for plasmid genome

04. k-mer Spectrum Analysis

TO DO: Extend the x-axis in this figure.

TO DO: Create the same figure based on the Nanopore data; maybe create a facet plot for both.

05. Whole-Genome Alignment

06. Coverage Visualization

TO DO: Do coverage visualization also for plasmid genome

07. Synteny and Structural Analysis

08. Evaluation of genome annotations

Behavior Table

Priority Order

09. Gene-Level Visualization

TO DO: Do GenoVi visualization also for plasmid genome

10. Metagenomic analysis of 16S rRNA amplicon data using QIIME2

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cyanobacterial Genome Assembly

Overview

Workflow Structure

Chapters

01. Genome Assembly

02. Contig Processing

03. Assembly Quality Assessment

What the .qv file format means

RESULTS FOR BACTERIAL GENOME

TO DO: Do both quality checks (QUAST, Mercury) also for plasmid genome

04. k-mer Spectrum Analysis

TO DO: Extend the x-axis in this figure.

TO DO: Create the same figure based on the Nanopore data; maybe create a facet plot for both.

05. Whole-Genome Alignment

06. Coverage Visualization

TO DO: Do coverage visualization also for plasmid genome

07. Synteny and Structural Analysis

08. Evaluation of genome annotations

Behavior Table

Priority Order

09. Gene-Level Visualization

TO DO: Do GenoVi visualization also for plasmid genome

10. Metagenomic analysis of 16S rRNA amplicon data using QIIME2

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages