Skip to content

michaelgruenstaeudl/Cyanobacterial_Genome_Assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cyanobacterial Genome Assembly

Scripts and workflows for the assembly, polishing, quality assessment, and visualization of complete cyanobacterial genomes.


Overview

This repository contains step-by-step protocols and scripts for generating high-quality cyanobacterial genome assemblies from sequencing data. The workflow covers hybrid assembly, contig circularization, quality evaluation, and multiple visualization approaches commonly used in comparative and structural genomics.


Workflow Structure

Each chapter corresponds to one logical step in the genome assembly and evaluation pipeline. Chapters are designed to be followed sequentially, but individual steps can also be used independently.


Chapters

01. Genome Assembly


02. Contig Processing


03. Assembly Quality Assessment

What the .qv file format means

The Merqury QV output file has five columns:

<label>   <err_kmers>   <total_kmers>   <QV>   <error_rate>

Column descriptions

Column Meaning
label Assembly name (or combined assemblies)
err_kmers Number of assembly k-mers not found in the reads
total_kmers Total number of assembly k-mers
QV Phred-scaled quality value
error_rate Estimated base error rate
RESULTS FOR BACTERIAL GENOME
<label>                         <err_kmers>   <total_kmers>   <QV>     <error_rate>
BacterialGenome_Bactopia        0             4536393        +inf    0
PlasmidGenome_Bactopia          0             4018           +inf    0
both                            0             4540411        +inf    0

Notes:

  • A +inf QV indicates zero observed error k-mers relative to the reads.
  • An error_rate of 0 reflects no estimated base errors under this metric.
TO DO: Do both quality checks (QUAST, Mercury) also for plasmid genome

04. k-mer Spectrum Analysis

TO DO: Extend the x-axis in this figure.

TO DO: Create the same figure based on the Nanopore data; maybe create a facet plot for both.

05. Whole-Genome Alignment

Limnothrix_sp_BL_A_16_CP166615	82994	113322	30329
Limnothrix_sp_BL_A_16_CP166615	592628	736045	143418
Limnothrix_sp_BL_A_16_CP166615	1756173	1817785	61613
Limnothrix_sp_BL_A_16_CP166615	2674925	2733309	58385
Limnothrix_sp_BL_A_16_CP166615	3085082	3131342	46261
Limnothrix_sp_BL_A_16_CP166615	4038055	4191007	152953

06. Coverage Visualization

TO DO: Do coverage visualization also for plasmid genome

07. Synteny and Structural Analysis


08. Evaluation of genome annotations

# Bacterial genome
python Step1__PYSCRIPT_Evaluate_reading_frames_of_genes.py -i Limnothrix_sp_HT2024_Bactopia.gb -o Limnothrix_sp_HT2024_Bactopia_ANNOTATION-INFO
# Bacterial plasmid
python Step1__PYSCRIPT_Evaluate_reading_frames_of_genes.py -i Limnothrix_sp_HT2024_plasmid.gb -o Limnothrix_sp_HT2024_plasmid_ANNOTATION-INFO
# Bacterial genome
python Step2__PYSCRIPT_Compare_genes_by_name_and_position.py Limnothrix_sp_HT2024_Bactopia.gb Limnothrix_sp_HT2024_bacass.gb --max-start-diff 500
  • Standardize the annotations of a bacterial genome This script ensures that every CDS and every gene annotation contain at least a gene-tag as well as a product-tag. The gene-tag contains the four-letter gene abbreviation. The full behaviour of the script is as follows:
# Bacterial genome
python Step3__PYSCRIPT_Standardize_annotations_of_bacterial_genome.py input.gb output.gb
grep -v '/gene="unknown_gene"' output.gb | grep -v "/locus_tag=" > output_final.gb
# Bacterial plasmid
python Step3__PYSCRIPT_Standardize_annotations_of_bacterial_genome.py Limnothrix_sp_HT2024_plasmid.gb Limnothrix_sp_HT2024_plasmid_TMP.gb
grep -v '/gene="unknown_gene"' Limnothrix_sp_HT2024_plasmid_TMP.gb | grep -v "/locus_tag=" > Limnothrix_sp_HT2024_plasmid_FINAL.gb

Behavior Table

Situation Action Log style
CDS already has valid gene and product Copy missing values to the paired gene feature; CDS remains authoritative White if nothing changes, yellow if synchronization changes something
Only the gene feature has valid gene and product Copy missing values to the paired CDS feature Yellow if this changes the CDS, otherwise white
gene and CDS disagree Prefer CDS values and record the conflict in the report Yellow if resolved successfully, red if still unresolved
Valid product exists but gene is missing Try local mapping first; otherwise query UniProt cyanobacteria by product to infer the most common gene abbreviation Yellow on successful resolution, red on failure
Valid gene exists but product is missing Try local mapping first; otherwise query UniProt cyanobacteria by gene to infer the most common product description Yellow on successful resolution, red on failure
standard_name is present Use it as supporting information during local resolution and as the displayed name in logs Shown as standard_name instead of locus_tag
standard_name is hypothetical protein CDS or hypothetical protein gene Do not query UniProt and do not log the annotation; still standardize /product to hypothetical protein No log output
Product is already hypothetical protein for one of those hypothetical-standard-name annotations Leave it as hypothetical protein or rewrite it to the same standard form No log output
Nothing reliable can be inferred Fall back to unresolved values such as unknown_gene or remaining missing information, and record it in the report Red
Annotation does not change Keep existing values as they are White

Priority Order

Priority Rule
1 Prefer existing CDS qualifiers over gene qualifiers
2 Copy missing values across paired gene and CDS features
3 Apply the local mapping table
4 Query UniProt cyanobacteria by product if gene is missing
5 Query UniProt cyanobacteria by gene if product is missing
6 Skip logging and UniProt lookup for annotations whose standard_name is hypothetical protein CDS or hypothetical protein gene, but still standardize their product to hypothetical protein
7 Record conflicts and unresolved cases in the report
  • Run PAGP Foo bar baz
Foo bar baz

09. Gene-Level Visualization

TO DO: Do GenoVi visualization also for plasmid genome


10. Metagenomic analysis of 16S rRNA amplicon data using QIIME2

Metric HorseThief_lake_water_2 HorseThief_in_cultivation
Raw reads 54,376 91,377
ASV method DADA2 DADA2
Number of ASVs 201 272
Number of genera 53 67
Number of families 40 46
Shannon diversity 6.570 6.341
Pielou evenness 0.859 0.784
Simpson diversity 0.983 0.972

Notes

  • Image files are stored in the corresponding data_STEPXX__* directories.
  • All workflows assume Linux environments and standard bioinformatics toolchains.
  • Individual steps can be adapted to other bacterial genomes with minimal modification.

About

Scripts for the assembly and annotation of complete cyanobacterial genomes

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors