Skip to content

Inferring resolved species tree using fastoma

Sina Majidian edited this page Feb 15, 2025 · 2 revisions

Consider a scenario that NCBI tree was used with FastOMA. Since such tree is not resolved (binary), we can use gene markers (Orthologous Groups) found with FastOMA to infer a high quality species tree.

This is based on the F1000 paper here and mostly codes from here. Updated codes are available in here.

step1: run fastoma

After running FasTOMA, you should see a folder OrthologousGroupsFasta in the output folder.

step2: multiple sequence alignment

Usually there many small OrthologousGroups (OGs), using the code filter_groups_fastoma_v2.py from here we can filter and select big ones (with more species).

You need to change min_species=50 in the following bash script, the goal is to have >100 or >1000 gene markers in the output folder OrthologousGroupFasta_gt*. One suggestion is to use 0.8*proteomes.

cp -r ../out/OrthologousGroupsFasta .
cd OrthologousGroupsFasta
gunzip -k * 
cd ..

min_species=64 
python filter_groups_fastoma_v2.py --min-nr-species ${min_species} --input OrthologousGroupsFasta --output OrthologousGroupFasta_gt${min_species}


cd OrthologousGroupFasta_gt${min_species}
mkdir ../aligned/

thread=20

for i in $(ls -1 *.fa); do 
mafft --maxiterate 1000 --localpair --thread ${thread} ${i}  > ../aligned/${i}.aln
done

Note that in code here I changed the function get_species_from_header to get fastoma species names in ||speciesNames||

concatenate MSA per gene marker (supermatrix) and trim it

cd ..
python concat_alignments_v2.py  aligned/*aln --format-output fasta > concat.fasta
trimal -in  concat.fasta -out concat_gappyout.fasta -gappyout

run iqtree

thread=48
iqtree  -T ${thread}  -s concat_gappyout.fasta -bb 1000 #  -m LG+I+G 

Clone this wiki locally