- Running Strainify without a precomputed variant matrix
- Running Strainify with a precomputed variant matrix
- Example Output
- Clustering input reference genomes
- Unknown strains in metagenomic samples
Strainify includes example input data to help you get started quickly. To run the pipeline on the example data:
- Unzip the compressed FASTQ files. The following command line can be used for Linux systems.
gunzip example/fastqs/paired/*.gz
gunzip example/fastqs/single/*.gz- Make sure your
config.yamlis set like this:
genome_folder: example/genomes
fastq_folder: example/fastqs/paired
output_dir: example/results
read_type: paired
modify_windows: --window_size 500 --window_overlap 0
weight_by_entropy: false
use_precomputed_variants: false- Run Strainify:
./strainify run --cores 12 --configfile config.yamlThe output will be written to the example/results directory.
If you are running Strainify again on the same set of genomes, you can use the precomputed variant matrix by doing the following:
- In the
config.yamlfile, setuse_precomputed_variantstotrueandprecomputed_output_dirto your desired directory for storing the new output. If you do not provide a path forprecomputed_output_dir, your output directory will be automatically set tooutput_dir/precomputed_results. Make sure you setoutput_dirto the directory that contains the precomputed variant matrix. For example, let's use the variant matrix from the previous example, and theconfig.yamlfile would look like this:
genome_folder: example/genomes
fastq_folder: example/fastqs/single
output_dir: example/results
read_type: single
modify_windows: --window_size 500 --window_overlap 0
weight_by_entropy: false
use_precomputed_variants: true
precomputed_output_dir: example/new_output- Run Strainify:
./strainify run --cores 12 --configfile config.yamlThe output will be written to the example/new_output directory.
The abundance estimates are stored in a CSV file named abundance_estimates_combined.csv.
- Each row corresponds to a strain.
- Each column (after the first) corresponds to a sample.
- Values represent the estimated relative abundances.
Example CSV output:
strain name,10x_ratio_2
E24377A,23.7987
H10407,24.8376
Sakai,24.9406
UTI89,26.423Note: these numbers are percentages.
Other important output files:
sites.txtcontains a list of variant positions that passed the filter. Read counts supporting the allele and reference base at these positions are then obtained and used as input to the MLE model.filtered_variant_matrix.csvcontains the filtered variant matrix. Confounding variants (potential recombination sites) have been removed. For metagenomic samples that share the same set of strains (i.e. query genomes), this file can be reused to avoid rerunning the genome alignment and variant filtering steps. For more details, see instructions above for running Strainify with a precomputed variant matrix.significantly_enriched_windows.tsvcontains the start and end coordinates of windows that are flagged as potential recombination sites. The z-score and p-values of each window are also shown. Variants in these windows are removed from downstream analysis (i.e. excluded from the filtered variant matrix).abundance_estimates_bootstrap_CIs.csv(if--bootstrapis applied) contains the 95% confidence intervals for each estimated relative abundance value.
By default, Strainify defines each input reference genome as a strain, and it is able to operate with strains that are highly similar provided a reference genome is available for each one. However, strains can also be defined at a less granular level by clustering similar reference genomes and selecting a representative sequence from each cluster as input for Strainify.
A simple tool that can be used for clustering is TreeCluster. You can first run Parsnp (included in Strainify's dependencies, no need to install separately) with your reference genomes to obtain a tree (one of Parsnp's default outputs). For example:
parsnp -r reference_genome.fna -d path/to/genome/folder/*.fnaTip: if you would like Parsnp to randomly select a genome from the input as the reference for alignment, you can simply replace
reference_genome.fnawith!.
Look for parsnp.tree in Parsnp's output, and provide it to TreeCluster. Here is an example command:
TreeCluster.py -i parsnp.tree -m max_clade -t 0.001 -o clusters.tsvYou can then select a representative genome from each cluster and provide them to Strainify. Strainify will then estimate relative abundances at the cluster level.
When strains in a metagenomic sample to be analyzed are unknown (i.e. reference genomes are not available), Strainify can be paired with an upstream strain identification tool such as StrainGST. Just run the upstream tool and collect the genomes that it identified in your metagenomic samples, and then run Strainify with these genomes as references. For longitudinal/time-series studies, we recommend running the upstream tool on metagenomic samples from all timepoints, and provide all of the identified genomes to Strainify.