Skip to content

feat: avoid freebayes runnning forever on high coverage low-complexity regions (and parallelize more smoothly) #352

@dlaehnemann

Description

@dlaehnemann

We regularly run into situations where freebayes does not seem to parallelize particularly well, with some genomic regions taking forever to process, while all other genomic regions have long finished.

I think this usually comes down to the situation that is described in this issue on the freebayes repository. Namely, we are using the snakemake wrapper for freebayes, which in turn uses the fasta_generate_regions.py script to generate equally sized stretches of reference genome sequence to process in parallel. There are stretches of low complexity genome, in our current case an only-G repeat in this region on chromosome 5 of the human genome (I think it was this), which can accumulate a lot of coverage, even in targeted sequencing data. This will lead to freebayes on such regions running for a very long time on a single CPU (as in days). And if the snakemake workflow reserved a lot more cores for a freebayes job, say 32, all the other CPUs reserved (31 in this case) will sit idle once all other genomic regions have been done. And if you kill the job to avoid wasting the reserved CPU time, you loose all the work that was done by this freebayes job completely.

Thus, I see three possible things we could address here:

  1. Set a limit on the number of alleles to consider at any site.
    This is recommended for performance tuning and can be done via the --use-best-n-alleles command line argument. The default here is all, which I guess can easily balloon in low-complexity regions. So I am assuming that even a rather permissive value would be fine here, to just avoid crazy out there numbers. But where exactly this lies, I am unsure -- could be that 20 is good, or 12, or 8, or.... I'll try to get a test case out of our above example and test run times with different values.
  2. Improve the generation of regions to process in parallel, most importantly making them contain an equal amount of sequencing data (as opposed to reference genome stretches of equal length).
    Both the issue already mentioned above and another issue on long freebayes run times due to ultra high coverage mention a clear solution for this: Using the coverage_to_regions.py script to get more even coverage across genome regions that are processed. We could for example add this functionality to the snakemake wrapper for freebayes, adding a params: number_of_regions= configuration analogous to the existing params: chunksize= for fasta_generate_regions.py. Alternatively, we could create a separate snakemake wrapper to generate such regions beforehand and/or using different tools than mentioned in those issues.
  3. Avoid the loss of all work done if we nevertheless hit a hard to process region.
    This is really annoying, especially for whole genome sequencing data, where freebayes can run for days, even with heavy parallelization. For me, it would thus make sense to split the execution of freebayes into distinct jobs even for a single sample group, so that snakemake can take care of the parallelization (instead of the parallel-based freebayes-parallel script). Thus, each job would run on a single thread and could run independently from all other jobs, and we would have to have a gather-rule to aggregate the resulting VCF/BCF output. For a compute cluster environment with a batch submission system, it would then make sense to group jobs per sample, as they share the same input files, or to group them with the gather/aggregate rule. This is especially important for the case where input and output data are transferred into and out of a compute node via a storage plugin.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions