feat: avoid freebayes runnning forever on high coverage low-complexity regions (and parallelize more smoothly)

We regularly run into situations where `freebayes` does not seem to parallelize particularly well, with some genomic regions taking forever to process, while all other genomic regions have long finished.

I think this usually comes down to the situation that is described in [this issue on the `freebayes` repository](https://github.com/freebayes/freebayes/issues/479). Namely, we are using the [snakemake wrapper for `freebayes`](https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/freebayes.html), which in turn uses the [`fasta_generate_regions.py` script](https://github.com/freebayes/freebayes/blob/master/scripts/fasta_generate_regions.py) to [generate equally sized stretches of reference genome sequence to process in parallel](https://github.com/snakemake/snakemake-wrappers/blob/917ba7b6f2f3c50021a7aabc44e2b117b4f60d84/bio/freebayes/wrapper.py#L48). There are stretches of low complexity genome, in our current case [an only-G repeat in this region on chromosome 5 of the human genome](https://oct2024.archive.ensembl.org/Homo_sapiens/Share/5196750fee44a24326724e6adc9cc2d0?redirect=no) (I think it was this), which can accumulate a lot of coverage, even in targeted sequencing data. This will lead to freebayes on such regions running for a very long time on a single CPU (as in days). And if the snakemake workflow reserved a lot more cores for a freebayes job, say 32, all the other CPUs reserved (31 in this case) will sit idle once all other genomic regions have been done. And if you kill the job to avoid wasting the reserved CPU time, you loose all the work that was done by this freebayes job completely.

Thus, I see three possible things we could address here:

1. **Set a limit on the number of alleles to consider at any site.**
   This is [recommended for performance tuning and can be done via the `--use-best-n-alleles` command line argument](https://github.com/freebayes/freebayes?tab=readme-ov-file#performance-tuning). The default here is `all`, which I guess can easily balloon in low-complexity regions. So I am assuming that even a rather permissive value would be fine here, to just avoid crazy out there numbers. But where exactly this lies, I am unsure -- could be that 20 is good, or 12, or 8, or.... I'll try to get a test case out of our above example and test run times with different values.
2. **Improve the generation of regions to process in parallel, most importantly making them contain an equal amount of sequencing data (as opposed to reference genome stretches of equal length).**
   Both the [issue already mentioned above](https://github.com/freebayes/freebayes/issues/479) and [another issue on long `freebayes` run times due to ultra high coverage](https://github.com/freebayes/freebayes/issues/529) mention a clear solution for this: Using the [`coverage_to_regions.py` script](https://github.com/freebayes/freebayes/blob/master/scripts/coverage_to_regions.py) to get more even coverage across genome regions that are processed. We could for example add this functionality to the `snakemake wrapper` for `freebayes`, adding a `params: number_of_regions=` configuration analogous to the existing `params: chunksize=` for `fasta_generate_regions.py`. Alternatively, we could create a separate `snakemake wrapper` to generate such regions beforehand and/or using different tools than mentioned in those issues.
3. **Avoid the loss of all work done if we nevertheless hit a hard to process region.**
   This is really annoying, especially for whole genome sequencing data, where freebayes can run for days, even with heavy parallelization. For me, it would thus make sense to split the execution of freebayes into distinct jobs even for a single sample group, so that snakemake can take care of the parallelization (instead of the `parallel`-based `freebayes-parallel` script). Thus, each job would run on a single thread and could run independently from all other jobs, and we would have to have a gather-rule to aggregate the resulting VCF/BCF output. For a compute cluster environment with a batch submission system, it would then make sense to [group jobs per sample](https://snakemake.readthedocs.io/en/stable/executing/grouping.html), as they share the same input files, or to group them with the gather/aggregate rule. This is especially important for the case where input and output data are transferred into and out of a compute node via a storage plugin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: avoid freebayes runnning forever on high coverage low-complexity regions (and parallelize more smoothly) #352

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: avoid freebayes runnning forever on high coverage low-complexity regions (and parallelize more smoothly) #352

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions