High-throughput bioinformatics analyses of next generation sequencing (NGS) data often require
challenging pipeline optimization. The key problem is choosing appropriate tools and selecting the best parameters for optimal precision and recall.NGS is becoming the method of choice for an ever-growing number of applications in both research and clinics . However, obtaining unbiased and accurate NGS analysis results usually requires a complex multi-step processing pipeline, specifically tailored to the data and experimental design. In the case of variant detection from DNA sequencing data, the analytical pipeline includes pre-processing, read alignment and variant calling. Multiple tools are available for each of these steps, each using its own set of modifiable parameters, creating a vast amount of possible distinct pipelines which vary greatly in the resulting called variants . Selecting an adequate pipeline is a daunting task for a non-professional, and even an experienced bioinformatician needs to test many configurations in order to optimize the analysis.
Here we introduce ToTem, a tool for automated pipeline optimization. ToTem is a stand-alone web application
with a comprehensive graphical user interface (GUI)
The core principle of pipeline optimization in ToTem is to automatically test pipeline performance for all the parameter combinations in a user defined range. Pipelines are defined through consecutively linked “processes”, where each process can execute one or more tools, functions or code. ToTem is optimized to test the pipelines represented as linear sequences of commands, but also supports branching at the level of tested processes, e.g. to simultaneously optimize two variant callers in one pipeline. To facilitate pipeline definition, common steps shared by multiple pipelines can be easily copied or moved using drag and drop function.
The results are interpreted as interactive graphs and tables allowing an optimal pipeline to be selected, based on the user’s priorities. Using ToTem, we were able to optimize somatic variant calling from ultra-deep targeted gene sequencing (TGS) data.
The benchmarking of each pipeline is done using ground truth data and is based on an evaluation of true positives, false positives, false negative rates and performance quality metrics derived from them. Ground truth data generally consists of raw sequencing data or alignments and an associated set of validated variants ToTem provides two benchmarking approaches, with each focusing on different applications and having different advantages:
The first approach is using ToTem’s filtering tool to filter (stratified) performance reports generated by
external benchmarking tools, which are incorporated as a final part of tested analytical pipelines. This allows
an evaluation of many parameter combinations and simple setting selection that produce the best results
considering e.g. quality metrics, variant type and region of interest (variables depend on the report). This approach is particularly useful for optimizing the pipeline for WGS or whole exome sequencing (WES)
and also TGS.
Little Profet (LP) is ToTem’s genuine benchmarking method, which compares variant calls generated by
tested pipelines to the gold standard variant call set. LP calculates standard quality metrics (precision, recall and F-measure) and most importantly – the reproducibility of each quality metric, which is the main advantage over the standard Genome in a Bottle (GIAB) approach. ToTem thus allows the best pipelines to be selected considering the selected quality metrics and its consistency over multiple data subsets. The LP approach is designed primarily for TGS data harboring a limited number of sequence variants and suffering from high a risk of pipeline over-fitting.