Fastq analysis: Quality Control, filtering and preprocessing

In the latest years new high-throughput sequencers have been dramatically increasing their throughput while new applications of sequencing emerge. In particular genome or targeted exome resequencing is gaining popularity as it has been successfully applied to discover new gene diseases, reducing amazingly the time used in the process. The consequence of the growing trend in the use of resequencing will be a huge avalanche of data that must be processed and analyzed in a reasonable time. This analysis is done in a five stage NGS pipeline whose phases are: Fastq analysis, Mapping, SAM/BAM analisys, Variant Calling and VCF analisys.

Fastq analysis is divided into three steps than can be performed separately or sequentially. If performed sequentially the order of execution is: Preprocessing, Filtering and Quality Control. First reads are cut by erasing first and last nucleotides that do not meet quality thresholds. Later filters are applied to validate the reads, both valid and invalid reads are stored in separate files. Finally Quality Control information is obtained for valid and invalid reads in a separate way to facilitate the analisys of the sequencing process.

Preprocessing is the 1st step of the three stage Fastq analysis. It screens first and/or last nucleotides in fastq reads to determine if quality thresholds are meet. If not, this first and/or last nucleotides are cut. The first and last nucleotides to screen can be entered in command line. This is done to trim fragments of reads that can be usually out of quality thresholds.

more info about preprocessing

Filtering is the second step of the Fastq analysis. It applies to fastq reads a filter based on quality thresholds that determine valid and non-valid reads. Only reads that match the quality set thresholds will be valid while the other will be marked as invalid in a separate file. This way the mapping step can be optimized by only mapping the reads with certain levels of quality. Paired end reads must accomplish quality levels in both pairs.

more info about filtering

Quality Control (onwards QC) is the 3rd piece of this three stage Fastq analysis. Its aim is to obtain useful feedback information from the sequencing process that can be used later to optimize the fastq file preprocessing step.

more info about QC

All this work is done by a custom software developed tool: Fastq-GPU-Tool


SAM/BAM analysis: Quality Control, filtering, sorting and conversing

Sorting is the first step of the BAM analysis. It can be run standalone or jointly with Quality Control calculations. The aim of sorting is to shorten the duration of the later processes, BAM file Quality Control and Variant Calling, mainly. The sort is driven by the number of chromosome and the coordinate of the alignments and is performed ascending.

more info about sorting

Quality Control obtains useful statistics of alignments in a BAM file. The feedback provided by the Quality Control information can be of a great value to analyze the mapping process. The objective of QC over a BAM file is characterizing the quality of the mapping process and detecting deviations from the standard behaviours. With QC information later processes can be better planned and evaluated.

more info about QC

Filtering obtains an output BAM file from a given input BAM file including those reads that meet the given parameters. The filter parameters can describe target regions or thresholds based on the Quality Control variables. The aim of filtering is separating alignments of different quality into different files in order to facilitate later processes.

more info about filtering

Diff operation allows to obtain common alignments of different BAM files into a unique BAM file. The most valuable use of the diff operation is the cross validation (or intersection) of different alignments obtained by different mapping methods or with different quality thresholds.

more info about diff


some notes about samtools installation and integration

Variant call

VCF analysis: Quality Control, filtering and preprocessing

Alex, ves mirando: