Fastq analysis: Quality Control, filtering and preprocessing

In the latest years new high-throughput sequencers have been dramatically increasing their throughput while new applications of sequencing emerge. In particular genome or targeted exome resequencing is gaining popularity as it has been successfully applied to discover new gene diseases, reducing amazingly the time used in the process. The consequence of the growing trend in the use of resequencing will be a huge avalanche of data that must be processed and analyzed in a reasonable time. This analysis is done in a five stage NGS pipeline whose phases are: Fastq analysis, Mapping, SAM/BAM analisys, Variant Calling and VCF analisys.

Fastq analysis is divided into three steps than can be performed separately or sequentially. If performed sequentially the order of execution is: Preprocessing, Filtering and Quality Control. First reads are cut by erasing first and last nucleotides that do not meet quality thresholds. Later filters are applied to validate the reads, both valid and invalid reads are stored in separate files. Finally Quality Control information is obtained for valid and invalid reads in a separate way to facilitate the analisys of the sequencing process.

Quality Control (onwards QC) is the 3rd piece of this three stage Fastq analysis. Its aim is to obtain useful feedback information from the sequencing process that can be used later to optimize the fastq file preprocessing step.

more info about QC

Preprocessing is the 1st step of the three stage Fastq analysis. It screens first and/or last nucleotides in fastq reads to determine if quality thresholds are meet. If not, this first and/or last nucleotides are cut. The first and last nucleotides to screen can be entered in command line. This is done to trim fragments of reads that can be usually out of quality thresholds.

more info about preprocessing

Filtering is the second step of the Fastq analysis. It applies to fastq reads a filter based on quality thresholds that determine valid and non-valid reads. Only reads that match the quality set thresholds will be valid while the other will be marked as invalid in a separate file. This way the mapping step can be optimized by only mapping the reads with certain levels of quality. Paired end reads must accomplish quality levels in both pairs.

more info about filtering

All this work is done by a custom software developed tool: fastq-hpc-tool

Executing and examples

  • Example command line:

./bin/fastq-hpc-tools --qc --fq1 1M_reads_pe_1.fastq --fq2 1M_reads_pe_2.fastq --outdir /tmp --batch-list-size 4 --batch-size 50000000 --kmers --cpu-num-threads 4 --t

  • Example Fastq files:

  • Genomic signature files:


  • Binary and sources: