Home

Quality Control (QC) documentation

Introduction

Quality Control (onwards QC) is the 3rd piece of this three stage Fastq analysis. Its aim is to obtain useful feedback information from the sequencing process that can be used later to optimize the fastq file preprocessing step.

Implementation

We propose a new approach that takes advantage of new GPU processors. QC operations have been implemented in GPUs and combined with some additional calculation on CPUs. As an output of the QC step the following metrics, graphics and data are obtained:

  • General statistics
  • Number of reads
  • Min. read length
  • Max. read length
  • Mean read length
  • Mean read quality
  • Anaylisis of nucleotides
  • GC content
  • Mean quality per nucleotide position
  • 5-mer count (full count, sorted from higher to lower count)
  • Graphics
  • Quality per nucleotide position
  • Per sequence quality scores
  • Per base sequence content
  • Per base GC content
  • Per sequence GC content
  • %N per nucleotide position
  • Sequence length distribution
  • 5-mer count per nucleotide position (full count)
  • Data files (CSV format)
  • Information by nucleotide position (%A, %C, %G, %T, %N, %GC, read length histogram, quality per position)
  • Quality level histogram
  • %GC histogram
  • Total 5-mer count
  • Total 5-mer per position

Note: the former information is detailed separately for both paired ends.

We implement the Quality Control on GPUs using "CUDA": http://developer.NVidia.com/object/cuda.html. In order to take advantage of the parallelism provided by the segmentation, the QC process process has been split into several steps (read, load to GPU, quality and general computations, copy results from GPU, calculate final results and write report). Despite CUDA threads are ready for the simultaneous management of multiple GPUs, they lack many features needed for concurrent execution and process synchronization. For this reason we have implemented our solution using pthreads. In particular the QC implementation consists of three different concurrent CPU threads:

  • the read threads, that are responsible of disk input operations, loading blocks of reads, one thread per fastq file is used
  • the gpu threads, these threads copy fastq read information into the GPU, launch the QC kernel on each GPU and return results to main memory, one thread per GPU is used
  • the results thread, a unique thread that accumulates GPU partial results to obtain final metrics and writes the report to disk
Three limited size lists are used by the threads:
  • The first to store fastq reads grouped in batches
  • The second to store GPU quality control results
  • The third to store read status (left-trim, right-trim, both-side-trim) in order to select the right values depending on the state of the read

This set up allows loading new reads from disk while simultaneously calculating QC data in multiple GPU cards and CPU cores. This approach has resulted to be highly efficient and constitutes an optimal way of exploiting the full hybrid CPU-GPU computational power and the disk IO capacity.

System Requirements

  • Hardware:
    • 64-bit x86-64 CPUs
    • NVidia CUDA-enabled card with compute capability 2.0 or higher
  • Software:
    • 64-bit Linux system
    • NVidia driver supporting CUDA 4.0 or higher

Installation

  1. Download the tar file ngs-gpu.tgz from Files menu
  2. In the Linux console, type:
    1. tar xvfz ngs-gpu.tgz
    2. cd ngs-gpu/fastq-hpc-tools/src
    3. make fastq-hpc-tools
In directory ngs-gpu/bin there will be 1 executable file:
  • fastq-hpc-tools, this binary is used for preprocessing, filtering and quality control

We focus now only in fastq-hpc-tools usage for quality control.

Command line options

fastq-hpc-tools

 ngs-cpu/fastq-hpc-tools/bin/fastq-hpc-tools --qc --outdir [--batch-size] [--batch-list-size] [--fastq | --fq]|[--fastq1 | --fq1 --fastq2| --fq2] [--phred-quality] [--kmers] [--conf] [--t | --time]

 --qc, flag for quality control, optional. Note: if not flag at all only quality control is performed

 --outdir, directory where report and image files will be stored, mandatory

 --batch-size, size in bytes of fastq batches, optional (default 500000)

 --batch-list-size, maximum length of the list to store fastq read batches of size read-batch-size, optional (default 10)

 --fastq, synonymous of  --fq, input file, in FASTQ format, used in single end, mandatory for single end

 --fastq1, synonymous of  --fq1, input file, in FASTQ format, used in paired end (pair 1), mandatory for paired end

 --fastq2, synonymous of  --fq2, input file, in FASTQ format, used in paired end (pair 2), mandatory for paired end

 --phred-quality, phred quality scale to determine base quality, accepted values: 33, 64, sanger (=33), solexa (=64); optional (default 33)

 --kmers, flag for kmers calculation, optional (default no kmers)

 --conf, path to file with launch options. Each line of the file must be exactly as in the command line: parameter=value or flag, optional (if set file options override command line options)

 --t, synonymous of  --time, activate timing when set, optional (default no timing)

common options (hpc and log options)

Additional non mandatory options for configure hpc and log behaviour.

 ngs-cpu/fastq-hpc-tools/bin/fastq-hpc-tools [--cpu-num-threads] [--gpu-num-threads] [--log-file] [--log-level] [--v | --verbose]

 --cpu-num-threads, number of cpu threads launched with OpenMP in QC calculations (default value: number of cpu cores minus two)

 --gpu-num-threads, number of threads per block launched in GPU calculations, optional (default value: depends on compute capability, 512 threads on 2.0)

 --log-file, path for log filename, optional (default value: no log file)

 --log-level, log level between 1 (debug) and 5 (fatal), optional (default level 2, info). Scale: 1 (debug), 2 (info), 3 (warn), 4 (error) and 5 (fatal)

 --v, synonymous of  --verbose, enables/disable console log, optional (default true)

Web references

Downloads

  • Binary and sources: