Home

Filtering documentation

Introduction

Filtering is the second step of the Fastq analysis. It applies to fastq reads a filter based on quality thresholds that determine valid and non-valid reads. Only reads that match the quality set thresholds will be valid while the other will be marked as invalid in a separate file. This way the mapping step can be optimized by only mapping the reads with certain levels of quality. Paired end reads must accomplish quality levels in both pairs. The filtering thresholds that can be set are:
  • Max. and min. read length
  • Quality thresholds (minimum and maximum accepted quality), these thresholds are validated in:
    • Mean read quality
    • Median read quality
    • First nucleotides of the read (if option is set), reads that not pass this validation can be cut or invalidated depending of command line options
    • Last nucleotides of the read (if option is set), reads that not pass this validation can be cut or invalidated depending of command line option
  • Max. number of nt quality out of range per read
  • Max. N occurrences per read
    Note: if Quality Control is selected it will be performed over the .valid and the .invalid file.

Filtering implementation takes advantage of new GPU processors. Filtering calculations have been implemented in GPUs using "CUDA": http://developer.NVidia.com/object/cuda.html. In order to take advantage of the parallelism provided by the segmentation, the filter process has been split into several steps (read, load to GPU, quality computations, copy results from GPU, generate read status and write files). We have implemented our solution using pthreads. In particular the Filtering implementation consists of four different concurrent CPU threads:

  • the read threads, that are responsible of disk input operations, loading blocks of reads, one thread per fastq file is used
  • the gpu threads, these threads copy fastq read information into the GPU, launch the Preprocessing kernel on each GPU and return results to main memory, one thread per GPU
  • the results thread, a unique thread that uses GPU results to determine read status read by read
  • the writer thread, a unique thread that write valid and invalid reads to the corresponding file
Three limited size lists are used by the threads:
  • The first to store fastq reads grouped in batches
  • The second to store GPU quality results
  • The third to store read status (left-trim, right-trim, both-side-trim)

System Requirements

  • Hardware:
    • 64-bit x86-64 CPUs
    • NVidia CUDA-enabled card with compute capability 2.0 or higher
  • Software:
    • 64-bit Linux system
    • NVidia driver supporting CUDA 4.0 or higher

Installation

  1. Download the tar file ngs-gpu.tgz from Files menu
  2. In the Linux console, type:
    1. tar xvfz ngs-gpu.tgz
    2. cd ngs-gpu/fastq-hpc-tools/src
    3. make fastq-hpc-tools
In directory ngs-gpu/bin there will be 1 executable file:
  • fastq-hpc-tools, this binary is used for preprocessing, filtering and quality control

We focus now only in fastq-hpc-tools usage for filtering.

Command line options

fastq-hpc-tools

 ngs-cpu/fastq-hpc-tools/bin/fastq-hpc-tools --filter [--rfilter-nts] [--lfilter-nts] --outdir [--batch-size] [--batch-list-size] [--fastq | --fq]|[--fastq1 | --fq1 --fastq2| --fq2] [--phred-quality] [--min-quality] [--max-quality] [--conf] [--t | --time]

 --filter, flag for filtering, optional. Note: if not flag at all only quality control is performed

 --rfilter-nts, number of right nucleotides (last nucleotides) to screen its mean quality for validation, optional (default 0)

 --lfilter-nts, number of left nucleotides (first nucleotides) to screen its mean quality for validation, optional (default 0)

 --min-read-length, minimum length allowed in a read, mandatory

 --max-read-length, maximum length allowed in a read, mandatory

 --max-n-per-read, maximum N positions allowed in a read, optional (default 0)

 --max-nts-mismatch, maximum number of nt quality by read out of the accepted quality range, optional (default 3)

 --outdir, directory where .valid files will be stored, mandatory. Note: in paired end two files are generated

 --batch-size, size in bytes of fastq batches, optional (default 500000)

 --batch-list-size, maximum length of the list to store fastq read batches of size read-batch-size, optional (default 10)

 --fastq, synonymous of  --fq, input file, in FASTQ format, used in single end, mandatory for single end

 --fastq1, synonymous of  --fq1, input file, in FASTQ format, used in paired end (pair 1), mandatory for paired end

 --fastq2, synonymous of  --fq2, input file, in FASTQ format, used in paired end (pair 2), mandatory for paired end

 --phred-quality, phred quality scale to determine base quality, accepted values: 33, 64, sanger (=33), solexa (=64); optional (default 33)

 --conf, path to file with launch options. Each line of the file must be exactly as in the command line: parameter=value or flag, optional (if set file options override command line options)

 --t, synonymous of  --time, activate timing when set, optional (default no timing)

common options (hpc and log options)

Additional non mandatory options for configure hpc and log behaviour.

 ngs-cpu/fastq-hpc-tools/bin/fastq-hpc-tools [--cpu-num-threads] [--gpu-num-threads] [--log-file] [--log-level] [--v | --verbose]

 --cpu-num-threads, number of cpu threads launched with OpenMP in QC calculations (default value: number of cpu cores minus two)

 --gpu-num-threads, number of threads per block launched in GPU calculations, optional (default value: depends on compute capability, 512 threads on 2.0)

 --log-file, path for log filename, optional (default value: no log file)

 --log-level, log level between 1 (debug) and 5 (fatal), optional (default level 2, info). Scale: 1 (debug), 2 (info), 3 (warn), 4 (error) and 5 (fatal)

 --v, synonymous of  --verbose, enables/disable console log, optional (default true)

Web references