Introduction

New high-throughput sequencers are able to produce data at an unprecedented scale, while sequencing costs are in free fall. Primary data processing, which often includes mapping short reads onto a reference genome, is computationally very expensive. Recently, a variety of programs, many of them implementing Burrows-Wheeler Transform (BWT), have been developed for such task. Despite the efficiency of BWT, the general view is that new approaches will be necessary to cope with the increasing flood of sequencing data.

HPG-BWT proposes two new approaches to BWT sequence mapping:

  • The first one takes advantage of CUDA GPGPU processors. We have implemented BTW on GPU, allowing alignments of up to one error. Also, we parallelised the input/output operations on the CPU and the GPU execution of the mapping process. Our experiments show that this heterogeneous and really cheap solution outperforms by 3x the conventional algorithms based only on CPU.
  • The second one is a CPU implementation of the inexact search with BWT. By using optimised branch and bound techniques for biological information the number of errors allowed is greatly increased. In the case of the human genome, we support 6 errors with no sensibility penalty and within a reasonable execution time and memory size. The search routine studies insertions, deletions and mismatches. Our solution outperforms by a factor of 8x other BWT-CPU solutions. Also, by reducing the sensibility in 5-10% the speed of the algorithm can be increased by a factor of 4x, allowing up to 7-10 errors analysis.

System Requirements

  • Hardware:
    • 64-bit x86-64 CPUs
    • 32GB of RAM for human genome index calculation (This will be reduced in the near future).
    • The GPU test program requires two graphic cards, a more complete version is included in the HPG pipeline.
    • The CPU RAM required for searching against the human genome with the CPU or GPU test programs depends on the suffix array compression ratio (S, R and O.desp), with compression ratio 32x it needs about 7GB. This compression does not affect the search algorithm but can affect the speed of seeded mapping when using the tool with an Smith-Watterman implementation.
    • GPUs with at least 3GB of RAM are needed for the human genome.
  • Software:
    • 64-bit Linux system

Installation and datasets

The x86_64 binaries and the source files are here.

The instructions on how to execute the binaries can be found here.

The BWT indexes and datasets employed to measure the performance of this tool are availabe here.

When compiling the tools do not forget to check in the makefile that the preprocessor variables -DVECTOR_O_64BIT_COMPRESSION are present, in order to enable 32/64 bits vector O compression.
The preprocess program needs to activate OpenMP with -fopenmp also.

Both preprocess and search must have been compiled with the same options in order to use the generated index with the search tool.

Command line options

preprocess

./preprocess reference_file output_directory ratio duplicate_reverse nucleotides

reference_file, input file, in FASTA format, containing the reference to be indexed (up to 6GB references are supported)

output_directory, existing directory where the index files will be stored

ratio, the compression ratio of suffix array and its inverse (S and R).

duplicate_reverse, if 1 concatenates de reverse strand genome and computes the double sized index to search reverse and strand at the same time (for the human genome check that you compile the code with the flag -DSA_64, at the moment this takes a lot of memory).

nucleotides, the nucleotides present in the reference, in order.

For example,

./preprocess genome.fa index/ 8 0 ACGT

will index genome.fa and store the output files in directory "index", vectors S and R will be 8x compressed, the duplicate reverse will not be appended and the nucleotides encoding in this reference will be ACGT - 0123.

The preprocess command employs the algorithm and code proposed in "Daisuke Okanohara, Kunihiko Sadakane. A Linear-Time Burrows-Wheeler Transform Using Induced Sorting. In Proc. of SPIRE, LNCS 5721, pp. 90-101, 2009", hosted at https://code.google.com/p/csalib/downloads/list

./inexact_search mappings_file index_dir output_file duplicate_reverse search_tree_size num_errors min_fragment nucleotides

mappings_file The file with the reads

index_dir The directory with the index files

output_file, file where the mappings found will be stored

duplicate_reverse, if 1 the index of the reference and the concatenated duplicate reverse was stored in index_dir.

search_tree_size, the search tree size used for the Breadth First Search exploration (an small value increases performance but decreases the sensitivity).

num_errors, the maximum number of errors allowed during the search.

min_fragment, the minimum fragment size used by the exploration bounding techniques (does not affect sensitivity but an small value decreases performance).

nucleotides, the nucleotides present in the reference, in order.

For example,

./inexact_search drosophila.fa drosophila_dbwt found 0 10000 3 17 ACGT

will map reads in drosophila.fa using the preprocessed index in folder drosophila_dbwt, which does not include the reverse. The reads found will be stored in the file "found", with a search_tree_size of 10000 elements, allowing up to 3 errors and a minimum segment size of 17 nucleotides (for the human genome 30 is a good value). The nucleotides encoding in this reference will be ACGT - 0123.

search_gpu

./search_gpu mappings_file index_dir output_file notfound_file num_errors nucleotides

mappings_file The file with the reads

index_dir The directory with the index files

output_file, file where the mappings found will be stored

notfound_file, file where the mappings not found will be stored

num_errors, the maximum number of errors allowed during the search (only 0 or 1).

nucleotides, the nucleotides present in the reference, in order.

For example,

./search_gpu drosophila.fa drosophila_dbwt found notfound 1 ACGT

will map reads in drosophila.fa using the preprocessed index in folder drosophila_dbwt. The reads found will be stored in the file "found", the reads not found in file "notfound", allowing up to 1 errors. The nucleotides encoding in this reference will be ACGT - 0123.

Currently this is only a test program and it needs a configuration with two nvidia graphic cards in order to work. It does not support duplicate reference indexes at the moment.

Notice also that in all the tools the encoding for the four bases should be indicated. For genomes with less bases please check the reverse strand functions in file "search/csafm.c", and please do not forget to call function init_replace_table when embedding the source code.