Overview

HPG-SW is a modern implementation of the Smith-Waterman algorithm based on high-performance computing techniques. HPG-SW uses the OpenMP parallel programming model and the SSE instructions in order to take advantages of the multi-core processors and the SIMD registers of current CPU cores.

HPG-SW offers a simple C API interface to be used by many applications providing a high speed-up. HPG-SW also provides a CLI, a set of RESTful web-services and Web interface to align both DNA and protein sequences.

Installation

  1. Download the tar file hpg-sw.tar.gz from Files menu
  2. In the Linux console, type:
    1. tar xvfz hpg-sw.tar.gz
    2. make
After executing make command, in the directory bin there will be two executable files:
  • hpg-sw, this binary runs the Smith-Waterman algorithm to align multiple sequences, taking advantages of the multiple cores and SIMD registers, by using SSE intructions and OpenMP directives.
  • hpg-sw-bench, this binary performs a benchmarking test comparing the HPG-SW version with the EMBOSS + OpenMP version.
The hpg-sw.tar.gz contains four substitution score matrix files:
  • for DNA,
    • dnafull
  • for protein,
    • blosum50
    • blosum62
    • blosum80
Two data files are provided in the datasets directory:
  • queries-50k.txt, file containing the queries: 50000 70nt sequences.
  • refs-50k.txt, file containing the references: 50000 480nt sequences.

Usage

hpg-sw

hpg-sw aligns the sequences from the input files by running the Smith-Waterman algorithm based on SSE intructions and OpenMP directives.

 hpg-sw -q query_filename -r ref_filename -o output_filename -p gap_open_penalty -e gap_extend_penalty -s substitution_score_matrix -n number_of_threads -b number_of_reads_per_batch

 --query-file, -q, input file name containing the queries to align

 --ref-file, -r, input file name containing the references to align

 --output-dir, -d, output directory where the results will be saved

 --output-file, -o, output file name where the alignments will be saved

 --gap-open-penalty, -p, penalty for the gap openning: from 0.0 to 100.0

 --gap-extend-penalty, -e, penalty for the gap extending: from 0.0 to 10.0

 --substitution-matrix-file, -s, substitution score matrix file name, for DNA: dnafull, for proteins: blosum50, blosum62, blosum80

 --num-threads, -n, number of threads

 --reads-per-batch, -b, number of reads per batch

Example of command line:

 >./bin/hpg-sw -q datasets/queries-50k.txt -r datasets/refs-50k.txt -o alignment.out -p 10 -e 0.05 -s ./dnafull -n 4 -b 2000 -d .

Output (content of the alignment.out):

>head alignment.out
Query: GGCTGTTTCTTCCCGGGTGTTCATAGGAACCACCACAAGGATTCAGCTCAGTTACTGTTTCAGCACACAA   Start at 0
       |||||||||||||||||||||||||||||||||||||||x|||||||||||||||||||||||||x||||
Ref. : GGCTGTTTCTTCCCGGGTGTTCATAGGAACCACCACAAGAATTCAGCTCAGTTACTGTTTCAGCAAACAA   Start at 199
Score: 332.00   Length: 70      Identity: 97.14%        Gaps: 0.00%

Query: AGTATTTTCTTTATCATTTAATTCATAACATTTGTTAATTTCCATTGTGATTTTTAAATATTTGACCCAG   Start at 0
       |||||||||||||||||||||||||x||||||||||||||||||||||||||||||||||||||||||||
Ref. : AGTATTTTCTTTATCATTTAATTCAAAACATTTGTTAATTTCCATTGTGATTTTTAAATATTTGACCCAG   Start at 201
Score: 341.00   Length: 70      Identity: 98.57%        Gaps: 0.00%

hpg-sw-bench

hpg-sw-bench runs the hpg-sw binary and the Smith-Waterman algorithm version from EMBOSS, displays the resulting alignment times and calculates the speed-up. Only DNA sequences are
allowed and the DNAFULL substitution score matrix is used.

 hpg-sw-bench -q query_filename -r ref_filename -o output_filename -m out_emboss_filename -p gap_open_penalty -e gap_extend_penalty -s substitution_score_matrix -n number_of_threads -b number_of_reads_per_batch

 --query-file, -q, input file name containing the queries to align

 --ref-file, -r, input file name containing the references to align

 --output-dir, -d, output directory where the results will be saved

 --output-sse-file, -o, output file name where the HPG-SW alignments will be saved

 --out-emboss-file, -v, output file name where the EMBOSS alignments will be saved

 --gap-open-penalty, -p, penalty for the gap openning: from 0.0 to 100.0

 --gap-extend-penalty, -e, penalty for the gap extending: from 0.0 to 10.0

 --num-threads, -n, number of threads

 --reads-per-batch, -b, number of reads per batch

Example of command line:

>./bin/hpg-sw-bench -q datasets/queries-50k.txt -r datasets/refs-50k.txt -o sse.out -v emboss.out -p 10 -e 0.05 -n 4 -b 2048 -d .

Output :

query-file = datasets/queries-50k.txt
ref-file = datasets/refs-50k.txt
output-dir = .
output-sse-file = sse.out
output-emboss-file = emboss.out
gap-open-penalty = 10.000000
gap-extend-penalty = 0.050000
num-threads = 4
reads-per-batch = 2048
out SSE filename (full path) = ./sse.out
out EMBOSS filename (full path) = ./emboss.out

SSE done (50000 reads)
SSE + OpenMP
        Calculating matrix:
                Thread  0:      3.63423 s
                Thread  1:      3.62769 s
                Thread  2:      3.64567 s
                Thread  3:      3.67055 s
        Tracking back:
                Thread  0:      0.35023 s
                Thread  1:      0.36094 s
                Thread  2:      0.36224 s
                Thread  3:      0.35372 s
        Total:
                Max. time:      4.02426 s

EMBOSS done (50000 reads)
EMBOSS + OpenMP
        Calculating matrix:
                Thread  0:      10.46220 s
                Thread  1:      10.21471 s
                Thread  2:      10.55586 s
                Thread  3:      9.94035 s
        Tracking back:
                Thread  0:      4.15750 s
                Thread  1:      4.49104 s
                Thread  2:      4.53458 s
                Thread  3:      4.36875 s
        Total:
                Max. time:      15.09043 s

Speed-up (EMBOS vs SSE)
        Total             : 3.75