RNA-seq arch

The architecture of rna-seq is focused in a pipeline where the first thread get batches of reads from hard drive and then store they in a list. One batch is formed about 200 reads. The second thread reads the batches from this list and then applies for each read Burrows–Wheeler transform (BWT). Next, puts the batch with the results obtained by BWT to the next list. In next step, other thread generates seed for each read that were not mapped and applies BWT without errors for each seed and stores the alignments in the batch. At the end, the thread stores the batch in the next list. The next thread, gets the alignments recorded in the previous step and generates CALs (candidate alignment locations), they are the fusion of the seeds alignments for each read. Below, other thread extend the CALs about 30nt and connects those extended CALs which are close to each other in the reference genome. Then applies smith-waterman for each read and their references and searches all splice junctions. For it, finds all big gaps in the smith-waterman query output and search the start and end intron marks (GT-AG, CT-AC). The splice junctions that are find are stored in a search-tree (AVL), because we need a fast search structure. If the dataset that are processing is pair-end or mate-pair, the next thread process all alignments recorded in the other phases to find the pair alignments. The last thread of the pipeline, writes to the hard drive all alignments recorded. Finally, the splice junctions stored in the AVL will be written in the hard drive too.

Pipeline.png (34.2 kB) Hector Martinez, 01/09/2013 09:08 am