liu.seSearch for publications in DiVA
Change search
ReferencesLink to record
Permanent link

Direct link
Highly improved homopolymer aware nucleotide-protein alignments with 454 data
Linköping University, Department of Physics, Chemistry and Biology, Bioinformatics. Linköping University, The Institute of Technology.
2012 (English)In: BMC Bioinformatics, ISSN 1471-2105, Vol. 13, no 230Article in journal (Refereed) Published
Abstract [en]


Roche 454 sequencing is the leading sequencing technology for producing long read high throughput sequence data. Unlike most methods where sequencing errors translate to base uncertainties, 454 sequencing inaccuracies create nucleotide gaps. These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis.


To address this issue, the Homopolymer Aware Cross Alignment Tool (HAXAT) was developed. HAXAT uses a novel dynamic programming algorithm for solving the optimal local alignment between a 454 nucleotide and a protein sequence by allowing frame-shifts, guided by 454 flowpeak values. The algorithm is an efficient minimal extension of the Smith-Waterman-Gotoh algorithm that easily fits in into other tools. Experiments using HAXAT demonstrate, through the introduction of 454 specific frame-shift penalties, significantly increased accuracy of alignments spanning homopolymer sequence errors. The full effect of the new parameters introduced with this novel alignment model is explored. Experimental results evaluating homopolymer inaccuracy through alignments show a two to five-fold increase in Matthews Correlation Coefficient over previous algorithms, for 454-derived data.


This increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis. The alignment tool is available at

Place, publisher, year, edition, pages
2012. Vol. 13, no 230
National Category
Natural Sciences
URN: urn:nbn:se:liu:diva-86192DOI: 10.1186/1471-2105-13-230ISI: 000314682700001OAI: diva2:575541
Available from: 2012-12-10 Created: 2012-12-10 Last updated: 2013-10-07Bibliographically approved
In thesis
1. Bioinformatic methods for characterization of viral pathogens in metagenomic samples
Open this publication in new window or tab >>Bioinformatic methods for characterization of viral pathogens in metagenomic samples
2013 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Virus infections impose a huge disease burden on humanity and new viruses are continuously found. As most studies of viral disease are limited to theinvestigation of known viruses, it is important to characterize all circulating viruses. Thus, a broad and unselective exploration of the virus flora would be the most productive development of modern virology. Fueled by the reduction in sequencing costs and the unbiased nature of shotgun sequencing, viral metagenomics has rapidly become the strategy of choice for this exploration.

This thesis mainly focuses on improving key methods used in viral metagenomics as well as the complete viral characterization of two sets of samples using these methods. The major methods developed are an efficient automated analysis pipeline for metagenomics data and two novel, more accurate, alignment algorithms for 454 sequencing data. The automated pipeline facilitates rapid, complete and effortless analysis of metagenomics samples, which in turn enables detection of potential pathogens, for instance in patient samples. The two new alignment algorithms developed cover comparisons both against nucleotide and  protein databases, while retaining the underlying 454 data representation. Furthermore, a simulator for 454 data was developed in order to evaluate these methods. This simulator is currently the fastest and most complete simulator of 454 data, which enables further development of algorithms and methods. Finally, we have successfully used these methods to fully characterize a multitude of samples, including samples collected from children suffering from severe lower respiratory tract infections as well as patients diagnosed with chronic fatigue syndrome, both of which presented in this thesis. In these studies, a complete viral characterization has revealed the presence of both expected and unexpected viral pathogens as well as many potential novel viruses.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2013. 65 p.
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1489
National Category
Natural Sciences
urn:nbn:se:liu:diva-86194 (URN)978-91-7519-745-6 (ISBN)
Public defence
2013-01-25, Planck, Fysikhuset, Campus Valla, Linköpings universitet, Linköping, 10:15 (English)
Available from: 2012-12-10 Created: 2012-12-10 Last updated: 2012-12-10Bibliographically approved

Open Access in DiVA

fulltext(2016 kB)100 downloads
File information
File name FULLTEXT01.pdfFile size 2016 kBChecksum SHA-512
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Search in DiVA

By author/editor
Lysholm, Fredrik
By organisation
BioinformaticsThe Institute of Technology
In the same journal
BMC Bioinformatics
Natural Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 100 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 121 hits
ReferencesLink to record
Permanent link

Direct link