liu.seSearch for publications in DiVA
Change search
ReferencesLink to record
Permanent link

Direct link
Unsupervised hidden Markov model for automatic analysis of expressed sequence tags
Linköping University, Department of Physics, Chemistry and Biology, Bioinformatics .
2011 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This thesis provides an in-depth analyze of expressed sequence tags (EST) that represent pieces of eukaryotic mRNA by using unsupervised hidden Markov model (HMM). ESTs are short nucleotide sequences that are used primarily for rapid identificationof new genes with potential coding regions (CDS). ESTs are made by sequencing on double-stranded cDNA and the synthesizedESTs are stored in digital form, usually in FASTA format. Since sequencing is often randomized and that parts of mRNA contain non-coding regions, some ESTs will not represent CDS.It is desired to remove these unwanted ESTs if the purpose is to identifygenes associated with CDS. Application of stochastic HMM allow identification of region contents in a EST. Softwares like ESTScanuse HMM in which a training of the HMM is done by supervised learning with annotated data. However, because there are not always annotated data at hand this thesis focus on the ability to train an HMM with unsupervised learning on data containing ESTs, both with and without CDS. But the data used for training is not annotated, i.e. the regions that an EST consists of are unknown. In this thesis a new HMM is introduced where the parameters of the HMM are in focus so that they are reasonablyconsistent with biologically important regionsof an mRNA such as the Kozak sequence, poly(A)-signals and poly(A)-tails to guide the training and decoding correctly with ESTs to proper statesin the HMM. Transition probabilities in the HMMhas been adapted so that it represents the mean length and distribution of the different regions in mRNA. Testing of the HMM's specificity and sensitivityhave been performed via BLAST by blasting each EST and compare the BLAST results with the HMM prediction results.A regression analysis test shows that the length of ESTs used when training the HMM is significantly important, the longer the better. The final resultsshows that it is possible to train an HMM with unsupervised machine learning but to be comparable to supervised machine learning as ESTScan, further expansion of the HMM is necessary such as frame-shift correction of ESTs byimproving the HMM's ability to choose correctly positioned start codons or nucleotides. Usually the false positive results are because of incorrectly positioned start codons leadingto too short CDS lengths. Since no frame-shift correction is implemented, short predicted CDS lengths are not acceptable and is hence not counted as coding regionsduring prediction. However, when there is a lack of supervised models then unsupervised HMM is a potential replacement with stable performance and able to be adapted forany eukaryotic organism.

Place, publisher, year, edition, pages
2011. , 81 p.
Keyword [en]
Machine learning, Markov Model, Hidden Markov Model, Expressed sequence tag, EST, Baum-Welch, 1-best, Unsupervised, Supervised, GHMM
National Category
Bioinformatics and Systems Biology
URN: urn:nbn:se:liu:diva-69575ISRN: LiTH-IFM-A-EX--11/2553--SEOAI: diva2:429096
Subject / course
Available from: 2011-07-04 Created: 2011-07-03 Last updated: 2011-07-04Bibliographically approved

Open Access in DiVA

Master Thesis(1271 kB)319 downloads
File information
File name FULLTEXT01.pdfFile size 1271 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 319 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 325 hits
ReferencesLink to record
Permanent link

Direct link