liu.seSearch for publications in DiVA
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Unsupervised hidden Markov model for automatic analysis of expressed sequence tags
Linköpings universitet, Institutionen för fysik, kemi och biologi, Bioinformatik.
2011 (engelsk)Independent thesis Advanced level (degree of Master (Two Years)), 20 poäng / 30 hpOppgave
Abstract [en]

This thesis provides an in-depth analyze of expressed sequence tags (EST) that represent pieces of eukaryotic mRNA by using unsupervised hidden Markov model (HMM). ESTs are short nucleotide sequences that are used primarily for rapid identificationof new genes with potential coding regions (CDS). ESTs are made by sequencing on double-stranded cDNA and the synthesizedESTs are stored in digital form, usually in FASTA format. Since sequencing is often randomized and that parts of mRNA contain non-coding regions, some ESTs will not represent CDS.It is desired to remove these unwanted ESTs if the purpose is to identifygenes associated with CDS. Application of stochastic HMM allow identification of region contents in a EST. Softwares like ESTScanuse HMM in which a training of the HMM is done by supervised learning with annotated data. However, because there are not always annotated data at hand this thesis focus on the ability to train an HMM with unsupervised learning on data containing ESTs, both with and without CDS. But the data used for training is not annotated, i.e. the regions that an EST consists of are unknown. In this thesis a new HMM is introduced where the parameters of the HMM are in focus so that they are reasonablyconsistent with biologically important regionsof an mRNA such as the Kozak sequence, poly(A)-signals and poly(A)-tails to guide the training and decoding correctly with ESTs to proper statesin the HMM. Transition probabilities in the HMMhas been adapted so that it represents the mean length and distribution of the different regions in mRNA. Testing of the HMM's specificity and sensitivityhave been performed via BLAST by blasting each EST and compare the BLAST results with the HMM prediction results.A regression analysis test shows that the length of ESTs used when training the HMM is significantly important, the longer the better. The final resultsshows that it is possible to train an HMM with unsupervised machine learning but to be comparable to supervised machine learning as ESTScan, further expansion of the HMM is necessary such as frame-shift correction of ESTs byimproving the HMM's ability to choose correctly positioned start codons or nucleotides. Usually the false positive results are because of incorrectly positioned start codons leadingto too short CDS lengths. Since no frame-shift correction is implemented, short predicted CDS lengths are not acceptable and is hence not counted as coding regionsduring prediction. However, when there is a lack of supervised models then unsupervised HMM is a potential replacement with stable performance and able to be adapted forany eukaryotic organism.

sted, utgiver, år, opplag, sider
2011. , s. 81
Emneord [en]
Machine learning, Markov Model, Hidden Markov Model, Expressed sequence tag, EST, Baum-Welch, 1-best, Unsupervised, Supervised, GHMM
HSV kategori
Identifikatorer
URN: urn:nbn:se:liu:diva-69575ISRN: LiTH-IFM-A-EX--11/2553--SEOAI: oai:DiVA.org:liu-69575DiVA, id: diva2:429096
Fag / kurs
Bioinformatics
Uppsök
Technology
Veileder
Examiner
Tilgjengelig fra: 2011-07-04 Laget: 2011-07-03 Sist oppdatert: 2011-07-04bibliografisk kontrollert

Open Access i DiVA

Master Thesis(1271 kB)403 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1271 kBChecksum SHA-512
6185072e3972fb5e2c6548c412d5f0084112c56d0fdcf4ced27c52b5d445fba435327c3b685d37d329cd2117c7764956e3e05cb10d828f2ac6d3174126c17229
Type fulltextMimetype application/pdf

Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 403 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 994 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf