liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
On text mining to identify gene networks with a special reference to cardiovascular disease
Linköping University, The Department of Physics, Chemistry and Biology.
2005 (English)Independent thesis Basic level (professional degree)Student thesisAlternative title
Identifiering av genetiska nätverk av betydelse för kärlförkalkning med hjälp av automatisk textsökning i Medline, en medicinsk litteraturdatabas (Swedish)
Abstract [en]

The rate at which articles gets published grows exponentially and the possibility to access texts in machine-readable formats is also increasing. The need of an automated system to gather relevant information from text, text mining, is thus growing.

The goal of this thesis is to find a biologically relevant gene network for atherosclerosis, themain cause of cardiovascular disease, by inspecting gene cooccurrences in abstracts from PubMed. In addition to this gene nets for yeast was generated to evaluate the validity of using text mining as a method.

The nets found were validated in many ways, they were for example found to have the well known power law link distribution. They were also compared to other gene nets generated by other, often microbiological, methods from different sources. In addition to classic measurements of similarity like overlap, precision, recall and f-score a new way to measure similarity between nets are proposed and used. The method uses an urn approximation and measures the distance from comparing two unrelated nets in standard deviations. The validity of this approximation is supported both analytically and with simulations for both Erd¨os-R´enyi nets and nets having a power law link distribution. The new method explains that very poor overlap, precision, recall and f-score can still be very far from random and also how much overlap one could expect at random. The cutoff was also investigated.

Results are typically in the order of only 1% overlap but with the remarkable distance of 100 standard deviations from what one could have expected at random. Of particular interest is that one can only expect an overlap of 2 edges with a variance of 2 when comparing two trees with the same set of nodes. The use of a cutoff at one for cooccurrence graphs is discussed and motivated by for example the observation that this eliminates about 60-70% of the false positives but only 20-30% of the overlapping edges. This thesis shows that text mining of PubMed can be used to generate a biologically relevant gene subnet of the human gene net. A reasonable extension of this work is to combine the nets with gene expression data to find a more reliable gene net.

Place, publisher, year, edition, pages
Institutionen för fysik, kemi och biologi , 2005.
Keyword [en]
Bioinformatics, Atherosclerosis, Cardiovascular Disease, Cooccurrence, Data mining, Gene networks, Literature networks, Prior incorporation, Text mining
Keyword [sv]
Bioinformatik
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:liu:diva-2810ISRN: LITH-IFM-EX--05/1382--SEOAI: oai:DiVA.org:liu-2810DiVA: diva2:20152
Uppsok
fysik/kemi/matematik
Available from: 2005-03-22 Created: 2005-03-22

Open Access in DiVA

fulltext(512 kB)1449 downloads
File information
File name FULLTEXT01.pdfFile size 512 kBChecksum MD5
10b3663a448e23057457bfd20447e68b2d3e13091d2d0875230dafe9c1c1cad191408a3f
Type fulltextMimetype application/pdf

By organisation
The Department of Physics, Chemistry and Biology
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 1449 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 566 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf