liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Subdivision of the MDR superfamily of medium-chain dehydrogenases/reductases through iterative hidden Markov model refinement
Linköping University, Department of Physics, Chemistry and Biology, Bioinformatics . Linköping University, The Institute of Technology.
Dept of Medical Biochemistry and Biophysics, Karolinska Institutet, S-171 77 Stockholm, Sweden.
Linköping University, Department of Physics, Chemistry and Biology, Bioinformatics . Linköping University, The Institute of Technology.
2010 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 11, 534- p.Article in journal (Refereed) Published
Abstract [en]

Backgroun: The Medium-chain Dehydrogenases/Reductases (MDR) form a protein superfamily whose size and complexity defeats traditional means of subclassification; it currently has over 15000 members in the databases, the pairwise sequence identity is typically around 25%, there are members from all kingdoms of life, the chain-lengths vary as does the oligomericity, and the members are partaking in a multitude of biological processes. There are profile hidden Markov models (HMMs) available for detecting MDR superfamily members, but none for determining which MDR family each protein belongs to. The current torrential influx of new sequence data enables elucidation of more and more protein families, and at an increasingly fine granularity. However, gathering good quality training data usually requires manual attention by experts and has therefore been the rate limiting step for expanding the number of available models.

Result: We have developed an automated algorithm for HMM refinement that produces stable and reliable models for protein families. This algorithm uses relationships found in data to generate confident seed sets. Using this algorithm we have produced HMMs for 86 distinct MDR families and 34 of their subfamilies which can be used in automated annotation of new sequences. We find that MDR forms with 2 Zn2+ ions in general are dehydrogenases, while MDR forms with no Zn2+ in general are reductases. Furthermore, in Bacteria MDRs without Zn2+ are more frequent than those with Zn2+, while the opposite is true for eukaryotic MDRs, indicating that Zn2+ has been recruited into the MDR superfamily after the initial life kingdom separations. We have also developed a web site http://mdr-enzymes.org webcite that provides textual and numeric search against various characterised MDR family properties, as well as sequence scan functions for reliable classification of novel MDR sequences.

Conclusion: Our method of refinement can be readily applied to create stable and reliable HMMs for both MDR and other protein families, and to confidently subdivide large and complex protein superfamilies. HMMs created using this algorithm correspond to evolutionary entities, making resolution of overlapping models straightforward. The implementation and support scripts for running the algorithm on computer clusters are available as open source software, and the database files underlying the web site are freely downloadable. The web site also makes our findings directly useful also for non-bioinformaticians.

Place, publisher, year, edition, pages
2010. Vol. 11, 534- p.
National Category
Natural Sciences
Identifiers
URN: urn:nbn:se:liu:diva-61752DOI: 10.1186/1471-2105-11-534ISI: 000284038100001OAI: oai:DiVA.org:liu-61752DiVA: diva2:370748
Note
Original Publication: Joel Hedlund, Hans Jörnvall and Bengt Persson, Subdivision of the MDR superfamily of medium-chain dehydrogenases/reductases through iterative hidden Markov model refinement, 2010, BMC Bioinformatics, (11), , 534. http://dx.doi.org/10.1186/1471-2105-11-534 Licensee: BioMed Central http://www.biomedcentral.com/ Available from: 2010-11-17 Created: 2010-11-17 Last updated: 2017-12-12
In thesis
1. Bioinformatic protein family characterisation
Open this publication in new window or tab >>Bioinformatic protein family characterisation
2010 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Biological research is necessary; not only to further our understanding of the processes of life, but also to combat disease, hunger and environmental damage.

Bioinformatics is the science of handling biological information. It entails integrating, structuring and analysing the ever-increasing amounts of available biological data. In practise it means using computers to analyse huge amounts of very complicated data taken from a field that is only partially understood, to see the hidden trends and connections, and to draw useful conclusions.

My thesis work has mainly concerned the study of protein families, which are groups of evolutionarily related proteins. I have analysed known protein families and created predictive models for them, and developed algorithms for defining new protein families. My principal techniques have been sequence alignments and hidden Markov models (HMM). To aid my work, I have written a lot of software, including MSAView, a visualiser for multiple sequence alignments (MSA).

In this thesis, the protein family of inorganic pyrophosphatases (H+-PPases) is studied, as well as the two protein superfamilies BRICHOS and MDR (medium-chain dehydrogenases/reductases). The H+-PPases are tightly membrane bound, proton pumping, dimeric enzymes with ~700-residue subunits and found in bacteria, plants and eukaryotic parasites, and which use pyrophosphate as an alternative to ATP. The BRICHOS superfamily is only present in higher eukaryotes, but encompasses at least 8 protein families with a wide range of functions and disease associations, such as respiratory distress syndrome, dementia and cancer. The sequences are typically ~200 residues with even shorter functional forms. Finally, MDR, is a large and complex protein superfamily; it currently has over 16000 members, it is present in all kingdoms of life, the pairwise sequence identity is typically around 25 %, the chain lengths vary as does the oligomericity, and the members are partaking in a multitude of biological processes. The member families include the classical liver alcohol dehydrogenase (ADH), quinone reductase, leukotriene B4 dehydrogenase, and many more forms. There are at least 25 human MDR genes excluding close homologues. There are HMMs available for detecting MDR superfamily membership, but none for the individual families.

For the H+-PPase family, we characterised member sequences found using an HMM of a conserved 57-residue region thought to form part of the active site. This region was found to contain two highly conserved nonapeptides, mainly consisting of the four “very early” residues Gly, Ala, Val and Asp, compatible with an ancient origin of the family. The two patterns have charged amino acid residues at positions 1, 5 and 9, are apparent binding sites for the substrate and parts of the active site, and were shown to be so specific for these enzymes that they can be used for automated annotation of new sequences.

For the BRICHOS superfamily, we were able to find three previously unknown member families; group A, which may be ancestral to the ITM2 families (integral membrane protein 2); group B, which is a close relative to the gastrokine families, and group C, which appears to be a truly novel, disjoint BRICHOS family. The C-terminal region of group C has nearly identical sequences in all species ranging from fish to man and is seemingly unique to this family, indicating critical functional or structural properties.

For the MDR superfamily, we characterised and built stable HMMs for 17 member families using an empiric approach. From our experiences we were able to develop an algorithm for automated HMM refinement that uses relationships in data to produce stable and reliable classifiers, and we used it to produce HMMs for 86 distinct MDR families. We have made the program freely available and it can be readily applied to other protein families. We also developed a web site (http://mdr–enzymes.org) that makes our findings directly useful also for non-bioinformaticians.

In our analyses of the 86 families, we found that MDR forms with 2 Zn2+ ions in general are dehydrogenases, while MDR forms with no Zn2+ in general are reductases. Furthermore, in Bacteria, MDRs without Zn2+ are more frequent than those with Zn2+, while the opposite is true for eukaryotic MDRs, indicating that Zn2+ has been recruited into the MDR superfamily after the initial life kingdom separations.

Multiple sequence alignments (MSA) play a central part in most work on protein families, and are integral to many bioinformatic methods. With the ongoing explosive increase of available sequence data, the scales of bioinformatic projects are growing, and efficient and human-friendly data visualisation becomes increasingly challenging, but is still essential for making new interpretations and discovering unexpected properties of the data.

Ideally, visualisation should be comprehensive and detailed, and never distract with irrelevant information. It needs to offer natural and responsive ways of exploring the data, as well as provide consistent views in order to facilitate comparisons between datasets. I therefore developed MSAView, which is a fast, modular, configurable and extensible package for analysing and visualising MSAs and sequence features. It has a graphical user interface and a powerful command line client, and can be imported as a package into any Python program. It has a plugin architecture and a user extendable preset library. It can integrate and display data from online sources and launch external viewers for showing additional details. It also includes two new conservation measures; alignment divergences, which indicate atypical residues or deletions, and sequence conformances, which highlight sequences that differ from their siblings at crucial positions.

In conclusion, this thesis details my work in analysing two protein superfamilies and one protein family using bioinformatic methods; developing an algorithm for automated generation of stable and reliable HMMs, as well as a new conservation measure, and a software platform for working with aligned sequences.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2010. 105 p.
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1343
National Category
Natural Sciences
Identifiers
urn:nbn:se:liu:diva-61754 (URN)978-91-7393-297-4 (ISBN)
Public defence
2010-12-10, Planck, Fysikhuset, Campus Valla, Linköpings universitet, Linköping, 10:00
Opponent
Supervisors
Available from: 2010-11-17 Created: 2010-11-17 Last updated: 2010-11-17Bibliographically approved

Open Access in DiVA

fulltext(836 kB)212 downloads
File information
File name FULLTEXT01.pdfFile size 836 kBChecksum SHA-512
8a5661e428844625d7cf569955e888fa0a55dd9e4e4b121531665c682b8f527223499b18cb2403cd33ed81d9239415e76e88cbca5f06e6348e0fb11b9753a734
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records BETA

Hedlund, JoelPersson, Bengt

Search in DiVA

By author/editor
Hedlund, JoelPersson, Bengt
By organisation
Bioinformatics The Institute of Technology
In the same journal
BMC Bioinformatics
Natural Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 212 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 158 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf