liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Characterization of protein families, sequence patterns, and functional annotations in large data sets
Linköping University, Department of Physics, Chemistry and Biology, Bioinformatics . Linköping University, The Institute of Technology.
2008 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Bioinformatics involves storing, analyzing and making predictions on massive amounts of protein and nucleotide sequence data. The thesis consists of six papers and is focused on proteins. It describes the utilization of bioinformatics techniques to characterize protein families and to detect patterns in gene expression and in polypeptide occurrences. Two protein families were bioinformatically characterized - the membrane associated proteins in eicosanoid and glutathione metabolism (MAPEG) and the Tripartite motif (TRIM) protein families.

In the study of the MAPEG super-family, application of different bioinformatic methods made it possible to characterize many new members leading to a doubling of the family size. Furthermore, the MAPEG members were subdivided into families. Remarkably, in six families with previously predominantly mammalian members, fish representatives were also now detected, which dated the origin of these families back to the Cambrium ”species explosion”, thus earlier than previously anticipated. Sequence comparisons made it possible to define diagnostic sequence patterns that can be used in genome annotations. Upon publication of several MAPEG structures, these patterns were confirmed to be part of the active sites.

In the TRIM study, the bioinformatic analyses made it possible to subdivide the proteins into three subtypes and to characterize a large number of members. In addition, the analyses showed crucial structural dependencies between the RING and the B-box domains of the TRIM member

Ro52. The linker region between the two domains, denoted RBL, is known

to be disease associated. Now, an amphipathic helix was found to be a

characteristic feature of the RBL region, which also was used to divide the family into three subtypes.

The ontology annotation treebrowser (OAT) tool was developed to detect functional similarities or common concepts in long lists of proteins or genes, typically generated from proteomics or microarray experiments. OAT was the first annotation browser to include both Gene Ontology (GO) and Medical Subject Headings (MeSH) into the same framework. The complementarity of these two ontologies was demonstrated. OAT was used in the TRIM study to detect differences in functional annotations between the subtypes.

In the oligopeptide study, we investigated pentapeptide patterns that were over- or under-represented in the current de facto standard database of protein knowledge and a set of completed genomes, compared to what could be expected from amino acid compositions. We found three predominant categories of patterns: (i) patterns originating from frequently occurring families, e.g. respiratory chain-associated proteins and translation machinery proteins; (ii) proteins with structurally and/or functionally favored patterns; (iii) multicopy species-specific retrotransposons, only found in the genome set. Such patterns may influence amino acid residue based prediction algorithms. These findings in the oligopeptide study were utilized for development of a new method that detects translated introns in unverified protein predictions, which are available in great numbers due to the many completed and ongoing genome projects.

A new comprehensive database of protein sequences from completed genomes was developed, denoted genomeLKPG. This database was of central importance in the MAPEG, TRIM and oligopeptide studies. The new sequence database has also been proven useful in several other studies.

Place, publisher, year, edition, pages
Institutionen för fysik, kemi och biologi , 2008. , 85 p.
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1159
Keyword [en]
Bioinformatics, sequence analysis, patterns, protein families
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:liu:diva-10565ISBN: 978-91-85523-01-6 (print)OAI: oai:DiVA.org:liu-10565DiVA: diva2:17301
Public defence
2008-02-15, Planck, Fysikhuset, Linköpings Universitet, Linköping, 10:15 (English)
Opponent
Supervisors
Available from: 2008-01-28 Created: 2008-01-28 Last updated: 2010-01-13Bibliographically approved
List of papers
1. Bioinformatic and enzymatic characterization of the MAPEG superfamily
Open this publication in new window or tab >>Bioinformatic and enzymatic characterization of the MAPEG superfamily
Show others...
2005 (English)In: The FEBS Journal, ISSN 1742-464X, E-ISSN 1742-4658, Vol. 272, no 7, 1688-1703 p.Article in journal (Refereed) Published
Abstract [en]

The membrane associated proteins in eicosanoid and glutathione metabolism (MAPEG) superfamily includes structurally related membrane proteins with diverse functions of widespread origin. A total of 136 proteins belonging to the MAPEG superfamily were found in database and genome screenings. The members were found in prokaryotes and eukaryotes, but not in any archaeal organism. Multiple sequence alignments and calculations of evolutionary trees revealed a clear subdivision of the eukaryotic MAPEG members, corresponding to the six families of microsomal glutathione transferases (MGST) 1, 2 and 3, leukotriene C4 synthase (LTC4), 5-lipoxygenase activating protein (FLAP), and prostaglandin E synthase. Prokaryotes contain at least two distinct potential ancestral subfamilies, of which one is unique, whereas the other most closely resembles enzymes that belong to the MGST2/FLAP/LTC4 synthase families. The insect members are most similar to MGST1/prostaglandin E synthase. With the new data available, we observe that fish enzymes are present in all six families, showing an early origin for MAPEG family differentiation. Thus, the evolutionary origins and relationships of the MAPEG superfamily can be defined, including distinct sequence patterns characteristic for each of the subfamilies. We have further investigated and functionally characterized representative gene products from Escherichia coli, Synechocystis sp., Arabidopsis thaliana and Drosophila melanogaster, and the fish liver enzyme, purified from pike (Esox lucius). Protein overexpression and enzyme activity analysis demonstrated that all proteins catalyzed the conjugation of 1-chloro-2,4-dinitrobenzene with reduced glutathione. The E. coli protein displayed glutathione transferase activity of 0.11 µmol·min−1·mg−1 in the membrane fraction from bacteria overexpressing the protein. Partial purification of the Synechocystis sp. protein yielded an enzyme of the expected molecular mass and an N-terminal amino acid sequence that was at least 50% pure, with a specific activity towards 1-chloro-2,4-dinitrobenzene of 11 µmol·min−1·mg−1. Yeast microsomes expressing the Arabidopsis enzyme showed an activity of 0.02 µmol·min−1·mg−1, whereas the Drosophila enzyme expressed in E. coli was highly active at 3.6 µmol·min−1·mg−1. The purified pike enzyme is the most active MGST described so far with a specific activity of 285 µmol·min−1·mg−1. Drosophila and pike enzymes also displayed glutathione peroxidase activity towards cumene hydroperoxide (0.4 and 2.2 µmol·min−1·mg−1, respectively). Glutathione transferase activity can thus be regarded as a common denominator for a majority of MAPEG members throughout the kingdoms of life whereas glutathione peroxidase activity occurs in representatives from the MGST1, 2 and 3 and PGES subfamilies.

Keyword
MAPEG, microsomal glutathione transferase, prostaglandin, leukotriene
National Category
Natural Sciences
Identifiers
urn:nbn:se:liu:diva-12886 (URN)10.1111/j.1742-4658.2005.04596.x (DOI)
Available from: 2008-01-28 Created: 2008-01-28 Last updated: 2017-12-14Bibliographically approved
2. The fellowship of the RING: The RING-B-box linker region (RBL) interacts with the RING in TRIM21/Ro52, contributes to an autoantigenic epitope in Sjögren's syndrome, and is an integral and conserved region in TRIM proteins
Open this publication in new window or tab >>The fellowship of the RING: The RING-B-box linker region (RBL) interacts with the RING in TRIM21/Ro52, contributes to an autoantigenic epitope in Sjögren's syndrome, and is an integral and conserved region in TRIM proteins
Show others...
2008 (English)In: Journal of Molecular Biology, ISSN 0022-2836, E-ISSN 1089-8638, Vol. 377, no 2, 431-449 p.Article in journal (Refereed) Published
Abstract [en]

Ro52 is a major autoantigen that is targeted in the autoimmune disease Sjögren syndrome and belongs to the tripartite motif (TRIM) protein family. Disease-related antigenic epitopes are mainly found in the coiled-coil domain of Ro52, but one such epitope is located in the Zn2+-binding region, which comprises an N-terminal RING followed by a B-box, separated by a ∼40-residue linker peptide. In the present study, we extend the structural, biophysical, and immunological knowledge of this RING-B-box linker (RBL) by employing an array of methods. Our bioinformatic investigations show that the RBL sequence motif is unique to TRIM proteins and can be classified into three distinct subtypes. The RBL regions of all three subtypes are as conserved as their known flanking domains, and all are predicted to comprise an amphipathic helix. This helix formation is confirmed by circular dichroism spectroscopy and is dependent on the presence of the RING. Immunological studies show that the RBL is part of a conformation-dependent epitope, and its antigenicity is likewise dependent on a structured RING domain. Recombinant Ro52 RING-RBL exists as a monomer in vitro, and binding of two Zn2+ increases its stability. Regions stabilized by Zn2+ binding are identified by limited proteolysis and matrix-assisted laser desorption/ionization mass spectrometry. Furthermore, the residues of the RING and linker that interact with each other are identified by analysis of protection patterns, which, together with bioinformatic and biophysical data, enabled us to propose a structural model of the RING-RBL based on modeling and docking experiments. Sequence similarities and evolutionary sequence patterns suggest that the results obtained from Ro52 are extendable to the entire TRIM protein family.

Keyword
Ro52; TRIM21; RING; linker; zinc binding
National Category
Natural Sciences
Identifiers
urn:nbn:se:liu:diva-12887 (URN)10.1016/j.jmb.2008.01.005 (DOI)
Available from: 2008-01-28 Created: 2008-01-28 Last updated: 2017-12-14Bibliographically approved
3. Ontology annotation treebrowser: an interactive tool where the complementarity of medical subject headings and gene ontology improves the interpretation of gene lists
Open this publication in new window or tab >>Ontology annotation treebrowser: an interactive tool where the complementarity of medical subject headings and gene ontology improves the interpretation of gene lists
2006 (English)In: Applied Bioinformatics, ISSN 1175-5636, Vol. 5, no 4, 225-236 p.Article in journal (Refereed) Published
Abstract [en]

Gene expression and proteomics analysis allow the investigation of thousands of biomolecules in parallel. This results in a long list of interesting genes or proteins and a list of annotation terms in the order of thousands. It is not a trivial task to understand such a gene list and it would require extensive efforts to bring together the overwhelming amounts of associated information from the literature and databases. Thus, it is evident that we need ways of condensing and filtering this information. An excellent way to represent knowledge is to use ontologies, where it is possible to group genes or terms with overlapping context, rather than studying one-dimensional lists of keywords. Therefore, we have built the ontology annotation treebrowser (OAT) to represent, condense, filter and summarise the knowledge associated with a list of genes or proteins.

The OAT system consists of two disjointed parts; a MySQL® database named OATdb, and a treebrowser engine that is implemented as a web interface. The OAT system is implemented using Perl scripts on an Apache web server and the gene, ontology and annotation data is stored in a relational MySQL® database. In OAT, we have harmonized the two ontologies of medical subject headings (MeSH) and gene ontology (GO), to enable us to use knowledge both from the literature and the annotation projects in the same tool. OAT includes multiple gene identifier sets, which are merged internally in the OAT database. We have also generated novel MeSH annotations by mapping accession numbers to MEDLINE entries.

The ontology browser OAT was created to facilitate the analysis of gene lists. It can be browsed dynamically, so that a scientist can interact with the data and govern the outcome. Test statistics show which branches are enriched. We also show that the two ontologies complement each other, with surprisingly low overlap, by mapping annotations to the Unified Medical Language System®.

We have developed a novel interactive annotation browser that is the first to incorporate both MeSH and GO for improved interpretation of gene lists. With OAT, we illustrate the benefits of combining MeSH and GO for understanding gene lists. OAT is available as a public web service at: http://www.ifm.liu.se/bioinfo/oat

National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-12888 (URN)
Available from: 2008-01-28 Created: 2008-01-28 Last updated: 2009-11-07Bibliographically approved
4. Characterization of oligopeptide patterns in large protein sets
Open this publication in new window or tab >>Characterization of oligopeptide patterns in large protein sets
2007 (English)In: BMC Genomics, ISSN 1471-2164, E-ISSN 1471-2164, Vol. 8, no 346, 1-15 p.Article in journal (Refereed) Published
Abstract [en]

Background: Recent sequencing projects and the growth of sequence data banks enable oligopeptide patterns to be characterized on a genome or kingdom level. Several studies have focused on kingdom or habitat classifications based on the abundance of short peptide patterns. There have also been efforts at local structural prediction based on short sequence motifs. Oligopeptide patterns undoubtedly carry valuable information content. Therefore, it is important to characterize these informational peptide patterns to shed light on possible new applications and the pitfalls implicit in neglecting bias in peptide patterns.

Results: We have studied four classes of pentapeptide patterns (designated POP, NEP, ORP and URP) in the kingdoms archaea, bacteria and eukaryotes. POP are highly abundant patterns statistically not expected to exist; NEP are patterns that do not exist but are statistically expected to; ORP are patterns unique to a kingdom; and URP are patterns excluded from a kingdom. We used two data sources: the de facto standard of protein knowledge Swiss-Prot, and a set of 386 completely sequenced genomes. For each class of peptides we looked at the 100 most extreme and found both known and unknown sequence features. Most of the known sequence motifs can be explained on the basis of the protein families from which they originate.

Conclusion: We find an inherent bias of certain oligopeptide patterns in naturally occurring proteins that cannot be explained solely on the basis of residue distribution in single proteins, kingdoms or databases. We see three predominant categories of patterns: (i) patterns widespread in a kingdom such as those originating from respiratory chain-associated proteins and translation machinery; (ii) proteins with structurally and/or functionally favored patterns, which have not yet been ascribed this role; (iii) multicopy species-specific retrotransposons, only found in the genome set. These categories will affect the accuracy of sequence pattern algorithms that rely mainly on amino acid residue usage. Methods presented in this paper may be used to discover targets for antibiotics, as we identify numerous examples of kingdom-specific antigens among our peptide classes. The methods may also be useful for detecting coding regions of genes.

National Category
Natural Sciences
Identifiers
urn:nbn:se:liu:diva-12889 (URN)10.1186/1471-2164-8-346 (DOI)
Available from: 2008-01-28 Created: 2008-01-28 Last updated: 2017-12-14Bibliographically approved
5. Using SVM and tripeptide patterns to detect translated introns
Open this publication in new window or tab >>Using SVM and tripeptide patterns to detect translated introns
2007 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105Article in journal (Refereed) Submitted
National Category
Natural Sciences
Identifiers
urn:nbn:se:liu:diva-12890 (URN)
Available from: 2008-01-28 Created: 2008-01-28 Last updated: 2017-12-14
6. GenomeLKPG: A comprehensive proteome sequencedatabase for taxonomy studies
Open this publication in new window or tab >>GenomeLKPG: A comprehensive proteome sequencedatabase for taxonomy studies
2008 (English)Article in journal (Refereed) Submitted
Abstract [en]

Background: In order to perform taxonomically unbiased analyses of protein relationships, there is a need ofcomplete proteomes rather than databases with bias towards well characterized protein families. However, nocomprehensive resource of completed proteomes is currently available. Instead, the proteomes need to be down-loaded manually from di®erent servers, all using different filename conventions and fasta header formats.

Results: We have developed a semi-automatic algorithm that retrieves complete proteomes from multiple FTP-servers and maps the species-speci¯c sequence entries to the NCBI taxonomy. The compiled data is provided ina sequence database named genomeLKPG.

Conclusions: The usefulness of genomeLKPG is proven in several published taxonomical studies.

National Category
Natural Sciences
Identifiers
urn:nbn:se:liu:diva-52933 (URN)
Available from: 2010-01-13 Created: 2010-01-13 Last updated: 2010-01-13

Open Access in DiVA

fulltext(1061 kB)1862 downloads
File information
File name FULLTEXT01.pdfFile size 1061 kBChecksum MD5
d84fa871df10e1e2d1d9eb623026af5215a78da9bd095c006878ac86937d7d603a87f045
Type fulltextMimetype application/pdf
cover(2346 kB)87 downloads
File information
File name COVER01.pdfFile size 2346 kBChecksum MD5
38d871a79be6611b5c79e4b664b6a88e5eac1add034f12b13b10913b9fbb9370bb861c36
Type coverMimetype application/pdf

Authority records BETA

Bresell, Anders

Search in DiVA

By author/editor
Bresell, Anders
By organisation
Bioinformatics The Institute of Technology
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 1862 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2536 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf