Biopathway Projects

1. BioIE: Extraction of Biological Interactions

BioIE is a novel system that extracts biological interactions such as protein-protein interactions from the rapidly growing volume of biomedical literature in on-line resources such as MEDLINE and annotates the information with the terminologies of the ontologies in biomedical domain such as Gene Ontology. It delivers both the quality and the diversity of the extracted information by examining the grammatical functions of the arguments of interactions with Combinatory Categorial Grammar and by allowing for a wide variety of interactions as keywords.

2. AutoGO: Automatic Construction of Gene and Protein Ontologies

Our research group is interested in automatic extension of Gene Ontology (GO) with hierarchy/pathway information which is extracted from biomedical literature by information extraction systems such as BioIE. AutoGO is developed to validate the resource automatically extended with information extraction systems and to integrate existing bio-resources in a consistent way.

3. BiopathwayBuilder: Visualization of Inferred Gene and Protein Networks

In order to gain a full understanding of a biological process, we must be able to augment the known molecular interactions with discovered knowledge. We believe that a visualization system works as a means for accomplishing this task, as it provides an intuitive base for necessary information, among others. However, reported implementations have further problems: (1) The size of the information is not only enormous, but also grows very fast, which makes scalability and elision essential properties; (2) the available information is not only incomplete, but also unreliable; and (3) the usual information in the field, such as protein modification, is inherently complex, which makes it very difficult to make the resulting visualization intuitive enough for end users as well as field experts. We address all the problems above with a 3D visualization system.

4. BioNLQ: Natural Language Query for Heterogenous Database Access

Our research group conducts research on natural language database interfaces, where expressions in natural languages are transformed into corresponding expressions in formal database languages with a Combinatory Categorial Grammar. We utilize an extra level of representation for formal language queries in addition to the other levels of information for natural languages, i.e. syntax, semantics, and discourse. Addressed formal database languages include SQL, OOQL and CPL. We are particularly interested in providing a unified natural language interface for heterogeneous database access, which is essential in a biomedical domain.

5. BioContrasts: Knowledge Discovery with Protein-Protein Contrasts

Contrasts are effective conceptual vehicles for learning processes such as correcting, highlighting, contrasting, and grouping central concepts. Thus, they are useful for exploring the unknown. They can provide much invaluable insights and explanations about the observed phenomena. For example, contrasts between proteins in terms of their biological interactions can reveal what similarities, divergences, and relations there are of the proteins, leading to additional useful insights about the underlying functional nature of the proteins. BioContrasts Database is a database with protein-protein contrastive information. The database currently contains 41,471 protein-protein contrasts, which are automatically extracted from MEDLINE abstracts. With the web interface provided in this homepage, users can search for contrastive information of proteins of interest with their Swiss-Prot IDs or their names. Users also can attempt knowledge discovery with protein-protein contrasts through several templates of user interface.

6. Automatic Generation of Gene Summaries

An effective way to grasp new biological concepts is to start with their summaries. In particular, an informative summary can give the readers sufficient information, and a coherent summary will enhance the level of understanding and memorability together. When an automatic generation of a gene summary achieves both informativeness and coherency of this kind, people in a biomedical domain will be able to utilize it actively in order to gain professional knowledge with much ease.

7. E3DB: Database for Ubiquitin-Protein Ligases

The ubiquitin-proteasome system plays an important role in a number of diseases. Ubiquitin-protein ligases (E3s) are of particular interest as they determine the targeting specificity of the system. Substrate targeting specificity is normally dependent on the unique interaction between a particular combination of a ubiquitin-conjugating enzyme, an E3, and a target substrate. Thus, as many substrate proteins are targeted by ubiquitination, so are the corresponding E3s also discovered in eukaryotes. In order to help researchers to investigate E3 proteins regulated by ubiquitination, we provide an efficient method to identify proteins that are involved in the ubiquitin-protein ligase activity as well as to construct a database that organizes E3-related information including E3s, substrate proteins, associated proteins, related diseases and publications. To collect E3-related protein data, we first generate 52 combinations of databases for 13 underlying databases. We utilize such combinations to retrieve and integrate E3 and the related data. From such E3 data, we identify 917 distinctive proteins consisting of single component E3s and subunits for multicomponent E3 complexes.

8. Automatic Construction of Frame Ontology with Varying Granularities

The IR and IE techniques have acquired the level of maturity for a widespread use in our everyday lives, but they are still short of fully satisfying the user and thus in need of much further improvement, the problem of utilizing granularity being certainly one of them. In order to search for the information of the right granularity in response to the requests of the user, we should be able to assess the granularity of individual pieces of information, which can in turn be dealt with by leveraging the granularity of the predicate and its argument(s) of each sentence. In this project, we developed methods for automatically constructing and managing various ontologies and for utilizing them to provide granularity options to existing IR and IE techniques. The ontologies are used to specify the granularity of predicates and arguments at different levels, so that it becomes possible to compare the granularity of different pieces of information, and to zoom in to those that are of the target granularity. We used the domain of biology and medicine for our case study, since this is an area where many natural language processing techniques are currently in much use for effective information retrieval and extraction due to the explosively growing amount of information and where our research group has been working on customized natural language processing services with a number of fruitful results.

9. Identifying the Presence and Certainty of Clinical Conditions from Clinical Reports

The specific details of clinical conditions and their proper understanding play a critical function in clinical decision making about related diagnoses and treatments. Such clinical conditions are usually spelled out by medical experts in terse but plain English and stored as clinical reports. Nevertheless, it is quite demanding to determine promptly and accurately the nature and extent of clinical conditions in such reports and to decide on the next course of action, due primarily to the immense volume of information that must be taken into account but also to the complexity of natural language expressions as employed in such reports. In this regard, automatically extracting clinical conditions from clinical reports is often considered the first step in various applications of medical language processing (MLP). In this on-going project, we address issues in extracting and coding clinical conditions with the help of ICD codes, which are used to maintain medical statistics in many countries including the United States and the Republic of Korea. In particular, we are looking further into the negation and uncertainty of natural language expressions, because clinical conditions may sometimes be presented in negated or uncertain forms, the importance of whose correct identification is already well recognized.

10. CoMAGC: a Corpus with Multi-faceted Annotations of Gene-Cancer Relations

In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations. In order to support the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences, we publish CoMAGC, a corpus with multi- faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, the corpus deals with changes in gene expression levels among other types of gene changes.

11. OncoSearch: cancer gene search engine with literature evidence

In order to identify genes that are involved in oncogenesis and to understand how such genes affect cancers, abnormal gene expressions in cancers are actively studied. For an efficient access to the results of such studies that are reported in biomedical literature, the relevant information is accumulated via text-mining tools and made available through the Web. However, current Web tools are not yet tailored enough to allow queries that specify how a cancer changes along with the change in gene expression level, which is an important piece of information to understand an involved gene's role in cancer progression or regression. OncoSearch is a Web-based engine that searches Medline abstracts for sentences that mention gene expression changes in cancers, with queries that specify (i) whether a gene expression level is up-regulated or down-regulated, (ii) whether a certain type of cancer progresses or regresses along with such gene expression change and (iii) the expected role of the gene in the cancer. OncoSearch is available through http://oncosearch.biopathway.org.

1. BioIE: Extraction of Biological Interactions

	BioIE is a novel system that extracts biological interactions such as protein-protein interactions from the rapidly growing volume of biomedical literature in on-line resources such as MEDLINE and annotates the information with the terminologies of the ontologies in biomedical domain such as Gene Ontology. It delivers both the quality and the diversity of the extracted information by examining the grammatical functions of the arguments of interactions with Combinatory Categorial Grammar and by allowing for a wide variety of interactions as keywords.

2. AutoGO: Automatic Construction of Gene and Protein Ontologies

	Our research group is interested in automatic extension of Gene Ontology (GO) with hierarchy/pathway information which is extracted from biomedical literature by information extraction systems such as BioIE. AutoGO is developed to validate the resource automatically extended with information extraction systems and to integrate existing bio-resources in a consistent way.

3. BiopathwayBuilder: Visualization of Inferred Gene and Protein Networks

	In order to gain a full understanding of a biological process, we must be able to augment the known molecular interactions with discovered knowledge. We believe that a visualization system works as a means for accomplishing this task, as it provides an intuitive base for necessary information, among others. However, reported implementations have further problems: (1) The size of the information is not only enormous, but also grows very fast, which makes scalability and elision essential properties; (2) the available information is not only incomplete, but also unreliable; and (3) the usual information in the field, such as protein modification, is inherently complex, which makes it very difficult to make the resulting visualization intuitive enough for end users as well as field experts. We address all the problems above with a 3D visualization system.

4. BioNLQ: Natural Language Query for Heterogenous Database Access

	Our research group conducts research on natural language database interfaces, where expressions in natural languages are transformed into corresponding expressions in formal database languages with a Combinatory Categorial Grammar. We utilize an extra level of representation for formal language queries in addition to the other levels of information for natural languages, i.e. syntax, semantics, and discourse. Addressed formal database languages include SQL, OOQL and CPL. We are particularly interested in providing a unified natural language interface for heterogeneous database access, which is essential in a biomedical domain.

5. BioContrasts: Knowledge Discovery with Protein-Protein Contrasts

	Contrasts are effective conceptual vehicles for learning processes such as correcting, highlighting, contrasting, and grouping central concepts. Thus, they are useful for exploring the unknown. They can provide much invaluable insights and explanations about the observed phenomena. For example, contrasts between proteins in terms of their biological interactions can reveal what similarities, divergences, and relations there are of the proteins, leading to additional useful insights about the underlying functional nature of the proteins. BioContrasts Database is a database with protein-protein contrastive information. The database currently contains 41,471 protein-protein contrasts, which are automatically extracted from MEDLINE abstracts. With the web interface provided in this homepage, users can search for contrastive information of proteins of interest with their Swiss-Prot IDs or their names. Users also can attempt knowledge discovery with protein-protein contrasts through several templates of user interface.

6. Automatic Generation of Gene Summaries

	An effective way to grasp new biological concepts is to start with their summaries. In particular, an informative summary can give the readers sufficient information, and a coherent summary will enhance the level of understanding and memorability together. When an automatic generation of a gene summary achieves both informativeness and coherency of this kind, people in a biomedical domain will be able to utilize it actively in order to gain professional knowledge with much ease.


7. E3DB: Database for Ubiquitin-Protein Ligases

	The ubiquitin-proteasome system plays an important role in a number of diseases. Ubiquitin-protein ligases (E3s) are of particular interest as they determine the targeting specificity of the system. Substrate targeting specificity is normally dependent on the unique interaction between a particular combination of a ubiquitin-conjugating enzyme, an E3, and a target substrate. Thus, as many substrate proteins are targeted by ubiquitination, so are the corresponding E3s also discovered in eukaryotes. In order to help researchers to investigate E3 proteins regulated by ubiquitination, we provide an efficient method to identify proteins that are involved in the ubiquitin-protein ligase activity as well as to construct a database that organizes E3-related information including E3s, substrate proteins, associated proteins, related diseases and publications. To collect E3-related protein data, we first generate 52 combinations of databases for 13 underlying databases. We utilize such combinations to retrieve and integrate E3 and the related data. From such E3 data, we identify 917 distinctive proteins consisting of single component E3s and subunits for multicomponent E3 complexes.

8. Automatic Construction of Frame Ontology with Varying Granularities

	The IR and IE techniques have acquired the level of maturity for a widespread use in our everyday lives, but they are still short of fully satisfying the user and thus in need of much further improvement, the problem of utilizing granularity being certainly one of them. In order to search for the information of the right granularity in response to the requests of the user, we should be able to assess the granularity of individual pieces of information, which can in turn be dealt with by leveraging the granularity of the predicate and its argument(s) of each sentence. In this project, we developed methods for automatically constructing and managing various ontologies and for utilizing them to provide granularity options to existing IR and IE techniques. The ontologies are used to specify the granularity of predicates and arguments at different levels, so that it becomes possible to compare the granularity of different pieces of information, and to zoom in to those that are of the target granularity. We used the domain of biology and medicine for our case study, since this is an area where many natural language processing techniques are currently in much use for effective information retrieval and extraction due to the explosively growing amount of information and where our research group has been working on customized natural language processing services with a number of fruitful results.

9. Identifying the Presence and Certainty of Clinical Conditions from Clinical Reports

	The specific details of clinical conditions and their proper understanding play a critical function in clinical decision making about related diagnoses and treatments. Such clinical conditions are usually spelled out by medical experts in terse but plain English and stored as clinical reports. Nevertheless, it is quite demanding to determine promptly and accurately the nature and extent of clinical conditions in such reports and to decide on the next course of action, due primarily to the immense volume of information that must be taken into account but also to the complexity of natural language expressions as employed in such reports. In this regard, automatically extracting clinical conditions from clinical reports is often considered the first step in various applications of medical language processing (MLP). In this on-going project, we address issues in extracting and coding clinical conditions with the help of ICD codes, which are used to maintain medical statistics in many countries including the United States and the Republic of Korea. In particular, we are looking further into the negation and uncertainty of natural language expressions, because clinical conditions may sometimes be presented in negated or uncertain forms, the importance of whose correct identification is already well recognized.

10. CoMAGC: a Corpus with Multi-faceted Annotations of Gene-Cancer Relations

	In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations. In order to support the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences, we publish CoMAGC, a corpus with multi- faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, the corpus deals with changes in gene expression levels among other types of gene changes.

11. OncoSearch: cancer gene search engine with literature evidence

	In order to identify genes that are involved in oncogenesis and to understand how such genes affect cancers, abnormal gene expressions in cancers are actively studied. For an efficient access to the results of such studies that are reported in biomedical literature, the relevant information is accumulated via text-mining tools and made available through the Web. However, current Web tools are not yet tailored enough to allow queries that specify how a cancer changes along with the change in gene expression level, which is an important piece of information to understand an involved gene's role in cancer progression or regression. OncoSearch is a Web-based engine that searches Medline abstracts for sentences that mention gene expression changes in cancers, with queries that specify (i) whether a gene expression level is up-regulated or down-regulated, (ii) whether a certain type of cancer progresses or regresses along with such gene expression change and (iii) the expected role of the gene in the cancer. OncoSearch is available through http://oncosearch.biopathway.org.