University Home
Manchester Centre for Integrative Systems Biology

CV expand


CV expand is a text mining tool for automatic expansion of controlled vocabularies as a practical alternative to tailor-made named entity recognition methods.


Specification

We describe a text mining (TM) method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies (CVs) with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of CVs across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.

A set of relevant tasks regarding CV term acquisition has been identified, including information retrieval, term recognition and term filtering. The given figure summarises the main steps taken in our TM approach to CV expansion. First, the information retrieval module is used to gather documents relevant for a given CV from the literature databases. Once a domain-specific corpus of documents has been assembled, it is searched for potential terms unaccounted for in the initial CV. Automatic term recognition is performed to extract terms as domain-specific lexical units, i.e. the ones that frequently occur in the corpus and bear special meaning in the domain. In order to reduce the number of terms not directly related to a given CV, we filter out typically co-occurring types of terms that belong to sub-domains having more established CVs. The existing CVs can be exploited to recognise these terms using a dictionary-based approach.

TM workflow

top



Implementation

Project details

Project administrators Dr Irena Spasić
Developers 1
Development status 4 - Beta
Intended audience Developers, Science/Research
License Academic Free License (AFL) v3.0
Operating system OS Independent (Written in an interpreted language)
Programming language Java
Database environment PostgreSQL
Topics Text mining, Bio-Informatics
User interface Command-line

Prerequisites

In order to use CV expand, the below listed tools need to be installed first. The versions given refer to the ones used during the development of CV expand. The higher versions should work in general, but no testing has been performed.

ToolVersionURL
Java 2 1.6.0 http://java.sun.com/javase/
Java WSDP 2.0 http://java.sun.com/webservices/
PostgreSQL 8.1 http://www.postgresql.org/
Entrez Utilities Web Service 1.5 http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html
TerMine http://www.nactem.ac.uk/software/termine/
UMLSKS API 5.0 http://umlsks.nlm.nih.gov/kss/

Downloads

All relevant files have been archived into CVexpand.zip. The links to some of the files given below are provided for illustration and do not need to be downloaded separately.

Description of the content:

FolderDescription
case_studies The results of two case studies described in the given publications. This folder is provided for illustration only and is not required to run the tool.
config Java classes used by configuration.java. These classes were generated automatically using JAXB (which is distributed with Java WSDP) based on the XML schema config.xsd given in the schema folder.
data Input and output data can be found in the corresponding sub-folders.
schema This folder contains config.xsd (XML schema, which describes how the tool can be configured) and tables.sql (SQL schema of the relational database, which should be installed locally).
FileDescription
Citation.java Data structure for exchanging the citation details.
config.xml A configuration of the tool to be used. This XML file conforms to the schema given in config.xsd. The annotations given in the XML Schema document explain the configurable parameters.
configuration.java Java class used to retrieve the configuration parameters from config.xml.
CVexpand.java The main class.
Entrez.java Java class for using the Entrez Utilities, used here for information retrieval.
TerMine.java Java class for using TerMine, used here for automatic term recognition.
UMLS.java Java class for using UMLS knowledge sources, used here to retrieve terms from the given semantic types used for term filtering.

top



Publications

Irena Spasić, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas Kell, Norman Paton and the MSI Ontology Working Group Members (2007) Facilitating the development of controlled vocabularies for metabolomics technologies with text mining. BMC Bioinformatics, submitted

Background: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually.

Results: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts.

Conclusions: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.

Irena Spasić, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas Kell, Norman Paton and the MSI Ontology Working Group Members (2007) Facilitating the development of controlled vocabularies for metabolomics with text mining, in ISMB/ECCB Special Interest Group (SIG) Meeting Program Materials, Bio-Ontologies SIG Workshop, Vienna, Austria, pp. 103-106
Bioinformatics applications heavily rely on controlled vocabularies and ontologies to consistently interpret and seamlessly integrate information scattered across disparate public resources. Experimental data from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. Here we describe the development of controlled vocabularies for metabolomics investigations. Manual term acquisition approaches are time-consuming, labour-intensive and error-prone, especially in a rapidly developing domain such as metabolomics, where new analytical techniques emerge regularly so that the domain experts are often compelled to use non-standardised terms. We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature.

top



People

PersonRole
Dr Irena Spasić IS designed and implemented the text mining application for automated controlled vocabulary expansion.
Dr Daniel Schober DS maintains the controlled vocabularies and ontologies.
Dr Susanna-Assunta Sansone SAS coordinates the activities of the MSI OWG group.
Dr Dietrich Rebholz-Schuhmann DRS participated in the design and coordination of the text mining aspects of the study.
Prof. Douglas B. Kell DBK provides his expertise in metabolomics to help evaluate the results.
Prof. Norman W. Paton NP supervises the bioinformatics integration aspects.
MSI Ontology Working Group Members MSI OWG members participate in provision of the data, discussions and evaluation.

top



Contact details

General contacts
Area Person Contact
text mining Dr Irena Spasić I.Spasic-AT-manchester.ac.uk
controlled vocabularies, ontologies Dr Daniel Schober schober-AT-ebi.ac.uk

top