University Home
Manchester Centre for Integrative Systems Biology

Analysis of quantitative data using R in Taverna workflows: an example using microarray data


The statistical analysis of quantitative post-genomic data can present a number of technical challenges to the entry-level scientist. For example, using tools such as R and MATLAB to analyse numerical data require knowledge of their programming languages in order to implement the underlying analysis algorithm for their effective use. In addition, there may be problems combining the use of two or more of these tools as the transfer of data may require manual copying and pasting between different user interfaces. Furthermore, these intermediate data may require a transformation step due to it being incompatible as input to the next service.

The Taverna workflow system can be used to construct pipelines integrating the use of computational tools for the statistical analysis of quantitative data. These workflows can automate the transfer of data between these tools as well as their invocation with appropriate parameters. A workflow showing the analysis of microarray data as an example of post-genomic data stored in a maxdload2 database using R is shown below.


Microarray data analysis workflow - download

Note: This workflow requires access to R deployed as a service using RServe on a server or on your PC where your Taverna workbench is installed. See here for further information.

This example workflow analyses microarray data from Castrillo et al., (2006). A series of t-tests are performed on gene expression levels in case and control data sets in order to identify differentially-expressed genes. How these genes may relate to changes in biological processes in the cell are then investigated by identifying common terms assoicated with the genes from the Gene Ontology.

There are three parts to this workflow. The first part involves the retrieval of control and case microarray data from the maxdload2 database using a web service interface generated by maxdBrowse. Two beanshell scripts are used to allow the workflow user to select control and case data sets for analysis from Taverna.

The expression levels for each gene between the control and case data sets are then analysed by t-tests. These t-tests are declared in a script written in the R programming language which is executed by R deployed as a service using RServe. The R service is invoked using the RShell processor from Taverna. The RShell processor is available from the service palette.

The RServe processor in Taverna is configured with an R script which implements the t-test analysis. The RServe processor makes input ports available as variables named after the port, and output ports read their named variable after executing the script. The last assigned value to the variable will be the one returned from the processor.

If using a remote R server, the RServe processor must be configured with its domain name, and a username and password if required for access.

Its also possible to use Bioconductor packages for microarray data analysis in your R scripts if they have been installed on the R server accessed by the Taverna workflow. The same workflow described on this web page but using the LIMMA Bioconductor package for identifying differentially-expressed genes can be downloadedhere.

The list of significant genes which are differentially expressed between the selected control and case data sets are then analysed by the 'analyseGenesPDFOutput' task which invokes the GoTermFinder tool. This tool identifies common terms from the biological process sub-ontology of the Gene Ontology which are associated with the list of genes.


There are two outputs generated from this workflow. Firstly, a report in the form of a PDF file is generated showing the association of the genes with common terms from the Gene Ontology which were identified by the GoTermFinder tool. The PDF report can be viewed using the PDF renderer plugin by right-clicking on the PDF results file and selecting view as PDF.


Secondly, a text file containing comma-separated values is produced consisting of the list of significant genes annotated with its gene name, ORF number, a description of its function and its t-test p-value. To view in tabular format, right-click on the CSV file and select view as comma separated values.

The PDFRenderer and CSVRenderer plugins have to be installed for the display of PDF files and CSV in the Taverna workbench. Instructions for installing these renderer plugins are available here.



Results

The results from the carbon and nitrogen t-test comparisons described in the paper can be downloaded here.