Functional similarity by literature mining

A similarity metric for biological processes through literature analysis

A web supplement to the work published in:
Chagoyen M, Carmona-Saez P, Gil, C, Carazo JM and Pascual-Montano A.
A literature-based similarity metric for biological processes
BMC Bioinformatics 2006 7: 363 [Full article]

Motivation:

Recent analyses in systems biology pursue the discovery of functional modules within the cell. Recognition of modules in genome-wide information requires the integrative analysis of experimental data together with available functional schemes. Since the literature covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining [1]. The scientific literature contains the information to bridge the gap between the abstract definitions of biological processes in current schemes and the interlinked nature of biological networks.

Supplementary data files

Datasets:
(GOP/PMIDs files)

SGD data [tab text]
Validation subset [tab text]

Results:

Pair-wise similarities validation subset [tab text]

Objective:

To construct pair-wise similarities between biological processes through the analysis of the biomedical literature.

Abstract:

In this work we explore the use of the scientific literature to establish potential relationships among biological processes. To this end we have used a document based similarity method (latent semantic analysis [2]), to compute pair-wise similarities of GO biological process categories.

The method has been applied to the biological processes annotated for Saccharomyces cerevisiae genome (www.yeastgenome.org). SGD annotation file (01/25/2006), GO database (01/27/2006).

Method overview:

A broad process-document was constructed for each GO biological process by concatenating its relevant bibliographic references (abstracts and titles) (independently of genes and GO hierarchy relationships).
A vector space representation, namely, a weighted term-frequency matrix (A), is built using the approached described in [3].
This term-process matrix (A) is mapped by means of a factorization technique, Singular Value Decomposition (SVD), to a lower-dimensional representation (A = USV').
Biological process similarities are computed in the new reduced space by computing the cosine between each pair of rows of VS.

References:

[1] Shatkay, H. and R. Feldman, Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 2003. 10(6): p. 821-855.

[2] Deerwester S, Dumais S, Landauer T, Furnas G and Beck L, Improving Information-Retrieval with Latent Semantic Indexing. P Asis Annu Meet, 1988. 25: p. 36-40.

[3] Chagoyen, M., et al., Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics, 2006 7(1):p.41. [PubMed] [Full article]