Discovering semantic features in the literature: a foundation for building functional associations

A web supplement to the work published in:
Chagoyen M, Carmona-Saez P, Shatkay H, Carazo JM, Pascual-Montano A.
Discovering semantic features in the literature: a foundation for building functional associations
BMC Bioinformatics. 2006 Jan 26;7(1):41. [PubMed] [Full article]


A complete characterization of biological processes at the genomic and the proteomic level requires the combination of numerous aspects, among which functional information is one of the most difficult to automatically acquire and interpret.
Supplementary data files

(gene/PMIDs files)


SGD8 data:

Reelin data:

To discover biological topics in the biomedical literature relevant to sets of genes/proteins that allow to infer functional associations.


Since the literature covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining [1]. Therefore, data mining techniques able to extract biological patterns from large lists of genes from biomedical literature are very useful tools to interpret experimental data and derive new biological knowledge.

In this work we present a method for extracting common biological topics from the biomedical literature associated to sets of genes/proteins, in the form of semantic features. This characterization of topics provides the means to associate genes with semantic profiles which indicate their functional role, ant to establish functional relationships among genes.

Our approach applies non negative matrix factorization (NMF), a machine-learning algorithm capable of identifying local patterns that exist in only a sub-portion of the data [2]. NMF was originally applied to image and text analysis and more recently has been used to analyse gene expression data [3, 4], sequence data [5] and gene functional annotations [6].

We have applied our method to two datasets in order to test its performance:

  • The SGD8 dataset contains 575 Saccharomyces cerevisiae genes corresponding to eight different biological processes as annotated by the SGD GO slim mapper (namely 'cell cycle', 'cell wall organization and biogenesis', 'DNA metabolism', 'lipid metabolism', 'protein biosynthesis', 'response to stress', 'signal transduction' and 'transport'). Bibliographic annotations under "Function/Process" category where obtained from the Saccharomyces Genome Database (SGD).
  • The Reelin dataset contains bibliographic annotations from Entrez Gene of the human and mouse genes selected by Homayouni et al. [7]


[1] Shatkay, H. and R. Feldman, Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 2003. 10(6): p. 821-855.

[2] Lee, D.D. and H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature, 1999. 401(6755): p. 788-91.

[3] Kim, P.M. and B. Tidor, Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res, 2003. 13(7): p. 1706-18.

[4] Brunet, J.P., et al., Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A, 2004. 101(12): p. 4164-9.

[5] Heger, A. and L. Holm, Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins. Bioinformatics, 2003. 19 Suppl 1: p. i130-9.

[6] Pehkonen, P., et al., Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics, 2005. 6: p. 162.

[7] Homayouni, R., et al., Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics, 2005. 21(1): p. 104-15.