PubMed Ontology LEarning with Deep Learning (POLE-DL)

Distributional semantic models (DSMs) derive representations for words in such a way that words occurring in similar contexts will have similar representations. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are well-known DSM. Although neural models are not new in DSMs, recent advances in artificial neural networks make feasible the derivation of words from corpora of billions of words: hence the growing interest in Deep Learning. However, word embeddings (i.e. distributed word representations typically induced using neural language models) and traditional distributional semantics methods lack of precise formal definitions that can be found in an ontology. According to Maedche and Staab, an ontology can be described as: “sets of concepts, relations, lexical entries, and links between these entities”.

The experiments page provides technical details and pointers to the software used. The work page contains an outline of the tutorial for the AI-2016 SGAI workshop and recent publications.

We investigate the application of distributional semantics models for facilitating unsupervised extraction of biomedical terms from unannotated corpora. Term extraction is used as the first step of an ontology learning process that aims at (semi-)automatic annotation of biomedical concepts and relations from large-scale unannotated MEDLINE/PubMed corpus (titles and abstracts). We are experimenting with both traditional distributional semantics methods (e.g. LDA, LSA) and the neural language models CBOW (Continuous Bag-of-Words) and Skip-gram from of Mikolov et al..

Mining PubMed


From text to knowledge

PubMed is the largest biomedical resource. In June 2016, PubMed contained 26 million citations with an average of 1.5 papers added per minute. Automatic identification of concepts and relations from biomedical publications can help curators, researchers, and clinicians to keep up with the findings published in the scientific literature. As of today, this is a challenging task in the realm of Big Data Analytics.