Download Full Text (696 KB)
Vector Representations of Multi-Word Terms for Semantic Relatedness
Sam Henry, Clint Cuffy and Bridget T. McInnes, PhD
Introduction: Semantic similarity and relatedness measures quantify the degree to which two concepts are similar (e.g. liver-organ) or related (e.g. headache-aspirin). These metrics are critical to improving many natural language processing tasks involving retrieval and clustering of biomedical and clinical documents and developing biomedical terminologies and ontologies. Numerous ways exist to quantify these measures between distributional context vectors but no direct comparison between these metrics and exploration of representing multi-word context vectors. We explore several multi-word aggregation methods of distributional context vectors for the task of semantic similarity and relatedness in the biomedical domain.
Methods: We use two multi-word aggregation methods including the summation and averaging of component word vectors. The direct creation of multi-word vectors using our compoundify tool and creation of concept vectors using the Metamap tool are also utilized to generate a single vector representation for multi-word terms. Along with these methods, we employ three vector dimensionality reduction techniques: singular value decomposition (SVD), word embeddings using word2vec’s continuous bag of words (CBOW) and skip-gram (SG) approaches. Lastly, explicit vectors of word-to-word, term-to-term, or component-to-component co-occurrences are used as a baseline. Lastly, we measure differences between vector dimensionalities consisting of comparison lengths 100, 200, 500, 1000, 1500 up to 3000.
Results: We evaluate the metrics on the UMNSRS and MiniMayoSRS evaluation reference standards. Results show lower dimensional vectors word2vec’s concept vectors (CBOW and SG) with vector dimensionality of 200 to outperform explicit and SVD. SVD performs best with the vector dimensionality of 1000. Between multi-term aggregation methods, the choice was arbitrary. Combining single terms to create multi-word terms pre or post training showed little statistical significance between all dimensionality reduction techniques and vector dimensionalities.
Conclusions: In general, there is no increase in correlation between word2vec’s SG versus CBOW in biomedical context. Relatively high accuracy with little computational complexity was shown using the sum or mean of context vectors to create a single vector representation for multi-word terms. Although the method of generating distributional context vectors differ; both have their strengths and weaknesses depending on the hyper-parameters utilized.
Natural Language Processing, NLP, Neural Network, Word2vec, Medline, Word Vectors, Word Embeddings
Current Academic Year
Bridget T. McInnes
© The Author(s)