Download Full Text (275 KB)


Knowledge discovery is a critical function of infrastructure protection in the U.S. By analyzing key text documents, we can gain insight into the interwoven and interdependent infrastructure system of the U.S., and better understand the security aspects of the system as a whole. Massive amounts of relevant data resides in text documents, which must be gathered and parsed to be analyzed on a large scale. Our algorithm collects web-based text embedded in HTML pages and analyzes it in various ways to decipher similarities. It will be a needed component of the larger system being developed by the Idaho National Laboratory, which will seek to accomplish what was described above. By analyzing the similarity of these HTML documents, we are helping the Idaho National Laboratory to keep redundant data out of the database. Without proper parsing of similar data, repetitive entries may clog the system with unneeded information. We attack this problem by providing a series of interfaces, each culminating into the same comparison algorithm. The interface can accept a raw String, a text file, or a web URL. The BoilerPipe library is used to extract useful text from the HTML document, by stripping the document of its tags, and using a series of filters to acquire desired text. A simple Java scanner is used to parse the text file. This text is then lemmatized, stripped of punctuation, converted to lowercase, stemmed, and put into a term-document matrix. Finally, we use cosine similarity to generate a proper percentage point representing how similar or dissimilar the two provided text documents are.

Publication Date



computer science, text analytics


Computer Engineering | Engineering

Faculty Advisor/Mentor

Milos Manic

VCU Capstone Design Expo Posters


© The Author(s)

Date of Submission

July 2015

Text Analytic System: Document Similarity