Building an Inverted Index at the DBMS Layer for Fast Full Text Search
Abstract
In order to make accurate and fast full text searches it is recommended to index the words in the documents. One way to do this is to use an Inverted Index to maintain, in a structured form, the occurrence of words in a set of documents. In order to minimize the number of stored words in the index, a stemmer like Porter Stemmer can be used, so only the root word will be kept for each word. In this paper an Inverted Index for documents stored in MongoDB and Oracle databases will be constructed. Four different methods for constructing an Inverted Index to compare and determine which model has the best performance will be presented. Two of them are implemented in Python, one constructed is using a single thread and the other uses the MapReduce algorithm. The other two systems will use the frameworks and tools provided by the databases. MapReduce framework for MongoDB and Pipelined Table Functions for Oracle will be used.
Keywords
MapReduce; Inverted Index; Porter Stemmer; Oracle; MongoDB; Pipeline Table Functions