Coder Social home page Coder Social logo

inverted-index's Introduction

inverted-index

This project implements an inverted index, which is commonly used to provide full text search (e.g. in Elasticsearch), and is also used by databases like PostgreSQL (not only for search, but also features like the GIN index on JSON columns).

Indexing

A corpus (or collection of documents) is indexed in the following manner:

  • each document is tokenized (e.g. split on whitespace)
  • the tokens are normalized by case folding, stripping newlines etc. There are more advanced techniques such as stemming and lemmatization, which are not done in this code
  • overly commond words, or "stopwords" (e.g. a, as, it, the) are removed, as they are unlikely to do anything but add noise to the search results
  • Statistics about the following pieces of information are built up:
    • the number of times a term appears in the current document (IDF)
    • the length of the document
    • the number of times a term appears across all documents (TF)
    • the average length of all documents
    • the number of documents in the the collection

Represenation on disk

-Posting file: each row is a JSON encoded list of objects, which contain a document identifier, and the in document frequency (IDF)

  • Document map: this is stored as a map of document id: document length
  • Lexicon: this is stored as a nested object with the outmost key is a term, which points to an object containing the trm frequency (TF), and a pointer to the documents it appears in within the posting file
  • Metadata file: this is a simple JSON object containing information on the collection size, and the average document length (in words)

Searching

This repo uses a ranked retrieval function (BM25), which is intended to provide more relevant search results than a boolean search. This controls for factors such as document length (longer documents contain more terms, but they may not necessarily be more relevant than a shorter document also containing some/all of the query terms)

Approach:

  • for term in query:
    • check if term is a stopword, and skip it if so
    • check if term is in the lexicon (skip it if so), and the lookup the documents the term appears
    • for every document the current term appears in, calculate a BM25 score and store it in an accumulator
    • repeat this process for all remaining terms, and return the document ids containing the highest accumulated BM25 score

Usage

Compilation: go build

Indexing: ./inverted-index -action index -corpus corpus/big

Searching: ./inverted-index -action search -query new -query york

(last tested with Go 1.9.2 on Ubuntu 16.04)

inverted-index's People

Contributors

crolfe avatar

Stargazers

David Pennington avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.