Coder Social home page Coder Social logo

g-nitin-1 / wikipedia-search-engine Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ayushidalmia/wikipedia-search-engine

0.0 0.0 0.0 172 KB

Involves building a search engine on the Wikipedia Data Dump using the data dump of 2013 of size 43 GB. The search results returns in real time.

Python 100.00%

wikipedia-search-engine's Introduction

Wikipedia-Search-Engine

This repository consists of the mini project done as part of the course Information Retrieval and Extraction - Spring 2014. The course was instructed by Dr. Vasudeva Varma.

##Requirements Python 2.6 or above

Python libraries:

  • Porter Stemmer
  • XML Parser

##Problem The mini project involves building a search engine on the Wikipedia Data Dump without using any external index. For this project we use the data dump of 2013 of size 43 GB. The search results returns in real time. Multi word and multi field search on Wikipedia Corpus is implemented. SAX Parser is used to parse the XML Corpus. After parsing the following morphological operations are implemented:

  • Casefolding: Casefolding is easily done.
  • Tokenisation: Tokenisation is done using regular expressions.
  • Stop Word Removal: Stop words are removed by referring a stop word list.
  • Stemming: Using an externalFor stemming, a python library PyStemmer is used.

The index, consisting of stemmed words and posting list is build for the corpus after performing the above operations along with the title and the unique mapping I have used for each document. Thus the document id of the wikipedia page is ignored. This helps in reducing the size as the document id do not begin with single digit number in the corpus. Since the size of the corpus will not fit into the main memory several index files are generated. Next, these index files are merged using K-Way Merge along with creating field based indices files.

For example, index0.txt, index1.txt, index2.txt are generated. These files may contain the same word. Hence, K Way Merge is applied and field based files are generated along with their respective offsets. These field based files are generated using multi-threading. This helps in doing multiple I/O simultaneously. Along with this the vocabulary file is also generated.

Along with these I have also stored the offsets of each of the field files. This reduces the search time to O(logm * logn) where m is the number of words in the vocabulary file and m is the number of words in the largest field file.

The src folder contains the following files:

###Main Functions:

  • wikiIndexer.py This function takes as input the corpus and creates the entire index in field separated manner. Along with the field files, it also creates the offsets for the same. It also creates a map for the title and the document id along with its offset. Apart from this it also creates the vocabulary List

In order to run this code run the following: python wikiIndexer.py ./sampleText ./outputFolderPath

  • search.py This function takes as input the query and returns the top ten results from the Wikipedia corpus.

In order to run this code run the following: python search.py ./outputFolderPath

###Helper Functions:

  • textProcessing.py This helper function does all the preprocessing. It acts as helper for search.py, wikiIndexer.py

  • fileHandler.py This function does all the file preprocessing. It acts as helper for wikiIndexer.py

wikipedia-search-engine's People

Contributors

ayushidalmia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.