Coder Social home page Coder Social logo

booleanretrieval's Introduction

booleanRetrieval

===========================================================================================================

  1. Unzip given file

  2. Make sure you have python installed on your pc. I have used 2.7.3 version

  3. Open terminal

  4. Excute below commands a) python index.py b) It will ask you to enter directory path containing input text files. Paste the directory path and press enter Example - "doc" c) It will ask you to enter path of the query file (file, where all the queries are written). Refer "query" file Example - "query" d) It will build index, print time taken to build index, dictionary and doc id to filename map e) It will read every line of queries file, treat each line as a new query and perform search on them f) prints search result along with time taken. g) Have a look at attached "invertedIndex.txt" file to view created inverted index.

                                         Algorithm
    

============================================================================================================= A) Build Index : 1) This method gets invoked as soon as user inputs dictionary path 2) Iterates through all the files in the dictionary 3) For each file, read every line and stores it in lines list. 4) Then iterates through lines list and split each line using '\W+' de limiter 5) Removes all the empty words and iterates through wordsList 6) Convert each word to lower case and search for word in dictionary keys. 7) If word is present in dictionary, search for document Id. 8) If document id is same as current doc id, append the position of the word to posting list, else create a new map with docId as key and position as value. 9) If word is not present in dictionary, create a new entry in map with word as key, it's docId and position as value 10) Example map -> "word" : {docId : [position_List], docId : [positionList]} 11) Each file is given a unique doc id 12) position variable is used to calculate position of all words and is initialized to 1 everytime a new file is opened to read

B) Merge : 1) and_query method takes list of query terms as input 2) If the length of query_terms is equal to 1, It will get the posting list for the only term. If no result is found, appropriate message will be printed. Else, print the respective file name by traversing docIdToFileName map for all the posting lists. 3) If there are more than 1 item in query_terms, it will get posting list for first and second term. Then calls mergePostingList ("This function takes 2 lists as input parameters and returns intersection of both in a sorted form") by passing posting lists of term0 and term1. The merge result will be used to get intersection of posting list of term3. 4) This will be repeated for all the subsequent query terms. 5) Result of step 2 will have merged list of all the query terms 6) Uses the final merged list to traverse through DocIdToFileName map and prints respective file names

                                            Performance

============================================================================================================= 1) Avg time taken to build index for given set of text file --> ~1 to ~1.5 seconds 2) Time taken by query "with AND without AND yemen" --> ~0.000185 seconds 3) Time taken by query "with AND without AND yemen AND yemeni" --> ~0.0001540 seconds

booleanretrieval's People

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.