Coder Social home page Coder Social logo

screddy1313 / phrase_query_indexing Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 20 KB

In this project we construct phrase query retrieval using 20_newsgropus dataset

License: GNU General Public License v3.0

Jupyter Notebook 100.00%
phrase-query positional-indexing positional-posting-list nltk-python information-retrieval intermediate-projects natural-language-processing

phrase_query_indexing's Introduction

phrase_query_indexing

In this project we construct phrase query retrieval using 20_newsgropus dataset

Getting Started

Before running the above code please make sure that, you have downloaded 20_newsgroups dataset which is freely available in the internet and place in the data folder. Now run the code cell by cell.

Installing

The following packages are used for the project

+ python 3
+ nltk for text processing
+ pickle for storing and loading data structures

Pre processing

+ Removal of meta data in each file
+ Removal of symbols like [!@#$%^&*()] from the tokens
+ Removal of non alpha numeric tokens
+ Lemmatization of tokens
+ Removal of words that are less than size 2 (useless words)

Methodology

  • Positional Indexing is constructed for two folders rec.motorcycles, comp.graphics in 20_newsgroups dataset
  • Preprocessing each file in the above folders
  • Construction of the positional indexing dictionary. (Code is self explainatory with comments)
  • when the user gives the query, we find the relative distance between tokens in the query and retrieve all the files which contains tokens in the same distance.

References

followed chris manning book for constructing the dictionary link

phrase_query_indexing's People

Contributors

screddy1313 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.