Coder Social home page Coder Social logo

search-engine's Introduction

Search Engine

My first search engine built using Python, Numpy, Pandas, and spaCy

Functionalities

Boolean retrieval and filtering: Free and phrase queries

By default, passing a string is an OR query where we search for documents that contain any of the words in the query string. For example, if we pass the query string python pandas we will get all the documents that contain either python or pandas or both.

Additionally, if we pass any of the words in the query string in double quotes, we will get all the documents that contain the phrase in the double quotes. For example, if we pass the query string "python" "pandas" we will get all the documents that contain both words "python" and "pandas" (an AND query). Both of these queries are implemented querying inverted indexes built from the cleaned and lemmatized corpus.

TODO: Every word that needs to be present in the document needs to be in double quotes separately. If we want the words python and pandas, we must pass the query string as "python" "pandas". It would be better if we could pass the query string as "python pandas".

If the option is_phrase is set to True, we now consider the order of the words to be significant (phrase queries). It is implemented using biwords. So if we pass the string I love python, we will get all the documents that contain the phrase I love AND love python wherein each of the words in the biword is present in the document but the biwords themselves might be in a different order in the document.

Wildcard queries

Wildcard queries contain a wildcard character * which can be used to match any number of characters. For example, if we pass the query string pyth*n we will get all the documents that contain words that start with pyth and end with n. We can pass wildcard queries in three different ways:

  1. At the beginning of the word: *thon

  2. At the end of the word: pyth*

  3. In the middle of the word: py*on

Note: Currently only one wildcard character * is supported per query word (though each query word in the query string can have its wildcard character), so a query like p*yth*n is translated to p*n in the backend.

Additionally while using wildcard characters in a phrase query, it is retrieved as on OR query on the biwords instead of an AND query.

Wildcard queries are automatically identified and then queried on, and don't need any extra effort on the user's part.

TODO: The search functionality will now consider any * as a wildcard query, so I am currently building an option to turn wildcard querying off so they can search for test containing an actual * in them.

Ranking

Ranking is done based on the tf-idf scores of the documents for the query. The tf-idf score is calculated document at a time and the results are sorted and sent to the primary search function. Only those scores that pass the initial boolean filter will have their scores sent from the backend. If the user just wants a boolean filtered result he/she/they just need to turn the ranked parameter to False.

If wildcard characters are present in the query, all the words that match the wildcard query contribute to the score. For example if we pass dat*, both data and date (and any others that match) will contribute to the score.

While retrieving the documents, the user can choose the number of documents they want retrieved using the retrieve_n parameter in the engine's search function.

Note: Biwords do not have a separate ranking and each word individually counts towards ranking. The initial filter takes care that the documents do indeed contain these biwords. Though performance comparisons are yet to be done, the ranking is expected to be better if the biwords are ranked as well.

Spelling correction

Spelling correction uses a slightly modified version of the levenshtein edit distance algorithm. If the spell_check option is set to True and the given query string does not match any documents, the engine will try to find the closest match to every word from the corpus and return the documents that contain those words.

Additionally, to the normal edit distance algorithm, I added a check if the word has two adjacent characters swapped (i.e.. now a new operation swapping adjacent characters is added to the normal replacing, inserting, and deleting options to find the minimum distance between two words).

Spell checks ignore the wildcard characters and only check for the words that are not wildcard queries. It converts phrase and AND queries to OR queries and then checks for the closest match as we change the query itself.

Autocomplete suggestions

Using the same edit distance algorithm, we find words with the smallest distance to the last word in the query from the corpus. If two words have the same distance, we compare it based on its frequency in the corpus. The word that occurs more frequently is likelier to be the correct 'autocomplete' word.

To check for possible autocomplete results instead of a search, set the autocomplete parameter to True in the search function.

The lemmatized words are converted back to possible 'un'lemmatized words using the lemminflect package.

Usage instructions

  1. Clone or download the repository and setup the environment from the env.yml file using
conda env create -f env.yml
  1. Activate the environment using
conda activate search_engine
  1. If you want to change the pdfs being read/converted, do the needful and modify the paths wherever necessary. If not just continue with the next step.

  2. Run the cleaning.py, tokenizing.py and setup.py file in the same order.

  3. See all the possible usage examples in example_usage.ipynb and fit it to use in your application.

search-engine's People

Contributors

majimearun avatar

Stargazers

Shreyash Dash avatar Soumitra Shewale avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.