Coder Social home page Coder Social logo

sthir's Introduction

Sthir

Search using spectral bloom filters in static sites

sthir docs python Downloads Downloads MIT License

Sthir can create memory efficient search feature for your static website. Sthir is equipped with an user friendly command-line interface. In two steps you can build a working search page for your website!

Description

Sthir is a library to create search functionality for your static websites. It scans your html pages for words and indexes these words in an efficient data structure called Spectral Bloom Filters. Spectral Bloom Filteres differs from regular ones as they can store counts for each hash (it can estimate, at minimum, how many times a hash was indexed). We are using an efficient base 15 decoding to compress and transfer the bloom filters at client side. Our goal can be described with a simple equation:

Less Memory Footprint + Term Frequency Knowledge = Perfect Search Functionality for Static Sites!

A deployed example of our library can be found on this blog.

Installation

sthir runs on Python 3.5 or above.

Installation with pip via PyPI for Linux, OS X and Windows:

pip install sthir

To check installation run the following command:

sthir -h

If you see the help messages without any error then the installation was successful.

Quickstart

Help message:

usage: sthir [-h] [-e ErrorRate] [-s Counter_size] [-l] [-ds] path

Creates a Spectral Bloom filter(SBF) for .html files in the specified
directory.

positional arguments:
  path             Path to source directory for creating the filter

optional arguments:
  -h, --help       show this help message and exit
  -e ErrorRate     Error_rate for the filter Range:[0.0,1.0] Default:0.01
  -s Counter_size  Size in bits of each counter in filter Range:[1,10]
                   Default:4(recommended)
  -l, --lemmetize  Enable Lemmetization
  -ds              Disable stopword removal from files (not recommended)

Basic

To scan your HTML files and generate a static search webpage, use the command: sthir <your-path-name>

By default, a search.html file, containing the static search functionality will be generated.

Error rate

You can change the error rate of the generated Spectral Bloom Filter using: sthir <your-path-name> -e <error-rate>

We recommend an error rate of 0.01. Having a high error rate is likely to produce more false positive results (i.e. it will recommend URLs which do not contain the search word(s)).

Counter size

By default, our counters can have a maximum count of 16 (counter_size=4). Counters are used by Spectral Bloom Filters to keep track of the number of times a particular hash has been indexed in the bloom filter. So by default, we keep a count till 16. However, you can chnage the counter size using: sthir <your-path-name> -s <counter-size>.

Note:

  • counter_size of x can store upto a maximum count 2^x. For example: counter_size of 3, has a maximum count of 2^3 or 8.
  • As Spectral Bloom Filters are a probabilistic data structure, they cannot be used to accurately determine the upper bound of each word's hashes. They keep a track of the lower bound of a word's hashes (primarily using Minimum Increment method).

Documentation

Our entire documentation is available in:

Here is a working demo -

terminal-output

License

The library is licensed under MIT License. This project is developed by Parth Parikh, Mrunank Mistry, and Dhruvam Kothari.

Credits

Contributing

  1. Fork it (https://github.com/pncnmnp/Spectral-Bloom-Search/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

Data

https://drive.google.com/uc?id=1UpL5IPdzdPSEmv1U-ethebU_-ODKUxN0&export=download

sthir's People

Contributors

decimalpack avatar fork52 avatar iotarepeat avatar pncnmnp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.