Coder Social home page Coder Social logo

indexor's Introduction

indexor - A toy search engine!

It's not done yet! But it can do some things:

  1. Download https://www.dropbox.com/s/lv44vyl8ia46llx/sample.tar.bz2?dl=0 and unzip it next to this directory. It creates a directory named sample (that should be a sibling directory of this repo) with about 8,476 files in it.

  2. Run ./build_index.py to try to build the index.

    This creates a few big files in the parent directory. With our sample from step 1, they take up half a GB of disk space. (It would be pretty easy to cut that in half; haven't bothered yet.)

  3. Run ./read_index.py to read raw entries out of the index.

    This is something less than a real search engine. You can only query one term at a time, and the output is bare-bones. But it works! With our sample, it takes a few seconds to start up; after that, queries are fast.

indexor's People

Contributors

jorendorff avatar sapphicnerd avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

indexor's Issues

multiple similar entries

This is more of a fug and not a bug. Entries that are the same but are followed by punctuation are treated as different entries. We are not stripping out punctuation at this point, so that is to be expected. currently if you search for something like 'penny' it will have different results than 'penny.', 'penny;', and 'penny,'

To resolve this issue:

  • we need to decide what punctuation should be stripped and what should stay
  • we need to find the best way to strip punctuation out

This shouldn't be very difficult to solve for.

Searching for `superman` should find the article on Superman

> superman
Marvel Comics        1024.txt
London Symphony Orchestra 3593.txt
Broadcast syndication 4622.txt
Dwight Howard        5134.txt
Track cycling        1555.txt
Emblem               4627.txt
San Miguel Beermen   6690.txt
Tim Duncan           4137.txt
Hans Zimmer          3628.txt
2008 in film         6189.txt
> batman
Marvel Comics        1024.txt
The X-Files          1539.txt
Geelong              2563.txt
Sex Pistols          1541.txt
Adult Swim           4621.txt
Joss Whedon          3616.txt
Frank Welker         5153.txt
George Clooney       3622.txt
Andy Warhol          0040.txt
Nicole Kidman        1067.txt

These are super classy results but come on.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.