Light

jorendorff / indexor Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 3.0 45 KB

A toy search engine

Python 37.08% Rust 62.92%

indexor's Introduction

indexor - A toy search engine!

It's not done yet! But it can do some things:

Download https://www.dropbox.com/s/lv44vyl8ia46llx/sample.tar.bz2?dl=0 and unzip it next to this directory. It creates a directory named sample (that should be a sibling directory of this repo) with about 8,476 files in it.
Run ./build_index.py to try to build the index.

This creates a few big files in the parent directory. With our sample from step 1, they take up half a GB of disk space. (It would be pretty easy to cut that in half; haven't bothered yet.)
Run ./read_index.py to read raw entries out of the index.

This is something less than a real search engine. You can only query one term at a time, and the output is bare-bones. But it works! With our sample, it takes a few seconds to start up; after that, queries are fast.

indexor's People

Contributors

Stargazers

Watchers

Forkers

courey sandeepmanocha affixalex

indexor's Issues

This is more of a fug and not a bug. Entries that are the same but are followed by punctuation are treated as different entries. We are not stripping out punctuation at this point, so that is to be expected. currently if you search for something like 'penny' it will have different results than 'penny.', 'penny;', and 'penny,'

To resolve this issue:

we need to decide what punctuation should be stripped and what should stay
we need to find the best way to strip punctuation out

This shouldn't be very difficult to solve for.

results return only words starting with A or B

lamesearch runs without barking, but results returned are limited to those starting with A or B. Cause unknown at this point.

Searching for `superman` should find the article on Superman

> superman
Marvel Comics        1024.txt
London Symphony Orchestra 3593.txt
Broadcast syndication 4622.txt
Dwight Howard        5134.txt
Track cycling        1555.txt
Emblem               4627.txt
San Miguel Beermen   6690.txt
Tim Duncan           4137.txt
Hans Zimmer          3628.txt
2008 in film         6189.txt
> batman
Marvel Comics        1024.txt
The X-Files          1539.txt
Geelong              2563.txt
Sex Pistols          1541.txt
Adult Swim           4621.txt
Joss Whedon          3616.txt
Frank Welker         5153.txt
George Clooney       3622.txt
Andy Warhol          0040.txt
Nicole Kidman        1067.txt

These are super classy results but come on.

Support searching for multiple words

> batman suit
no hits

unacceptable.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

jorendorff / indexor Goto Github PK

indexor's Introduction

indexor - A toy search engine!

indexor's People

Contributors

Stargazers

Watchers

Forkers

indexor's Issues

Support storing the index

multiple similar entries

results return only words starting with A or B

Searching for `superman` should find the article on Superman

Support searching for multiple words

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent