Coder Social home page Coder Social logo

informationretrieval's People

Contributors

aligusnet avatar

Stargazers

 avatar

Watchers

 avatar  avatar

informationretrieval's Issues

Get rid of DocumentId's batch/local id parts

DocumentId is unique identifier of the document. DocumentId consists of 2 IDs:

  • BlockId is a unique identifier of the document's block and
  • LocalId is a unique identifier inside the block.

DocumentId has some disadvantages:

  • it creates holes in a series of document ids if number of documents in collection is less then 2^16 (which is almost always the case)

  • it does not really saves the space, e.g. in case of RangeDocumentId introducing intermediate class for storing ranges in collection (collectionId, (localId, length)[])[] does not save in case of small ranges;

Implement ExternalIndex

Index that does not cache whole postings list in memory and but postings from disk as soon as it is required.

make names clearer

  • Storage => Corpus
  • Collection => Block
  • Rename projects using pattern: <RootProjectName>(\.[Test|App|WPF])?

switch from structs to primitive types in protobuf serialization

DictionaryIndex uses 2 structures to store index:

  • DocumentId
  • DocumentIdRange.

There are a couple of issues with that:

  • protobuf-net boxes structs when serializes them: protobuf-net/protobuf-net#471
  • protobuf in general handles complex types much less effective than integers

Both of the structs can be represented as 32-bit integers. We should use integer representations of the structs to avoid unnecessary allocations and optimize index structure on disk.

optimize DictionaryIndex

store DocumentIDs using ranges only when it makes sense (e.g. there are 5 documents id in a row).

Combine all text preprocessing steps into one

To save time on reading/writing to disk.
Because, as it turns out it takes most of the time of every text preprocessing steps:

  • cleaning from wiki markups
  • tokenizing
  • hashing.

In general we do not need to store any results of the listed steps to disk so we can pass the preprocessed text to an indexer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.