aligusnet / informationretrieval Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 350 KB

Information Retrieval

C# 100.00%

information-retrieval nlp

informationretrieval's People

Contributors

Stargazers

Watchers

informationretrieval's Issues

Implement new DocumentDataSerializer into/from IList of chars

Optimise ReadDocument of StorageZipReader

It takes a few hundred milliseconds to load document from a storage.
We should not load the whole collection to obtain one document.

Clean up wikipedia texts from wiki markups.

Partially is enough to make tokenizer's job easier.

Implement Binary Search

Operations to support:

Implement LISP-like Boolean query language

E.g. (AND query moon (NOT (OR sun star)))

Get rid of DocumentId's batch/local id parts

DocumentId is unique identifier of the document. DocumentId consists of 2 IDs:

BlockId is a unique identifier of the document's block and
LocalId is a unique identifier inside the block.

DocumentId has some disadvantages:

it creates holes in a series of document ids if number of documents in collection is less then 2^16 (which is almost always the case)
it does not really saves the space, e.g. in case of RangeDocumentId introducing intermediate class for storing ranges in collection (collectionId, (localId, length)[])[] does not save in case of small ranges;

Add possibility to change elements size in WPF UI

Using GridSplitter

Implement ExternalIndex

Index that does not cache whole postings list in memory and but postings from disk as soon as it is required.

Asynchronous index loading in WPF UI

Use new StateMachineTokenizer

make names clearer

Storage => Corpus
Collection => Block
Rename projects using pattern: <RootProjectName>(\.[Test|App|WPF])?

Support postings list compression using varint

Implement parallel version of transformers

Simple Parallel.For is enough

switch from structs to primitive types in protobuf serialization

DictionaryIndex uses 2 structures to store index:

DocumentId
DocumentIdRange.

There are a couple of issues with that:

protobuf-net boxes structs when serializes them: protobuf-net/protobuf-net#471
protobuf in general handles complex types much less effective than integers

Both of the structs can be represented as 32-bit integers. We should use integer representations of the structs to avoid unnecessary allocations and optimize index structure on disk.

Optimize merging varint postings lists

in PostingsListWriter.WriteChainedVarint

optimize DictionaryIndex

store DocumentIDs using ranges only when it makes sense (e.g. there are 5 documents id in a row).

Combine all text preprocessing steps into one

To save time on reading/writing to disk.
Because, as it turns out it takes most of the time of every text preprocessing steps:

cleaning from wiki markups
tokenizing
hashing.

In general we do not need to store any results of the listed steps to disk so we can pass the preprocessed text to an indexer.

Implement IBuidableIndex
Implement ISerachableIndex
Implement Serialization/Deserialization

aligusnet / informationretrieval Goto Github PK

informationretrieval's People

Contributors

Stargazers

Watchers

informationretrieval's Issues

Recommend Projects

Recommend Topics

Recommend Org