Coder Social home page Coder Social logo

milli's Introduction

the milli logo

a concurrent indexer combined with fast and relevant search algorithms


⚠️ ARCHIVED REPOSITORY ⚠️

The content of this repository is now available in the Meilisearch repository in the workspace milli.


Introduction

This repository contains the core engine used in Meilisearch.

It contains a library that can manage one and only one index. Meilisearch manages the multi-index itself. Milli is unable to store updates in a store: it is the job of something else above and this is why it is only able to process one update at a time.

This repository contains crates to quickly debug the engine:

  • There are benchmarks located in the benchmarks crate.
  • The cli crate is a simple command-line interface that helps run flamegraph on top of it.
  • The filter-parser crate contains the parser for the Meilisearch filter syntax.
  • The flatten-serde-json crate contains the library that flattens serde-json Value objects like Elasticsearch does.
  • The json-depth-checker crate is used to indicate if a JSON must be flattened.

How to use it?

Milli is a library that does search things, it must be embedded in a program. You can compute the documentation of it by using cargo doc --open.

Here is an example usage of the library where we insert documents into the engine and search for one of them right after.

let path = tempfile::tempdir().unwrap();
let mut options = EnvOpenOptions::new();
options.map_size(10 * 1024 * 1024); // 10 MB
let index = Index::new(options, &path).unwrap();

let mut wtxn = index.write_txn().unwrap();
let content = documents!([
    {
        "id": 2,
        "title": "Prideand Prejudice",
        "author": "Jane Austin",
        "genre": "romance",
        "price$": "3.5$",
    },
    {
        "id": 456,
        "title": "Le Petit Prince",
        "author": "Antoine de Saint-Exupéry",
        "genre": "adventure",
        "price$": "10.0$",
    },
    {
        "id": 1,
        "title": "Wonderland",
        "author": "Lewis Carroll",
        "genre": "fantasy",
        "price$": "25.99$",
    },
    {
        "id": 4,
        "title": "Harry Potter ing fantasy\0lood Prince",
        "author": "J. K. Rowling",
        "genre": "fantasy\0",
    },
]);

let config = IndexerConfig::default();
let indexing_config = IndexDocumentsConfig::default();
let mut builder =
    IndexDocuments::new(&mut wtxn, &index, &config, indexing_config.clone(), |_| ())
        .unwrap();
builder.add_documents(content).unwrap();
builder.execute().unwrap();
wtxn.commit().unwrap();


// You can search in the index now!
let mut rtxn = index.read_txn().unwrap();
let mut search = Search::new(&rtxn, &index);
search.query("horry");
search.limit(10);

let result = search.execute().unwrap();
assert_eq!(result.documents_ids.len(), 1);

Contributing

We're glad you're thinking about contributing to this repository! Feel free to pick an issue, and to ask any question you need. Some points might not be clear and we are available to help you!

Also, we recommend following the CONTRIBUTING.md to create your PR.

milli's People

Contributors

ahlner avatar akki1306 avatar anirudhrowjee avatar bidoubiwa avatar bors[bot] avatar brunoocasali avatar cnlhc avatar codelink-jackiepham avatar curquiza avatar dependabot[bot] avatar dureuill avatar ehiggs avatar fumblehool avatar gregoryconrad avatar irevoire avatar jeertmans avatar kerollmops avatar loiclec avatar manythefish avatar marinpostma avatar meili-bot avatar msvaljek avatar pranav-yadav avatar psvnlsaikumar avatar samyak2 avatar shekhirin avatar skvkpandey avatar unvalley avatar vincent-herlemont avatar vishalsodani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

milli's Issues

Improve the A* based best proximity function

I spotted multiple problems that forces the engine to use more CPU than necessary:

  1. Sometimes paths between words are composed of repeated equal word positions (i.e. 1, 3, 1, 3), this is not possible to find documents containing different words at equivalent positions. Computing the intersections of these paths in unecessary. To avoid generating these paths we should be aware of the complete path any time we generate successors.

  2. We should implement a custom A* that keeps the state of the paths it already explored to find the previous shorter proximities, this way we avoid recomputing those.

  3. When we want to find the next word with a proximity of 8 we should search in the word_attribute_docids database and return those documents.

Maybe doing these improvements could make the engine faster for some of these queries:

  • "the best of the do" or "is this a j" on the 5.8 millions wikipedia abstracts.

Improve the indexing disk usage

When indexing a lot of documents we can see that the disk usage can be very high, this is likely due to the fact that all the threads are indexing at the same time but also with the exact same parameters and because each thread read 1/jobs CSV line the workload is evenly spread on each threads.

This is a problem when the MTBL max-memory sorter parameter is reached at the same time on every thread, they start writing also at the same time on disk and therefore they are slowed down as the disk throughput is quickly reached (when 16 threads write at the same time).

We could maybe just randomize the max-memory and linked-hash-map-size parameters of each threads to avoid this unfortunate behavior.

Introduce a threshold to the asc/desc sorting criteria

The current Asc/Desc criteria choose between two ways of doing the job:

  1. If the number of candidates is lower or equal to 1000, it does a simple sort.
  2. If the number of candidates is bigger than 1000, it uses the facet database.

This number (1000) should better be a percentage of the total number of documents. It could be something like 0.1%. This way, we ensure that the best technique is used even if the database grows. #48 helps in retrieving the number of documents and, therefore, speedups computing a percentage.

Support multiple indexes

Supporting multiple indexes by the same library would give more power to the user, it would be easier for the HTTP interface if this library supports this out of the box. This feature will provide swapping two indexes out of the box, dynamic creation and deletion of indexes.

Supporting this feature forces the indexes to have a dedicated update store, the updates stores are notified by using a crossbeam-channel Sender, this means that we must keep a reference to each UpdateStore in memory (e.g. a Hashmap wrapped by an RwLock). This HashMap would be populated with the Index and the UpdateStore.

When we swap two indexes we must make sure that this change is propagated to disk (LMDB) and in this HashMap.
We must also make sure that the users do not keep the Index struct and index's UpdateStore (i.e. no Arc) and instead give them a reference to it (i.e. RwLockReadGuard). About that, the parking_lot RwLockReadGuard API provides a map method that can be useful to map the RwLock<HashMap> to a given entry in it.

Change master branch to main

Let's be allies and make this change that means a lot.

Here is a blog post that explain a little more why it's important, and how to easily do it. It will be a bit more complicated with automation, but we still should do it!

Ant Colony Optimization as the word proximity algorithm

I tried to use an Ant Colony Optimization algorithm to retrieve the best path between query words, as this is very difficult to find results even for simple query with an A*. Querying "back in the day hip hip" with the 109m songs dataset killed the engine as the A* was searching in a lot of direction and stored a lot of potential paths, OOMing the engine.

A* is a determinist algorithm which will always answer with the shortest path, so when it comes to potentially complex queries with a lot of posssibillities it is impossible to find an answer in a given amount of time.

So I tried to use a non-deterministic algorithm, something able to improve over time and find a not-so-bad solution first and improve it over time, until the maximum amount of time has been reached or no better solution can be found.

I also found the multi-Colony Ant Colony Optimization algorithm that is a way to make the algorithm search a solution for different criterion, and being able to define that one criterion is better than another.

Specify the rayon ThreadPool to the UpdateBuilder

Currently the ThreadPool is created inside of the IndexDocument struct based on the given number of jobs.

https://github.com/Kerollmops/milli/blob/e3de600108026d6be2c2362250713f584ec76626/src/update/index_documents/mod.rs#L326-L327

This is not an appropriate way to manage the number of used threads to index documents or do anything else with the specified number of jobs. It can be a problem when multiple update stores in different threads are triggered at the same time for different indexes, as more threads than the requested number will be used.

The best way to get rid of that problem would be to specify the ThreadPool to the UpdateBuilder struct (by ref), we could be able to setup this pool in a OnceCell or something, initialized at program startup and used concurrently for all the indexes update stores.

Support bigger proximities

The current search algorithm only supports proximites up to 7 between two query words. It means that if a user searches for a document where the query words are scattered in different attributes or even if the query words are in the same attribute and the proximity between two of them is over 7 then the document will not be found.

This is an optimisation we put in place to be able to benchmark other parts of the engine, now that we have highly improved those (it is not the final performances, a program always improves) we can enable this feature again and be able to find documents where two query words can be at a higher proximity than 7.

How we compute the proximities less than or equal to 7

To do so we should change the way we discover new paths. At indexing we stored the list of all words and position where it appears in the documents associated with the related documents ids. (e.g. "shoes" in position 12 appears in the docs 321, 432, 531 and 673). When we want to retrieve the next node (word) of a path we take the position of the current node we are at and look if the next word appeared in interresting positions (e.g. "James" in position 12 and "bond" in position 13 gives a proximity of 1).

We also added an optimisation to this algorithm, before we run the A* that searches for the best path we pre-compute the documents that contains all the query words in the same attribute. When we explore the paths graph we therefore check that each path edge (documents between two words at given positions) contains the candidates documents we pre-computed before. When paths are returned from the A* run we remove the documents found from these pre-computed candidates lists and run the A* for a bigger proximity again. Doing so allows us to avoid exploring paths where no documents can be found (because they don't contains all the query words).

Why we can't use this technic with proximities of 8

This is a good technique to discover less than 7 proximities but for proximities of 8 or more you need to look to all other words in the document. Words that gives a proximity of 8 can be at a distance of 8 or more (it is clamped) or in two different attributes. As you probably already guessed this is too much positions to tests, computing is too CPU intensive.

A better solution can be found

When we are searching for proximities of 8 between two words it means that we already have found all proximities of 7 or less for those two words, so I think we can compute the intersection of all the attributes between both words regardless of the position (we store the list of documents ids that contains a word in a given attribute) and remove the documents which were already returned from previous A* runs.

Keeping the attribute of the relation between two nodes (words) of the returned path is important for the futur Attribute/Word criterion.

Introduce prefix caching for small query words

  • Make the Query Tree work with the prefix caches.
  • Introduce an fst Set that indicates the prefixes/words that are cached.
  • Introduce a database for cached prefixes that gives the corresponding docids.
  • Introduce a database for proximities between a whole word and a cached prefix.
  • Make the ranking rules work with the prefix caches:
    • Proximity.
    • Attributes.

Indexing "bug" while writing grenad chunks

Grenad is a append only key value store, which means that it can only create key value by sorting the keys first and merging the conflicting ones, these key values are also called sorted-string tables (SSTable).

The library also supports sorters, this is a datastructure that allows the user to give keys in any specific order, grenad will sort the key-values in memory and the user can define the maximum amount of memory to use, once this sorter reach the max memory specified it sort and merge them in memory then dump them on disk.

The grenad library, and more specifically the sorter datastructure, is used in the indexing system, keys and values are given to the sorter and the sorter is limited to the max memory specified at program startup, divided by the number of threads dedicated to the indexing task and then divided by the number of sorters in each of the threads.

This comes to be a problem, sometimes the sorter datastructure "blocks" the indexing system, it tries to dump one (1) entry to disk, this is because we use RoaringBitmap and that the max memory of the sorter can only accept one value in memory. RoaringBitmap sometimes needs to represent u32s with multiple containers that takes much space and therefore this serialized value triggers the chunk disk dumping.

There is maybe multiple solutions to fix that:

  • Increase the amount of memory allocated to the sorters by either:
    • Reducing the number of threads used for indexing, this gives more memory to the individual sorter for each threads.
    • Increasing the max memory allocated to the program itself.
  • Improve and reduce the size of the RoaringBitmap serialization by supporting run containers.
  • Detect in the sorter datastructure the fact that we are trying to merge and write one (1) entry and do something (but what?).
  • Share the same allocator for all of the threads and grenad sorters, combined with an allocator wrapper it would help balance memory usage.
  • We could also use something like a memmapped Vec, like zkp-mmap-vec but with resize features.
  • We could also move the data that is too big into another file, it would work well with an enum that specifies where the data is located and oversized data is moved into another place (like a file).

Implement synonyms

Implement synonyms to make transplant iso with the current version of MeiliSearch

july 3 2020 brainstorm digest

We've done a little bit of research on the search algorithm and more specifically the search query "the best of the do".

The conclusion that was made is that we must do the intersections between the word positions documents ids.
It was found that doing the proximity in the order of the query (i.e "the" ⋂ "best" ⋂ "of" ⋂ "the" ⋂ "do") is more CPU consuming than doing it in the inverse of the popularity of the words (i.e "do" ⋂ "best" ⋂ "of" ⋂ "the" ⋂ "the").

Here are the words popularities:

  • "best" 81 values between 1000 and 3005
  • "do" 142 values between 1000 and 3016
  • "the" 403 values between 1000 and 3033
  • "of" 483 values between 1000 and 3022

What's the problem

I would like to find the best shortest path to move from the first word to the last word position, in order.
Each word appear at given positions in all the dataset. We group the words by the positions they appears at.

For example "william" appears at 7 different positions in all the documents, "cowboy" at 5 positions and "the" is one of the very popular words in the dataset and therefore appears at many different positions.

william   0 12 13 14 16 18 32
the       1  2  5  6  8  9 11 ...
cowboy    2  5 21 23 32

When a user search for "william the cowboy" we want to return the documents containing the words ordered by proximity. Proximity is computed based on the order of the query words and those found in the documents.

For example the best proximity that can exist is "william" in position 0 followed by "the" in 1 and "cowboy" in 2, which gives a proximity of (1 - 0) + (2 - 1) = 3. But as you probably already guessed we don't know if there is documents which exists with the given words at these positions, we must check that.

This is why we also store the documents ids for each word and position.

william in 0   156 674 764 793 798 801
the     in 1   112 156 256 325 421 674 798 801 ...
cowboy  in 2   112 201 674 798 801

If we follow the path 0-1-2 we find out that there is only 3 documents containing those words in those exact positions, the documents 674, 798 and 801. We obtained this by doing an intersection between the positions and words of the shortest paths found.

We must find the next shortest path because there isn't enough documents to return to the user.
We also would like to be able to combine multiple criterion, more than just the global proximity of the path but also the words that compose it (to be aware of the number of typos for this path).

Current implementation

We chose to use an A* to find the shortest path between words positions, a node represent a move from one layer to the next with the parent position and the accumulated proximity. Only successors which produce a valid path are explored (i.e. contains documents).

To be able to detect valid and invalid paths we need to compute the intersection between the parent position and the position we are testing at the layer we are. We obviously cache most of the work we do for every request.

  • A* gives us all of the shortest paths with the same proximity.
  • We retrieve the documents corresponding to these paths.
  • if there isn't enough documents we run the first step but with a bigger proximity.

The path discovery problem

I'm not sure if we can find a way to keep the explored graph of the previous searches for proximity.

The miss conceptions of path exploration

Intersect between less popular words first

We should find a way to know if a path is valid by computing the intersection between the less popular words of the query, this would allow us to restrict the documents faster.

We could improve that by finding a better algorithm that could be aware of the words positions in advance which would allows the path validator to use the less popular words to invalidate paths more efficiently.

william 83
the     93
cowboy  39

Compute the documents ids before searching for the best paths

Another solution would also be to compute the documents that we should find before searching for the paths.

We have to store the documents ids unders all the words and their attributes. When we are searching for the best path we first compute the documents that we must find in the attribute we are searching. We then intersect this final documents ids list with every lists of documents ids we meet.

william in 0   101 156 256 589 764 793 798 801 ...
william in 1   112 135 256 325 421 635 925 ...
william in 2   100 201 674 798 801 ...

The advantage of this approach is that it reduces the number of documents we will try to finding. We must obviously update the candidate list by removing from it the documents ids we find when diving through the proximities, this way we can early terminate when no more documents can be found.

Smarter intersections between words positions

When we search for the best path to find documents we intersect between the documents ids of two words positions to be sure there is documents in common between them.
The current implementation is not smart enough and can validate that two words positions have documents in common even if thos documents are not the same than those with the previous word position.

For example there can be the documents 1, 2 and 3 between the first and second words positions but the documents 4, 5, 6 between the second and third words positions. The solution would be to keep the intersection between the previous words positions and do the intersection with that for the next words positions. This way we would be sure that the documents are in common all along the way to the last word.

Support the other criteria

The criteria

The current implementation is really fast to find documents with the best proximity but this is only one of the multiple criteria.
Moreover, all criteria will have to interact with the query tree, and the proximity criterion will need some refactor.

Definition

This part explains the purpose of each criterion:

  • the behavior when the criterion is present in ranking rules
  • the behavior when the criterion is absent from ranking rules

Typo

This criterion is responsible for fetching the documents which contain fewer typos.

present

the criterion will sort and group candidates by the number of typos in ascending order.

Priority behavior:

  1. Candidates matching 0 typos over all the query (exact words)
  2. Candidates matching 1 typo over all the query (1 word with 1 typo)
  3. Candidates matching 2 typos over all the query (max 1 word with 2 typos or 2 words with 1 typo)
  4. ...

absent

Candidates matching exact words and Candidates matching words with typos are considered equally relevant.

Words

This criterion will rank documents with more words of the query.
This is only available when the optionalWords setting is set. (?) (this should be probably moved in the query tree scope)

present

This criterion will sort and group candidates by the number of words in the query matching candidates n descending order,
ignoring word in the query from the last to the first.

for the query "hello word 2021" the priority behavior should be:

  1. Candidates containing "hello", "world" and "2021" (all the world of the query)
  2. Candidates containing "hello" and "world" (last word ignored)
  3. Candidates containing "hello"

absent

Candidates containing all the query words and Candidates containing some of the query words are considered equally relevant.

Proximity

This criterion is the one that has been implemented in this engine, it will rank higher documents with the query words near to each other.

present

This criterion will sort and group candidates by the distance between matching words comparing to the query words.

for the query "hello word 2021" the priority behavior should be:

  1. Candidates containing "hello", "world" and "2021" consecutively.
  2. Candidates containing 2 words consecutively and 1 word with a distance of 2 from the group for a global distance of 3 ("hello big world 2021": "world" and "2021" are consecutive but "hello" has a distance of 2 with "world", the global distance is 3 [1 + 2])
  3. Candidates matching the query words with a global distance of 4
  4. ...

absent

Candidates matching query words consecutively and Candidates matching query words at any distance are considered equally relevant.

Attribute

This criterion will rank higher documents that contain the best match of query word in a more important attribute than another match in another document.

present

This criterion will sort and group candidates by the attributes matching query words from the first attribute to the last.

Having documents with attributes "title" -> "body" -> "footer", the priority behavior should be:

  1. Candidates matching all the query in the "title"
  2. Candidates matching a part of the query in the "title"
  3. Candidates matching all the query in the "body"
  4. Candidates matching a part of the query in the "body"
  5. Candidates matching all the query in the "footer"
  6. Candidates matching a part of the query in the "footer"

absent

Candidates matching the query in the higher attribute and Candidates matching the query in any attribute are considered equally relevant.

Exact

Documents with words that match query terms exactly are ranked higher. An exact match occurs when a full word in a query is matched without typos to a word in an attribute.

It is also this criterion that is responsible for moving up documents that contain the query word in a field in its exact form (only when there is one word).

present

the priority behavior should be:

  1. Candidates containing an attribute matching exactly the original query and only the original query
  2. Candidates containing a sequence of words matching exactly the original query
  3. any other match

absent

Candidates containing an attribute matching exactly the original query and only the original query and Candidates containing query words anywhere are considered equally relevant.

Custom

This criterion rank documents based on a custom field, there can be multiple of them but they can't be moved between any of the previous textual criteria.

Support reordering the criteria

The real problem is to allow users to reorder the default list of criteria or even a list without one or more of the criteria.
We should find a way to make the criterion generic, by using a trait maybe, an important thing to keep in mind is that some criteria are contextual, it means that their behaviors depends on the position of others criteria.

Possible interactions with the query tree

Typo

Typo could modify the query by changing the Tolerant leaves.

# [...]
OR
    AND
        Tolerant { dfa_2: "hello"}
        Tolerant { dfa_2: "world" }
        Tolerant { dfa_1: "2021" }
    END-AND
# [...]
  1. 0 typos
# [...]
OR
    AND
        Exact { word: "hello" }
        Exact { word: "world" }
        Exact { word: "2021" }
    END-AND
# [...]
  1. 1 typo
# [...]
# simple approach
OR
    AND
        Tolerant { dfa_1: "hello"}
        Tolerant { dfa_1: "world" }
        Tolerant { dfa_1: "2021" }
    END-AND
# [...]
# [...]
# accurate approach (maybe too much)
OR
    AND
        Tolerant { dfa_1: "hello"}
        Exact { word: "world" }
        Exact { word: "2021" }
    END-AND
    AND
        Exact { word: "hello" }
        Tolerant { dfa_1: "world" }
        Exact { word: "2021" }
    END-AND
    AND
        Exact { word: "hello" }
        Exact { word: "world" }
        Tolerant { dfa_1: "2021" }
    END-AND
# [...]

Word

Word could modify the query by eluding branches in the tree.

OR
    AND
        Tolerant { dfa: "hello"}
        Tolerant { dfa: "world" }
        Tolerant { dfa: "2021" }
    END-AND
    AND
        Tolerant { dfa: "hello"}
        Tolerant { dfa: "world" }
    END-AND
    AND
        Tolerant { dfa: "hello"}
    END-AND
 END-OR
# second and third branch eluded
OR
    AND
        Tolerant { dfa: "hello"}
        Tolerant { dfa: "world" }
        Tolerant { dfa: "2021" }
    END-AND
 END-OR
# first and third branch eluded
OR
    AND
        Tolerant { dfa: "hello"}
        Tolerant { dfa: "world" }
    END-AND
 END-OR

Proximity

Proximity should not modify the query tree but would have a specific interaction with it.

This Criterion needs to fetch all the candidates containing words that match with a certain distance.

note: To compute this distance the algorithm iterate over query words in a window of 2 ("hello world 2021" = "hello world" -> "world 2021") and make the sum of the minimal distance of each pair.

AND branch

The algorithm should consider sub-branches as words and iterate over them in a window of 2 to make the sum of the minimal distance of each pair.

CONSECUTIVE branch

As the AND branch, the algorithm should consider sub-branches as words but should retain candidates which have all the pair with a distance of 1.

OR branch

The algorithm should consider sub-branches as alternates words in the same position and aggregate them in one meta-word for the parent branch

Result of branches

all the first sub-branch words as left-hand, all the last sub-branch words as right-hand, the candidates, and the distance.

AND
    CONSECUTIVE
        Exact { word: "hello" }
        Exact { word: "world" }
    END-CONSECUTIVE
    OR
        Tolerant { dfa: "2021" }
        Exact {word: "twenties"}
    END-OR
END-AND

step 1: compute leaves

AND
    CONSECUTIVE
        { left: "hello", right: "hello", additional_distance: 0, candidates: C1 }
        { left: "world", right: "world", additional_distance: 0, candidates: C2 }
    END-CONSECUTIVE
    OR
        { left: dfa(2021), right: dfa(2021), additional_distance: 0, candidates: C3 }
        { left: "twenties", right: "twenties", additional_distance: 0, candidates: C4 }
    END-OR
END-AND

step 2: compute CONSECUTIVE sub-branch

AND
    # CONSECUTIVE
    { left: "hello", right: "world", additional_distance: 0, candidates: dist_0(C1, C2) }
    OR
        { left: dfa(2021), right: dfa(2021), additional_distance: 0, candidates: C3 }
        { left: "twenties", right: "twenties", additional_distance: 0, candidates: C4 }
    END-OR
END-AND

step 3: compute OR sub-branch

AND
    # CONSECUTIVE
    { left: "hello", right: "world", additional_distance: 0, candidates: dist_0(C1, C2) }
    # OR
    { left: dfa(2021) | "twenties", right: dfa(2021) | "twenties", additional_distance: 0, candidates: C3 | C4 }
END-AND

step 4: compute AND branch with the minimal distance, let's say 2, using the right field of the first branch with the left field of the second.

# CONSECUTIVE AND OR
{
    left: "hello",
    right: dfa(2021) | "twenties",
    additional_distance: 2,
    candidates: dist_2(dist_0(C1, C2), C3 | C4)
}  

Attributes

As Proximity, Attributes should not modify the query tree but would have a specific interaction with it.

The algorithms will give weight to each word depending on his position in the document, a word in the first attribute at the first position will have the minimum weight, a word in the last attribute in the last position will have the maximal weight.

AND/CONSECUTIVE branch

The algorithm should compute the means of the weights of sub-branches and intersect candidates.

OR branch

The algorithm should keep the minimal weights of sub-branches and keep the candidates of the minimal sub-branch.
If several sub-branches are minimal, the algorithm keeps the union of the candidates of these minimal sub-branches.

AND
    CONSECUTIVE
        Exact { word: "hello" }
        Exact { word: "world" }
    END-CONSECUTIVE
    OR
        Tolerant { dfa: "2021" }
        Exact {word: "twenties"}
    END-OR
END-AND

step 1: compute leaves

AND
    CONSECUTIVE
        # "hello"
        { weight: W1, candidates: C1 }
        # "world"
        { weight: W2, candidates: C2 }
    END-CONSECUTIVE
    OR
        # ~"2021"
        { weight: W3, candidates: C3 }
        # "twenties"
        { weight: W4, candidates: C4 }
    END-OR
END-AND

step 2: compute CONSECUTIVE (means)

AND
    # "hello world"
    { weight: (W1+W2)/2, candidates: C1&C2 }
    OR
        # ~"2021"
        { weight: W3, candidates: C3 }
        # "twenties"
        { weight: W4, candidates: C4 }
    END-OR
END-AND

step 3: compute OR (min) for W4 < W3

AND
    # "hello"&"world"
    { weight: (W1+W2)/2, candidates: C1&C2 }
    # "twenties"
    { weight: W4, candidates: C4 }
END-AND

step 3 bis: compute OR (min) for W4 = W3

AND
    # "hello"&"world"
    { weight: (W1+W2)/2, candidates: C1&C2 }
    # "2021"|"twenties"
    { weight: W3, candidates: C3|C4 }
END-AND

step 4: compute AND (means) for W4 < W3

# "hello"&"world"&"twenties"
{
    weight: ((W1+W2)/2 + W4)/2,
    candidates: C1&C2&C4
}

step 4 bis: compute AND (means) for W4 = W3

# "hello"&"world"&("2021"|"twenties")
{
    weight: ((W1+W2)/2 + W3)/2,
    candidates: C1&C2&(C3|C4)
}

Concerns about primary key behaviour

While I was writing tests for transplant, I was testing the following:

  • create an index with primary key primary
  • add a document, and set primary key to primaru (testing that this is ignored):
{"primaru": 1, "text": "tata"}

I was expecting that there would be an error, since I have set the primary key to some value, and handed a document that doesn't contain this field. Yet this seems perfectly acceptable. Furthermore, when querying for the document back, I get:

[
  {
    "primaru": 1,
    "text": "tata"
  }
]

So there are no notion of the primary key. The document was given an id, but I have no way of knowing which. This is concerning, since as a user I specifically requested that a primary key is used, and yet, it is ignored.

Here is what I think should be the behaviour:

  • If the primary key is not set, and the engine can infer it, then it is set to the inferred value.
  • If it can't be infered, the engine creates a id field for each document, and uses it as a primary key. This fields should be added to all documents, and retrieved when doing a GET documents.
  • If the primary key is set, either manually or by any of the previous manners, then a document that lack this primary key causes the addition to fail with an error.

Highlight corner case

The current version of the engine only return the words found in the different documents and it is up to the caller to highlight the document's words depending on the types found.

The problem is that the engine, when indexing, transforms the numbers into strings to be able to understand their meanings, returning a string to the caller that corresponds to the original number, therefore the caller must or must not, that is the question, transform the numbers into strings to be able to surround them with the mark tag.

The other complementary solution would be to not hihglight them, and not apply any transformation to the original document but in the mean time returning the strings to highlight to the client.

Removing a facet on a field has no effect

It seems that I didn't develop the code to remove a facet from the list of faceted fields, currently removing a facet do nothing.
We must handle this and add some tests to check that this is correct.

Redesign the indexer

The current indexer uses MTBL to be able to index a CSV part on a single thread without any synchronization required.

Once all the CSV parts have been indexed in parallel, a merge function is used to aggregate every entries into one final LMDB database.

This is a good design but it can be even better to get rid of MTBL and use heed instead. The advantage of MTBL is that it is possible to use a temporary file which will be removed when dropped but also if the program crashed (the OS managed that itself) and this is pretty common when OOM.

There is a way to create an LMDB database with only one .mdb file, no .lock file. Therefore it should be possible to give it the temporary file where it needs to write.

Doing so allows us to remove one dependency and would simplify the indexing process I believe. One other advantage would be that the temporary .mdb files will be fully valid databases, this can be great for the future of the library.

Implement stop words

Implement stop words to make transplant iso with the current version of MeiliSearch

Use the log crate ecosystem instead of the stderr

We currently use the standard error and the eprintln macro to log things to the output, it is not the best way to do that.
It is complex to disable this logging when benchmarking the engine, sometimes the log can take up to 29% of the time spent.

Simplify the indexing system

I would like to get rid of the current indexing system by making it single threaded for now. This multithreaded indexing algorithm is proabably fast but it brings to much complexity that I would prefer to avoid in this early stage.

We will only keep the ArcCache on top of LMDB for now, only one thread will do the indexing job. No more hashing of the tokens to spread the indexing load of the threads.

Speedup indexing by not using RoaringBitmaps

When we are indexing documents we write always insert new ids under the list of words or facets, those ids list are in fact RoaringBitmaps which are fasts and compact lists of integers.

However, if the amount of data in memory reaches a certain threshold we dump it to disk and merge those entries (where the values are RoaringBitmaps) when needed, the problem is that the RoaringBitmap indexing format is a little bit costy and forces us to deserialize it into memory before being able to merge both datastructures.

I propose that we don't use RoaringBitmaps when it is related to datastructures that will merged a lot of time and that must be deserialized for that to happen, and instead use simple list of integers (native endian encoded) and where we will be able to just push the new integers at the end. Note that those integers are the internal documents ids and will be seen once during the indexing process.

Move the heed Env into the Index struct

We should move the heed Env into the Index and expose the read_txn and write_txn methods.
This way we will be able to close an index when we close and drop the Env that is stored inside.

But first we must update heed to use the changes in meilisearch/heed#64.

Improve documents highlighting

The current engine doesn't highlight the query words found in the documents, this is dur to the fact that it doesn't even know where in terms of bytes a query words were extracted from the documents.

In the previous engine we were keeping the positions of the words in terms of bytes offset and length and it appears to be one big source of problem when it came to highlighting the document, many out of bounds, etc. This was also a big parts of what was stored on disk, maybe 30% of the database size on disk.

So, I thought about why not getting rid of this raw information? Using the query words the user inputed to reconstruct an higlighted document, by using the stored document.

We can highlight a document by extracting its content, using the tokenizer to find the words it contains and highlighting those matching. It could be a little bit harder when it come to highlight multi word synonyms but maybe we could find a solution later.

Asc/Desc criteria must return the remaining candidates that doesn't have the specified field

The current implementation of the Ascending and Descending criteria does correctly return the candidates in order but there is one little flaw to that, candidates that aren't faceted under the asc/desc field are never returned, they should.

We must be sure that when no more candidates can be returned when the iterator has reached the end of the facet values the remaining candidates are returned.

Improve the words pairs proximities document ids serialization algorithm

We have found that most of the space taken by the database is used by the word pair proximity docids database, 22GB are used by the keys (w1,w2,prox) and 63GB are used by the values (RoaringBitmap).

I computed some statistics about this word pair database and I found out that 75% of the words pairs proximities contains only 2 document ids and that 90% only contains less than 6 document ids. The problem is really related to the RoaringBitmap serialization algortihm that uses four u32 as the serialized header.

More than 75% of the values contains a header that is twice the size or more of the payload (the document ids).

The solution would be to write the document ids (u32) one after the other insstead of using the RoaringBitmap serialization algorithm when the size of that is less than or equal to 4 integer (the size of the roaring header). When deserializing we do a simple condition that says that if the serialized data is less than or equal to 4 integers this is not a RoaringBitmap.

words pairs proximities stats
    number of proximity pairs: 1568490098
    first quartile: 1
    median: 1
    third quartile: 2
    90th percentile: 5
    minimum: 1
    maximum: 5032102

Improve the DFS algorithm

@LegendreM found that the DFS algorithm, which is used to explore the proximity graph, does not find the best answers i.e. answers with the shortest proximity between words.

The current algorithm will search pairs proximities as follow:

soup of of the the day proximity
1 1 1 3
1 1 2 4
1 1 3 5
1 1 4 6
1 1 5 7
1 1 6 8
1 2 1 4
1 2 2 5
1 2 3 6

proximity

As you can see the DFS algorithm does not yield the proximity sums in order, once the highest proximity of the last pair of words is reached the second-to-last pair of words is incremented. This is not what was expected, we must return proximities in order.

We found a way to specialize the original DFS algorithm and make sure it does only visit a subset of the possible solutions at a given time. We use a parameter that we named "mana" which indicates the highest allowed proximity to visit for the next pair of words, each pair of words reduces the "mana" by the proximity it finds, the remaining "mana" is given to the next pair of words to explore.

Let's say we have 4 "mana" available to find the best proximities between 3 pair of words (4 words):

  1. The first pair consumes 2 "mana" and gives 2 to the next pairs.
  2. The second pair consumes 1 "mana" and gives 1 to the next and last pair.
  3. The pair is forced to use the remaining "mana".

These steps give us a final path with every pair of words associated with a final proximity of 4, allowing us to fetch the documents associated to those pairs and proximities.

MANA soup of of the the day proximity
3 1 1 1 3
4 2 1 1 4
4 1 2 1 4
4 1 1 2 4
5 3 1 1 5
5 2 2 1 5
5 2 1 2 5
5 1 3 1 5

proximity(1)

Be able to generate the documents ids

There were many user requests about the fact that MeiliSearch isn't able to automatically generate ids for the newly inserted documents, this isn't an hard feature to add and can make the indexing process faster.

The feature could be a setting flag that can be set or unset and when set the engine will automatically generate the documents ids.

Implement distinctAttribute

Implement distinctAttributes.

The application will be after the bucket sort for this first iteration.

Giving the possibility to apply the distinct before the bucket sort might be implemented once the new search engine is out and needs a new issue.

Support more complex queries using a Query Tree

The engine should support a list of features:

  1. Legacy features:
    • Synonyms e.g. NYC -> new york city.
    • Ngram up to 3 (ke roll mops | keroll mops | ke rollmops | kerollmops).
    • Split words e.g. newyork city -> new york city.
    • prefix (many -> manyllmops)
    • typo (poisson -> poison)
  2. future features:
    • Phrase e.g. "new york city" only search for those words without typos in this exact order.
    • complex synonyms (new york city -> NYC)
    • ignored word (-rude)
    • word criterion

To be able to support all of those features we will need to get rid of our current "list" of words and develop a more complex query tree, where there can be rules and paths between the words, where each word can be associated with a word or a group of words.

For this to work we will need to rework the mdfs a little bit, the new criteria system will also depend on this.

Implementation

The chosen approach is to generate a "query tree" similar to an AST with its own contextual keywords:

  1. Branching keywords:
  • AND: the result of each sub-branches are considered as sequential words
  • OR: the result of each sub-branches are considered as alternative words
  • CONSECUTIVE: same as the AND but with a distance between them equal to 1
  1. Leaf keywords:
  • Exact: contains an exact word or an exact prefix to search in the database
  • Tolerant: contains a DFA of a word or of a prefix to retrieve each alternate word with or without a typo.

Features integration

This tree should be sufficient to express every feature described before.

Let say we have a query hello world 2021:

Synonyms

Synonyms are additional exact words to add to the query matching 1 query word.

we could have "hello" = "hi" or "good morning", the generated tree would be:

# [...]
# here we have 3 alternatives `"hello" OR "hi" OR ("good" AND "morning")`,
# because "hi", "good" and "morning" are synonyms they are considered Exact.
OR
    Tolerant { dfa: "hello" }
    Exact { word: "hi" }
    AND
        Exact { word: "good" }
        Exact { word: "morning" }
    END-AND
END-OR
# [...]
  • implemented
  • tested

Ngrams

Ngrams concatenations of words to find alternates relevant words ("blog post" => "blogpost").
The generated ngrams are considered Exact.

we could have in dataset hello world and helloworld, the generated tree would be:

# [...]
# here we have 2 alternatives `("hello" AND "world") OR "helloworld"`,
# because "helloworld" is an Ngram it is considered Exact.
OR
    AND
        Tolerant { dfa: "hello" }
        Tolerant { dfa: "world" }
    END-AND
    Exact { word: "helloworld" }
END-OR
# [...]
  • implemented
  • tested

Word split

Word split is kind of a reverse Ngram ("blogpost" => "blog post").
The generated words are considered Exact and CONSECUTIVE.

we could have in dataset hello and hell o, the generated tree would be:

# [...]
# here we have 2 alternatives `"hello" OR ("hell" CONSECUTIVE "o")`
OR
    Tolerant { dfa: "hello" }
    CONSECUTIVE
        Exact { word: "hell" }
        Exact { word: "o" }
    END-CONSECUTIVE
END-OR
# [...]
  • implemented
  • tested

Prefix

Prefix allows the search engine to fetch alternatives of a word starting by it ("world" => "worldmap"),
this feature is represented by a boolean in a leaf keyword :

# [...]
Exact { dfa: "hello", is_prefix: true }
# [...]
  • implemented
  • tested

Typo

Typo allows the search engine to fetch alternatives of a word with typos ("world" => "word"),
this feature is represented in the tree by the keyword Tolerant:

# [...]
Tolerant { dfa: "hello" }
# [...]
  • implemented
  • tested

Phrase

Phrase is a sequence of words surrounded by double-quotes in the query ("hello world" =/= hello world).
The words are considered Exact and CONSECUTIVE.

for the query "hello world" 2021, the generated tree would be:

# [...]
AND
    CONSECUTIVE
        Exact { word: "hello" }
        Exact { word: "world" }
    END-CONSECUTIVE
    Tolerant { word: "2021" }
END-AND
# [...]
  • implemented
  • tested

Complex synonyms

Complex synonyms are additional exact words to add to the query matching several query words ("new york" = "NYC").
We have several approaches:

  1. compute synonyms after Ngrams and concatenate complex synonyms in one Ngram

"new york" = "NYC" => "newyork" = "NYC" at indexation time,
during query tree generation the Ngram newyork would be generated from new york and synonyms would be fetched for this Ngram. Simple but limited to Ngram size ("the lord of the ring" = "tlotr" would never work if Ngram size = 3).

for the query new york, the generated tree would be:

# [...]
OR
    # original query
    AND
        Tolerant { word: "new" }
        Tolerant { word: "york" }
    END-AND
    OR
        # generated Ngram
        Exact { word: "newyork" }
        #  Ngram Synonym
        Exact {word: "NYC"}
    END-OR
END-OR
# [...]
  1. Have a specific structure to fetch synonyms (to be defined and not in the scope)

for the query new york, the generated tree would be:

# [...]
OR
    # original query
    AND
        Tolerant { word: "new" }
        Tolerant { word: "york" }
    END-AND
    #  query Synonym
    Exact {word: "NYC"}
    # generated Ngram
    Exact { word: "newyork" }
END-OR
# [...]
  • implemented
  • tested

Ignored words

Ignored words are words prefixed by - in the query meaning that found documents should not contain it.
It may not be represented in the query tree but in an additional field (TO BE DEFINED).

  • implemented
  • tested

Word criterion

Word criterion allows the engine to remove words in the query to find more documents ("new york city" => "new york city").

for the query hello world 2021, the generated tree would be:

# [...]
OR
    AND
        Tolerant { word: "hello" }
        Tolerant { word: "world" }
        Tolerant { word: "2021" }
    END-AND
    AND
        Tolerant { word: "hello" }
        Tolerant { word: "world" }
    END-AND
    AND
        Tolerant { word: "hello" }
    END-AND
END-OR
# [...]
  • implemented
  • tested

Support prefix search of any length

Currently, when a user sends a search query to the engine, the words in the query are extracted and derivates are retrieved. Derivate words are words with typos and can also be prefixes (i.e. he can match hello or hermit).

When we want to find the derivates of the query words, we generate DFAs that are used with the word FST (the word dictionary contains all the known words). This automaton will give us an iterator over all the words that match, but there can be many words.

When we got these words, we retrieve all the documents ids that contain those words. For each list of derivates of a query word we do a union of all the ids, then between each union we do an intersection, this corresponds to the list of candidates words to this query, therefore the more query words we got the fewer candidates we get.

Performance issue

A very short token that is declared as a prefix (e.g. se) in a user query will force the engine to aggregate (doing a union) too many documents ids spread over too many words, slowing down the computation of the candidates.

Fixing the issue

The idea would be to maintain two words dictionary, the current classic one which contains all the words in the database, and a new one, the prefixes dictionary which contains all of the important prefixes. The prefix dictionary is an FST that contains the prefixes which correspond to a lot of words.

Take for example the "a" prefix, it will be present in this prefix FST as something like 5% of the words dictionary matches this prefix. A basic algorithm would be to iterate through the word dictionary, in order, and group the words by their prefix letters, each time a group is found to be bigger or equal to a given threshold the prefix is added to the prefix dictionary. Once the prefix dictionary is ready, we are able to construct a prefix-word-docids database equivalent to the word-docids, where a prefix key gives all of the ids of the documents that we would have been able to find using the original words dictionary, it is equivalent to a cache.

We will have to find an accurate way to highlight the words in the output documents too, maybe using the DFAs.

Make the A* paths aware of the query words typos

When we explore the path graph we use an A* algorithm that try to find the best path regarding the distance between the query words.

In fact we mixed all the derived words, the words that looks like the orginal word, regarding the number of typos they have, unfortunately this makes the current algorithm unaware of the real distance a document is based on the typos of the words found in it.

So I though it could be a good solution to split the derived words in three different "buckets" instead of mixing them up, then when the A* system need to find the next step for a given path we give it the paths with a word that contains no typos from the first bucket, one with one typo from the second, ...

We obviously need to stuff the cost of moving to these paths by adding the number of typos the paths contains, this way the A* will search for paths with the minimum amount of typos. We could represent the cost by using a tuple (typos, proximity).

RAM comsumption issue when trying to limit the usage

We need to investigate the RAM consumption issue we have with the flag to limit consumption (for big datasets).

Sorry my issue is not clear, we are going to complete it as we go along.

Not needed for the first release of the new search engine.

(Also Kero sorry if you already created one 😇)

Bug with primary key type.

I have found a bug that concerns the primary key type when working on implementing deletion. This bug arises when we specify a primary key that is not a string in the documents:

[
	{
		"id": 1,
		"content": "foo"
	},
	{
		"id": 2,
		"content": "foo"
	}
]

This update is accepted and processed by the engine. When trying to delete documents, however, the code panics:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error("invalid type: integer `1`, expected a string", line: 1, column: 1)', /home/mpostma/Documents/code/rust/milli/src/update/delete_documents.rs:98:86

I'm not sure how we want to handle this, but here are the options I see:

  • deny non string primary_key
  • serialize all primary key into a string format
  • deserialize the primary key to value in deletion, then cast it to string

Whatever we decide, this function must not panic.

Speed-up facet fetching

We currently have some important issues when we need to gather facet distributions. Indeed fetching the facet distributions involves iterating over a lot of RoaringBitmaps to check that some documents ids are part of them, this involves deserializing and that takes a lot of time.

I propose we associate those bitmaps with xor-filters or bloom filters to speed up those intersections, I am not sure this is the easiest and more performant way to do so, but it's an idea. Those data structures must be freely deserialized (directly read from unaligned raw bytes).

Improving the BestProximity algorithm

The best proximity algorithm is a system that returns the best proximities between positions of the different query words.

How the engine works

Consider that the search engine have indexed many documents and those documents contain the words "eminem", "rap" and "god". The documents ids will be stored under each of the words apparition under the positions they appear.

If a user searches for "eminem rap god" then this is what the engine will be able to retrieve from the key-value store, a list associated to the query words.

In the example above the engine shows that the word "eminem" appears in many different positions in the documents: in the first attribute at index 0 but also in position 2 and 4 (i.e. 0, 2 and 4). it also appears in the second attribute at index 0 and 1 (i.e. 1000 and 1001).

eminem: [0,    2,    4,       1000, 1001]
rap:    [0, 1, 2, 3,    5,                1002, 1003]
god:    [0, 1, 2, 3,       6, 1000, 1001,       1003]

These numbers doesn't represent documents ids. Once the engine have found the best proximity between these positions it retrieves the documents associated to these words at these exact positions. For example the word "eminem" at position 1000 can contain documents 30, 32, 35 and 72.

Splitting documents ids related to a given word reduces the number of documents ids to do intersections on and allows to return the best documents for a given query (by proximity and position) on the fly.

Once these documents ids have been retrieved the engine will do a simple intersection between them to know which documents are concerned.

Improve the generated successors

We use a function that produces successors of a given path between the differents words positions. This successors function does not known if a positions combination (an intersection) gives documents or not, we must introduce this feature to avoid producing successors paths when not documents will be considered.

Keep the graph exploration algorithm state between iterations

We execute a graph exploration algorithm (dikjstra, A-star) to retrieve the minimum proximity we can find between the different words and positions the engine retrieved for the user query.

This algorithm will return the best path related to its computed proximity but if not enough documents results from this execution we execute it a second time by ignoring paths that we have already seen in the previous iteration. Doing so means that we droped the whole data structures we filled from the previous iteration to re-explore the same nodes, this is wasted time. We must keep the open closed list.

Test the relevancy

We should add integration tests to check the relevancy: are the search results what we expect according to a dataset and the applied queries/settings?

  • "basic" searches
  • check the criteria are applied
  • check the settings are applied (synonyms, stop words, facets, distinct attributes...)

I know there are already some, but you should add more 💪

Feel free to complete my issue with more detailed tests, or how to organize this test suite!

Introduce a typo aware Iterator which returns the best set of candidates

After experimenting with #11 it was found that it was not the best way to deal with typos.

The new idea would be to introduce a new Iterator based on an A* algorithm a mana depth first search, which would be able to return the best set of candidates based on the amount of typos of the previously found words.

We would first split into different buckets the different words derived from the query words, those three buckets would store the words without typos, one typo and two typos.

This new Iterator would then take these buckets and produce lists of candidates, these candidates are then given to the current A* system mana DFS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.