The criteria
The current implementation is really fast to find documents with the best proximity but this is only one of the multiple criteria.
Moreover, all criteria will have to interact with the query tree, and the proximity criterion will need some refactor.
Definition
This part explains the purpose of each criterion:
- the behavior when the criterion is present in ranking rules
- the behavior when the criterion is absent from ranking rules
Typo
This criterion is responsible for fetching the documents which contain fewer typos.
present
the criterion will sort and group candidates by the number of typos in ascending order.
Priority behavior:
- Candidates matching 0 typos over all the query (exact words)
- Candidates matching 1 typo over all the query (1 word with 1 typo)
- Candidates matching 2 typos over all the query (max 1 word with 2 typos or 2 words with 1 typo)
- ...
absent
Candidates matching exact words and Candidates matching words with typos are considered equally relevant.
Words
This criterion will rank documents with more words of the query.
This is only available when the optionalWords
setting is set. (?) (this should be probably moved in the query tree scope)
present
This criterion will sort and group candidates by the number of words in the query matching candidates n descending order,
ignoring word in the query from the last to the first.
for the query "hello word 2021" the priority behavior should be:
- Candidates containing "hello", "world" and "2021" (all the world of the query)
- Candidates containing "hello" and "world" (last word ignored)
- Candidates containing "hello"
absent
Candidates containing all the query words and Candidates containing some of the query words are considered equally relevant.
Proximity
This criterion is the one that has been implemented in this engine, it will rank higher documents with the query words near to each other.
present
This criterion will sort and group candidates by the distance between matching words comparing to the query words.
for the query "hello word 2021" the priority behavior should be:
- Candidates containing "hello", "world" and "2021" consecutively.
- Candidates containing 2 words consecutively and 1 word with a distance of 2 from the group for a global distance of 3 ("hello big world 2021": "world" and "2021" are consecutive but "hello" has a distance of 2 with "world", the global distance is 3 [1 + 2])
- Candidates matching the query words with a global distance of 4
- ...
absent
Candidates matching query words consecutively and Candidates matching query words at any distance are considered equally relevant.
Attribute
This criterion will rank higher documents that contain the best match of query word in a more important attribute than another match in another document.
present
This criterion will sort and group candidates by the attributes matching query words from the first attribute to the last.
Having documents with attributes "title" -> "body" -> "footer", the priority behavior should be:
- Candidates matching all the query in the "title"
- Candidates matching a part of the query in the "title"
- Candidates matching all the query in the "body"
- Candidates matching a part of the query in the "body"
- Candidates matching all the query in the "footer"
- Candidates matching a part of the query in the "footer"
absent
Candidates matching the query in the higher attribute and Candidates matching the query in any attribute are considered equally relevant.
Exact
Documents with words that match query terms exactly are ranked higher. An exact match occurs when a full word in a query is matched without typos to a word in an attribute.
It is also this criterion that is responsible for moving up documents that contain the query word in a field in its exact form (only when there is one word).
present
the priority behavior should be:
- Candidates containing an attribute matching exactly the original query and only the original query
- Candidates containing a sequence of words matching exactly the original query
- any other match
absent
Candidates containing an attribute matching exactly the original query and only the original query and Candidates containing query words anywhere are considered equally relevant.
Custom
This criterion rank documents based on a custom field, there can be multiple of them but they can't be moved between any of the previous textual criteria.
Support reordering the criteria
The real problem is to allow users to reorder the default list of criteria or even a list without one or more of the criteria.
We should find a way to make the criterion generic, by using a trait maybe, an important thing to keep in mind is that some criteria are contextual, it means that their behaviors depends on the position of others criteria.
Possible interactions with the query tree
Typo
Typo could modify the query by changing the Tolerant
leaves.
# [...]
OR
AND
Tolerant { dfa_2: "hello"}
Tolerant { dfa_2: "world" }
Tolerant { dfa_1: "2021" }
END-AND
# [...]
- 0 typos
# [...]
OR
AND
Exact { word: "hello" }
Exact { word: "world" }
Exact { word: "2021" }
END-AND
# [...]
- 1 typo
# [...]
# simple approach
OR
AND
Tolerant { dfa_1: "hello"}
Tolerant { dfa_1: "world" }
Tolerant { dfa_1: "2021" }
END-AND
# [...]
# [...]
# accurate approach (maybe too much)
OR
AND
Tolerant { dfa_1: "hello"}
Exact { word: "world" }
Exact { word: "2021" }
END-AND
AND
Exact { word: "hello" }
Tolerant { dfa_1: "world" }
Exact { word: "2021" }
END-AND
AND
Exact { word: "hello" }
Exact { word: "world" }
Tolerant { dfa_1: "2021" }
END-AND
# [...]
Word
Word could modify the query by eluding branches in the tree.
OR
AND
Tolerant { dfa: "hello"}
Tolerant { dfa: "world" }
Tolerant { dfa: "2021" }
END-AND
AND
Tolerant { dfa: "hello"}
Tolerant { dfa: "world" }
END-AND
AND
Tolerant { dfa: "hello"}
END-AND
END-OR
# second and third branch eluded
OR
AND
Tolerant { dfa: "hello"}
Tolerant { dfa: "world" }
Tolerant { dfa: "2021" }
END-AND
END-OR
# first and third branch eluded
OR
AND
Tolerant { dfa: "hello"}
Tolerant { dfa: "world" }
END-AND
END-OR
Proximity
Proximity should not modify the query tree but would have a specific interaction with it.
This Criterion needs to fetch all the candidates containing words that match with a certain distance.
note: To compute this distance the algorithm iterate over query words in a window of 2 ("hello world 2021" = "hello world" -> "world 2021") and make the sum of the minimal distance of each pair.
AND branch
The algorithm should consider sub-branches as words and iterate over them in a window of 2 to make the sum of the minimal distance of each pair.
CONSECUTIVE branch
As the AND branch, the algorithm should consider sub-branches as words but should retain candidates which have all the pair with a distance of 1.
OR branch
The algorithm should consider sub-branches as alternates words in the same position and aggregate them in one meta-word for the parent branch
Result of branches
all the first sub-branch words as left-hand, all the last sub-branch words as right-hand, the candidates, and the distance.
AND
CONSECUTIVE
Exact { word: "hello" }
Exact { word: "world" }
END-CONSECUTIVE
OR
Tolerant { dfa: "2021" }
Exact {word: "twenties"}
END-OR
END-AND
step 1: compute leaves
AND
CONSECUTIVE
{ left: "hello", right: "hello", additional_distance: 0, candidates: C1 }
{ left: "world", right: "world", additional_distance: 0, candidates: C2 }
END-CONSECUTIVE
OR
{ left: dfa(2021), right: dfa(2021), additional_distance: 0, candidates: C3 }
{ left: "twenties", right: "twenties", additional_distance: 0, candidates: C4 }
END-OR
END-AND
step 2: compute CONSECUTIVE sub-branch
AND
# CONSECUTIVE
{ left: "hello", right: "world", additional_distance: 0, candidates: dist_0(C1, C2) }
OR
{ left: dfa(2021), right: dfa(2021), additional_distance: 0, candidates: C3 }
{ left: "twenties", right: "twenties", additional_distance: 0, candidates: C4 }
END-OR
END-AND
step 3: compute OR sub-branch
AND
# CONSECUTIVE
{ left: "hello", right: "world", additional_distance: 0, candidates: dist_0(C1, C2) }
# OR
{ left: dfa(2021) | "twenties", right: dfa(2021) | "twenties", additional_distance: 0, candidates: C3 | C4 }
END-AND
step 4: compute AND branch with the minimal distance, let's say 2, using the right
field of the first branch with the left
field of the second.
# CONSECUTIVE AND OR
{
left: "hello",
right: dfa(2021) | "twenties",
additional_distance: 2,
candidates: dist_2(dist_0(C1, C2), C3 | C4)
}
Attributes
As Proximity, Attributes should not modify the query tree but would have a specific interaction with it.
The algorithms will give weight to each word depending on his position in the document, a word in the first attribute at the first position will have the minimum weight, a word in the last attribute in the last position will have the maximal weight.
AND/CONSECUTIVE branch
The algorithm should compute the means of the weights of sub-branches and intersect candidates.
OR branch
The algorithm should keep the minimal weights of sub-branches and keep the candidates of the minimal sub-branch.
If several sub-branches are minimal, the algorithm keeps the union of the candidates of these minimal sub-branches.
AND
CONSECUTIVE
Exact { word: "hello" }
Exact { word: "world" }
END-CONSECUTIVE
OR
Tolerant { dfa: "2021" }
Exact {word: "twenties"}
END-OR
END-AND
step 1: compute leaves
AND
CONSECUTIVE
# "hello"
{ weight: W1, candidates: C1 }
# "world"
{ weight: W2, candidates: C2 }
END-CONSECUTIVE
OR
# ~"2021"
{ weight: W3, candidates: C3 }
# "twenties"
{ weight: W4, candidates: C4 }
END-OR
END-AND
step 2: compute CONSECUTIVE (means)
AND
# "hello world"
{ weight: (W1+W2)/2, candidates: C1&C2 }
OR
# ~"2021"
{ weight: W3, candidates: C3 }
# "twenties"
{ weight: W4, candidates: C4 }
END-OR
END-AND
step 3: compute OR (min) for W4 < W3
AND
# "hello"&"world"
{ weight: (W1+W2)/2, candidates: C1&C2 }
# "twenties"
{ weight: W4, candidates: C4 }
END-AND
step 3 bis: compute OR (min) for W4 = W3
AND
# "hello"&"world"
{ weight: (W1+W2)/2, candidates: C1&C2 }
# "2021"|"twenties"
{ weight: W3, candidates: C3|C4 }
END-AND
step 4: compute AND (means) for W4 < W3
# "hello"&"world"&"twenties"
{
weight: ((W1+W2)/2 + W4)/2,
candidates: C1&C2&C4
}
step 4 bis: compute AND (means) for W4 = W3
# "hello"&"world"&("2021"|"twenties")
{
weight: ((W1+W2)/2 + W3)/2,
candidates: C1&C2&(C3|C4)
}