Since 2df6a42 it is possible to store every datapoint in n
lists by building with ivf.build(n_probes=n)
.
This increases performance recall/qps quite a lot, but only when going from n=1
to n=2
, as seen in the attached figure.
The problem is probably that duplicate matches aren't handled well.
When calling ctop
from ivf, we should somehow tell it about the indices we have already collected so the distance table can focus on telling us about alternative interesting candidates.
One option is to even reuse the query_pq_sse(transformed_data, self.tables, indices, values, True)
call by calling it on multiple (transformed_data, tables)
pairs while keeping (indices, values)
fixed. That way we also would only do rescoring/pass-2 a single time on all candidates retrieved from different lists.
The issue is that
(1) the binary heap data structure we use can't recognize duplicates, and
(2) the query_pq_sse
function only knows the "local" id of a point in a list, not the global id.
To solve (2) we could pass a list with the global ids of all the points considered. This would be some extra overhead for query_pq_sse
to pass around, but perhaps not that much. And we wouldn't have to "relabel" the returned ids afterwards.
For (1) we could switch back to using insertion sort, or just try heuristically to remove some of the duplicates the heap is able to find.