rmind / nxsearch Goto Github PK
View Code? Open in Web Editor NEWnxsearch: a full-text search engine
License: BSD 2-Clause "Simplified" License
nxsearch: a full-text search engine
License: BSD 2-Clause "Simplified" License
/docs
).nxsearch_svc.lua
source.nxsearch-svc
image.Options:
Add support for exclusion e.g. "science -medicine"
where any documents matching "medicine" are excluded.
Currently, idxdoc_get_termcount() is O(n) and is therefore very inefficient. The t_index_limits
test exposes this behaviour (slow search).
Need a function to substitute for normalization, e.g. replace ė
with e
. This will either be used by the built-in normalizer or a added as a new filter. Note: ICU library might have this functionality hidden somewhere.
{
"title": "Short story",
"content": "Once upon a time ...",
"date": 1666111000
}
title: "story"
and content: "once"
Currently, filter_pipeline_run()
fails with FILT_ERROR
we have no way to obtain the error message.
Provide an API for the error message. Update the parser and the Lua code to support it.
Regardless of the error and the end point it always returns 200.
If the index gets deleted, see see here, the other nginx workers won't pick it up. Since references use LRU with TTL, it will eventually get G/C-ed and closed. However, this doesn't work if index is deleted and immediatelly re-created. This can be fixed either on the OpenResty side or the nxsearch lib.
Currently, the tokenizer is a quickly written naive implementation: 1) it is not efficient as it copies the whole input string; 2) it uses libc strtok_r() to replace the separator bytes with NIL terminator which may break UTF-8 strings.
Objective: write an efficient and UTF-8 compatible tokenizer, probably using a simple state machine, perhaps leveraging some UTF-8 helpers from the libicu Unicode library.
The new tokenizer should handle separators (dots, commas, etc) and other punctuation. Some cases to think about and consider:
foo, bar
producing foo
and bar
rather than foo,
and bar
.the [client] is <...>
, some *bold*
or _underscore_
marks.Some.Text,which doesn't have spaces right;one;two;three.
.keep-alive
, this--done
(m-dash stands for "is").U.S.A.
, I.B.M.
or Dennis M. Ritchie
.Currently, all tokens are treated using the OR
logic ("science medicine" is "science" or "medicine"). The query parser should support the following requirements:
AND
, OR
as well as NOT
logic.&
, |
and -
(minus means exclusion, e.g. science -medicine
)(C OR C++ OR Python) AND developer AND (Linux OR Unix OR BSD) AND NOT (C# OR Java)
(software AND (engineer OR developer))
("software engineer" OR "software developer")
A | B & C
mean (A | B) & C
or A | (B & C)
? I'd say stick with standard precedence in logics: ¬, ∧,∨. But left-to-right might also be an option.Consider using re2c
and lemon
for lexer and parser.
Currently, the index structures are generally append-only as the document removal uses tombstones (special markings) to indicate deletions. Therefore, many deletions would produce gaps in the index files which would waste space. We want to address this problem by implementing compaction.
The following is a proposal for a concurrent compaction algorithm:
new-dtmap = exclusive-open-create
lock new-dtmap
// Initial concurrent sync (captures most of the data)
dtmap-sync from current-dtmap
lock current-dtmap
// Sync any remaining data with the lock held if there was a race
dtmap-sync from current-dtmap
// Make the new index globally visible
atomic-posix-rename new-dtmap.filename to current-dtmap.filename
// Publish the compaction offset for the active index references
atomic-store-release current-dtmap.compaction-offset <= last offset in new-dtmap
unlock current-dtmap
unlock new-dtmap
compaction-offset
change, re-open the index (picking up the new file) and sync from this offset. The existing references to the memory-mapped file itself, primarily idxdoc_t::offset
, would require a sequential scan to be adjusted.Currently, the results in nxs_resp_t are returned in arbitrary order. They should be ordered by score and the results should have a custom cap/limit, e.g. a default can be 10k.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.