redisearch / redisearch Goto Github PK

A query and indexing engine for Redis, providing secondary indexing, full-text search, vector similarity search and aggregations.

Home Page: https://redis.io/docs/stack/search/

License: Other

Makefile 0.48% C 59.83% C++ 7.44% Python 29.14% Yacc 1.07% Ragel 0.42% Shell 1.29% CMake 0.33%

fulltext geospatial gis inverted-index redis redis-module search search-engine vector-database

redisearch's Introduction

RediSearch

Querying, secondary indexing, and full-text search for Redis

Overview

RediSearch is a Redis module that provides querying, secondary indexing, and full-text search for Redis. To use RediSearch, you first declare indexes on your Redis data. You can then use the RediSearch query language to query that data.

RediSearch uses compressed, inverted indexes for fast indexing with a low memory footprint.

RediSearch indexes enhance Redis by providing exact-phrase matching, fuzzy search, and numeric filtering, among many other features.

Getting started

If you're just getting started with RediSearch, check out the official RediSearch tutorial. Also, consider viewing our RediSearch video explainer.

The fastest way to get up and running with RediSearch is by using the Redis Stack Docker image.

How do I Redis?

Learn for free at Redis University

Build faster with the Redis Launchpad

Try the Redis Cloud

Dive in developer tutorials

Join the Redis community

Work at Redis

Trying RediSearch

To try RediSearch, either use the RediSearch Docker image, or create a free Redis Cloud Essentials account to get a RediSearch instance in the cloud.

Docker image

The Redis Stack Docker image makes it easy to try RediSearch.

To create a local RediSearch container, run:

$ docker run -p 6379:6379 redis/redis-stack-server:latest

To connect to this instance, run:

$ redis-cli

Documentation

The RediSearch documentation provides a complete overview of RediSearch. Helpful sections include:

The RediSearch quick start
The RediSearch command reference
References on features such as aggregations, highlights, stemming, and spelling correction.

Mailing list and forum

Got questions? Join us in #redisearch on the Redis Discord server.

If you have a more detailed question, drop us a line on the RediSearch Discussion Forum.

Client libraries

Official clients


NRedisStack	Jedis	node-redis	redis-py
Redis.OM	Redis OM Spring	redis-om-node	redis-om

Community-maintained clients

Project	Language	License	Author
redisson	Java	MIT	Redisson
redisearch-go	Go	BSD	Redis
rueidis	Go	Apache 2.0	Rueian
redisearch-php	PHP	MIT	Ethan Hann
php-redisearch	PHP	MIT	MacFJA
redisearch-api-rs	Rust	BSD	Redis
redi_search_rails	Ruby	MIT	Dmitry Polyakovsky
redisearch-rb	Ruby	MIT	Victor Ruiz
redi_search	Ruby	MIT	Nick Pezza
coredis	Python	MIT	Ali-Akber Saifee

RediSearch features

Full-Text indexing of multiple fields in Redis hashes
Incremental indexing without performance loss
Document ranking (using tf-idf, with optional user-provided weights)
Field weighting
Complex boolean queries with AND, OR, and NOT operators
Prefix matching, fuzzy matching, and exact-phrase queries
Support for double-metaphone phonetic matching
Auto-complete suggestions (with fuzzy prefix suggestions)
Stemming-based query expansion in many languages (using Snowball)
Support for Chinese-language tokenization and querying (using Friso)
Numeric filters and ranges
Geospatial searches using Redis geospatial indexing
A powerful aggregations engine
Supports for all utf-8 encoded text
Retrieve full documents, selected fields, or only the document IDs
Sorting results (for example, by creation date)

Cluster support

RediSearch has a distributed cluster version that scales to billions of documents across hundreds of servers. At the moment, distributed RediSearch is available as part of Redis Cloud and Redis Enterprise Software.

See RediSearch on Redis Enterprise for more information.

License

RediSearch is licensed under the Redis Source Available License 2.0 (RSALv2) or the Server Side Public License v1 (SSPLv1).

redisearch's People

Contributors

Stargazers

Watchers

Forkers

devopsbox umit rayleyva ceyhunkerti pfreixes davidalphafox zhanglei slact tuhina2020 rishabhpuri reqshark tonylau08 mheiber feathers88 amallia dvirsky nickbondarenko pkmnfrk zackives adrienbrault yurybubnov mbusath gavincastleton vruizext stewartpark arielw xushenkun allenli1990 mnunberg chrismayday zoyhao zhoudianyou awsekfozc faunjoe mission10138 qialie tjzx lntoly guyueyuqi cqwang lxq2537664558 zcfrank1st iloveyou416068 fc13240 hiekay benfucai pygmalion666 smallhu qiansunn jy4618272 rainleo woniulinux asirxing nanbo99 zq9206 snzigod vincent4reason imran-ideas2it jichi-houzz lusenoot deivamagalhaes hugoren houzz ethanhann tornado008 tguangch tenglongwentian tenglognwentian soloestoy wqw-dream lamby junhe jai2033shankar zhenyuan66 sherlockwu rogervaas tuananh josansun dut3062796s chubbymaggie db4u sifusanka slavaganzin panaedra xuzong 13dagger hrsrinivas legendzhu nickyzhang chengyanglei haiyhe1991 ljx0305 salt4pommes ethan-k kink80 zhoujian1210 cassandra-t-oduola tangyibo cnangel luckyseal

redisearch's Issues

Support Configurable Stopwords

We should provide a way to configure per-instance and/or per-index and/or per field list of stopwords, instead of the very primitive hard-coded list we have today.

MAC OS x compilation issues

Ive got some issues compiling RedisSearch in my MAC OS X Yosemite, first issue with the header to import all definitions regarding the malloc.h file and the second o one related with the u_short symbol definition. Next snippet displays the silly modifications made to overcome this issues.

diff --git a/src/rmutil/sdsalloc.h b/src/rmutil/sdsalloc.h
index 33ee741..1538fdf 100644
--- a/src/rmutil/sdsalloc.h
+++ b/src/rmutil/sdsalloc.h
@@ -36,7 +36,11 @@
  * the include of your alternate allocator if needed (not needed in order
  * to use the default libc allocator). */

+#if defined(__MACH__)
+#include <stdlib.h>
+#else
 #include <malloc.h>
+#endif
 //#include "zmalloc.h"
 #define s_malloc malloc
 #define s_realloc realloc
diff --git a/src/types.h b/src/types.h
index 13cde82..9d5be91 100644
--- a/src/types.h
+++ b/src/types.h
@@ -2,8 +2,12 @@
 #define __MDMA_TYPES_H__
 #include <stdlib.h>

+#if defined(__MACH__)
+#include <sys/types.h>
+#endif
+
 typedef u_int32_t t_docId;
 typedef u_int32_t t_offset;


-#endif
\ No newline at end of file
+#endif

Aside of this issue, just worth mentioning that default linker in MAC does not recognize the -Bsymbolic option. To address this issue use the original gcc command rather than the default one shipped with the MAC

$ brew install gcc48
$ export CC=/usr/local/Cellar/gcc48/4.8.4/bin/gcc-4.8
$ export LD=/usr/local/Cellar/gcc48/4.8.4/bin/gcc-4.8
$ make all

Dropping an index and recreating causes duplicates if using a database other than 0

Basically, FT.DROP doesn't seem to clean up all the keys it probably should if you're in a different database. We are running v0.10.

Here is a script that demonstrates the problem. The first time I run it, I get this output:

Recreate index:
OK

Add data:
OK
OK
OK

Index stats:
 1) index_name
 2) test
 3) fields
 4) 1) 1) body
       2) type
       3) TEXT
       4) weight
       5) "1"
 5) num_docs
 6) "3"
 7) max_doc_id
 8) "3"
 9) num_terms
10) "16"
11) num_records
12) "16"
13) inverted_sz_mb
14) "7.62939453125e-05"
15) inverted_cap_mb
16) "0"
17) inverted_cap_ovh
18) "inf"
19) offset_vectors_sz_mb
20) "1.52587890625e-05"
21) skip_index_size_mb
22) "0"
23) score_index_size_mb
24) "0"
25) doc_table_size_mb
26) "9.72747802734375e-05"
27) key_table_size_mb
28) "0.00010013580322265625"
29) records_per_doc_avg
30) "5.333333333333333"
31) bytes_per_record_avg
32) "5"
33) offsets_per_term_avg
34) "1"
35) offset_bits_per_record_avg
36) "8"

Query:
1) (integer) 1
2) "d1"
3) (empty list or set)

Drop index:
OK

However, the second time I run it, I get:

Query:
1) (integer) 2
2) "d1"
3) (empty list or set)
4) "d1"
5) (empty list or set)

A third time yields three results, etc.

If you edit the script to target db 0, then it behaves the way you expect.

0.17 exact search syntax change

seems exact search syntax was changed and now need to use ` except "
I think it need to be documented in release notes and may be change parser to work with " for back compatibility

Error with Snowball libstemmer compilation on Alpine

Hi,

I'm in a journey to install RediSearch on Alpine Linux. First problem was this RedisLabs/triemap#5

Now, is the snowball libstemmer compilation fail:

Console log:

make[1]: Entering directory '/usr/src/RediSearch/RediSearch/src/dep/snowball'
sed 's/@MODULES_H@/modules.h/' libstemmer/libstemmer_c.in >libstemmer/libstemmer.c
libstemmer/mkmodules.pl libstemmer/modules.h src_c libstemmer/modules.txt libstemmer/mkinc.mak
cc -Wall -Wno-unused-function -Wno-unused-variable -Wno-unused-result -fPIC -D_GNU_SOURCE -std=gnu99 -I"/usr/src/RediSearch/RediSearch/src" -DREDIS_MODULE_TARGET -O3 -g -Iinclude  -c -o compiler/space.o compiler/space.c
cc -Wall -Wno-unused-function -Wno-unused-variable -Wno-unused-result -fPIC -D_GNU_SOURCE -std=gnu99 -I"/usr/src/RediSearch/RediSearch/src" -DREDIS_MODULE_TARGET -O3 -g -Iinclude  -c -o compiler/tokeniser.o compiler/tokeniser.c
cc -Wall -Wno-unused-function -Wno-unused-variable -Wno-unused-result -fPIC -D_GNU_SOURCE -std=gnu99 -I"/usr/src/RediSearch/RediSearch/src" -DREDIS_MODULE_TARGET -O3 -g -Iinclude  -c -o compiler/analyser.o compiler/analyser.c
compiler/analyser.c: In function 'check_name_type':
compiler/analyser.c:235:19: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
         case 'r': if (p->type == t_routine ||
                   ^~
compiler/analyser.c:236:54: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
                       p->type == t_external) return; break;
                                                      ^~~~~
compiler/analyser.c: In function 'read_program':
compiler/analyser.c:942:21: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
                     if (q->used && q->definition == 0) error4(a, q); break;
                     ^~
compiler/analyser.c:942:70: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
                     if (q->used && q->definition == 0) error4(a, q); break;
                                                                      ^~~~~
compiler/analyser.c:944:21: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
                     if (q->used && q->grouping == 0) error4(a, q); break;
                     ^~
compiler/analyser.c:944:68: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
                     if (q->used && q->grouping == 0) error4(a, q); break;

Any idea?

BUG when autocompleting unicode strings

Following search provides no results:

redisCommand(ctx, "FT.SUGADD userslex %b 1", "\u010Caji\u0107", sizeof("\u010Caji\u0107") - 1);
redisCommand(ctx, "FT.SUGGET userslex %b MAX 2", "\u010Caj", sizeof("\u010Caj") - 1);

Another example using random bytes:

uint32_t a = 1234432413;
redisCommand(ctx, "FT.SUGADD userslex %b 1", &a, 4);
redisCommand(ctx, "FT.SUGGET userslex %b MAX 2", &a, 3); // nothing found

Wrong result in AND query

I created an index with a single document: FT.ADD myIdx doc1 1.0 FIELDS title "hello world" body "lorem ipsum" url "http://redis.io".
From my understanding of query syntax documentation the following queries should have the same meaning:

FT.SEARCH myIdx test ipsum LIMIT 0 1
FT.SEARCH myIdx ipsum test LIMIT 0 1
They both should return the intersection of test AND ipsum.

Apparently the result set is empty for the first query (correct), while doc1 is returned for the second query (wrong).

Is this an expected behaviour?

Wrong complex query parsing

Hello, i had another kind of problem, i have index with fields @status : text, @current_bet : NUMERIC

I want to find items with: status = open & current_bet > 200 & current_bet < 6000, need to use syntax with 3 conditions

127.0.0.1:7771> FT.EXPLAIN 2_25_1496935565 "   ( @status: \"open\" )   ( (@current_bet: [200.000000 +inf]) (@current_bet: [-inf 6000.000000]) )  "
@status:INTERSECT {
  @status:open
  @status:INTERSECT {
    NUMERIC {200.000000 <= @current_bet <= inf}
    NUMERIC {-inf <= @current_bet <= 6000.000000}
  }
}

same search but duplicate one condition in other place (problem not in duplicate, just for illustrate a problem)

127.0.0.1:7771> FT.EXPLAIN 2_25_1496935565 "   ( @status: \"open\" )   ( (@current_bet: [200.000000 +inf]) (@current_bet: [-inf 6000.000000]) ) (@current_bet: [200.000000 +inf])  "
@status:INTERSECT {
  @status:open
  @status:INTERSECT {
    NUMERIC {200.000000 <= @current_bet <= inf}
    NUMERIC {-inf <= @current_bet <= 6000.000000}
    NUMERIC {200.000000 <= @current_bet <= inf}
  }
}

127.0.0.1:7771> FT.SEARCH 2_25_1496935565 "   ( @status: \"open\" )   ( (@current_bet: [200.000000 +inf]) (@current_bet: [-inf 6000.000000]) )  "
5

127.0.0.1:7771> FT.SEARCH 2_25_1496935565 "   ( @status: \"open\" )   ( (@current_bet: [200.000000 +inf]) (@current_bet: [-inf 6000.000000]) ) (@current_bet: [200.000000 +inf])  "
7

i had different result, seems that @status was ignored

Correct way to use GEO/GEOFILTER?

This is a pretty basic example I put together based on the docs, but these commands yield zero search results...

FT.CREATE GeoTest SCHEMA name TEXT place GEO
FT.ADD GeoTest 158ce9ce941bce 1 FIELDS name Foo place Washington D.C. -77.0366 38.8977
FT.SEARCH GeoTest Foo GEOFILTER place -77.0366 38.8977 1 km

These are the keys the commands create...

1) "ft:GeoTest/foo"
2) "idx:GeoTest"
3) "158ce9ce941bce"

There should probably be a key called something like "geo:GeoTest/place", correct?

Segmentation Fault when running FT.SEARCH

10209:M 26 Jun 00:03:46.864 # Redis 3.9.103 crashed by signal: 11
10209:M 26 Jun 00:03:46.864 # Accessing address: (nil)
10209:M 26 Jun 00:03:46.864 # Failed assertion: <no assertion failed> (<no file>:0)

------ STACK TRACE ------
redis-server 172.30.4.218:6379(logStackTrace+0x29)[0x466ee9]
redis-server 172.30.4.218:6379(sigsegvHandler+0xac)[0x46758c]
/lib64/libpthread.so.0(+0xf370)[0x7f49a2298370]

------ INFO OUTPUT ------
# Server
redis_version:3.9.103
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:297697f387be7f5a
redis_mode:standalone
os:Linux 4.9.32-15.41.amzn1.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.3
process_id:10209
run_id:0bc09b79062ccd5e7739376c50eb61578215abf7
tcp_port:6379
uptime_in_seconds:17
uptime_in_days:0
hz:10
lru_clock:5263202
executable:/home/ec2-user/redis-server
config_file:/etc/redis/redis.conf

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:1

# Memory
used_memory:824792
used_memory_human:805.46K
used_memory_rss:6086656
used_memory_rss_human:5.80M
used_memory_peak:864720
used_memory_peak_human:844.45K
used_memory_peak_perc:95.38%
used_memory_overhead:775042
used_memory_startup:757664
used_memory_dataset:49750
used_memory_dataset_perc:74.11%
total_system_memory:513331200
total_system_memory_human:489.55M
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:419430400
maxmemory_human:400.00M
maxmemory_policy:allkeys-lru
mem_fragmentation_ratio:7.38
mem_allocator:jemalloc-4.0.3
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1498435409
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:1
total_commands_processed:3
instantaneous_ops_per_sec:0
total_net_input_bytes:105
total_net_output_bytes:1055155
instantaneous_input_kbps:0.03
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:1
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:0
master_replid:d8550b568ae09e2d0c3ed8afb8cb393670afbb2a
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.00
used_cpu_user:0.01
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Commandstats
cmdstat_auth:calls=1,usec=2,usec_per_call=2.00
cmdstat_command:calls=1,usec=297,usec_per_call=297.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=9,expires=0,avg_ttl=0

------ CLIENT LIST OUTPUT ------
id=2 addr=127.0.0.1:48390 fd=8 name= age=2 idle=0 flags=b db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ft.search

------ REGISTERS ------
10209:M 26 Jun 00:03:46.866 # 
RAX:00007f4999bfbf88 RBX:00000000021bbef0
RCX:00007f49a229725a RDX:00000000021aa968
RDI:00007f499b41c570 RSI:0000000000000000
RBP:00000000021bbdf0 RSP:00007f499815ed38
R8 :0000000000000000 R9 :00000000000027e8
R10:0000000000000000 R11:0000000000000206
R12:00000000021aa900 R13:00000000021aa910
R14:00000000021aa968 R15:00000000021aa938
RIP:0000000000000000 EFL:0000000000010202
CSGSFS:002b000000000033
10209:M 26 Jun 00:03:46.866 # (00007f499815ed47) -> 0000000000000000
10209:M 26 Jun 00:03:46.866 # (00007f499815ed46) -> 0000000000000000
10209:M 26 Jun 00:03:46.866 # (00007f499815ed45) -> 0000000000000000
10209:M 26 Jun 00:03:46.866 # (00007f499815ed44) -> 0000000000000000
10209:M 26 Jun 00:03:46.866 # (00007f499815ed43) -> 0000000000000000
10209:M 26 Jun 00:03:46.866 # (00007f499815ed42) -> 000000332d6c6f6f
10209:M 26 Jun 00:03:46.866 # (00007f499815ed41) -> 702d646165726874
10209:M 26 Jun 00:03:46.866 # (00007f499815ed40) -> 00007f49999b5774
10209:M 26 Jun 00:03:46.866 # (00007f499815ed3f) -> 00000000021aa968
10209:M 26 Jun 00:03:46.866 # (00007f499815ed3e) -> 00000000021aa910
10209:M 26 Jun 00:03:46.866 # (00007f499815ed3d) -> 00000000021aa900
10209:M 26 Jun 00:03:46.866 # (00007f499815ed3c) -> 00000000021ac7c8
10209:M 26 Jun 00:03:46.866 # (00007f499815ed3b) -> 00000000021bbef0
10209:M 26 Jun 00:03:46.866 # (00007f499815ed3a) -> 00007f49a2294765
10209:M 26 Jun 00:03:46.866 # (00007f499815ed39) -> 00007f499815ed70
10209:M 26 Jun 00:03:46.866 # (00007f499815ed38) -> 00007f49999b4c8c

any info about Hebrew support?

NUMERIC data won't delete when drop index.

I run this command.

FT.CREATE myIdx SCHEMA title TEXT WEIGHT 5.0 body TEXT url TEXT sc NUMERIC 
FT.ADD myIdx doc1 1.0 FIELDS title "hello world" body "lorem ipsum" url "http://redis.io" sc 100
FT.DROP myIdx
KEYS *myIdx*

Turn out to be

1) "nm:myIdx/sc"

When I drop an index, it will not delete the NUMERIC data automatically.

Support ALTER INDEX

After the creation, the index is currently fixed and cannot be changed. ALTER should be able to:

Add indexed fields
Change field weight (forward only)
Add sortable fields
Change stopwords
Change field names.

Some of these will require reindexing, which is another issue. We should start by supporting only the changes that do not require re-indexing.

Wildcard Search

Is it possible to query partial words with a wildcard query, such as "he*" for "hello"?

Complex queries of OR and AND doesn't work as expected

I see that AND queries work and OR queries work.
However AND and OR together doesn't work as expected.
Please advise !

"FT.SEARCH" "vc" "(7F647D08|7F647D08)" "NOCONTENT" "VERBATIM" "NOSTOPWORDS" "LIMIT" "0" "1000000"
25,246 results

"FT.SEARCH" "vc" "(7E765B38|7E765B38)" "NOCONTENT" "VERBATIM" "NOSTOPWORDS" "LIMIT" "0" "1000000"
11,468 results

"FT.SEARCH" "vc" "(7F647D08|7F647D08)(7E765B38|7E765B38)" "NOCONTENT" "VERBATIM" "NOSTOPWORDS" "LIMIT" "0" "1000000"

(integer) 1
"v:x544tok"

"FT.SEARCH" "vc" "(7F647D08)(7E765B38)" "NOCONTENT" "VERBATIM" "NOSTOPWORDS" "LIMIT" "0" "1000000"
9,555 results

search queries failling with "Internal error processing query"

i tried out the main example from the wiki.

`27.0.0.1:6379>` FT.CREATE myIdx title 5.0 body 1.0 url 1.0
OK
127.0.0.1:6379> FT.ADD myIdx doc1 1.0 fields title "hello world" body "lorem ipsum" url "http://redis.io" 
OK
`127.0.0.1:6379> FT.SEARCH myIdx "http://re"
1) (integer) 0
127.0.0.1:6379> FT.SEARCH myIdx "hell"
(error) Internal error processing query
127.0.0.1:6379> FT.SEARCH myIdx "wor"
(error) Internal error processing query

i am using the latest redis (4.0-rc2)

"Module path/to/module/module.so initialization failed. Module not loaded" when cluster mode enabled

When cluster mode is enabled in the config file like so:
port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
loadmodule /path/RediSearch/src/module.so
I get the error "Module path/to/module/module.so initialization failed. Module not loaded".
The module however loads fine in standalone mode
Am I doing something wrong? Is there a special configuration for cluster mode?

Redisearch list indexes

Hi,
i'm currently using RediSeacrh 0.4 in production and i use FT.INFO num_docs field to get the number of documents in my index. in this method i get less documents then i expect and last documents than showed in redis_docIdCounter, but i guess this counter is global to all indexes.
Is there a way to list indexes ? this will help me debug the problem.

Core when using FT.INFO

1.redis version: 4.0 (https://github.com/antirez/redis/archive/4.0.zip)
2.redisearch version: github master branch
3.command before core: FT.INFO indexName
4.default config, not change any config nor use any redis.conf
5.OS: Darwin Mac.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64
6.gcc: Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 8.1.0 (clang-802.0.42)
Here is the detail messgae:
6029:M 06 Jul 13:17:46.751 * Ready to accept connections
=== REDIS BUG REPORT START: Cut & paste starting from here ===
6029:M 06 Jul 13:53:02.528 # === ASSERTION FAILED ===
6029:M 06 Jul 13:53:02.528 # ==> networking.c:1898 'c->reply_bytes < SIZE_MAX-(102464)' is not true
6029:M 06 Jul 13:53:02.528 # (forcing SIGSEGV to print the bug report.)
6029:M 06 Jul 13:53:02.528 # Redis 3.9.103 crashed by signal: 11
6029:M 06 Jul 13:53:02.528 # Crashed running the instuction at: 0x10570088d
6029:M 06 Jul 13:53:02.528 # Accessing address: 0xffffffffffffffff
6029:M 06 Jul 13:53:02.528 # Failed assertion: c->reply_bytes < SIZE_MAX-(102464) (networking.c:1898)

------ STACK TRACE ------
EIP:
0 redis-server 0x000000010570088d _serverAssert + 157

Backtrace:
0 redis-server 0x00000001057024dd logStackTrace + 109
1 redis-server 0x000000010570288c sigsegvHandler + 236
2 libsystem_platform.dylib 0x00007fff9b9f3b3a _sigtramp + 26
3 ??? 0x0000000000000400 0x0 + 1024
4 redis-server 0x00000001056c80f9 asyncCloseClientOnOutputBufferLimitReached + 361
5 redis-server 0x000000010572dfe6 RM_ReplyWithSimpleString + 102
6 module.so 0x00000001058ea5b4 IndexInfoCommand + 148
7 redis-server 0x000000010572cdf3 RedisModuleCommandDispatcher + 147
8 redis-server 0x00000001056ba998 call + 216
9 redis-server 0x00000001056bb221 processCommand + 1297
10 redis-server 0x00000001056cb234 processInputBuffer + 292
11 redis-server 0x00000001056b3201 aeProcessEvents + 689
12 redis-server 0x00000001056b34eb aeMain + 43
13 redis-server 0x00000001056be13b main + 1643
14 libdyld.dylib 0x00007fff9b7e4235 start + 1
15 ??? 0x0000000000000003 0x0 + 3

....
------ CLIENT LIST OUTPUT ------
id=2 addr=127.0.0.1:51730 fd=8 name= age=2095 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=2 omem=18446744073709551458 events=r cmd=ft.info

------ CURRENT CLIENT INFO ------
id=2 addr=127.0.0.1:51730 fd=8 name= age=2095 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=2 omem=18446744073709551458 events=r cmd=ft.info
argv[0]: 'FT.INFO'
argv[1]: 'foodspu'

FEATURE REQUEST: utf-8 support in autocomplete

It's me again!
Just tried some searches containing utf-8 characters and here are the results:

// add utf8-encoded string
redisCommand(ctx, "FT.SUGADD userslex %b 1", "\u010Caji\u0107", sizeof("\u010Caji\u0107") - 1); 

// None of the two queries bellow have found the match
redisCommand(ctx, "FT.SUGGET userslex %b MAX 2 FUZZY", "\u010Caj", sizeof("\u010Caj") - 1);
redisCommand(ctx, "FT.SUGGET userslex %b MAX 2 FUZZY", "aji", sizeof("aji") - 1);

// However, if the utf8 character comes after the searched characters
redisCommand(ctx, "FT.SUGADD userslex %b 1", "Caji\u0107", sizeof("Caji\u0107") - 1);

// The query can find it
redisCommand(ctx, "FT.SUGGET userslex %b MAX 2 FUZZY", "aji", sizeof("aji") - 1);

Is it possible to implement this?

results size wrongly reported as 50 for large single word indexes

hi.
i tried to insert a simple one fielded text to and index and then query it.
i inserted the word "hello' for multiple times.
everything was ok until the 5000 insertion but when i inserted the 5001 document the the search result for the world "hello" returned a constant result of 50 appearances , instead of 5001 appearances.

thx all

FT.SUGGET key ""，包含单引号或者双引号，代码里读不出来数据

Can't install redis-server with search module

Need the option to install redis-server 4.0 RCe and specify the search module for the executable.

New Ruby client library for RediSearch

So far the only Ruby gem to work with RediSearch is Rails only. I'm working on a project with Sinatra, so I had to build my own client, and now I've open sourced it. It is framework agnostic, the only dependence is the official Ruby client for Redis 4 RC1. It would be great if you could help to spread the word adding it to the README. Thanks!

https://github.com/vruizext/redisearch-rb
https://rubygems.org/gems/redisearch-rb/versions/0.1.3

LIMIT return redundant records on the end of list

From #61

Seems that paging work not absolutely correct. Imagine i want to organize paging functionality with page-size = 4

create an index and add five documents to it:

FT.CREATE indexName SCHEMA status TEXT

FT.ADD indexName doc1 1 FIELDS status open
FT.ADD indexName doc2 1 FIELDS status open
FT.ADD indexName doc3 1 FIELDS status open
FT.ADD indexName doc4 1 FIELDS status open
FT.ADD indexName doc5 1 FIELDS status open

Then i want to get first page with 4 elements in it:

FT.SEARCH indexName "@status:open" LIMIT 0 4
1) (integer) 5
2) "doc5"
3) 1) "status"
   2) "open"
4) "doc4"
5) 1) "status"
   2) "open"
6) "doc3"
7) 1) "status"
   2) "open"
8) "doc2"
9) 1) "status"
   2) "open"

response correct

Now i want go see last element:

FT.SEARCH indexName "@status:open" LIMIT 4 4
1) (integer) 5
2) "doc4"
3) 1) "status"
   2) "open"
4) "doc3"
5) 1) "status"
   2) "open"
6) "doc2"
7) 1) "status"
   2) "open"
8) "doc1"
9) 1) "status"
   2) "open"

Incorrect - must contains only doc1.

I think i can try fix it in backend and if it is last page - trim result and use just a tail, but it looks like a problem

Query using field modifiers returning syntax error if the field name contains '_'

FT.SEARCH idx @brand:(ford) @model:(mustang gt) => works

If now I index brand and model in a single field and name it brand_model

FT.SEARCH idx @brand_model:(ford mustang) => (error) Syntax error at offset 0 near 'brand'

RediSearch 0.16 with Redis 3.9.103

Normalize and fold Unicode text in auto-complete

Right now we do not process text entered into the autocomplete engine at all. no lower-casing, no accent removing etc.

With the introduction of libnu to the engine, it is now possible to clean and case-fold the input and the query text, to allow more recall and not have the user do this manually.

Building on Mac?

Are there additional dependencies or build steps for Mac that aren't covered in the quick start guide? Running "make all" produces this output:

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C util
make[1]: Nothing to be done for `all'.
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C rmutil
make[1]: Nothing to be done for `all'.
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C dep/snowball libstemmer.o
cc -Wall -Wno-unused-function -Wno-unused-variable -Wno-pointer-sign -fPIC  -std=gnu99 -I./ -DREDIS_MODULE_TARGET -mmacosx-version-min=10.6 -O3 -g -Iinclude  -c -o compiler/space.o compiler/space.c
Could not open input file: app/console
Could not open input file: app/console
Could not open input file: app/console
chmod: -R: No such file or directory
chmod: app/logs: No such file or directory
chmod: -R: No such file or directory
chmod: app/cache: No such file or directory
make[1]: *** [compiler/space.o] Error 1
make: *** [snowball] Error 2

Segfault when using GEOFILTER with missing arguments

A segfault occurs when omitting trailing arguments to GEOFILTER. For example, these commands trigger a segfault...

FT.SEARCH MyIndex Foo GEOFILTER place -77.0 38.0 10
FT.SEARCH MyIndex Foo GEOFILTER place -77.0 38.0
FT.SEARCH MyIndex Foo GEOFILTER place -77.0

Full bug report output...

=== REDIS BUG REPORT START: Cut & paste starting from here ===
48078:M 19 Mar 11:54:51.043 # Redis 999.999.999 crashed by signal: 11
48078:M 19 Mar 11:54:51.043 # Crashed running the instuction at: 0x10e68b03c
48078:M 19 Mar 11:54:51.043 # Accessing address: 0x1cea9cb88
48078:M 19 Mar 11:54:51.043 # Failed assertion: <no assertion failed> (<no file>:0)

------ STACK TRACE ------
EIP:
0   redis-server                        0x000000010e68b03c RM_StringPtrLen + 12

Backtrace:
0   redis-server                        0x000000010e660313 logStackTrace + 115
1   redis-server                        0x000000010e6606d5 sigsegvHandler + 245
2   libsystem_platform.dylib            0x00007fffa3223bba _sigtramp + 26
3   ???                                 0x000000000000001a 0x0 + 26
4   module.so                           0x000000010e83f0c5 rmutil_vparseArgs + 421
5   module.so                           0x000000010e83eeff RMUtil_ParseArgs + 127
6   module.so                           0x000000010e835bee GeoFilter_Parse + 94
7   module.so                           0x000000010e83cb2a SearchCommand + 602
8   redis-server                        0x000000010e68a220 RedisModuleCommandDispatcher + 96
9   redis-server                        0x000000010e618385 call + 213
10  redis-server                        0x000000010e618c27 processCommand + 1335
11  redis-server                        0x000000010e62905b processInputBuffer + 283
12  redis-server                        0x000000010e610cf9 aeProcessEvents + 649
13  redis-server                        0x000000010e610fdb aeMain + 43
14  redis-server                        0x000000010e61bb0a main + 1642
15  libdyld.dylib                       0x00007fffa3016255 start + 1

------ INFO OUTPUT ------
# Server
redis_version:999.999.999
redis_git_sha1:a62f7863
redis_git_dirty:0
redis_build_id:472f0e91ee86701f
redis_mode:standalone
os:Darwin 16.1.0 x86_64
arch_bits:64
multiplexing_api:kqueue
gcc_version:4.2.1
process_id:48078
run_id:9d49d6beb5461f209d45cb9d7ad94a35ca9c415f
tcp_port:6379
uptime_in_seconds:99
uptime_in_days:0
hz:10
lru_clock:13543883
executable:/Users/ethan/src/redis/src/./redis-server
config_file:

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:1051232
used_memory_human:1.00M
used_memory_rss:2723840
used_memory_rss_human:2.60M
used_memory_peak:1051232
used_memory_peak_human:1.00M
used_memory_peak_perc:100.15%
used_memory_overhead:1012870
used_memory_startup:963184
used_memory_dataset:38362
used_memory_dataset_perc:43.57%
total_system_memory:8589934592
total_system_memory_human:8.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:2.59
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1489938792
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:1
total_commands_processed:7
instantaneous_ops_per_sec:0
total_net_input_bytes:643
total_net_output_bytes:81
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:5
keyspace_misses:2
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:0
master_replid:ca8e972b1b8bb0adc5d4bf198b6e949d639d787b
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.05
used_cpu_user:0.03
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Commandstats
cmdstat_select:calls=1,usec=6,usec_per_call=6.00
cmdstat_georadius:calls=1,usec=5,usec_per_call=5.00

# Cluster
cluster_enabled:0

# Keyspace
db15:keys=1,expires=0,avg_ttl=0

------ CLIENT LIST OUTPUT ------
id=2 addr=127.0.0.1:51438 fd=6 name= age=95 idle=0 flags=N db=15 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=ft.search

------ CURRENT CLIENT INFO ------
id=2 addr=127.0.0.1:51438 fd=6 name= age=95 idle=0 flags=N db=15 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=ft.search
argv[0]: 'FT.SEARCH'
argv[1]: 'GeoTest'
argv[2]: 'Foo'
argv[3]: 'GEOFILTER'
argv[4]: 'place'
argv[5]: '-77'
argv[6]: '38'
argv[7]: '100'

------ REGISTERS ------
48078:M 19 Mar 11:54:51.048 # 
RAX:000000010e877d20 RBX:000000010e84615b
RCX:00007fff515f3608 RDX:0000000000000001
RDI:00000001cea9cb80 RSI:0000000000000000
RBP:00007fff515f34a0 RSP:00007fff515f34a0
R8 :0000000000000001 R9 :0000000000000001
R10:0000000000000000 R11:00007f9975510a53
R12:00007fff515f3748 R13:000000010e83f128
R14:00007fff515f35b0 R15:0000000000000005
RIP:000000010e68b03c EFL:0000000000010246
CS :000000000000002b FS:0000000000000000  GS:0000000000000000
48078:M 19 Mar 11:54:51.048 # (00007fff515f34af) -> 0000000000000000
48078:M 19 Mar 11:54:51.048 # (00007fff515f34ae) -> 00007f997551ffff
48078:M 19 Mar 11:54:51.048 # (00007fff515f34ad) -> ff80000000001001
48078:M 19 Mar 11:54:51.048 # (00007fff515f34ac) -> 000000010e73f000
48078:M 19 Mar 11:54:51.048 # (00007fff515f34ab) -> 000000010e83eeff
48078:M 19 Mar 11:54:51.048 # (00007fff515f34aa) -> 00007fff515f35e0
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a9) -> 0000000000000001
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a8) -> 00007fff515f3748
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a7) -> 00007f9975510940
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a6) -> 00007f9975510940
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a5) -> b589ce9f5dcf0044
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a4) -> 0000000000000005
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a3) -> 00007f9975510960
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a2) -> 0000000000000a00
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a1) -> 000000010e83f0c5
48078:M 19 Mar 11:54:51.048 # (00007fff515f34a0) -> 00007fff515f34f0

------ DUMPING CODE AROUND EIP ------
Symbol: RM_StringPtrLen (base: 0x10e68b030)
Module: /Users/ethan/src/redis/src/./redis-server (base 0x10e60c000)
$ xxd -r -p /tmp/dump.hex /tmp/dump.bin
$ objdump --adjust-vma=0x10e68b030 -D -b binary -m i386:x86-64 /tmp/dump.bin
------
48078:M 19 Mar 11:54:51.048 # dump of function (hexdump of 140 bytes):
554889e54885ff74314885f6488b470874550fb650ff89d783e70731c983ff047742488d0d4300000048633cb94801cfffe7c1ea034889d1eb2a4885f6488d05b5750400742148c706280000005dc30fb648fdeb0f0fb748fbeb098b48f7eb04488b48ef48890e5dc30f1f00c6ffffffe3ffffffe9ffffffeffffffff4ffffff4889f0488b7f080fb64fff89

=== REDIS BUG REPORT END. Make sure to include from START to END. ===

"FT.SEARCH index the" returning error

When I run FT.SEARCH ix the, I'm receiving an error. Example:

127.0.0.1:6379> FT.SEARCH bands th
1) (integer) 0
127.0.0.1:6379> FT.SEARCH bands the
(error) Syntax error at offset 0 near ''
127.0.0.1:6379> FT.SEARCH bands then
(error) Syntax error at offset 0 near ''
127.0.0.1:6379> FT.SEARCH bands thenn
1) (integer) 0
127.0.0.1:6379> FT.SEARCH bands '@name:the'
(error) Syntax error at offset 5 near ':'
127.0.0.1:6379>

Other problematic terms: and, if, or...

Environment info:

Redis: 4.0 RC2
RediSearch: 0.12.1

RediSearch - FT.SEARCH with Field Modifiers or INFIELDS option

Trying to perform FT.SEARCH using Field Modifier and INFIELDS option but getting wrong results. Please see below.

I mentioned to search within Title but got result even if it contains in description.

Please suggest the right syntax if I am doing wrong.
Thanks in advance.

Feature suggestion: FT.SEARCHSTORE - store results in a Redis Sorted Set

It would be convenient to have the search result IDs persisted to a Sorted Set (optionally with score) for further manipulation such as intersection or union with other non-FT sets.

Query on special character doesnt seem to work

it seem that the query gets broken on special character instead of treating it as a string

example
127.0.0.1:6379> "FT.SEARCH" "idxvisa" "@BusinessName:"Wells Fargo Bank, National Association"" "VERBATIM"
(error) Syntax error at offset 27 near 'Bank'
works without comma
127.0.0.1:6379> "FT.SEARCH" "idxvisa" "@BusinessName:"Wells Fargo Bank National Association"" "VERBATIM"

(integer) 5
"decc7c4f-45ab-46c1-a1f6-62dd9277ccc4"
1. "LastModified"
2. "37527.108076"
3. "BusinessName"
4. "Wells Fargo Bank, National Association"
5. "Classification"
6. "Processor"
7. "ClientCountry"
8. "UNITED STATES OF AMERICA"
9. "BusinessID"
  10) "10000056"

The Module in a Redis Windows Service?

How to add this module to a Redis Windows Service?

when the document will be delete?

hi. I am using the version 0.16.0
"NOTE: This does not actually delete the document from the index, just marks it as deleted. Thus, deleting and re-inserting the same document over and over will inflate the index size with each re-insertion."
I want to know when the document will actually delete because my index size is very large large now.
Thanks~

LIMIT for NUMERIC fields not working

steps to reproduce

prepare data

FT.CREATE testidx SCHEMA num NUMERIC name TEXT
FT.ADD testidx doc1 1 FIELDS num 140 name test
FT.ADD testidx doc2 1 FIELDS num 1400 name test
FT.ADD testidx doc3 1 FIELDS num 1405 name test

check with 2 records

 FT.SEARCH testidx "@num:[0 1000000]" LIMIT 0 2
return: doc1, doc3

 FT.SEARCH testidx "@num:[0 1000000]" LIMIT 1 2
return: doc2, doc3

 FT.SEARCH testidx "@num:[0 1000000]" LIMIT 2 2
return: doc2, doc3

check with 1 record

 FT.SEARCH testidx "@num:[0 1000000]" LIMIT 0 1
return: doc3

 FT.SEARCH testidx "@num:[0 1000000]" LIMIT 0 1
return: doc3

Can't use uppercase symbols for field names

steps to reproduce:

FT.CREATE testidx SCHEMA somePrice NUMERIC
FT.ADD testidx doc1 1 FIELDS somePrice 144
FT.SEARCH testidx "@somePrice:[0 200]" LIMIT 1 1

return

(integer) 0

Two separate indexes affects each other

create two absolutely separate indexed mixes1 & mixes2 and add two separate documents to each

FT.CREATE mixes1 SCHEMA name text
FT.CREATE mixes2 SCHEMA namename text
FT.ADD mixes1 doc10 1 FIELDS name name
FT.ADD mixes2 doc10 1 FIELDS namename name

notice, that doc10 in mixes1 not same as doc10 in mixes2
search document in mixes1 by name, expect to get doc10 just with name field:

FT.SEARCH mixes1 "@name:name*"


2) "doc10"
3) 1) "name"
   2) "name"
   3) "namename"
   4) "name"

result mixed with mixes2 index fields

Design.MD missing

The README refers to a Design.MD file, but the link is broken.

Thanks for your work on this, and I'm looking forward to reading about the design!

Python - Is Index Exists

when upgrading from 0.4 to 0.9, creating an existing index raised a ResponseError.
is there a way to check if index exist without an exception?

Mark stems with a flag so verbatim search doesn't show them

right now verbatim doesn't stem the word, but if the word is a stem of another word, it still pops up in the results, which is bad

Error searching for

FT.CREATE User SCHEMA name TEXT
FT.ADD User 'bob'
FT.SEARCH User ''

RESULT
(error) Syntax error at offset 0 near ''

Sometimes UI searches pass empty strings when person clicks Search w/o specifying anthing

Create custom tokenization API

We can take one of two approaches (or both):

Allow tokenization extensions.
Allow a structure, pre-tokenized document format where the user has more control of the tokens and their flags.

Feature suggestion: FT.SEARCH - optional "which fields to return" modifier

The INFIELDS modifier tells it which fields to look at, but it returns either all or nothing (via NOCONTENT), but - this doesn't seem very granular compared to similar tools that allow restriction of returned fields (you might want a few key pieces of metadata, but not the heavy text data, for example).

Suggest: something like INFIELDS, but for the values to return in addition to the document id (defaults to all if not specified). There's a question as to what to do if this proposed modifier was specified in addition to NOCONTENT... I guess either an error or NOCONTENT wins?

This is not a blocker for me, though - merely a suggestion.

Server dies on FT.SEARCH

I can't seem to get the module to run. I'm running Redis 4 RC3 and RediSearch 0.18 on macOS. Every time I attempt to FT.SEARCH the server dies and the subsequently client eats it.

Here is what I'm doing:

127.0.0.1:6379> ft.create redishop-ft schema name text
OK
127.0.0.1:6379> hget redishop3:items:Small-Abstract-Mug name
"Small Abstract Mug"
127.0.0.1:6379> FT.ADDHASH redishop-ft redishop3:items:Small-Abstract-Mug 1
OK
127.0.0.1:6379> FT.SEARCH redishop-ft "hello world"
Could not connect to Redis at 127.0.0.1:6379: Connection refused

I can simplify it even more:

127.0.0.1:6379> ft.create redishop-ft schema name text
OK
127.0.0.1:6379> FT.SEARCH redishop-ft "hello world"
Could not connect to Redis at 127.0.0.1:6379: Connection refused

Attached is the bug report.

redisearch-bug-report.txt

Support for Unicode Fuzzy suggestions

Right now the trie and the Levenshtein automaton that handle prefix searches, operate on chars. This means that while we can store and search for prefixes using utf-8 non-latin strings (or any encoding for that matter), fuzzy matching will not work, because levenshtein distance is calculated in bytes and not codepoints / letters.

We need to operate on unicode runes, and not on chars. However, doing this with variable length encoding like utf-8 is difficult in the current implementation of the trie, and will probably reduce performance considerably.

In order to support this, the following steps will be needed:

The trie and levenshtein automaton will be converted to operate on unicode wide characters or runes.
However, to avoid having to hold 32-bits per letter in the trie, which will consume tons of memory, we'll use 16-bit unicode runes, ignoring anything that is not contained in Unicode BMP. This will work for all modern languages.
The Levenshtein automaton will be converted to use 16-bit runes as well. Thus evaluation of any input string will have the correct results in terms of Levenshtein distance. (this means converting the DFA from using a 255 byte pointer array to the next state node, to a sparse array of pointers. This will slow down searches a bit, but can be sped up with sorting the array and using binary search.

Converting text

We assume all input to FT.SUG* commands is valid utf-8.
We convert the input strings to 32-bit unicode, optionally normalizing, case-folding and removing accents on the way. If the conversion fails it's because the input is not valid utf-8.
We trim the 32-bit runes to 16-bit runes using the lower 16 bits. These can be used for insertion, deletion and search.
We convert the output of searches back to utf-8.

Memory Penalty

The above will make the module consume more memory. However, the penalty will not be too bad:

If all the strings in the trie are non-latin utf-8, there won't be much change, since these are already expressed as 2 or 3 bytes in utf-8.
If all strings are latin ascii - while the text representations in the trie will take exactly X2 more memory - in reality, the penalty will be a lot less: Depending on the fragmentation of the tree, anywhere from 20 to 80% of the memory is pointers and metadata. Thus doubling the memory on the rest of the data will not double the memory consumption directly.

Any mix of 1 and 2 will yield a result in between.

The Levenshtein automaton will take up more memory, but this will be negligible.

FT.CREATE myIdx SCHEMA title TEXT WEIGHT 5.0 body TEXT url TEXT sc NUMERIC SORTABLE
FT.ADD myIdx doc1 1.0 FIELDS title "hello world" body "lorem ipsum" url "http://redis.io" sc 1
FT.ADD myIdx doc2 1.0 FIELDS title "hello world" body "lorem ipsum" url "http://redis.io" sc 2

ok. here we sort by sc field(asc) and limit 2 doc return.

FT.SEARCH myIdx "hello" NOCONTENT SORTBY sc ASC LIMIT 0 2

1) (integer) 2
2) "doc1"
3) "doc2"

now we just want return the first one.

FT.SEARCH myIdx "hello" NOCONTENT SORTBY sc ASC LIMIT 0 1

1) (integer) 2
2) "doc2"

it will return doc2. After my practice found, not mater how many data I inserted, when I use limit and sort, the last doc is fixed.