zuliaio / zuliasearch Goto Github PK

View Code? Open in Web Editor NEW

29.0 6.0 9.0 2.74 MB

Zulia Search Engine

Home Page: http://zulia.io/

License: Apache License 2.0

Java 99.95% HTML 0.05%

lucene java search search-engine realtime distributed

zuliasearch's Introduction

Zulia

Distributed Lucene with deep object searching

Zulia is a real-time distributed search and storage system. Zulia is designed to scale both vertically and horizontally across servers.

Zulia is:

Realtime
Distributed
Pure Java
Open Source
Based on Lucene 9.x

Zulia supports:

To learn more see the wiki:

zuliasearch's People

Contributors

Stargazers

Watchers

Forkers

tomzhang bears852 fperez809 kdallasd iswepston millmanw elbehery mrityunjaypandeyy loayghreeb

zuliasearch's Issues

Allow better query handling of nested fields

Consider the sample document:

{  
   "id": "123",
   "title": "The best title",
   authors: [
      {
          firstName: "Tom",
          lastName: "Jones"
      },
      {
          firstName: "Jennifer",
          lastName: "Smith"
      }
   ]
}

Currently it is possible to search authors.firstName:"Tom" AND authors.lastName:"Smith" but that would erroneous match the above document because the matching it is at the root document level.

Using Lucene's blockjoin, it is possible to index child document alongside the parent document and allow separate queries against the child documents.

Zulia should support defining authors as a nested type and then allow searching on firstName and lastName correctly within that nested type.

Add optional setting of Pojo Codec Registry to handle objects with mongodb better

Add term query syntax and numeric set syntax

Currently term queries and numeric set queries can be executed as a separate query but not as part of the search syntax.
Add new search syntax to enable these queries inline inside another query.

Term Queries

field1:zl:tq(term1 term2 term3)  // search for the terms term1 term2 term3 in field1 
zl:tq(term1 term2)               // search for the terms term1 term2 term3 in  the default search fields

Keyword tokenized fields can contain terms spaces or dashes and should be quoted.

Numeric Set Queries

field1:zl:ns(1 2 3)              // field1 must be numeric
zl:ns(1 2 3 4)                   // uses the default search fields which all must be numeric

to search for a field named zl (not recommend name for a field)

// use default search fields
new FilterQuery("value").addQueryFields("zl");
OR
// use multi field syntax with an empty field
new FilterQuery("zl,:ns")

Upgrade to MongoDB Java Driver 3.8.0 to support MongoDB 4.0 features

Also switch to the new mongo-java-sync driver for java 9 module compliance

New micronaut based rest is not reading the port from zulia config

Update to Lucene 9.7

Full List:
https://lucene.apache.org/core/9_7_0/changes/Changes.html#v9.7.0

Most interesting release notes for zulia currently:

KNN indexing and querying can now take advantage of vectorization for distance computation between vectors. To enable this, use exactly Java 20 or 21, and pass --add-modules jdk.incubator.vector as a command-line parameter to the Java program.
Queries sorted by field are now able to dynamically prune hits only using the after value. This yields major speedups when paginating deeply.
Reduced merge-time overhead of computing the number of soft deletes.
KNN vectors are now disallowed to have non-finite values such as NaN or ±Infinity.

Allow efficiently fetching all values for facetable fields

Currently the only way to fetch all values for a field is to use a keyword analyzer and use GetTerms request (which will contained deleted values) on the keyword analyzed field or to do a *:* search with a facet on a facetable field.

The zulia facet index has all the values and they can be fetched without running a *:* query and counting every document.

Add a GetFacetValuesRequest that takes a list of optional categories and optional path. Facet values are never deleted from the taxonomy so this has issues like GetTerms

Add ability to turn on Query Parser SplitOnWhitespace/AutoGeneratePhraseQueries

Improve facet performance on indexes with tons of faceted fields with many values

Switch Tests to Junit 5

Default Command Line Host/Port to Environment Variables

Follow these rules for Zulia server specification on the command line utils (everything except zuliad)

If --server is given use server given on the command line
If --server is NOT given use ZULIA_HOST as the default server if set, other default to localhost

Follow same logic for --port and ZULIA_PORT with default value of 32191

Add zuliaexport and zuliaimport like mongoexport and mongoimport

Switch REST to micronaut

Optimize Term In Set Queries that are sortable

After Lucene 9.6 we should be able to optimize term in set queries where the field is sortable (has a doc value) like in KeywordField.newSetQuery
apache/lucene#12175

Upgrade to Lucene 8.0.0

Support *:terms to search all fields in the index

The user may want to query all fields and they shouldn't have to name every field specifically, ensure (*:term1 term2 term3) actually searches all fields for the given terms.

Enhance zuliadump and zuliarestore to include associated docs

Make changes to the zuliadump and zuliarestore commands to include the associated docs.

Add adjacency matrix facets

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-adjacency-matrix-aggregation.html

Add reindex in place functionality

Change Associated Meta To Accept Any JSON/BSON metadata

The current structure for meta is Map<String, String>. This was done due to a protobuf limitation of key/value only and no handling for generic objects. This needs to be changed to accept a byte[] so that a bson document with proper types can be stored.

Upgrade to Gradle 6.0

Update Lucene to 7.4.0

Add delete documents by query

Allow Index Aliases to Point to Multiple Indexes

Upgrade to Lucene 8.5.2

List Length is Wrong for Arrays of Objects

with the data structure

{
  id: "123",
  authors: [{
    firstName: "Bob",
    lastName: "Jones",
    degrees: ["PHD", "MD"]
  },
  {
    firstName: "Tom",
    lastName: "Smith",
    degrees: ["PHD"]
  }
 ]
}

searching |||authors|||:2 will find that record but |||authors.degrees|||:3 will not. Fix to correctly compute number of items in these cases.

Allow searching field character length

Finish adding segment replication functionality

Zulia already contains indexing and query routing and some configuration for replicas, however there is no code copying data to the replicas or and wiring up the complete handling on replicas. There is also no code for new primary election.

Add ability to add field display name and description to the index schema

Update grpc to version 1.14.0

move to version 1.14.0 from 1.10.0

Use connection pool in Zulia REST client

Add distinct value facet

Instead of counting the count of each facet just return the existence of a facet for a search. Storage can be in BitSet instead of a much more memory and compute heavy HashIntIntMap. Values would be returned in A-Z order up until a count.

Update to Lucene 9.6

Adds support for Java 19/20 (instead of just 19) optimized foreign memory API
Adds string optimizations when docvalues exist (example in KeywordField.newSetQuery)

Add facet histogram facets

Currently date facets do not fill in zeros for date without data. It would be useful to be able to facet for a date range and fill in zeros.

This is elastic search's approach:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

Solr's:
https://lucene.apache.org/solr/guide/8_5/working-with-dates.html

Add synonym filter

Needs to include an API to store filter file in centralized place

Create index field mapping aliases

Create alias fields that can be pointed to one or more fields or field wildcards for an index

Specify the index field name for numeric fields

When indexing a numeric field the specified field name should be respected.

Update to Lucene 8.1.1

Allow sorting by relevance (ascending and in combination with other fields)

Upgrade to Lucene 8.3.0

Create index stats API to include detailed stats about each index

Allow returning inputstream from associated file from the zulia java client

Currently the options are getting back the file as a proto object (reads into memory), writing it to file, or streaming to an outputstream.

Unique Facet and Sort Values

Make facet and sort values unique before storing in facet index

Compress documents on disk

Since Zulia 3.0, Zulia has been storing the BSON document in binary doc values fields instead of stored fields. Inside Lucene, stored fields are compressed but binary doc values are not compressed in newer versions of Lucene.

Goals:

Turn on document level compression by default but allow it to be turned off at the index level
Use a high speed compression algorithm
Existing indexes must not break but new documents (or reindexed documents) will be stored compressed into the index unless the user disables document level compression in the index config

Implementation:

Add a single byte flag (bool) with the id meta info (IdInfo) of each document that indicates if the document is compressed. The protobuf field will default to false to enable backwards compatibility with existing documents
Add a boolean flag to IndexSettings called disableCompression that will turn off compression at an index level. The protobuf field will default to false turning this on for new or reindexed documents by default
Use snappy compression for high throughput compression