rbmhtechnology / vind Goto Github PK

Vind is build to enable the integration of search facilities in java projects without getting to deep into the search topic

Home Page: https://rbmhtechnology.github.io/vind/

License: Apache License 2.0

Java 99.93% Shell 0.07%

information-discovery java library search solr vind

vind's People

Contributors

Stargazers

Watchers

Forkers

redlink-gmbh gitter-badger wernerharing stefan-sachs spunk166 jorjao81 seakayone arbner javierron kwatzal luaks

vind's Issues

enable timeout configuration

Add the possibility to setup the connection and read timeouts by configuration file properties.

Enable analysis view in solr admin

The analysis view is missing a request handler. This should be added in the next release. Otherwise field debugging is quite hard.

Add documentation for termsQueryFilter

A new termsQueryFilter has been added in #45 , but documentation is still missing and should be added.

Improve Filter&Facet report

Design a reporting model for filters and facets far from the current java pojo representation and closer to a user friendly format.

Fix build error due to javadoc plugin

Currently the build fails (at least on mac). An update of the javadoc plugin version fixes this issue, but now javadoc itself fails.

Homogenize monitoring field types

Some of the monitoring fields are actually giving a type depending on the original Vind component (i.e. an interval facet for a numeric field will have start and end typed as long/float while a date interval will give back dates). This creates issues when writing the json to an elasticsearch and probably to other non structured DBs.

To solve this identify the fields and translate them to the same type (i.e. dates to timestamp).

Provide Docker-Image for Solr Backend

In order to simplify testing vind integration with a "real" backend, it would be convenient to provide a ready-to-use docker image containing the vind-schema and -extensions.

Possibility to index document into two solr servers of different version

In order to enable migration strategies from one Solr version to another, it would be helpful if Vind supports indexing into two Solr servers of different version at the same time. In such a case, an application could build up the index in the new Solr server in parallel to an already existing one. As soon as both Solr server contain the same amount of documents, the application could switch to the new Solr server for querying.

Set session per query

The Reporting server, among other information, logs info about the session and the user.
Currently the session is set when instantiating the report server, but it has to be possible to have, with the same reporting server instance, different sessions logged.

multiple values in suggestion single value field

Unexpected behavior on Vind 1.2.3 solr schema. When a document is indexed with a different suggestion value and there is still an old value from previous versions of the index in the field dynamic_suggest.string_fieldname the new suggestion field dynamic_suggest_analyzed_fieldname
gets two values even if defined as a single value field.
This is due to the definition of the copy rule from dynamic_suggest.string_fieldname to dynamic_suggest_analyzed_fieldname

Create analyzer: report service

Provide the means to generate reports out of the vind logs.

Children search with AND filter searches in all child documents instead of one document

When performing a children search with an AND filter the resulting query searches in all children instead of one:

    final FulltextSearch atomSearch = Search.fulltext()
        .filter(AndFilter.fromSet(myFilters));

    final FulltextSearch search = Search.fulltext()
        .filter(parentFilter(xyz))
        .andChildrenSearch(atomSearch, indexer.getAtomDocumentFactory());

result in the following query:

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
  (
    {!parent which='_type_:asset' v='_type_:atom AND dynamic_multi_filter_string_field_1:"VALUE1"'} AND 
    {!parent which='_type_:asset' v='_type_:atom AND dynamic_multi_filter_string_field_2:"VALUE2"'}
  )
)

instead of:

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
  (
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"VALUE1" AND
      dynamic_multi_filter_string_field_2:"VALUE2"'}
  )
)

Support Term Query Parser for huge ID searches

There are some use cases where we want to search for a large set of document IDs but there is no other search filter that identifies these specific group of documents. Hence we need to search via the IDs only, to offer the user further possibilities to sort, page and apply additional filters.

The current problem is, that this group of document IDs can be up to 5000. In the furture, this may be even extended up to 30-50k.

As the standard query parser only supports up to 1024 boolean clauses, please offer the possibility to use the term query parser instead.

NOT filter in Solr needs a positive base operator

In Solr filter syntax a NOT operator is not valid as stand alone expression as it is calculated as a substraction:
'NOT status:active' is parsed as '-status:active'

For simple operations like the one mention above Solr is able to interpret it but more complex ones of the style 'NOT status:active AND (NOT due_date:[* TO NOW])' will not give the expected results.

Scoped facets

Add the possibility to define in which field value use case (Filter, Suggest or Facet) the facet will be done.

make MonitoringServer configurable: exception resilient

Request from an integration:

can we make the MonitoringSearchServer configurable so it only logs the monitoring exceptions and performs the search nevertheless? In my opinion the tracking is not important enough to let the search fail if there is a problem only with tracking

DateMathExpression monitoring issue

Jackson is not able to serialize the DateMathExpression object.

Passing several children searches

In our case, we search for documents which have child documents which are matching different filter criterias.

Currently only one childrenSearch can be defined

// first set of filter criteria
final FulltextSearch childSearch1 = Search.fulltext()
   .filter(AndFilter.fromSet(firstSetOfChildCriteria));

final FulltextSearch search = Search.fulltext()
   .filter(parentFilter(xyz))
   .andChildrenSearch(childSearch1, indexer.getAtomDocumentFactory());

which results in

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"VALUE1" AND
      dynamic_multi_filter_string_field_2:"VALUE2"'}
)

But we need to search for parents which

have children matching our first set of criteria and
have children matching our second set of criteria and so on

Basically we want to result in something like this

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"VALUE1" AND
      dynamic_multi_filter_string_field_2:"VALUE2"'}
)
 AND 
(
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"ANOTHER_VALUE1" AND
      dynamic_multi_filter_string_field_2:"ANOTHER_VALUE2"'}
)

Something like this could be imagined

// first set of filter criteria
final FulltextSearch childSearch1 = Search.fulltext()
   .filter(AndFilter.fromSet(firstSetOfChildCriteria));

// second set of filter criteria
final FulltextSearch childSearch2 = Search.fulltext()
   .filter(AndFilter.fromSet(secondSetOfChildCriteria));

final FulltextSearch search = Search.fulltext()
   .filter(parentFilter(xyz))
   .andChildrenSearches(indexer.getAtomDocumentFactory(), childSearch1, childSearch2);

Add health check functionality

At the moment vind does not provide functionality for health checks (e.g. ping) so the clients have to use some custom implementations (for example expose a solr client and use the Spring Boot actuator SolrHealthIndicator). It would be nice if vind could offer some functionality to support these health checks.

Missing final fulltext queries in preprocessing

When preprocessing a set of session monitoring entries if the last query is valid, it is missing the final = true flag

Prepare OSS repository hosting

Enable global meta data for batch commit identification

Current State

Vind https://javadoc.io/page/com.rbmhtechnology.vind/vind/latest/com/rbmhtechnology/vind/api/SearchServer.html provides some methods to index documents:

void index(Document... doc)
void index(List<Document> doc)
void indexBean(List<Object> t)
void indexBean(Object... t)

Internally, both methods trigger an indexing process but not a commit (which is an intended behavior, as the server itself can handle commits internally much more efficient). Note, there are methods for commit, which guarantee that all indexing processes are commited (with all negative consequences regarding performance).

Problem

In applications that support Read-Your-Writes this behaviour might be a problem (because the application has to guarantee an always-up-to-date index status and thus is forced to use many hard commits).

Idea

Vind could support version numbering for indexing processes so an application could proof, which is the latest version that has been indexed (and thus is able to control via an additional method, if the necessary indexes already has been processed). This could be an internal counter or a counter based within the application, which could lead to the following api:

long index(List<Document> doc)
void index(List<Document> doc, long version)

Note, that the other methods would work analogous. To get the latest index version there could be a method, like:

long getLatestVersion()
boolean isVersionIndexed(long version)

In addition, each Document could have an additional field version.

Add configuration support via environment variables

Currently the vind configuration is mostly done via a properties file. To ensure the cloud-readiness of the library the configuration via environment variables is needed.

Example:

VIND_SERVER_SOLR_CLOUD=true
VIND_HOST=...
...

check Collection manager 404 / success update

While running a collection update from a private repository the collection manager tool logs a 404 when updating but still displays the successful update message (and successfully updates the collection).

TermFacet ignores facet limit property

When a facet limit is set for a search, the TermFacet json implementation is completely ignoring it due to a missing 'limit' parameter in the json generated.

Long queries are not supported by current implementation

At the moment method GET is used in Solr requests, which does not allow to perform long queries.

Add report post processing

Include in report server required post processing to obtain meaningful information out of basic data.

Update suggestionhandler result to NamedList

Currently the suggestion handler gives back a Map object instead of NamedList as Solr usually does. This is an inheritance from previous suggestionHandlers version but it should be changed to NamedList as it is more efficient and the expected type result from a solr handler. Some Vind modifications are needed to support this return type.

Report creation fails with SocketTimeout

The report creation fails with SocketTimeouts.

build	10-Jul-2018 12:41:58	12:41:58.348 [main] WARN  c.r.v.m.utils.ElasticSearchClient - contenthub.global.prod - Try 0 - Error in query scroll request query: Read timed out
build	10-Jul-2018 12:41:58	java.net.SocketTimeoutException: Read timed out
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.socketRead0(Native Method)
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.read(SocketInputStream.java:171)
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.read(SocketInputStream.java:141)
build	10-Jul-2018 12:41:58		at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
build	10-Jul-2018 12:41:58		at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
build	10-Jul-2018 12:41:58		at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)

This may due to non closing the ES Scrolle queries while setting a big Timeout of 30 minutes. As it is mentioned here https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-request-scroll.html the scroll should be explicitly cleared.

Suggestions not working after upgrading from 1.2.0 to 1.2.1

We did an upgrade from vind 1.2.0 to vind 1.2.1 and updated all our collections to the new config version. Unfortunately the suggestions do not work anymore after the upgrade.

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://....: java.lang.IllegalStateException: Type mismatch: dynamic_multi_stored_suggest_string_company was indexed as SORTED_SET
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:372)
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:325)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1121)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:891)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:827)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
        at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:974)
        at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:990)
        at com.rbmhtechnology.vind.solr.backend.SolrSearchServer.execute(SolrSearchServer.java:817)
        at com.rbmhtechnology.vind.solr.backend.SolrSearchServer.execute(SolrSearchServer.java:808)
        at com.rbmhtechnology.vind.monitoring.MonitoringSearchServer.execute(MonitoringSearchServer.java:349)
        at com.rbmhtechnology.vind.monitoring.MonitoringSearchServer.execute(MonitoringSearchServer.java:342)

Indexing the data did not solve the problem. Removing all the documents and indexing seems to solve it. However, due to the amount of data that is not an option for us.

Please provide a way we can do the upgrade without deleting all the data from the index.

Atomic update takes too long

In an specific usecase the atomic update is taking 2 seconds to update a document.

find out the reason.
find possible fix.

vinds dependency stack includes elasticsearch client

The com.rbmhtechnology.vind:monitoring-api module of vind depends on the elasticsearch client.

+--- com.rbmhtechnology.vind:log-writer:1.2.1
|    \--- com.rbmhtechnology.vind:monitoring-api:1.2.1
|         +--- com.rbmhtechnology.vind:vind-api:1.2.1 (*)
|         +--- com.fasterxml.jackson.core:jackson-databind:2.7.5 -> 2.8.3 (*)
|         +--- com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.7.5 -> 2.8.3
|         |    +--- com.fasterxml.jackson.core:jackson-core:2.8.3
|         |    +--- com.fasterxml.jackson.core:jackson-databind:2.8.3 (*)
|         |    \--- com.fasterxml.jackson.core:jackson-annotations:2.8.0 -> 2.8.3
|         +--- io.redlink.utils:utils:1.1.0
|         |    +--- org.slf4j:slf4j-api:1.7.25 -> 1.7.12
|         |    \--- org.apache.commons:commons-lang3:3.5
|         \--- io.searchbox:jest:5.3.3 -> 2.0.3

If vind is used inside a spring boot (at least in 1.x, 2.x needs to be confirmed) app, this triggers the elastic search health endpoint to be configured.

Is this dependency necessary? Or do we need to configure that somehow.

Wrong filters in 1.2.3

With vind 1.2.1 the following search

{"q":"*","filter":"((static_status='passive') OR (static_status='active')) AND ((static_partitionID='MV-1HP6U6PQS1W11') OR (static_partitionID='MV-1HP6TNXVH1W11') OR (static_partitionID='MV-1HP6UG2V51W11'))","timeZone":"null","sort":[{'direction':'Desc','field':'static_recordLastUpdateTimestamp'}],"result":{"sliceSize":21,"offset":0},"nestedDocSearchFlag":false,"nestedDocOp":"OR","nestedDocFactory":null,"nestedDocSearch":null,"facetFlag":false,"facetMinCount":1,"facetLimit":10,"facet":{},"geoDistance":null,"searchContext":"null","strictFlag":true}

resulted in those filters (only status and partitionID are relevant here)

&fq=((dynamic_multi_stored_filter_string_static_status:"active"+OR+dynamic_multi_stored_filter_string_static_status:"passive")+AND+(dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6UG2V51W11"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6TNXVH1W11"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6U6PQS1W11"))

which is the expected fq.

When using vind 1.2.3 the same code produces this search

{"q":"*","filter":"((static_status='active') OR (static_status='passive')) AND ((static_partitionID='MV-1HP6U6PQS1W11') OR (static_partitionID='MV-1HP6UG2V51W11') OR (static_partitionID='MV-1HP6TNXVH1W11'))","timeZone":"null","sort":[{'direction':'Desc','field':'static_recordLastUpdateTimestamp'}],"result":{"sliceSize":21,"offset":0},"nestedDocSearchFlag":false,"nestedDocOp":"OR","nestedDocFactory":null,"nestedDocSearch":[],"facetFlag":false,"facetMinCount":1,"facetLimit":10,"facet":{},"geoDistance":null,"searchContext":"null","strictFlag":true}

The filters in the search are the same as before (except the order). However, the generated fq for solr is broken since it now generates this:

&fq=dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6UG2V51W11"+OR+dynamic_multi_stored_filter_string_static_status:"passive"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6U6PQS1W11"+OR+dynamic_multi_stored_filter_string_static_status:"active"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6TNXVH1W11"

CollectionManagementService does not close CloudSolrClient

The com.rbmhtechnology.vind.solr.cmt.CollectionManagementService creates a CloudSolrClient during construction but never closes the client and also does not provide the means to close the client from the outside.

Provide before and after filter for java.util.Date

Please add methods similar to

com.rbmhtechnology.searchlib.api.query.filter.Filter.after(String, ZonedDateTime)
com.rbmhtechnology.searchlib.api.query.filter.Filter.before(String, ZonedDateTime)
to be used with java.util.Date.

It would be easy to convert the Date to a ZonedDateTime outside the search lib. However, since it provides a Field Descriptor to handle java.util.Data (com.rbmhtechnology.searchlib.model.SingleValueFieldDescriptor.UtilDateFieldDescriptor) this conversion should be done by the search lib in that case to be sure the conversion is done in a consistent way.

Test do not work because of an "outdated" dependency on Suggestion Handler

In vind/backend/solr/src/main/resources/solrhome/core/conf/solrconfig.xml there is still a dependency on the "outdated" suggestion handler. Therefor the tests does not work. This could be fixed by integration the suggestion-handler utils/suggestion-handler to the classpath.

Add new action reporting to report server

The report server is not login all the possible actions. Implement reporting for:

update
delete
index
get

10 Error logs per monitoring server search

The error message "Cannot get scope for non existing field descriptor" logged by the class com.rbmhtechnology.vind.api.query.filter.Filter appear many times per search in the logs, even if the search is working.

Page behaviour is inconsistent

I was replacing Slice by Page in a certain usecase and ended up asking myself if a page is 0-based or 1-based.

I guess it is 1-based, because, there is this FulltextSearch constructor

FulltextSearch() {
     this.searchString = "*";
     this.resultSet = new Page(1, SearchConfiguration.get(SearchConfiguration.SEARCH_RESULT_PAGESIZE,10));
}

But if I look into the constructor of Page itself, I see that

    public Page(int page, int pagesize) {
        if(page < 0) {
            log.error("Page number can not be lower than 0: {}",page);
            throw new IllegalArgumentException("Page should not be a negative value:" + page);
        }
        this.page = page;
        this.pagesize = pagesize;
        type = DivisionType.page;
    }

is stating that a page with value 0 is fine, unless it is not negative.

But If I fire up a search with

    final FulltextSearch search = Search.fulltext().page(0,10);

I end up with a solr exception

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/my_collection: 'start' parameter cannot be negative

I guess, the Page constructor should be changed to be consistent here.

Provide term/tuple autocompletion in suggestion handler

Currently the suggestion handler supports only suggestions for complete fieldvalues. Therefore a integration of an autocompletion field (based in fulltext fields) would be a benefit.

Versioning and release policy

Describe in Vind documentation the new versioning and release policy.

Solr parse error when creating filters shared fields on nested and parent docs

Extrenally reported issue which happens on a nested search or suggestion filtered by a field member of both the parent and the nested document:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://xxx.xxx.xxx.xxx:8983/solr/collection: org.apache.solr.search.SyntaxError: Expected identifier at pos 52 str='{!child of="_type_:asset" v='({!parent which='_type_:asset' v='_type_:marker AND dynamic_multi_stored_face t_string_static_entityType:"asset"'}' at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:372) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:325) at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1121) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:891) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:827) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:974) at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:990) at com.rbmhtechnology.vind.solr.backend.SolrSearchServer.execute(SolrSearchServer.java:817) at com.rbmhtechnology.vind.monitoring.MonitoringSearchServer.execute(MonitoringSearchServer.java:396) at com.redbullmediabase.mediamanager.core.index.AbstractVindSearchEngine.performSuggestionSearchAndGetSuggestionsFromResponse(AbstractVindSearchEngine.java:745) at com.redbullmediabase.mediamanager.core.index.AbstractVindSearchEngine.getSuggestions(AbstractVindSearchEngine.java:967) at com.redbullmediabase.mediamanager.core.mam.index.AssetVindSearchEngine.getSuggestions(AssetVindSearchEngine.java:121) at com.redbullmediabase.mediamanager.manager.module.assets.controller.AssetsModuleController.suggest(AssetsModuleController.java:1356) at sun.reflect.GeneratedMethodAccessor2063.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)

Support hierarchical paths as field values

Currently hierarchical paths (e.g. Taxonomy Fields) are not considered. They have to be supported by a field descriptor and properly fit into suggestion infrastructure.

Precendence of configuration settings using environment variables

According to the documentation of vind 1.2 :

The properties are overwritten following the ordering: Default Properties < Environment Variables < Property File

This behaviour is unlike e.g. Spring and typesafe config which do “Default Properties < Property File < Environment Variables”.

This means we have to provide all settings in every environment as environment variables as we cannot simply provide a property file for development which can be overwritten using environment variables. For productive deployment environment variables are easy to handle but locally one might want to provide defaults using a file instead of manually having to configure IDE env vars.

Suggestion: override of default operator

The default logical operator in the suggestions handler is hard-coded to "AND". This should be fixed, providing the option of setting "OR" instead if wished.