Coder Social home page Coder Social logo

rbmhtechnology / vind Goto Github PK

View Code? Open in Web Editor NEW
24.0 24.0 11.0 6.48 MB

Vind is build to enable the integration of search facilities in java projects without getting to deep into the search topic

Home Page: https://rbmhtechnology.github.io/vind/

License: Apache License 2.0

Java 99.93% Shell 0.07%
information-discovery java library search solr vind

vind's People

Contributors

alfonso-noriega avatar goerge avatar ja-fra avatar luaks avatar pilzm avatar purthaler avatar stefan-sachs avatar tkurz avatar wernerharing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vind's Issues

Improve Filter&Facet report

Design a reporting model for filters and facets far from the current java pojo representation and closer to a user friendly format.

Homogenize monitoring field types

Some of the monitoring fields are actually giving a type depending on the original Vind component (i.e. an interval facet for a numeric field will have start and end typed as long/float while a date interval will give back dates). This creates issues when writing the json to an elasticsearch and probably to other non structured DBs.

To solve this identify the fields and translate them to the same type (i.e. dates to timestamp).

Provide Docker-Image for Solr Backend

In order to simplify testing vind integration with a "real" backend, it would be convenient to provide a ready-to-use docker image containing the vind-schema and -extensions.

Possibility to index document into two solr servers of different version

In order to enable migration strategies from one Solr version to another, it would be helpful if Vind supports indexing into two Solr servers of different version at the same time. In such a case, an application could build up the index in the new Solr server in parallel to an already existing one. As soon as both Solr server contain the same amount of documents, the application could switch to the new Solr server for querying.

Set session per query

The Reporting server, among other information, logs info about the session and the user.
Currently the session is set when instantiating the report server, but it has to be possible to have, with the same reporting server instance, different sessions logged.

multiple values in suggestion single value field

Unexpected behavior on Vind 1.2.3 solr schema. When a document is indexed with a different suggestion value and there is still an old value from previous versions of the index in the field dynamic_suggest.string_fieldname the new suggestion field dynamic_suggest_analyzed_fieldname
gets two values even if defined as a single value field.
This is due to the definition of the copy rule from dynamic_suggest.string_fieldname to dynamic_suggest_analyzed_fieldname

Children search with AND filter searches in all child documents instead of one document

When performing a children search with an AND filter the resulting query searches in all children instead of one:

    final FulltextSearch atomSearch = Search.fulltext()
        .filter(AndFilter.fromSet(myFilters));

    final FulltextSearch search = Search.fulltext()
        .filter(parentFilter(xyz))
        .andChildrenSearch(atomSearch, indexer.getAtomDocumentFactory());

result in the following query:

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
  (
    {!parent which='_type_:asset' v='_type_:atom AND dynamic_multi_filter_string_field_1:"VALUE1"'} AND 
    {!parent which='_type_:asset' v='_type_:atom AND dynamic_multi_filter_string_field_2:"VALUE2"'}
  )
)

instead of:

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
  (
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"VALUE1" AND
      dynamic_multi_filter_string_field_2:"VALUE2"'}
  )
)

Support Term Query Parser for huge ID searches

There are some use cases where we want to search for a large set of document IDs but there is no other search filter that identifies these specific group of documents. Hence we need to search via the IDs only, to offer the user further possibilities to sort, page and apply additional filters.

The current problem is, that this group of document IDs can be up to 5000. In the furture, this may be even extended up to 30-50k.

As the standard query parser only supports up to 1024 boolean clauses, please offer the possibility to use the term query parser instead.

NOT filter in Solr needs a positive base operator

In Solr filter syntax a NOT operator is not valid as stand alone expression as it is calculated as a substraction:
'NOT status:active' is parsed as '-status:active'

For simple operations like the one mention above Solr is able to interpret it but more complex ones of the style 'NOT status:active AND (NOT due_date:[* TO NOW])' will not give the expected results.

Scoped facets

Add the possibility to define in which field value use case (Filter, Suggest or Facet) the facet will be done.

make MonitoringServer configurable: exception resilient

Request from an integration:

can we make the MonitoringSearchServer configurable so it only logs the monitoring exceptions and performs the search nevertheless? In my opinion the tracking is not important enough to let the search fail if there is a problem only with tracking

Passing several children searches

In our case, we search for documents which have child documents which are matching different filter criterias.

Currently only one childrenSearch can be defined

// first set of filter criteria
final FulltextSearch childSearch1 = Search.fulltext()
   .filter(AndFilter.fromSet(firstSetOfChildCriteria));

final FulltextSearch search = Search.fulltext()
   .filter(parentFilter(xyz))
   .andChildrenSearch(childSearch1, indexer.getAtomDocumentFactory());

which results in

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"VALUE1" AND
      dynamic_multi_filter_string_field_2:"VALUE2"'}
)

But we need to search for parents which

  • have children matching our first set of criteria and
  • have children matching our second set of criteria and so on

Basically we want to result in something like this

(_type_:asset AND dynamic_multi_filter_string_parent:"xyz") AND 
(
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"VALUE1" AND
      dynamic_multi_filter_string_field_2:"VALUE2"'}
)
 AND 
(
    {!parent which='_type_:asset' v='_type_:atom AND 
      dynamic_multi_filter_string_field_1:"ANOTHER_VALUE1" AND
      dynamic_multi_filter_string_field_2:"ANOTHER_VALUE2"'}
)

Something like this could be imagined

// first set of filter criteria
final FulltextSearch childSearch1 = Search.fulltext()
   .filter(AndFilter.fromSet(firstSetOfChildCriteria));

// second set of filter criteria
final FulltextSearch childSearch2 = Search.fulltext()
   .filter(AndFilter.fromSet(secondSetOfChildCriteria));

final FulltextSearch search = Search.fulltext()
   .filter(parentFilter(xyz))
   .andChildrenSearches(indexer.getAtomDocumentFactory(), childSearch1, childSearch2);

Add health check functionality

At the moment vind does not provide functionality for health checks (e.g. ping) so the clients have to use some custom implementations (for example expose a solr client and use the Spring Boot actuator SolrHealthIndicator). It would be nice if vind could offer some functionality to support these health checks.

Enable global meta data for batch commit identification

Current State

Vind https://javadoc.io/page/com.rbmhtechnology.vind/vind/latest/com/rbmhtechnology/vind/api/SearchServer.html provides some methods to index documents:

  • void index(Document... doc)
  • void index(List<Document> doc)
  • void indexBean(List<Object> t)
  • void indexBean(Object... t)

Internally, both methods trigger an indexing process but not a commit (which is an intended behavior, as the server itself can handle commits internally much more efficient). Note, there are methods for commit, which guarantee that all indexing processes are commited (with all negative consequences regarding performance).

Problem

In applications that support Read-Your-Writes this behaviour might be a problem (because the application has to guarantee an always-up-to-date index status and thus is forced to use many hard commits).

Idea

Vind could support version numbering for indexing processes so an application could proof, which is the latest version that has been indexed (and thus is able to control via an additional method, if the necessary indexes already has been processed). This could be an internal counter or a counter based within the application, which could lead to the following api:

  • long index(List<Document> doc)
  • void index(List<Document> doc, long version)

Note, that the other methods would work analogous. To get the latest index version there could be a method, like:

  • long getLatestVersion()
  • boolean isVersionIndexed(long version)

In addition, each Document could have an additional field version.

Add configuration support via environment variables

Currently the vind configuration is mostly done via a properties file. To ensure the cloud-readiness of the library the configuration via environment variables is needed.

Example:

VIND_SERVER_SOLR_CLOUD=true
VIND_HOST=...
...

check Collection manager 404 / success update

While running a collection update from a private repository the collection manager tool logs a 404 when updating but still displays the successful update message (and successfully updates the collection).

TermFacet ignores facet limit property

When a facet limit is set for a search, the TermFacet json implementation is completely ignoring it due to a missing 'limit' parameter in the json generated.

Update suggestionhandler result to NamedList

Currently the suggestion handler gives back a Map object instead of NamedList as Solr usually does. This is an inheritance from previous suggestionHandlers version but it should be changed to NamedList as it is more efficient and the expected type result from a solr handler. Some Vind modifications are needed to support this return type.

Report creation fails with SocketTimeout

The report creation fails with SocketTimeouts.

build	10-Jul-2018 12:41:58	12:41:58.348 [main] WARN  c.r.v.m.utils.ElasticSearchClient - contenthub.global.prod - Try 0 - Error in query scroll request query: Read timed out
build	10-Jul-2018 12:41:58	java.net.SocketTimeoutException: Read timed out
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.socketRead0(Native Method)
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.read(SocketInputStream.java:171)
build	10-Jul-2018 12:41:58		at java.net.SocketInputStream.read(SocketInputStream.java:141)
build	10-Jul-2018 12:41:58		at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
build	10-Jul-2018 12:41:58		at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
build	10-Jul-2018 12:41:58		at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)

This may due to non closing the ES Scrolle queries while setting a big Timeout of 30 minutes. As it is mentioned here https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-request-scroll.html the scroll should be explicitly cleared.

Suggestions not working after upgrading from 1.2.0 to 1.2.1

We did an upgrade from vind 1.2.0 to vind 1.2.1 and updated all our collections to the new config version. Unfortunately the suggestions do not work anymore after the upgrade.

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://....: java.lang.IllegalStateException: Type mismatch: dynamic_multi_stored_suggest_string_company was indexed as SORTED_SET
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:372)
        at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:325)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1121)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:891)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:827)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
        at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:974)
        at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:990)
        at com.rbmhtechnology.vind.solr.backend.SolrSearchServer.execute(SolrSearchServer.java:817)
        at com.rbmhtechnology.vind.solr.backend.SolrSearchServer.execute(SolrSearchServer.java:808)
        at com.rbmhtechnology.vind.monitoring.MonitoringSearchServer.execute(MonitoringSearchServer.java:349)
        at com.rbmhtechnology.vind.monitoring.MonitoringSearchServer.execute(MonitoringSearchServer.java:342)

Indexing the data did not solve the problem. Removing all the documents and indexing seems to solve it. However, due to the amount of data that is not an option for us.

Please provide a way we can do the upgrade without deleting all the data from the index.

Atomic update takes too long

In an specific usecase the atomic update is taking 2 seconds to update a document.

  • find out the reason.
  • find possible fix.

vinds dependency stack includes elasticsearch client

The com.rbmhtechnology.vind:monitoring-api module of vind depends on the elasticsearch client.

+--- com.rbmhtechnology.vind:log-writer:1.2.1
|    \--- com.rbmhtechnology.vind:monitoring-api:1.2.1
|         +--- com.rbmhtechnology.vind:vind-api:1.2.1 (*)
|         +--- com.fasterxml.jackson.core:jackson-databind:2.7.5 -> 2.8.3 (*)
|         +--- com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.7.5 -> 2.8.3
|         |    +--- com.fasterxml.jackson.core:jackson-core:2.8.3
|         |    +--- com.fasterxml.jackson.core:jackson-databind:2.8.3 (*)
|         |    \--- com.fasterxml.jackson.core:jackson-annotations:2.8.0 -> 2.8.3
|         +--- io.redlink.utils:utils:1.1.0
|         |    +--- org.slf4j:slf4j-api:1.7.25 -> 1.7.12
|         |    \--- org.apache.commons:commons-lang3:3.5
|         \--- io.searchbox:jest:5.3.3 -> 2.0.3

If vind is used inside a spring boot (at least in 1.x, 2.x needs to be confirmed) app, this triggers the elastic search health endpoint to be configured.

Is this dependency necessary? Or do we need to configure that somehow.

Wrong filters in 1.2.3

With vind 1.2.1 the following search

{"q":"*","filter":"((static_status='passive') OR (static_status='active')) AND ((static_partitionID='MV-1HP6U6PQS1W11') OR (static_partitionID='MV-1HP6TNXVH1W11') OR (static_partitionID='MV-1HP6UG2V51W11'))","timeZone":"null","sort":[{'direction':'Desc','field':'static_recordLastUpdateTimestamp'}],"result":{"sliceSize":21,"offset":0},"nestedDocSearchFlag":false,"nestedDocOp":"OR","nestedDocFactory":null,"nestedDocSearch":null,"facetFlag":false,"facetMinCount":1,"facetLimit":10,"facet":{},"geoDistance":null,"searchContext":"null","strictFlag":true}

resulted in those filters (only status and partitionID are relevant here)

&fq=((dynamic_multi_stored_filter_string_static_status:"active"+OR+dynamic_multi_stored_filter_string_static_status:"passive")+AND+(dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6UG2V51W11"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6TNXVH1W11"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6U6PQS1W11"))

which is the expected fq.

When using vind 1.2.3 the same code produces this search

{"q":"*","filter":"((static_status='active') OR (static_status='passive')) AND ((static_partitionID='MV-1HP6U6PQS1W11') OR (static_partitionID='MV-1HP6UG2V51W11') OR (static_partitionID='MV-1HP6TNXVH1W11'))","timeZone":"null","sort":[{'direction':'Desc','field':'static_recordLastUpdateTimestamp'}],"result":{"sliceSize":21,"offset":0},"nestedDocSearchFlag":false,"nestedDocOp":"OR","nestedDocFactory":null,"nestedDocSearch":[],"facetFlag":false,"facetMinCount":1,"facetLimit":10,"facet":{},"geoDistance":null,"searchContext":"null","strictFlag":true}

The filters in the search are the same as before (except the order). However, the generated fq for solr is broken since it now generates this:

&fq=dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6UG2V51W11"+OR+dynamic_multi_stored_filter_string_static_status:"passive"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6U6PQS1W11"+OR+dynamic_multi_stored_filter_string_static_status:"active"+OR+dynamic_multi_stored_filter_string_static_partitionID:"MV-1HP6TNXVH1W11"

Provide before and after filter for java.util.Date

Please add methods similar to

com.rbmhtechnology.searchlib.api.query.filter.Filter.after(String, ZonedDateTime)
com.rbmhtechnology.searchlib.api.query.filter.Filter.before(String, ZonedDateTime)
to be used with java.util.Date.

It would be easy to convert the Date to a ZonedDateTime outside the search lib. However, since it provides a Field Descriptor to handle java.util.Data (com.rbmhtechnology.searchlib.model.SingleValueFieldDescriptor.UtilDateFieldDescriptor) this conversion should be done by the search lib in that case to be sure the conversion is done in a consistent way.

10 Error logs per monitoring server search

The error message "Cannot get scope for non existing field descriptor" logged by the class com.rbmhtechnology.vind.api.query.filter.Filter appear many times per search in the logs, even if the search is working.

Page behaviour is inconsistent

I was replacing Slice by Page in a certain usecase and ended up asking myself if a page is 0-based or 1-based.

I guess it is 1-based, because, there is this FulltextSearch constructor

FulltextSearch() {
     this.searchString = "*";
     this.resultSet = new Page(1, SearchConfiguration.get(SearchConfiguration.SEARCH_RESULT_PAGESIZE,10));
}

But if I look into the constructor of Page itself, I see that

    public Page(int page, int pagesize) {
        if(page < 0) {
            log.error("Page number can not be lower than 0: {}",page);
            throw new IllegalArgumentException("Page should not be a negative value:" + page);
        }
        this.page = page;
        this.pagesize = pagesize;
        type = DivisionType.page;
    }

is stating that a page with value 0 is fine, unless it is not negative.

But If I fire up a search with

    final FulltextSearch search = Search.fulltext().page(0,10);

I end up with a solr exception

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/my_collection: 'start' parameter cannot be negative

I guess, the Page constructor should be changed to be consistent here.

Solr parse error when creating filters shared fields on nested and parent docs

Extrenally reported issue which happens on a nested search or suggestion filtered by a field member of both the parent and the nested document:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://xxx.xxx.xxx.xxx:8983/solr/collection: org.apache.solr.search.SyntaxError: Expected identifier at pos 52 str='{!child of="_type_:asset" v='({!parent which='_type_:asset' v='_type_:marker AND dynamic_multi_stored_face t_string_static_entityType:"asset"'}' at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:577) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:372) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:325) at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1121) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:891) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:827) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:974) at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:990) at com.rbmhtechnology.vind.solr.backend.SolrSearchServer.execute(SolrSearchServer.java:817) at com.rbmhtechnology.vind.monitoring.MonitoringSearchServer.execute(MonitoringSearchServer.java:396) at com.redbullmediabase.mediamanager.core.index.AbstractVindSearchEngine.performSuggestionSearchAndGetSuggestionsFromResponse(AbstractVindSearchEngine.java:745) at com.redbullmediabase.mediamanager.core.index.AbstractVindSearchEngine.getSuggestions(AbstractVindSearchEngine.java:967) at com.redbullmediabase.mediamanager.core.mam.index.AssetVindSearchEngine.getSuggestions(AssetVindSearchEngine.java:121) at com.redbullmediabase.mediamanager.manager.module.assets.controller.AssetsModuleController.suggest(AssetsModuleController.java:1356) at sun.reflect.GeneratedMethodAccessor2063.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)

Support hierarchical paths as field values

Currently hierarchical paths (e.g. Taxonomy Fields) are not considered. They have to be supported by a field descriptor and properly fit into suggestion infrastructure.

Precendence of configuration settings using environment variables

According to the documentation of vind 1.2 :

The properties are overwritten following the ordering: Default Properties < Environment Variables < Property File

This behaviour is unlike e.g. Spring and typesafe config which do “Default Properties < Property File < Environment Variables”.

This means we have to provide all settings in every environment as environment variables as we cannot simply provide a property file for development which can be overwritten using environment variables. For productive deployment environment variables are easy to handle but locally one might want to provide defaults using a file instead of manually having to configure IDE env vars.

Suggestion: override of default operator

The default logical operator in the suggestions handler is hard-coded to "AND". This should be fixed, providing the option of setting "OR" instead if wished.

Add log writer to Demos

Add the simple log writer and a logger plus configuration to the demo in Vind so there is an example of usage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.