imbo / imbo-metadata-search Goto Github PK

View Code? Open in Web Editor NEW

3.0 6.0 4.0 671 KB

Imbo plugin that enables metadata search

License: MIT License

PHP 89.96% Gherkin 10.04%

php imbo metadata-search elasticsearch

imbo-metadata-search's Issues

Lowercasing in the ES DSL transformation

I notice the AST/DSL transformation lowercase all values. I guess this is intentional and wanted.

Values are lowercase as part of the index-time processing and right now a search for cat results in a hit for Cat.

This is more of an implementation detail, but I guess we should probably lowercase the values nevertheless. Thoughts?

Query DSL representation

We need to decide on how to store/process the Query DSL internally inside this plugin. In the original Metadata-search issue (imbo/imbo#268), it was proposed and agreed upon to use a Mongo query-language subset to specify searches in. So this determines the external/textual representation of our query-DSL.

However, the internal structure can technically be whatever we want it to be. So here is my proposal to the three obvious internal representations (AST) of the query-DSL.

Option 1 Store the Mongo JSON as is

That is, the input-query

{"foo": "bar", "baz": {"$not": {"$gt": 42}}}

would be stored internally as

['foo' => 'bar', 'baz' => ['$not' => ['$gt': 42]]]

So basically the result of calling json_decode - Albeit with a few modifications (translate to lower-case) and checks (throw exceptions on unknown operators like $regex.

Option 2 Store the Mongo JSON as a normalised Mongo queries

In Mongo it is possible to represent many queries in multiple different, equivalent ways. Take for instance the two queries

{"foo": "bar", "baz": "blargh"}
{"$and": [{"foo": "bar"}, {"baz": "blargh"}]}

They are equivalent when executed, but the latter is much easier to transform into other query-languages because there will only be a few ways of building up queries. Because basically all queries can be normalised into being of one of the following 5 query structures

{"$and": [term, term, term]}
{"$or": [term, term, term]}
{"field": "value"}
{"field": {"$operator": "value"}}
{"field": {"$not": {"$operator": "value"}}}

So for instance the query

{"foo": "bar", "baz": {"$not": {"$gt": 42}}}

would be stored internally as

['$and' => [
  ['foo' => 'bar']
  ['baz' => ['$not' => ['$gt' => 42]]]]
]]

Doing recursive decents over such a simple data-structure makes it a lot easer to translate it into e.g. ElasticSearch queries.

Option 3 Store normalized Mongo queries as instances of AST-classes

This is basically doing the normalisation from option 2, but instead of storing it as associative arrays, it would be stored as instances of specific classes, like \Imbo\MetadataSearch\Dsl\Ast\And

So the query

{"foo": "bar", "baz": {"$not": {"$gt": 42}}}

would internally be stored as

new Dsl\Ast\And([
    new Dsl\Ast\Field('foo', new Dsl\Ast\Comparison\Equal('bar')),
    new Dsl\Ast\NegatedField('baz', new Dsl\Ast\Comparison\GreaterThan(42))
])

This structure makes it even easier / more readable to do recursive dececents over the query-DSLs AST. You could do something like the following (I admit this looks a bit silly, but you know - without pattern matching, there is only so much you can do)

function transformToEs(Dsl\Ast $query) {
    switch(TRUE) {
        case $query instanceof Dsl\Ast\And:
            return '(' . implode(' AND ', array_map('transformToEs', $query)) . ')';
        case $query instanceof Dsl\Ast\Or:
            return '(' . implode(' OR ', array_map('transformToEs', $query)) . ')';
        case $query instanceof Dsl\Ast\Field:
           return $query->field . ':' . transformComparisonToEs($query->value);
        case $query instanceof Dsl\Ast\NegatedField:
           return 'NOT ' . $query->field . ':' . transformComparisonToEs($query->comparison);
    }
}
function transformComparisonToEs(Dsl\Ast\Comparison $query) {
    switch(TRUE) {
        case $query instanceof Dsl\Ast\Comparison\Equal:
            return $query->value;
        case $query instanceof Dsl\Ast\Comparison\GreaterThan:
            return '>' . $query->value;
        // and so forth, for >=, <= and <
    }
}

Personally, I would want to go with either option 2 or 3. By going with option 1, we're going to make it harder than necessary to write transformations for multiple search backends. Doing the normalisation will also allow us to reject more malformed queries...

The differences between 2 and 3 is basically just that option 3 adds a more rigidly enforced structure on the internal representation (AST). It also can make it easier to read transformation functions, because can have potentially more descriptive class-names than the text-string that Mongo uses for operators. But this structure does come with the "overhead" of requiring quite a few class-definitions of all rather small classes that needs to contain 1-2 values.

So what are peoples opinion on how the query-DSL should be represented internally in this plugin?

-Morten.

Remove lowercasing from Imbo DSL -> AST transformation

The lowercasing of tokens in the AST parser was introduced as part of the Imbo DSL -> Mongo transformation initially written by @christeredvartsen, but is really more of a backend implementation detail. Some of the backends might need lowercasing, but the generic parser should't modify the data.

So for now, the lowercasing can be removed for now as the ES backend doesn't care for casing.

Model is missing a method

In Imbo-2.1.2 the Imbo\Model\ModelInterface interface added a method called getData that is currently missing from this package.

Add support for global search

The metadata search should support searching other publickeys by specifying a list of publickeys to search (given that the user searching has access).

This however depends on the introduction of access levels in Imbo; imbo/imbo#319.

Change endpoints

GET /users/<pubkey>/search should be changed to SEARCH /users/<pubkey>/images

and

SEARCH /images should be used for global image search.

500 when inserting wrong elastic type

[Wed Dec 09 09:02:01 2015] [error] [client 10.84.100.151] PHP Fatal error: Uncaught Exception with message: {"error":"MapperParsingException[failed to parse [metadata.date]]; nested: MapperParsingException[failed to parse date field [false], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "false"]; ","status":400} in /services/applications/imbo/vendor/imbo/imbo/public/index.php on line 25

Incompatible with ElasticSearch 5.2

The part of the ElasticSearch Search-DSL that we are using in the E.S.-backend was deprecated in 2.0.0-beta, and doesn't work in 5.2 (I haven't checked 5.1, and I think we run 5.0 in production where it works).

So we should update the ElasticSearch DSL-transformation to use a syntax that works from 2.0 and upwards.

We currently use the filtered-syntax, and we should probably move over to the bool-syntax - I've made a few tests that indicate that the bool-syntax seems to support the types of query that we support.

But what are peoples opinion on this? Should we simply change the transformation, which means that pre-2.0 will stop working (and thus should bump the semver-major version of this package), or should we have two different E.S.-backends depending on which version you run?

Describe the DSL on the readme

We need to describe how the Imbo DSL looks and should be used. This has been done in bits and pieces in issues and implementation, but we need to do a more "final" write-up of the details now that we've done the actual implementation.

Sorting/ordering of images is not consistent

The order of images returned are not consistent. This should obviosly be resolved somehow in order for the searching to behave in a predictable manner.

My observation is that this is not caused by ordering in ES, but by the order of the imageIdentifers returned from the search backend being shuffled when imbo fetches from DB and generates the final response.

I guess there are (at least) two approaches here;

Modify the response after imbo has fetched the images from the backend, and make sure the image ordering correspond to the order of the imageIdentifiers from the backend.
Make sure the imbo core generates a response with the same ordering as the imageIdentifers in the ids params array.

Use one elasticsearch index for storing metadata

Right now we use multiple indices for storing image metadata. We should, for the sake of keeping the elasticsearch indices list tidy and make querying easier store it all in one big index.

Documents inserted into ElasticSearch has a query-property

I'm currently looking into some improvements that we want to do on our metadata-search, so I was looking into the raw data that got stored in ElasticSearch, and I noticed something strange. All of the documents include the following

...,
"mime": "...",
"query": {
  "filtered": {
    "filter": {
      "and": []
    }
  }
},
"size": ...,
...

That seemed a little suspicious, and like a artefact of some search-query. So I looked around a bit, and have figured out it's because the same function, prepareParams is called from both set and search, and by default it populates a default search-query for the search functionality so that the later functions can always just extend that filter.

if (isset($params['body']['query']['filtered']['filter'][0])) {
  ...
} else {
  $params['body']['query']['filtered']['filter'] = ['and' => []];
}

But this is has the side-effect of also injecting the default search query into all the documents that are stored, which is probably not ideal.

I can take a look at fixing this, if you want - I would probably just move the query-part of prepareParams into search...

Tokenization of metadata

By default strings are tokenized in elasticsearch, so right now a search for

{ "animal":"red panda" }

would result in hits for all the following; red fox, red panda and giant panda.

What do you think is the best way to ensure this works in a predictable way? Is just telling people to configure their indices correctly enough?

What types of result modifiers do we need to support in the search?

These are the result modifiers currently supported by the images resource;

metadata
Whether or not to include metadata in the output. Defaults to 0, set to 1 to enable.

from
Fetch images starting from this Unix timestamp.

to
Fetch images up until this timestamp.

fields[]
An array with fields to display. When not specified all fields will be displayed.

sort[]
An array with fields to sort by. The direction of the sort is specified by appending asc or desc to the field, delimited by :. If no direction is specified asc will be used. Example: ?sort[]=size&sort[]=width:desc is the same as ?sort[]=size:asc&sort[]=width:desc. If no sort is specified Imbo will sort by the date the images was added, in a descending fashion.

ids[]
An array of image identifiers to filter the results by.

checksums[]
An array of image checksums to filter the results by.

originalChecksums[]
An array of the original image checksums to filter the results by.

metadata, fields are supported by the current metadata search implementation as a consequence of it utilizing the db.images.load event handler for fetching from the backend.

What I want input on is the other modifiers. ids, checksums and originalChecksums I don't really see the need for in the context of the metadata search.

sort is the param I'm thinking we probably want. In order to support sorting like we currently do on the images endpoint, we'll need to store all the image data instead of only the metadata. We would end up with a structure looking like the one used by Imbo in MongoDB for images. The data indexed by the search backend would look like this;

{
    "publicKey" : "publickey",
    "imageIdentifier": "92aa7029b22263ea0b64ba12b4cbf760",
    "size": 1001337,
    "height": 1337,
    "width": 1337,
    "added": 1337001337,
    "metadata": {
        "animal": "Red Panda"
    }
}

I think this is worth the effort, but if people think it's a waist of time I won't care. It would involve;

Fetching and passing full image objects to the search backend for indexing whenever the image data (not only metadata) is changed in order to keep it in sync
Tweaking the AST -> search DSL transformation slightly in order to build queries using the metadata object.
Adding transformation of search parameters to the search backend query building.

imbo / imbo-metadata-search Goto Github PK

imbo-metadata-search's Issues

Lowercasing in the ES DSL transformation

Query DSL representation

Remove lowercasing from Imbo DSL -> AST transformation

Model is missing a method

Add support for global search

Change endpoints

500 when inserting wrong elastic type

Incompatible with ElasticSearch 5.2

Describe the DSL on the readme

Sorting/ordering of images is not consistent

Use one elasticsearch index for storing metadata

Documents inserted into ElasticSearch has a query-property

Tokenization of metadata

What types of result modifiers do we need to support in the search?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent