imbo / imbo-metadata-search Goto Github PK

3.0 6.0 4.0 671 KB

Imbo plugin that enables metadata search

License: MIT License

PHP 89.96% Gherkin 10.04%

imbo-metadata-search's Introduction

Metadata search plugin for Imbo

The metadata search event listener hooks onto metadata updates for your images and keeps the search backend of your choice up to date, and allows you to find images by querying its metadata.

Installation

Setting up the dependencies

If you've installed Imbo through composer, getting the metadata search up and running is really simple. Simply add imbo/imbo-metadata-search as a dependency.

In addition to the metadata search plugin you'll need a search backend client. Right now the plugin ships with support for elasticsearch only, so you'll want to add elasticsearch/elasticsearch as well in order to be able to use it as search backend.

{
    "require": {
        "imbo/imbo-metadata-search": "dev-master",
        "elasticsearch/elasticsearch": "~2.1"
    }
}

The elasticsearch plugin requires that your elasticsearch server is at least version 2.0.

Metadata search setup

In order for the metadata search plugin to be registered and actually do something usedful for your Imbo installation you need to add a config file which declares the routes, resource and event listeners.

After installing with composer you will find a basic config file for the metadata search in vendor/imbo/imbo-metadata-search/config.dist.php. If you want to make changes to the file you should copy it to your config folder.

Indexing

Updates in the search backend is triggered whenever one of the following events are fired; image.delete, images.post, image.post, metadata.post, metadata.put, metadata.delete.

The image.delete event triggers a delete in the indexed object in the search backend, and the other ones trigger an update of the full object. When indexing, data in addition to metadata is provided to the search backend for indexin in order to sorting and such.

The data provided to the backend are;

Data	Description
`user`	The user "owning" the image
`size`	Byte size of image
`extension`	File extension
`mime`	Mime type of file
`metadata`	Image metadata
`added`	Timestamp representation of when the image was added
`updated`	Timestamp representation of when the image was last updated
`width`	Width of image in pixels
`height`	Height of image in pixels

Querying

Querying is done by issuing an HTTP SEARCH request to /users/<user>/images if you want to search in the images of a single user, or /images if you want to search across multiple users. Supported query parameters are:

Param	Description
`page`	The page number. Defaults to 1.
`limit`	Number of images per page. Defaults to 20.
`metadata`	Whether or not to include metadata in the output. Defaults to 0, set to 1 to enable.
`fields[]`	An array with fields to display. When not specified all fields will be displayed.
`sort[]`	An array with fields to sort by. The direction of the sort is specified by appending asc or desc to the field, delimited by :. If no direction is specified asc will be used. Example: ?sort[]=size&sort[]=width:desc is the same as ?sort[]=size:asc&sort[]=width:desc. If no sort is specified the search backend will rank by relevance.

The query is sent in the request body.

Examples

Querying one user

$ curl 'http://imbo/users/<user>/images?limit=1&metadata=1' -d '{"foo": "bar"}'

Querying multiple users

$ curl 'http://imbo/images?users[]=<user1>&users[]=<user2>&limit=1&metadata=1' -d '{"foo": "bar"}'

Both these requests results in a response that looks like this:

{
  "search": {
    "hits": 3,
    "page": 1,
    "limit": 1,
    "count": 1
  },
  "images": [
    {
      "added": "Mon, 10 Dec 2012 11:57:51 GMT",
      "updated": "Mon, 10 Dec 2012 11:57:51 GMT",
      "checksum": "<checksum>",
      "originalChecksum": "<originalChecksum>",
      "extension": "png",
      "size": 6791,
      "width": 1306,
      "height": 77,
      "mime": "image/png",
      "imageIdentifier": "<image>",
      "user": "<user>",
      "metadata": {
        "key": "value",
        "foo": "bar"
      }
    }
  ]
}

Imbo DSL

The query language used by Imbo Metadata Search is a subset of the MongoDB query DSL. The query is a JSON-encoded object including key => value matches and/or a combination of the supported operators, sent to Imbo in the request body. This section lists all operators and includes a number of examples showing you how to find images using the metadata query.

Note: The results of the different queries might end up with slightly different results depending on the backend you use the for metadata.

Key/value matching

The simplest form of a metadata query is a simple key => value match, where the expressions are AND-ed together if there is more than one key/value match in the query.

 {"key":"value","otherkey":"othervalue"}

The above search would result in images that have the metadata key key set to value and otherkey set to othervalue

Greater than - `$gt`

This operator can be used to check for values greater than the value specified.

{"age":{"$gt":35}}

Greater than or equal - `$gte`

Check for values greater than or equal to the value specified.

 {"age":{"$gte":35}}

Less than - `$lte`

Check for values less than to the value specified.

{"age":{"$lt":35}}

Less than or equal - `$lte`

Check for values less than or equal to the value specified.

{"age":{"$lte":35}}

Not equal - `$ne`

Matches values that are not equal to the value specified.

{"name":{"$ne":"christer"}}

In - `$in`

Look for values that appear in the specified set.

{"styles":{"$in":["IPA","Imperial Stout","Lambic"]}}

Not in - `$nin`

Look for values that does not appear in the specified set.

{"styles":{"$nin":["Pilsner"]}}

Field exists - `$exists`

Ensure that a given field does or does not exist.

{"age":{"$exists":true}}

Conjunctions - `$and`

This operator can be used to combine a list of criteria that must all match. It takes an array of queries.

{"$and": [{"name": {"$in": ["kristoffer", "morten"]}}, {"age": {"$lt": 30}}]}

Would find images where the key name is either kristoffer or morten and where the age key is less than 30.

Disjunction - `$or`

This operator can be used to combine a list of criteria where at least one must match. It takes an array of queries.

{"$or":[{"key":"value"},{"otherkey":"othervalue"}]}

Would fetch images that have a key named key with the value value and/or a key named otherkey which has the value of othervalue.

Using several operators in one query

All the above operators can be combined into one query. Consider a collection of images of beers which have all been tagged with the name of the brewery, the name of the beer, the style of the beer and the ABV. If we wanted to find all images of beers within a set of styles, above a specific ABV, from two different breweries, and all images of beers from Nøgne Ø, regardless of style and ABV, but not beers called Wit, regardless of brewery, style or ABV, the query could look like this (formatted for easier reading):

{
    "name":
    {
        "$ne": "Wit"
    },
    "$or":
    [
        {
            "brewery": "Nøgne Ø"
        },

        {
            "$and":
            [
                {
                    "abv":
                    {
                        "$gte": 5.5
                    }
                },

                {
                    "style":
                    {
                        "$in":
                        [
                            "IPA",
                            "Imperial Stout"
                        ]
                    }
                },

                {
                    "brewery":
                    {
                        "$in":
                        [
                            "HaandBryggeriet",
                            "Ægir"
                        ]
                    }
                }
            ]
        }
    ]
}

Keep in mind that large complex queries against large image collections can take a while to finish, and might cause performance issues on the Imbo server(s).

License

Licensed under the MIT License

imbo-metadata-search's People

Stargazers

Watchers

Forkers

kbrabrand tv2 sgulseth

imbo-metadata-search's Issues

Model is missing a method

In Imbo-2.1.2 the Imbo\Model\ModelInterface interface added a method called getData that is currently missing from this package.

500 when inserting wrong elastic type

[Wed Dec 09 09:02:01 2015] [error] [client 10.84.100.151] PHP Fatal error: Uncaught Exception with message: {"error":"MapperParsingException[failed to parse [metadata.date]]; nested: MapperParsingException[failed to parse date field [false], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "false"]; ","status":400} in /services/applications/imbo/vendor/imbo/imbo/public/index.php on line 25

Sorting/ordering of images is not consistent

The order of images returned are not consistent. This should obviosly be resolved somehow in order for the searching to behave in a predictable manner.

My observation is that this is not caused by ordering in ES, but by the order of the imageIdentifers returned from the search backend being shuffled when imbo fetches from DB and generates the final response.

I guess there are (at least) two approaches here;

Modify the response after imbo has fetched the images from the backend, and make sure the image ordering correspond to the order of the imageIdentifiers from the backend.
Make sure the imbo core generates a response with the same ordering as the imageIdentifers in the ids params array.

What types of result modifiers do we need to support in the search?

These are the result modifiers currently supported by the images resource;

metadata
Whether or not to include metadata in the output. Defaults to 0, set to 1 to enable.

from
Fetch images starting from this Unix timestamp.

to
Fetch images up until this timestamp.

fields[]
An array with fields to display. When not specified all fields will be displayed.

sort[]
An array with fields to sort by. The direction of the sort is specified by appending asc or desc to the field, delimited by :. If no direction is specified asc will be used. Example: ?sort[]=size&sort[]=width:desc is the same as ?sort[]=size:asc&sort[]=width:desc. If no sort is specified Imbo will sort by the date the images was added, in a descending fashion.

ids[]
An array of image identifiers to filter the results by.

checksums[]
An array of image checksums to filter the results by.

originalChecksums[]
An array of the original image checksums to filter the results by.

metadata, fields are supported by the current metadata search implementation as a consequence of it utilizing the db.images.load event handler for fetching from the backend.

What I want input on is the other modifiers. ids, checksums and originalChecksums I don't really see the need for in the context of the metadata search.

sort is the param I'm thinking we probably want. In order to support sorting like we currently do on the images endpoint, we'll need to store all the image data instead of only the metadata. We would end up with a structure looking like the one used by Imbo in MongoDB for images. The data indexed by the search backend would look like this;

{
    "publicKey" : "publickey",
    "imageIdentifier": "92aa7029b22263ea0b64ba12b4cbf760",
    "size": 1001337,
    "height": 1337,
    "width": 1337,
    "added": 1337001337,
    "metadata": {
        "animal": "Red Panda"
    }
}

I think this is worth the effort, but if people think it's a waist of time I won't care. It would involve;

Fetching and passing full image objects to the search backend for indexing whenever the image data (not only metadata) is changed in order to keep it in sync
Tweaking the AST -> search DSL transformation slightly in order to build queries using the metadata object.
Adding transformation of search parameters to the search backend query building.

Remove lowercasing from Imbo DSL -> AST transformation

The lowercasing of tokens in the AST parser was introduced as part of the Imbo DSL -> Mongo transformation initially written by @christeredvartsen, but is really more of a backend implementation detail. Some of the backends might need lowercasing, but the generic parser should't modify the data.

So for now, the lowercasing can be removed for now as the ES backend doesn't care for casing.

Describe the DSL on the readme

We need to describe how the Imbo DSL looks and should be used. This has been done in bits and pieces in issues and implementation, but we need to do a more "final" write-up of the details now that we've done the actual implementation.

Add support for global search

The metadata search should support searching other publickeys by specifying a list of publickeys to search (given that the user searching has access).

This however depends on the introduction of access levels in Imbo; imbo/imbo#319.

Tokenization of metadata

By default strings are tokenized in elasticsearch, so right now a search for

{ "animal":"red panda" }

would result in hits for all the following; red fox, red panda and giant panda.

What do you think is the best way to ensure this works in a predictable way? Is just telling people to configure their indices correctly enough?

Use one elasticsearch index for storing metadata

Right now we use multiple indices for storing image metadata. We should, for the sake of keeping the elasticsearch indices list tidy and make querying easier store it all in one big index.

Change endpoints

GET /users/<pubkey>/search should be changed to SEARCH /users/<pubkey>/images

and

SEARCH /images should be used for global image search.

Documents inserted into ElasticSearch has a query-property

I'm currently looking into some improvements that we want to do on our metadata-search, so I was looking into the raw data that got stored in ElasticSearch, and I noticed something strange. All of the documents include the following

...,
"mime": "...",
"query": {
  "filtered": {
    "filter": {
      "and": []
    }
  }
},
"size": ...,
...

That seemed a little suspicious, and like a artefact of some search-query. So I looked around a bit, and have figured out it's because the same function, prepareParams is called from both set and search, and by default it populates a default search-query for the search functionality so that the later functions can always just extend that filter.

if (isset($params['body']['query']['filtered']['filter'][0])) {
  ...
} else {
  $params['body']['query']['filtered']['filter'] = ['and' => []];
}

But this is has the side-effect of also injecting the default search query into all the documents that are stored, which is probably not ideal.

I can take a look at fixing this, if you want - I would probably just move the query-part of prepareParams into search...

Incompatible with ElasticSearch 5.2

The part of the ElasticSearch Search-DSL that we are using in the E.S.-backend was deprecated in 2.0.0-beta, and doesn't work in 5.2 (I haven't checked 5.1, and I think we run 5.0 in production where it works).

So we should update the ElasticSearch DSL-transformation to use a syntax that works from 2.0 and upwards.

We currently use the filtered-syntax, and we should probably move over to the bool-syntax - I've made a few tests that indicate that the bool-syntax seems to support the types of query that we support.

But what are peoples opinion on this? Should we simply change the transformation, which means that pre-2.0 will stop working (and thus should bump the semver-major version of this package), or should we have two different E.S.-backends depending on which version you run?

Lowercasing in the ES DSL transformation

I notice the AST/DSL transformation lowercase all values. I guess this is intentional and wanted.

Values are lowercase as part of the index-time processing and right now a search for cat results in a hit for Cat.

This is more of an implementation detail, but I guess we should probably lowercase the values nevertheless. Thoughts?

Query DSL representation

We need to decide on how to store/process the Query DSL internally inside this plugin. In the original Metadata-search issue (imbo/imbo#268), it was proposed and agreed upon to use a Mongo query-language subset to specify searches in. So this determines the external/textual representation of our query-DSL.

However, the internal structure can technically be whatever we want it to be. So here is my proposal to the three obvious internal representations (AST) of the query-DSL.

Option 1 Store the Mongo JSON as is

That is, the input-query

{"foo": "bar", "baz": {"$not": {"$gt": 42}}}

would be stored internally as

['foo' => 'bar', 'baz' => ['$not' => ['$gt': 42]]]

So basically the result of calling json_decode - Albeit with a few modifications (translate to lower-case) and checks (throw exceptions on unknown operators like $regex.

Option 2 Store the Mongo JSON as a normalised Mongo queries

In Mongo it is possible to represent many queries in multiple different, equivalent ways. Take for instance the two queries

{"foo": "bar", "baz": "blargh"}
{"$and": [{"foo": "bar"}, {"baz": "blargh"}]}

They are equivalent when executed, but the latter is much easier to transform into other query-languages because there will only be a few ways of building up queries. Because basically all queries can be normalised into being of one of the following 5 query structures

{"$and": [term, term, term]}
{"$or": [term, term, term]}
{"field": "value"}
{"field": {"$operator": "value"}}
{"field": {"$not": {"$operator": "value"}}}

So for instance the query

{"foo": "bar", "baz": {"$not": {"$gt": 42}}}

would be stored internally as

['$and' => [
  ['foo' => 'bar']
  ['baz' => ['$not' => ['$gt' => 42]]]]
]]

Doing recursive decents over such a simple data-structure makes it a lot easer to translate it into e.g. ElasticSearch queries.

Option 3 Store normalized Mongo queries as instances of AST-classes

This is basically doing the normalisation from option 2, but instead of storing it as associative arrays, it would be stored as instances of specific classes, like \Imbo\MetadataSearch\Dsl\Ast\And

So the query

{"foo": "bar", "baz": {"$not": {"$gt": 42}}}

would internally be stored as

new Dsl\Ast\And([
    new Dsl\Ast\Field('foo', new Dsl\Ast\Comparison\Equal('bar')),
    new Dsl\Ast\NegatedField('baz', new Dsl\Ast\Comparison\GreaterThan(42))
])

This structure makes it even easier / more readable to do recursive dececents over the query-DSLs AST. You could do something like the following (I admit this looks a bit silly, but you know - without pattern matching, there is only so much you can do)

function transformToEs(Dsl\Ast $query) {
    switch(TRUE) {
        case $query instanceof Dsl\Ast\And:
            return '(' . implode(' AND ', array_map('transformToEs', $query)) . ')';
        case $query instanceof Dsl\Ast\Or:
            return '(' . implode(' OR ', array_map('transformToEs', $query)) . ')';
        case $query instanceof Dsl\Ast\Field:
           return $query->field . ':' . transformComparisonToEs($query->value);
        case $query instanceof Dsl\Ast\NegatedField:
           return 'NOT ' . $query->field . ':' . transformComparisonToEs($query->comparison);
    }
}
function transformComparisonToEs(Dsl\Ast\Comparison $query) {
    switch(TRUE) {
        case $query instanceof Dsl\Ast\Comparison\Equal:
            return $query->value;
        case $query instanceof Dsl\Ast\Comparison\GreaterThan:
            return '>' . $query->value;
        // and so forth, for >=, <= and <
    }
}

Personally, I would want to go with either option 2 or 3. By going with option 1, we're going to make it harder than necessary to write transformations for multiple search backends. Doing the normalisation will also allow us to reject more malformed queries...

The differences between 2 and 3 is basically just that option 3 adds a more rigidly enforced structure on the internal representation (AST). It also can make it easier to read transformation functions, because can have potentially more descriptive class-names than the text-string that Mongo uses for operators. But this structure does come with the "overhead" of requiring quite a few class-definitions of all rather small classes that needs to contain 1-2 values.

So what are peoples opinion on how the query-DSL should be represented internally in this plugin?

-Morten.

imbo / imbo-metadata-search Goto Github PK

imbo-metadata-search's Introduction

Metadata search plugin for Imbo

Installation

Setting up the dependencies

Metadata search setup

Indexing

Querying

Examples

Imbo DSL

Key/value matching

Greater than - $gt

Greater than or equal - $gte

Less than - $lte

Less than or equal - $lte

Not equal - $ne

In - $in

Not in - $nin

Field exists - $exists

Conjunctions - $and

Disjunction - $or