imbo / imbo-metadata-search Goto Github PK
View Code? Open in Web Editor NEWImbo plugin that enables metadata search
License: MIT License
Imbo plugin that enables metadata search
License: MIT License
I notice the AST/DSL transformation lowercase all values. I guess this is intentional and wanted.
Values are lowercase as part of the index-time processing and right now a search for cat results in a hit for Cat.
This is more of an implementation detail, but I guess we should probably lowercase the values nevertheless. Thoughts?
We need to decide on how to store/process the Query DSL internally inside this plugin. In the original Metadata-search issue (imbo/imbo#268), it was proposed and agreed upon to use a Mongo query-language subset to specify searches in. So this determines the external/textual representation of our query-DSL.
However, the internal structure can technically be whatever we want it to be. So here is my proposal to the three obvious internal representations (AST) of the query-DSL.
Option 1 Store the Mongo JSON as is
That is, the input-query
{"foo": "bar", "baz": {"$not": {"$gt": 42}}}
would be stored internally as
['foo' => 'bar', 'baz' => ['$not' => ['$gt': 42]]]
So basically the result of calling json_decode
- Albeit with a few modifications (translate to lower-case) and checks (throw exceptions on unknown operators like $regex
.
Option 2 Store the Mongo JSON as a normalised Mongo queries
In Mongo it is possible to represent many queries in multiple different, equivalent ways. Take for instance the two queries
{"foo": "bar", "baz": "blargh"}
{"$and": [{"foo": "bar"}, {"baz": "blargh"}]}
They are equivalent when executed, but the latter is much easier to transform into other query-languages because there will only be a few ways of building up queries. Because basically all queries can be normalised into being of one of the following 5 query structures
{"$and": [term, term, term]}
{"$or": [term, term, term]}
{"field": "value"}
{"field": {"$operator": "value"}}
{"field": {"$not": {"$operator": "value"}}}
So for instance the query
{"foo": "bar", "baz": {"$not": {"$gt": 42}}}
would be stored internally as
['$and' => [
['foo' => 'bar']
['baz' => ['$not' => ['$gt' => 42]]]]
]]
Doing recursive decents over such a simple data-structure makes it a lot easer to translate it into e.g. ElasticSearch queries.
Option 3 Store normalized Mongo queries as instances of AST-classes
This is basically doing the normalisation from option 2, but instead of storing it as associative arrays, it would be stored as instances of specific classes, like \Imbo\MetadataSearch\Dsl\Ast\And
So the query
{"foo": "bar", "baz": {"$not": {"$gt": 42}}}
would internally be stored as
new Dsl\Ast\And([
new Dsl\Ast\Field('foo', new Dsl\Ast\Comparison\Equal('bar')),
new Dsl\Ast\NegatedField('baz', new Dsl\Ast\Comparison\GreaterThan(42))
])
This structure makes it even easier / more readable to do recursive dececents over the query-DSLs AST. You could do something like the following (I admit this looks a bit silly, but you know - without pattern matching, there is only so much you can do)
function transformToEs(Dsl\Ast $query) {
switch(TRUE) {
case $query instanceof Dsl\Ast\And:
return '(' . implode(' AND ', array_map('transformToEs', $query)) . ')';
case $query instanceof Dsl\Ast\Or:
return '(' . implode(' OR ', array_map('transformToEs', $query)) . ')';
case $query instanceof Dsl\Ast\Field:
return $query->field . ':' . transformComparisonToEs($query->value);
case $query instanceof Dsl\Ast\NegatedField:
return 'NOT ' . $query->field . ':' . transformComparisonToEs($query->comparison);
}
}
function transformComparisonToEs(Dsl\Ast\Comparison $query) {
switch(TRUE) {
case $query instanceof Dsl\Ast\Comparison\Equal:
return $query->value;
case $query instanceof Dsl\Ast\Comparison\GreaterThan:
return '>' . $query->value;
// and so forth, for >=, <= and <
}
}
Personally, I would want to go with either option 2 or 3. By going with option 1, we're going to make it harder than necessary to write transformations for multiple search backends. Doing the normalisation will also allow us to reject more malformed queries...
The differences between 2 and 3 is basically just that option 3 adds a more rigidly enforced structure on the internal representation (AST). It also can make it easier to read transformation functions, because can have potentially more descriptive class-names than the text-string that Mongo uses for operators. But this structure does come with the "overhead" of requiring quite a few class-definitions of all rather small classes that needs to contain 1-2 values.
So what are peoples opinion on how the query-DSL should be represented internally in this plugin?
-Morten.
The lowercasing of tokens in the AST parser was introduced as part of the Imbo DSL -> Mongo transformation initially written by @christeredvartsen, but is really more of a backend implementation detail. Some of the backends might need lowercasing, but the generic parser should't modify the data.
So for now, the lowercasing can be removed for now as the ES backend doesn't care for casing.
In Imbo-2.1.2 the Imbo\Model\ModelInterface
interface added a method called getData
that is currently missing from this package.
The metadata search should support searching other publickeys by specifying a list of publickeys to search (given that the user searching has access).
This however depends on the introduction of access levels in Imbo; imbo/imbo#319.
GET /users/<pubkey>/search
should be changed to SEARCH /users/<pubkey>/images
and
SEARCH /images
should be used for global image search.
[Wed Dec 09 09:02:01 2015] [error] [client 10.84.100.151] PHP Fatal error: Uncaught Exception with message: {"error":"MapperParsingException[failed to parse [metadata.date]]; nested: MapperParsingException[failed to parse date field [false], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "false"]; ","status":400} in /services/applications/imbo/vendor/imbo/imbo/public/index.php on line 25
The part of the ElasticSearch Search-DSL that we are using in the E.S.-backend was deprecated in 2.0.0-beta, and doesn't work in 5.2 (I haven't checked 5.1, and I think we run 5.0 in production where it works).
So we should update the ElasticSearch DSL-transformation to use a syntax that works from 2.0 and upwards.
We currently use the filtered
-syntax, and we should probably move over to the bool
-syntax - I've made a few tests that indicate that the bool
-syntax seems to support the types of query that we support.
But what are peoples opinion on this? Should we simply change the transformation, which means that pre-2.0 will stop working (and thus should bump the semver-major version of this package), or should we have two different E.S.-backends depending on which version you run?
We need to describe how the Imbo DSL looks and should be used. This has been done in bits and pieces in issues and implementation, but we need to do a more "final" write-up of the details now that we've done the actual implementation.
The order of images returned are not consistent. This should obviosly be resolved somehow in order for the searching to behave in a predictable manner.
My observation is that this is not caused by ordering in ES, but by the order of the imageIdentifers returned from the search backend being shuffled when imbo fetches from DB and generates the final response.
I guess there are (at least) two approaches here;
ids
params array.Right now we use multiple indices for storing image metadata. We should, for the sake of keeping the elasticsearch indices list tidy and make querying easier store it all in one big index.
I'm currently looking into some improvements that we want to do on our metadata-search, so I was looking into the raw data that got stored in ElasticSearch, and I noticed something strange. All of the documents include the following
...,
"mime": "...",
"query": {
"filtered": {
"filter": {
"and": []
}
}
},
"size": ...,
...
That seemed a little suspicious, and like a artefact of some search-query. So I looked around a bit, and have figured out it's because the same function, prepareParams is called from both set
and search
, and by default it populates a default search-query for the search
functionality so that the later functions can always just extend that filter.
if (isset($params['body']['query']['filtered']['filter'][0])) {
...
} else {
$params['body']['query']['filtered']['filter'] = ['and' => []];
}
But this is has the side-effect of also injecting the default search query into all the documents that are stored, which is probably not ideal.
I can take a look at fixing this, if you want - I would probably just move the query-part of prepareParams
into search
...
By default strings are tokenized in elasticsearch, so right now a search for
{ "animal":"red panda" }
would result in hits for all the following; red fox
, red panda
and giant panda
.
What do you think is the best way to ensure this works in a predictable way? Is just telling people to configure their indices correctly enough?
These are the result modifiers currently supported by the images resource;
metadata
Whether or not to include metadata in the output. Defaults to 0, set to 1 to enable.
from
Fetch images starting from this Unix timestamp.
to
Fetch images up until this timestamp.
fields[]
An array with fields to display. When not specified all fields will be displayed.
sort[]
An array with fields to sort by. The direction of the sort is specified by appending asc or desc to the field, delimited by :. If no direction is specified asc will be used. Example: ?sort[]=size&sort[]=width:desc is the same as ?sort[]=size:asc&sort[]=width:desc. If no sort is specified Imbo will sort by the date the images was added, in a descending fashion.
ids[]
An array of image identifiers to filter the results by.
checksums[]
An array of image checksums to filter the results by.
originalChecksums[]
An array of the original image checksums to filter the results by.
metadata
, fields
are supported by the current metadata search implementation as a consequence of it utilizing the db.images.load
event handler for fetching from the backend.
What I want input on is the other modifiers. ids
, checksums
and originalChecksums
I don't really see the need for in the context of the metadata search.
sort
is the param I'm thinking we probably want. In order to support sorting like we currently do on the images endpoint, we'll need to store all the image data instead of only the metadata. We would end up with a structure looking like the one used by Imbo in MongoDB for images. The data indexed by the search backend would look like this;
{
"publicKey" : "publickey",
"imageIdentifier": "92aa7029b22263ea0b64ba12b4cbf760",
"size": 1001337,
"height": 1337,
"width": 1337,
"added": 1337001337,
"metadata": {
"animal": "Red Panda"
}
}
I think this is worth the effort, but if people think it's a waist of time I won't care. It would involve;
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.