jurismarches / luqum Goto Github PK

View Code? Open in Web Editor NEW

184.0 14.0 38.0 551 KB

A lucene query parser generating ElasticSearch queries and more !

License: Other

Python 99.82% Makefile 0.18%

elasticsearch parser lucene python

luqum's People

Contributors

Stargazers

Watchers

luqum's Issues

Support for date arithmetics in queries

It would be great if words like now, now-5h, now-1h can be used in dates, without being treated as simple string.

Parsing date : [now-1h TO now] would add more meaning to luqum. These human-readable queries are supported in Elasticsearch. See here Common options

P.S. When I checked, this feature was not implemented. Kindly reply, if it is already there.

inverse ElasticsearchQueryBuilder

Hi, is there a way to go in the opposite direction, i.e.:
Elasticsearch -> Lucene Query DSL

Stop using print in code

print should not be used in production code as below. It is unprofessional.

luqum/luqum/parser.py

Lines 181 to 184 in 55c9cdc

    
           # Error handling rule FIXME 
        
           def t_error(t):  # pragma: no cover 
        
               print("Illegal character '%s'" % t.value[0]) 
        
               t.lexer.skip(1)

An exception can be raised.

There isn't even a good reason to silently skip certain characters, e.g. forward slash. Why can't it be processed? It should be a valid character.

match doesn't return results if default_field is '*'

Hello,
I've a strange behavior with full text search.

The parser converts search to something like:

{"query": {"match": {"*": {"query": "test", "zero_terms_query": "none"}}}}

That does not return any result.

Adding a trailing '*' the query is instead converted in this one:

{"query": {"query_string": {"query": "test*", "default_field": "*", "analyze_wildcard": true, "allow_leading_wildcard": true}}}

that correctly returns proper results.

There is a way to have latest behavior also for generic search without trailing *?
Like convert to query_string instead of match if field (or default_field) is *

Add an integration test

It would be great to make a sample database, load it in elastic and then try the different feature against it.

This could help verify which versions of ES we are compatible with.

Bug in base operation

Hey guys!

I'd like to know is there any reason spaces were removed from string representation?

luqum/luqum/tree.py

Line 398 in 48869be

value = ("%s" % self.op).join(o.__str__(head_tail=True) for o in self.operands)

Here's the code, that I'm running and facing the issue:

from luqum.tree import SearchField, AndOperation, Word
a = AndOperation(SearchField('po_cancelled', Word('false')), SearchField('deleted', Word('false')))

Previously I've had

str(a) == 'po_cancelled:false AND deleted:false'

After version 0.10.0 release I have

str(a) == 'po_cancelled:falseANDdeleted:false'

which brakes further code execution since syntactic is wrong

Could you please explain me another approach to build a query so my code would run as previously or add spaces back?

Allowing double quotes

Thank you for the wonderful library.

I have queries that have some field expressions with double quotes. That is, something like the following:

field_name:""expression text""

When these are parsed, they confuse the parser as it thinks that the initial double quotes is an unknown operation and it gets treated as a Phrase.

Here is a sample query and the parsing operation in python:

from luqum.parser import parser
query = 'field_name:""Field Text"" OR field_name:text AND field_name:"more text"'
parser.parse(query)

Here is the current output:

UnknownOperation(SearchField('field_name', Phrase('""')), Word('Field'), OrOperation(Word('Text""'), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"')))))

This is what the expected output of the parsing operation would look like:

OrOperation(SearchField('field_name', Phrase('""Field Text""')), AndOperation(SearchField('field_name', Word('text')), SearchField('field_name', Phrase('"more text"'))))

Any thoughts on this? The double quotes over the single quotes does have a distinct meaning in this case, hence why I am asking.

How are NOT and - (minus) different?

How is foo NOT bar different from foo -bar? Should luqum process them identically? If not, why not?

Parser fails with 'TypeError' on invalid query "~]"

There's a query from pair of symbols which breaks the error handling of the parser, so we can't get relevant information about the syntax error.

How to reproduce:

from luqum.parser import parser

parser.parse('~]')

Expected result: ParseSyntaxError
Actual result: TypeError: __str__ returned non-string (type NoneType)

Possible fix:

Update the TokenValue.__str__ method to always return string. I.e.

class TokenValue:
# ...
    def __str__(self):
        return str(self.value)

We have a project which uses apache2 license. We would like to leverage this library but license is incompatible(https://apache.org/legal/resolved.html). Is it possible to do duel license for the project?

Setting OR operation to only work on adjacent values without needing brackets

Short version is I would like to parse a b OR c d as a AND (b OR c) AND d.

Long version: I'm attempting to simplify the syntax a little before I integrate this into my code, as it's supposed to be an easy to use search with optional advanced features. I gave up coding it myself and found this library which seems nice.

I'd like it to work like Google where things are "AND" by default, but by providing "OR", it'll compare the two closest values. I've set UnknownOperationResolver to AND and tried changing the order of parser.precedence, but not had any luck.

Here's an example of a query I'd like to use:

Working syntax: (either page 1 or page 2, has either "a" or both "b" and "c", title is not sometitle)

(page:page1 OR page:page2) AND (a OR b AND c) AND -title:sometitle

Wanted syntax:

page:page1 OR page:page2 a OR (b c) -title:sometitle

For the record, we're still stuck on Python 2.7 for another year or two, so as I've already had to fix the yield from lines, I'm not against tweaking other bits of the code if needed.

Parser fails with 'TypeError' on invalid query "a^"

During parsing an invalid query like a^ the parser fails with TypeError: conversion from NoneType to Decimal is not supported

Traceback (most recent call last):
  File "scratches/scratch1.py", line 12, in 
    qb = parser.parse('a^')
  File "lib/python3.8/site-packages/ply/yacc.py", line 333, in parse
    return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
  File "lib/python3.8/site-packages/ply/yacc.py", line 1120, in parseopt_notrack
    p.callable(pslice)
  File "lib/python3.8/site-packages/luqum/parser.py", line 316, in p_boosting
    p[0] = Boost(p[1], p[2].value)
  File "lib/python3.8/site-packages/luqum/tree.py", line 374, in __init__
    self.force = Decimal(force).normalize()
TypeError: conversion from NoneType to Decimal is not supported

Expected result: luqum.exceptions.ParseSyntaxError: Syntax error in input ...

IPV6 parsing failure

Latest ES6 support Ipv6. When I tried the following query, Luqum is unable to parse properly.

srcIp: 1::1

Any suggestion?

zero_terms_query appended to match_phrase query

There is a regression since 0.7.x where zero_terms_query: none is being appended to generated query for match_phrase

For query some_id: hello-world

0.6.1 generated {'match_phrase': {'some_id': {'query': hello-world'}}}
0.7.1 generated {'match_phrase': {'participant_id': {'query': 'LB-S00133', 'zero_terms_query': 'none'}}}

Note that the 0.7.1 query fails on elasticsearch 5.6.8 with: TransportError(400, 'parsing_exception', '[match_phrase] query does not support [zero_terms_query]')

Support Surround Query Parser

https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-SurroundQueryParser

Is Surround Query Parser not supported?
like this:
3w(foo, bar)
or
(spot prices) 3w (gulf coast)

Docs haven't been built for more than a year

Install instruction refers to cookiecutter, and comes from here: d0bf7d7

A phrase with empty quotes breaks the es query builder

from luqum.elasticsearch.visitor import ElasticsearchQueryBuilder
from luqum.parser import parser 

ElasticsearchQueryBuilder().visit(parser.parse('""'))

Wrong .tar.gz pypi package

Hi Alex,

I just want to let you know that the .tar.gz file at PyPi has incorrect structure (the wheel is ok, though).

It looks like:

root@cis-hub:d327478a510f3# tar tzf luqum-0.10.0.linux-x86_64.tar.gz 
./
./home/
./home/alex/
./home/alex/projets/
./home/alex/projets/luqum/
./home/alex/projets/luqum/venv/
./home/alex/projets/luqum/venv/lib/
./home/alex/projets/luqum/venv/lib/python3.8/
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__init__.py
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/__init__.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/auto_head_tail.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/check.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/deprecated_utils.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/exceptions.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/head_tail.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/naming.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/parser.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/parsetab.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/pretty.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/tests.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/tree.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/utils.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/__pycache__/visitor.cpython-38.pyc
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/auto_head_tail.py
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/check.py
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/deprecated_utils.py
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/elasticsearch/
./home/alex/projets/luqum/venv/lib/python3.8/site-packages/luqum/elasticsearch/__init__.py

... etc

Thanks for your work,

Mirek

Allow to override `E` elements

I need to customize the behavior of some elements inside ElasticsearchQueryBuilder. Sadly it is not currently possible to just change these element without overriding every method to use my custom element.

Could you consider making these elements an attribute of the class so that we can override them more easily, i.e. something like :

class ElasticsearchQueryBuilder(LuceneTreeVisitorV2):
    
    E_MUST = EMust
    E_SHOULD = Eshould
    [...]

I can make a PR if you agree with the idea.

Regexp Operator Support

Any plans to add support for the regexp search operator ("/")?

Example from ElasticSearch documentation:
name: /joh?n(ath[oa]n)/

https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-query-string-query.html#_regular_expressions

Parse issue on field_42:42

When calling parser.parse on "field_42:42" the parse will not break it into search field and term.
It will parse the entire string as a single word.

The root cause of this is the TERM regex:
(?P^\s:^,"'+~-()[]{}*)
which can't break the string into groups.

Querying python dictionaries

Hi,

Anyone knows a module for applying lucene-DSL query on a python dictionary?
Something like an in-memory Elastic implementation in python

Thanks.

SearchField.name with spaces

Luqum is mostly working perfectly for me, but I've just hit a bit of a snag. I have a double SearchField to perform a more advanced query, but I've just realised it won't work with spaces.

>>> field:value1:"value 2"
SearchField('field', SearchField('value1', Phrase('"value 2"')))

>>> field:"value 1":"value 2"`
luqum.parser.ParseError: Syntax error in input at LexToken(COLUMN,':',1,24)!

Is this a limitation of yacc or is there a way I could get this working?

In the meantime, I've used this to convert "value 1" to value 1, it's awfully messy though.

# This will convert 'field:"value 1":"value 2"' to 'field:value&nbsp;1:"value 2"'
# It will need to be decoded again before being used
offset = 0
while True:
    try:
        index = value[offset:].index(':"') + 1
    except ValueError:
        break
    offset += index
    end = False
    for i, c in enumerate(value[offset:]):
        if not end:
            if i and c == '"':
                end = True
        elif c == ' ':
            break
        elif c == ':':
            word = value[offset:offset+i]
            new_word = word[1:-1].replace(' ', '&nbsp;')
            value = value[:offset] + new_word + value[offset+i:]
            offset += len(new_word) - len(word)
            break

Keyword fields containing wildcards cannot be searched for exactly

Thank you very much, you have created a really amazing library. 👍🏻

I have come across a special case. I have keyword fields that contain wildcard characters (* or ?). In Elasticsearch this is no problem at all. But it seems luqum has some difficulties with this use case.

Here is an example of indexing a document with a keyword field containing wildcard characters using ES.

from elasticsearch import Elasticsearch

es = Elasticsearch(hosts="http://localhost:9200")
mappings = {"properties":{"vendor":{"type":"keyword"}}}
es.indices.create(index="test", mappings=mappings)
es.index(index="test", body={"vendor": "f**k"}, id="example")

Now I want to search for the field. The following works, but is not what I want, because it does a wildcard search and not an exact term search.

es.search(body={
    "query": {
        "query_string": {
            "query": "vendor:f**k"
        }
    }
}, index="test")

{'took': 2,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'test',
    '_id': 'example',
    '_score': 1.0,
    '_source': {'vendor': 'f**k'}}]}}

(1) To search exact you have to escape the wildcard characters. This works in ES.

es.search(body={
    "query": {
        "query_string": {
            "query": "vendor:f\*\*k"
        }
    }
}, index="test")

{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 0.2876821,
  'hits': [{'_index': 'test',
    '_id': 'example',
    '_score': 0.2876821,
    '_source': {'vendor': 'f**k'}}]}}

(2) Alternatively you can also use a phrase query. This works in ES.

es.search(body={
    "query": {
        "query_string": {
            "query": 'vendor:"f\*\*k"'
        }
    }
}, index="test")

{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 0.2876821,
  'hits': [{'_index': 'test',
    '_id': 'example',
    '_score': 0.2876821,
    '_source': {'vendor': 'f**k'}}]}}

Now when I try both (1) and (2) with luqum, it doesn't seem to work.

from luqum.elasticsearch import SchemaAnalyzer, ElasticsearchQueryBuilder
schema_analizer = SchemaAnalyzer({"mappings": mappings})
es_builder = ElasticsearchQueryBuilder(**schema_analizer.query_builder_options())

(1) Luqum creates a wildcard query when the "*" characters are escaped. This behaviour is different from ES and not what I expected. Apparently the escape characters are not removed either.

from luqum.parser import parser
es_builder(parser.parse("vendor:f\*\*k"))

 {'wildcard': {'vendor': {'value': 'f\\*\\*k'}}}

(2) Luqum creates a wildcard query when the search term is entered as a phrase. This behaviour is also different from ES and not what I expected.

from luqum.parser import parser
es_builder(parser.parse('vendor:"f**k"'))

{'wildcard': {'vendor': {'value': 'f**k'}}}

Somehow I don't see any possibilities to formulate a query string in such a way that a term with "*" can be searched for exactly.

Regards, André

Question on behavior of SchemaAnalyzer

I have a double-nested schema given here:

https://gist.github.com/seandavi/528e98e943b24a7ef365fbf1e937f5ba

It seems that the top-level path is dropped by some methods of the SchemaAnalyzer when I use this schema.

import json
m2 = json.load(open('MAPPING_FILE.json'))
m3 = luqum.elasticsearch.SchemaAnalyzer({"mappings" : m2['sra_experiment_joined2']['mappings']['doc']['properties']})

And the output of methods:

 m3.nested_fields()
Out[900]: 
{'attributes': {'tag': {}, 'value': {}},
 'identifiers': {'id': {}, 'namespace': {}, 'uuid': {}},
 'reads': {'base_coord': {},
  'read_class': {},
  'read_index': {},
  'read_type': {}},
 'xrefs': {'db': {}, 'id': {}}}

In this schema, then, the problem arises from the fact that the nested field names (without the parent) are repeated. This results in too short a list and the paths are not complete.

And sub_fields:

list(m3.sub_fields())
Out[908]: 
['tag.keyword',
 'value.keyword',
 'id.keyword',
 'namespace.keyword',
 'uuid.keyword',
 'Status.keyword',
 'accession.keyword',
 'alias.keyword',
 'attributes.tag.keyword',
 'attributes.value.keyword',
 'broker_name.keyword',
 'center_name.keyword',
 'experiment_accession.keyword',
 'identifiers.id.keyword',
 'identifiers.namespace.keyword',
 'identifiers.uuid.keyword',
 'reads.read_class.keyword',
 'reads.read_type.keyword',
 'run_accession.keyword',
 'run_center.keyword',
 'BioSample.keyword',
 'GEO.keyword',
 'Status.keyword',
 'accession.keyword',
 'alias.keyword',
 'attributes.tag.keyword',
 'attributes.value.keyword',
 'broker_name.keyword',
 'center_name.keyword',
 'description.keyword',
 'identifiers.id.keyword',
 'identifiers.namespace.keyword',
 'numeric_properties.property_id.keyword',
 'numeric_properties.unit_id.keyword',
 'ontology_terms.keyword',
 'organism.keyword',
 'sample_type.keyword',
 'title.keyword',
 'xrefs.db.keyword',
 'xrefs.id.keyword',
 'BioProject.keyword',
 'GEO.keyword',
 'Status.keyword',
 'abstract.keyword',
 'accession.keyword',
 'alias.keyword',
 'attributes.tag.keyword',
 'attributes.value.keyword',
 'broker_name.keyword',
 'center_name.keyword',
 'description.keyword',
 'identifiers.id.keyword',
 'identifiers.namespace.keyword',
 'study_accession.keyword',
 'study_type.keyword',
 'title.keyword',
 'xrefs.db.keyword',
 'xrefs.id.keyword',
 'db.keyword',
 'id.keyword']

Again, note that the field names are missing the parent in the name.

It is quite possible that I am misusing the SchemaAnalyzer, so any thoughts you have are appreciated.

Remove the SEPARATOR token?

I don't know much about yacc, but I've been putting up with this warning each time I use luqum (I presume you already know it):

WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There is 1 unused token

According to this stackoverflow question, the guy with a similar issue just removed the token from the list and it was fine.

Doing that fixed the warning for me, so I'm wondering if there's a particular reason you are keeping the SEPARATOR token? If it's genuinely not used, shouldn't it be removed from the source code?

Support implicit OR operation

The following should be a valid lucene syntax:

name:bob city:nyc

it should equivalent to:

name:bob OR city:nyc

But I get UnknowOperation instead of OrOperation when parsing the string.

Question: Is it possible to allow `minimum_should_match` on all OR operations?

Is it possible to force having a minimum_should_match value on any bool values when an OR operation is present?

Currently I have something like:

search_content = "(a OR b) AND (b OR c)"
tree = parser.parse(search_content)
query = ES_BUILDER(tree)

And the query yields:

{
  "bool": {
    "must": [
      {
        "bool": {
          "should": [
            {
              "match": {
                "content": {
                  "query": "a",
                  "zero_terms_query": "none"
                }
              }
            },
            {
              "match": {
                "content": {
                  "query": "b",
                  "zero_terms_query": "none"
                }
              }
            }
          ]
        }
      },
      {
        "bool": {
          "should": [
            {
              "match": {
                "content": {
                  "query": "b",
                  "zero_terms_query": "none"
                }
              }
            },
            {
              "match": {
                "content": {
                  "query": "c",
                  "zero_terms_query": "none"
                }
              }
            }
          ]
        }
      }
    ]
  }
}

Whereas I'd like it to be:

{
  "bool": {
    "must": [
      {
        "bool": {
          "minimum_should_match": 1,
          "should": [
            {
              "match": {
                "content": {
                  "query": "a",
                  "zero_terms_query": "none"
                }
              }
            },
            {
              "match": {
                "content": {
                  "query": "b",
                  "zero_terms_query": "none"
                }
              }
            }
          ]
        }
      },
      {
        "bool": {
          "minimum_should_match": 1,
          "should": [
            {
              "match": {
                "content": {
                  "query": "b",
                  "zero_terms_query": "none"
                }
              }
            },
            {
              "match": {
                "content": {
                  "query": "c",
                  "zero_terms_query": "none"
                }
              }
            }
          ]
        }
      }
    ]
  }
}

But perhaps I'm doing something wrong?

Is it possible to manipulate a parsed tree to add elements?

I see on the documentation that we can manipulate the parsed tree in order to change the value of a field or expression. But is it possible to manipulate the tree in order to also append an extra element?

Example, manipulate:

dog: "Max" AND color: "brown"

into:

(name:"Max" AND animal:"dog") AND color: "brown"

I can already convert "dog" into "name" by using the LuceneTreeTransformer example on the documentation, but how about adding new nodes? Is it possible? If so can can anyone share a simple example?

Thanks

missing elasticsearch-dsl dependency

Hi everyone,
In the latest version 0.9.0, I think there is a problem with the required dependency elasticsearch-dsl. Installing this package by pip, the compiler shows an error due to missing dependency.
I think it's necessary to add in requirement.txt and setup.py.

Multi-level nesting fields problem

Hello!

First of all I'd like to thank the authors of this amazing library for all the effort - this library really helps when it comes to querying nested fields with ES' Query String Query.

I've found some issue concerning the parser when it comes to parsing multi-level nested fields (nested fields within nested fields).

Here's my definition of ElasticsearchQueryBuilder:

es_builder = ElasticsearchQueryBuilder(nested_fields={
        "study_units": {
            "country": {"id": {}, "name": {}}
    })

The query looks like this: study_units.country.name:italy

The produced output looks like this:

{
  'nested': {
    'query': {
      'nested': {
        'query': {
          'match': {
            'study_units.country.name': {
              'type': 'phrase',
              'query': 'italy',
              'zero_terms_query': 'none'
            }
          }
        },
        'path': 'study_units.country'
      }
    },
    'path': 'name'
  }
}

I think that the most outer path key should have the value of study_units instead of name.

My question - is this a problem with my definition of nested fields, or something is off here?

add escaping

The parser doesn't handle escaped characters.

See: https://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Escaping%20Special%20Characters

ElasticsearchQueryBuilder - support for Objects

Hello!

As mentioned in #15, I'd like to propose a solution for handling elasticsearch-dsl's Object fields (or ES objects in general). During the flattening of nested objects in Elasticsearch, fields in form of:

    country = Object(properties={
        "id": Integer(),
        "name": Text()
    })

are transformed into a flat structure, as can be seen here: link

Due to the fact, that here:

luqum/luqum/elasticsearch/visitor.py

Lines 211 to 213 in a8002c2

    
           def _is_nested(self, node): 
        
               if isinstance(node, SearchField) and '.' in node.name: 
        
                   return True

is the '.' in node.name check present, parsed queries in form of country.name:Italy will be transformed to a nested query, which will cause hiccups in Elasticsearch.

Now, the country.name seems to be parsed correctly by YACC as dot isn't a special char. It's the check in the mentioned file causes the problem for me here.

What have I done - I've subclassed ElasticsearchQueryBuilder as follows:

class ElasticsearchDotAwareQueryBuilder(ElasticsearchQueryBuilder):
    def _is_nested(self, node):
        for child in node.children:
            if isinstance(child, SearchField):
                return True
            elif self._is_nested(child):
                return True

        return False

And now I have both : available for nested fields, as well as . for flat objects.

I didn't have enough time to dive deeply but I think it shouldn't break other functionalities. My question is - why was the check for dot presence there in the first place - is it just an alternative to :, just not introduced on parsing level but later in the query building process, or am I missing something?

Not able to parse certain lucene queries properly

@jurismarches,

We have a Lucene query with the syntax like below:
(state: "Completed" OR "Cancelled") AND (segment: "total" OR "cancelled") AND NOT (comment:"This is a sample")

The above Lucene query expects the events having values for:

"state" field as "Completed" or "Cancelled",
"segment" field "total" OR "cancelled"
Comment not equal to "This is a sample"

However, the DSL query formed using the luqum module for the above lucene query is as follows:
{'query': {'bool': {'must': [{'bool': {'should': [{'match_phrase': {'state': {'query': 'Completed'}}}, {'match_phrase': {'text': {'query': 'Cancelled'}}}]}}, {'bool': {'should': [{'match_phrase': {'segment': {'query': 'total'}}}, {'match_phrase': {'text': {'query': 'cancelled'}}}]}}, {'bool': {'must_not': [{'match_phrase': {'comment': {'query': 'This is a sample'}}}]}}]}}}

It can be seen in the above DSL that the 'state' field is now just expecting 'Completed'. 'Canceled' value is not getting expected from the 'text' field which is not in our environment. Similar behavior is seen in the 'segment' field parsing as well.

Can you help us on priority to resolve this issue so that we can continue leveraging luqum module in our application?

Parsing error in multithreading

import _thread

from luqum.parser import parser


def run():
    qs1 = '(title:"foo bar" AND body:"quick fox") OR title:fox AND (title:"foo bar" AND body:"quick fox") OR ' \
          'title:fox AND (title:"foo bar" AND body:"quick fox") OR title:fox AND (title:"foo bar" AND body:"quick ' \
          'fox") OR title:fox AND (title:"foo bar" AND body:"quick fox") OR title:fox'
    qs2 = '(title:"foo bar" AND body:"quick fox") OR title:fox AND (title:"foo bar" AND body:"quick fox") OR ' \
          'title:fox AND (title:"foo bar" AND body:"quick fox") OR title:fox AND (title:"foo bar" AND body:"quick ' \
          'fox") OR title:fox AND (title:"foo bar" AND body:"quick fox") OR title:fox'

    parser.parse(qs1)
    parser.parse(qs2)


# The larger the range, the more likely it is
for i in range(100):
    _thread.start_new_thread(run, ())


# The single thread works properly
# for i in range(1000):
#     run()

raise error:
luqum.exceptions.ParseSyntaxError: Syntax error in input : unexpected end of expression (maybe due to unmatched parenthesis) at the end!

Visitor example

Is there any example of a tree visitor implmentation subclassing TreeVisitor base class ?

Can't find any example in the documentation.

My first use case is a visitor returning me a list of the search fields present in a tree.

Question: treating untagged words or phrases as "full text search" across multiple (or all) fields

Luqum is working great for me and my test users, but one thing that the test users miss is the behavior of query_string to do a full-text search across all fields when no field is specified (eg., "London") . I see the ability to specify a default fields, but this results in a simple match query. I guess I am looking to convert these to multi-match with all available text fields? Any suggestions?

Behavior of bare text in ElasticsearchQueryBuilder

I have a pretty naive user community that likes simple plain-text search, but I also want to support power users with nested, field-based queries. When I have a query like:

cancer AND study.title:colon

The query translation after ElasticsearchQueryBuilder results in:

{"bool": {"must": [{"match": {"text": {"query": "cancer", "zero_terms_query": "all"}}}, {"match": {"study.title": {"query": "colon", "zero_terms_query": "all"}}}]}}

I'd like to simulate the behavior of the query_string with cancer to match against all available fields (not just a single field, as above). Is there a recommended configuration or approach that I can use to have the "best of both worlds" with nested and object query support while maintaining free text search for bare text? Sorry if I missed this in the docs.

Which Elasticsearch version(s) do you support?

Hello, I am having a problem executing a simple query.

I am running Elasticsearch 6.2.2 locally. From a clean installed base, I do

POST /accounts/person/ 
{
    "name" : "John",
    "lastname" : "Doe",
    "job_description" : "Systems administrator and Linux specialist"
}

I can then run:

GET /accounts/_search?q=name:john

which returns a proper result. I cannot reproduce this result with luqum however. I am trying:

from elasticsearch import Elasticsearch
client = Elasticsearch(host='localhost', port=9200)
client.info()

{'name': 'jQqh6TD',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': 'vMnMGP4XRYC6CAJN7lOzyw',
 'version': {'number': '6.2.2',
  'build_hash': '10b1edd',
  'build_date': '2018-02-16T19:01:30.685723Z',
  'build_snapshot': False,
  'lucene_version': '7.2.1',
  'minimum_wire_compatibility_version': '5.6.0',
  'minimum_index_compatibility_version': '5.0.0'},
 'tagline': 'You Know, for Search'}

schema = client.indices.get_mapping(index='accounts')
schema

{'accounts': {'mappings': {'person': {'properties': {'address': {'properties': {'city': {'type': 'text',
        'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
       'street': {'type': 'text',
        'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}},
     'height': {'type': 'long'},
     'job_description': {'type': 'text',
      'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
     'lastname': {'type': 'text',
      'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
     'name': {'type': 'text',
      'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}}}

from luqum.parser import parser
tree = parser.parse('name:john')
query = es_builder(tree)
query

{'match': {'name': {'query': 'john', 'zero_terms_query': 'none'}}}

response = client.search(
    index='accounts',
    body=query
)

This search returns the following stack trace:

GET http://localhost:9200/accounts/_search [status:400 request:0.002s]
---------------------------------------------------------------------------
RequestError                              Traceback (most recent call last)
<ipython-input-187-e42ebfa4fba0> in <module>()
      1 response = client.search(
      2     index='accounts',
----> 3     body=query
      4 )

/anaconda3/lib/python3.6/site-packages/elasticsearch/client/utils.py in _wrapped(*args, **kwargs)
     74                 if p in kwargs:
     75                     params[p] = kwargs.pop(p)
---> 76             return func(*args, params=params, **kwargs)
     77         return _wrapped
     78     return _wrapper

/anaconda3/lib/python3.6/site-packages/elasticsearch/client/__init__.py in search(self, index, doc_type, body, params)
    653             index = '_all'
    654         return self.transport.perform_request('GET', _make_path(index,
--> 655             doc_type, '_search'), params=params, body=body)
    656 
    657     @query_params('_source', '_source_exclude', '_source_include',

/anaconda3/lib/python3.6/site-packages/elasticsearch/transport.py in perform_request(self, method, url, headers, params, body)
    316                 delay = 2**attempt - 1
    317                 time.sleep(delay)
--> 318                 status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
    319 
    320             except TransportError as e:

/anaconda3/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py in perform_request(self, method, url, params, body, timeout, ignore, headers)
    183         if not (200 <= response.status < 300) and response.status not in ignore:
    184             self.log_request_fail(method, full_url, url, body, duration, response.status, raw_data)
--> 185             self._raise_error(response.status, raw_data)
    186 
    187         self.log_request_success(method, full_url, url, body, response.status,

/anaconda3/lib/python3.6/site-packages/elasticsearch/connection/base.py in _raise_error(self, status_code, raw_data)
    123             logger.warning('Undecodable raw error response from server: %s', err)
    124 
--> 125         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
    126 
    127 

RequestError: TransportError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [match].')

Please advise on this. Perhaps I am missing something simple! Thank you very much.

Support for "multi-fields" in ES transform

Just a question: Is the keyword datatype supported?

LICENSE

Hi, we have problem when we searching for license for this project we found License: GNU Lesser General Public License v3 or later (LGPLv3+) on https://pypi.org/project/luqum/ but in this repository we see GNU General Public License v3.0

please which is correct ? however we recommend using LGPL because GPL is not suitable for most of companies

Allow forward slash in value of SearchField

The forward slash character is currently not entirely allowed. It is dropped. I consider this character to be relevant in values of Search Fields. For a motivation, consider GitHub code search by file location which uses it.

The query foo bar:/baz currently gets parsed as foo bar:baz which I think is incorrect.

readthedocs is not updated

https://readthedocs.org/projects/luqum/ last version is 7 monthes old !

Python 2

Unfortunately, we are stuck in Py2, and lurum is only compatible with Python. The only thing making it incompatible is the yield from usage, which is very very easy to convert into supported syntax. Would that be welcomed as a PR?

Yacc warnings with ply 3.10

We have these warning messages when we start the django shell with the 3.10 ply version

WARNING: yacc table file version is out of date
WARNING: Couldn't open 'parser.out'. [Errno 13] Permission non accordée: '/usr/local/lib/python3.4/dist-packages/luqum/parser.out'
WARNING: Token 'SEPARATOR' defined, but not used
WARNING: There is 1 unused token
Generating LALR tables
WARNING: 11 shift/reduce conflicts
WARNING: Couldn't create 'luqum.parsetab'. [Errno 13] Permission non accordée: '/usr/local/lib/python3.4/dist-packages/luqum/parsetab.py'

Avoid a endless loop on LuceneTreeTransformer

In the case that the returned node is the initial one plus something (like going from "foo" to "foo OR oof") the class do not stop visiting the new nodes, and since the first one is not removed it keeps triggering the transformation.

A solution could be to analyze only the initial query and not the new ones that have been added.

Escape & invalid syntax

I was under the impression luqum would be able to catch syntax issues, but is that not the case?

test_query = '''http://crazy.c'"om OR a"teste"'''
tree = parser.parse('content: ({})'.format(test_query))
print str(tree)
es_builder = ElasticsearchQueryBuilder(not_analyzed_fields=["published", "tag"])
query = es_builder(tree)
print query

just prints:

content:(http\:\/\/crazy.c'"om OR a"teste")

{'bool': {'should': [{'match': {'content.http\\': {'query': '\\/\\/crazy.c\'"om', 'zero_terms_query': 'none'}}}, {'match': {'content': {'query': 'a"teste"', 'zero_terms_query': 'none'}}}]}}

which is not accepted syntax for ES.

Unknown operation when parsing inequality field groups

Luqum parser yields UnknownOperation when parsing inequality with a FieldGroup.

Equality with FieldGroup:

>>> a = "a:(1 OR 2)"
>>> tree = parser.parse(a)
>>> print(repr(tree))
SearchField('a', FieldGroup(OrOperation(Word('1'), Word('2'))))

Inequality with FieldGroup:

>>> a = "a:>(1 OR 2)"
>>> tree = parser.parse(a)
>>> print(repr(tree))
UnknownOperation(SearchField('a', Word('>')), Group(OrOperation(Word('1'), Word('2'))))

Expected result:

SearchField('a', FieldGroup(OrOperation(Word('>1'), Word('>2'))))

Support for 'missing' and 'exists' filters

Take note that missing is deprecated in 2.2.0

Missing CHANGELOG.rst in release

It seems like you forgot to include CHANGELOG.rst in the newest release, which means https://github.com/jurismarches/luqum/blob/master/setup.py#L9 will fail and the package can't be installed.

$ sudo -H pip3 install luqum
Collecting luqum
  Using cached luqum-0.6.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-vlx183/luqum/setup.py", line 9, in <module>
        with open('CHANGELOG.rst', 'r') as f:
    IOError: [Errno 2] No such file or directory: 'CHANGELOG.rst'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-vlx183/luqum/

	# Error handling rule FIXME
	def t_error(t): # pragma: no cover
	print("Illegal character '%s'" % t.value[0])
	t.lexer.skip(1)

	def _is_nested(self, node):
	if isinstance(node, SearchField) and '.' in node.name:
	return True

jurismarches / luqum Goto Github PK

luqum's People

Contributors

Stargazers

Watchers

Forkers

luqum's Issues

Recommend Projects

Recommend Topics

Recommend Org