o19s / match-query-parser Goto Github PK

Search a single field with different query time analyzers in Solr

Home Page: http://opensourceconnections.com/blog/2017/01/23/our-solution-to-solr-multiterm-synonyms/

Java 100.00%

solr query-analysis synonyms match-qp relevance edismax relevant-search

match-query-parser's Introduction

Match Query Parser

Tightly control how Solr query parsing and execution by passing the query analyzer at query time. Read more in this blog post.

as an example:

q=sea biscuit likes to fish&bq={!match analyze_as=text_synonym search_with=term qf=body v=$q}

With this query, query processings goes through the following steps:

Analyze the query string using text_synonym field type, perhaps resulting in tokens [seabiscuit][sea] [biscuit] [likes] [to] [fish]
Treat the resulting tokens as term queries, with dismax for overlapping positions: (sea biscuit | sea | biscuit) OR likes OR to OR fish

Match QP gives you an extremely high level of control over the search. You control both query analysis and the resulting lucene queries. For example, if you repeat the above example with a shingle analyzer, you can run a bigram search (like pf2 in edismax):

Analyze the query string using text_shingle field type, perhaps resulting in [sea biscuit] [biscuit likes] [likes to] ...
Treat the resulting tokens as phrase queries by setting search_with=phrase: ("sea biscuit" OR "biscuit likes" OR "likes to" ...)

Or with a synonym analysis that outputs full synonyms as individual tokens, but with search_with=phrase:

[seabiscuit][sea biscuit] [likes] [to] [fish]
("sea biscuit" | seabiscuit) OR likes OR to OR fish

Read more in this tutorial and this blog post

Download and Install

Download plugin for Solr 6.0 | Solr 6.6
Place in a suitable location for plugins
Add XML to your solrconfig.xml:

<queryParser name="match" class="com.o19s.solr.search.MatchQParserPlugin"></queryParser>

Parameters

qf

A single field to be searched.

analyze_as

Use this field type for analysis. The field type's query-time analyzer is used to analyze the query string. When using match qp, I often create field types for the sole purpose of having different query-time analyzers at my disposal.

If omitted, uses the query analysis of qf.

search_with

Either term (default) or phrase.

term the tokens output from analysis from step (1) above are turned into term queries
phrase the tokens output from analysis from step (1) are whitespace tokenized, and turned into phrase queries

An important note about position overlaps. In the above example, we pretended that seabiscuit was transformed into just [sea biscuit] when in reality, both tokens [seabiscuit] and [sea biscuit] would be omitted in the same position. In this case, tokens in the same position are wrapped in a DisjunctionMaximum (dismax) query. So the actual query would be, using | to show the dismax operation

("sea biscuit" | seabiscuit) OR likes OR to OR fish

mm

Min-should-match expression used to specify the mm of the outer boolean query.

pslop

Phrase slop to use for phrase query type.

Acknowledgements

This is somewhat inspired by Elasticsearch's match query
Sponsored by OpenSource Connections

match-query-parser's People

Contributors

Stargazers

Watchers

Forkers

ggiudetti andy-wagner ayush488

match-query-parser's Issues

Tutorial question ... analyzer=text_general_syn ...

On the tutorial is there a typo?

You mention using analyzer=text_general_syn but there are not any field names called text_general_syn. How does that analyzer name know what field to use?

Also for the new analyzers (like below), would I place them in the managed-schema file? or right in the solrconfig.xml file?

<fieldType name="synonymized" class="solr.TextField"> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" format="solr" ignoreCase="false" expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="(_)" replacement=" " replace="all"/> </analyzer> </fieldType>

Unable to boost the qf field

The following syntaxes in "bq" do NOT produce the expected boost:

<str name="bq">{!match qf=MY_FIELD analyze_as=my_analyzer search_with=term v=$q}^2</str>
or
<str name="bq">{!match qf=MY_FIELD^2 analyze_as=my_analyzer search_with=term v=$q}</str>

Note that with edismax, the above first syntax does NOT work either, while the second does work:

<str name="bq">{!edismax qf=MY_FIELD^2 ... v=$q}</str>

the query actually don't work. and analyzer is not passed in runtime. Please fix.

In both cases analyzer is not picked up from analyze_as which can be clearly seen in parsedquery

Case 1

Query:
http://localhost:8983/solr/def/select?debugQuery=on&q={!match analyze_as=t_synonymized qf=nosynonymized}sea_biscuit

Output:
{ responseHeader: { status: 0, QTime: 0, params: { q: "{!match analyze_as=t_synonymized qf=nosynonymized}sea_biscuit", debugQuery: "on" } }, response: { numFound: 0, start: 0, docs: [ ] }, debug: { rawquerystring: "{!match analyze_as=t_synonymized qf=nosynonymized}sea_biscuit", querystring: "{!match analyze_as=t_synonymized qf=nosynonymized}sea_biscuit", parsedquery: "DisjunctionMaxQuery((nosynonymized:sea_biscuit))", parsedquery_toString: "(nosynonymized:sea_biscuit)", explain: { }, QParser: "MatchQParser", timing: { time: 0, prepare: { time: 0, query: { time: 0 }, facet: { time: 0 }, facet_module: { time: 0 }, mlt: { time: 0 }, highlight: { time: 0 }, stats: { time: 0 }, expand: { time: 0 }, terms: { time: 0 }, debug: { time: 0 } }, process: { time: 0, query: { time: 0 }, facet: { time: 0 }, facet_module: { time: 0 }, mlt: { time: 0 }, highlight: { time: 0 }, stats: { time: 0 }, expand: { time: 0 }, terms: { time: 0 }, debug: { time: 0 } } } } }
Here,
parsedquery: "DisjunctionMaxQuery((nosynonymized:sea_biscuit))",

Case 2

Query:
http://localhost:8983/solr/def/select?debugQuery=on&q={!match analyze_as=t_nosynonymized qf=synonymized}sea_biscuit

Output:
{ responseHeader: { status: 0, QTime: 0, params: { q: "{!match analyze_as=t_nosynonymized qf=synonymized}sea_biscuit", debugQuery: "on" } }, response: { numFound: 1, start: 0, docs: [ { id: "doc1", phonetic: [ "book", "hardcover", "four score and twenty" ], queryandindexphonetic: [ "book", "hardcover", "four score and twenty" ], indexphonetic: [ "book", "hardcover", "four score and twenty" ], queryphonetic: [ "book", "hardcover", "four score and twenty" ], synonymized: [ "seabiscuit", "sea biscuit the lonely horse" ], nosynonymized: [ "seabiscuit", "sea biscuit the lonely horse" ], queryphonetic_str: [ "book", "four score and twenty", "hardcover" ], _version_: 1614049012032209000, indexphonetic_str: [ "book", "four score and twenty", "hardcover" ], queryandindexphonetic_str: [ "book", "four score and twenty", "hardcover" ] } ] }, debug: { rawquerystring: "{!match analyze_as=t_nosynonymized qf=synonymized}sea_biscuit", querystring: "{!match analyze_as=t_nosynonymized qf=synonymized}sea_biscuit", parsedquery: "DisjunctionMaxQuery((synonymized:sea_biscuit))", parsedquery_toString: "(synonymized:sea_biscuit)", explain: { doc1: " 0.45207188 = weight(synonymized:sea_biscuit in 0) [SchemaSimilarity], result of: 0.45207188 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.2876821 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.0 = docFreq 1.0 = docCount 1.5714287 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 9.0 = avgFieldLength 5.0 = fieldLength " }, QParser: "MatchQParser", timing: { time: 0, prepare: { time: 0, query: { time: 0 }, facet: { time: 0 }, facet_module: { time: 0 }, mlt: { time: 0 }, highlight: { time: 0 }, stats: { time: 0 }, expand: { time: 0 }, terms: { time: 0 }, debug: { time: 0 } }, process: { time: 0, query: { time: 0 }, facet: { time: 0 }, facet_module: { time: 0 }, mlt: { time: 0 }, highlight: { time: 0 }, stats: { time: 0 }, expand: { time: 0 }, terms: { time: 0 }, debug: { time: 0 } } } } }

Here,
parsedquery: "DisjunctionMaxQuery((synonymized:sea_biscuit))"