textexploration / mtas Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 5.0 52.38 MB

Multi Tier Annotation Search

Home Page: https://textexploration.github.io/mtas/

License: Apache License 2.0

CSS 0.03% HTML 1.63% JavaScript 0.13% Java 97.81% Dockerfile 0.39%

annotations big-data cql distributed lucene search search-engine search-in-text solr structure text text-analysis

mtas's People

Contributors

Stargazers

Watchers

Forkers

hayco mwasiluk ycgoodluck zentrum-lexikographie reckart

mtas's Issues

Question: limiting the number of results to speed up a query

Is it possible to limit the number of documents that are queried to potentially speed up the query resolution time? We are working with large text-corpora (more than 1 billion words) and would like to quickly obtain at most N results, preferably but not necessary in random order. Our goal is to quickly provide some results, which should be enough for most cases. Now we are using list query for getting a page of results (with start: and number: parameters), but it is slow for queries with millions of matches. As far as we understand, it is because in such cases nearly all documents need to be queried and this takes several minutes on a single machine.

ids for s and w in FOLIA

I apologize if this is documented - I couldn't find it:

I am indexing a FOLIA corpus, to be queried via CQL. This works fine as far as "normal" annotations are concerned, i.e. I can query for (e.g.) POS or lemma on the token level, and also for annotations on the sentence level. However, it remains unclear to me how to account for the xml:id attribute on <s> and <w> elements. The XML looks like this:

<s class="line" xml:id="s3">
            <w xml:id="s3.w1">
                <t>are</t>
                <lemma class="be"/>
                <pos class="VBB"/>
            </w>
            <w xml:id="s3.w2">
                <t>you</t>
                <lemma class="you"/>
                <pos class="PNP"/>
            </w>
            <w xml:id="s3.w3">
                <t>ready</t>
                <lemma class="ready"/>
                <pos class="AV0"/>
            </w>
</s>

And I've tried several variants in the indexing configuration file such as:

<!-- id for the <w>-element -->
<token type="string" offset="false" realoffset="false" parent="false">
             <pre>
                  <item type="string" value="word.id" />
               </pre>
                <post> 
                    <item type="attribute" name="#" />
                 </post>
</token>

So far, I haven't been able to find or do anything with the xml:ids.

What I'd like to understand/do is:

How to represent xml:id on both sentence and token level in the config file
How to integrate them into a CQL query
How to access the ids programmatically after having done a query

For (3), I currently test my attempts like so:

  List<String> prefixes = new ArrayList<>();
  prefixes.add("t");
  prefixes.add("word.id");
  List<CodecSearchTree.MtasTreeHit<String>> allHits 
          = mtasCodecInfo.getPositionedTermsByPrefixesAndPositionRange("content", index, prefixes, spans.startPosition(), 
              spans.endPosition()-1);
  allHits.sort((MtasTreeHit<String> o1, MtasTreeHit<String> o2) -> Integer.compare(o1.startPosition, o2.startPosition));
  for (CodecSearchTree.MtasTreeHit<String> hit : allHits){
      System.out.print(CodecUtil.termValue(hit.data) + "(" + hit.startPosition + ")" +  " / " );
  }

I'd be grateful if somebody could point me in the right direction. Thanks in advance.

Help regarding configuration

Hi there,
I found your software after a small search online and it seems to probably fit my needs. Unfortunately, I am a bit lost with the configuration for indexing. I rarely dealt with search engine and I am a bit lost here.

I wrote some XML looking like the main example:

<text xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1">
<seg xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1">
<w xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1#w.1">
<t>Qualis</t>
<pos>PROint</pos>
<lemma>qualis2</lemma>
<morph>Case=Nom|Numb=Sing</morph>
</w>

But it's not clear to me what I should do from there... Can you point me to the relevant example or docs ?

Question: How to query programmatically

The code we are currently using in INCEpTION to perform an MTAS search looks pretty complicated - but I am pretty sure that is the way we were told that querying MTAS would work:

    private static void doQuery(IndexReader indexReader, String field, MtasSpanQuery q,
            List<String> prefixes)
        throws IOException
    {
        ListIterator<LeafReaderContext> iterator = indexReader.leaves().listIterator();
        IndexSearcher searcher = new IndexSearcher(indexReader);
        final float boost = 0;
        SpanWeight spanweight = q.rewrite(indexReader).createWeight(searcher, false, boost);

        while (iterator.hasNext()) {
            LeafReaderContext lrc = iterator.next();
            Spans spans = spanweight.getSpans(lrc, SpanWeight.Postings.POSITIONS);
            SegmentReader segmentReader = (SegmentReader) lrc.reader();
            Terms terms = segmentReader.terms(field);
            CodecInfo mtasCodecInfo = CodecInfo.getCodecInfoFromTerms(terms);
            if (spans != null) {
                while (spans.nextDoc() != Spans.NO_MORE_DOCS) {
                ...

But normally, querying lucene would use something like searcher.query(q, myCollector) - and there are also search signatures which would e.g. allow for sorting results etc. So I was wondering (I haven't tried it yet): can the search/collector approach really not be used with MTAS? If not, why? And if it can be used, does anybody have an example for it?

MTAS for Lucene 9.x

Will there be MTAS versions based on Lucene 9.x?

Question: How to search for unset values

Is it possible to search for features that have no value attached?

It is possible to search for a feature having any value, like so:

<layer.feature=""/>

But how can I search for layers with unset feature values? Such a search could look like one of these:

<layer.feature!=""/>
<!layer.feature=""/>
<layer.feature=none/>
<layer.feature=false/>

Question: Wildcards for layers

Is it possible to use wildcards for layers?

Imagine having several layers with the same features and values:

<layer_1.feature="value">
<layer_2.feature="value">
<layer_3.feature="value">

Is it possible to search for all layers that have feature="value", e.g.

<*.feature="value">

At the moment this works only by chaining all possible layers with or (|)

<layer_1.feature="value"> | <layer_2.feature="value"> | <layer_2.feature="value">

Unable to read index after MTAS upgrade

Lucene is able to work with indexes created with older versions of Lucene.

However, when upgrading MTAS say from 7.7.1.0 to 8.11.1.0, an exception is generated when trying to open the index:

2022-07-01 20:33:06 [main] ERROR MtasDocumentIndex - Unable to read MTAS index: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
org.apache.lucene.index.CorruptIndexException: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
	at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:208) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:95) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1037) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
...

CQL Support

It looks like MTAS doesn't support the full CQL language. This is based on the current specifications.

For example the ability to search on a span using multiple attributes value pairs, eg [t="dog" & pos="NN"] as per https://www.sketchengine.eu/documentation/cql-basics/#boolean

The MTAS documentation also shows an example of doing this in a slightly different way (no square brackets) using something like:

t="dog" & POS="NN"

However this appears to generate an exception in the CQL parser "mtas.parser.cql.ParseException: Encountered "" at line 1, column 1.".

The closest I've come to doing this is using fullyalignedwith to repeatedly match each attribute value pair.

Thanks

Tony

Multi-position query without postfix

It would be nice if searching for <field/> would have the same effect as searching for <field=".*"/>.