textexploration / mtas Goto Github PK
View Code? Open in Web Editor NEWMulti Tier Annotation Search
Home Page: https://textexploration.github.io/mtas/
License: Apache License 2.0
Multi Tier Annotation Search
Home Page: https://textexploration.github.io/mtas/
License: Apache License 2.0
Is it possible to limit the number of documents that are queried to potentially speed up the query resolution time? We are working with large text-corpora (more than 1 billion words) and would like to quickly obtain at most N results, preferably but not necessary in random order. Our goal is to quickly provide some results, which should be enough for most cases. Now we are using list query for getting a page of results (with start: and number: parameters), but it is slow for queries with millions of matches. As far as we understand, it is because in such cases nearly all documents need to be queried and this takes several minutes on a single machine.
I apologize if this is documented - I couldn't find it:
I am indexing a FOLIA corpus, to be queried via CQL. This works fine as far as "normal" annotations are concerned, i.e. I can query for (e.g.) POS or lemma on the token level, and also for annotations on the sentence level. However, it remains unclear to me how to account for the xml:id
attribute on <s>
and <w>
elements. The XML looks like this:
<s class="line" xml:id="s3">
<w xml:id="s3.w1">
<t>are</t>
<lemma class="be"/>
<pos class="VBB"/>
</w>
<w xml:id="s3.w2">
<t>you</t>
<lemma class="you"/>
<pos class="PNP"/>
</w>
<w xml:id="s3.w3">
<t>ready</t>
<lemma class="ready"/>
<pos class="AV0"/>
</w>
</s>
And I've tried several variants in the indexing configuration file such as:
<!-- id for the <w>-element -->
<token type="string" offset="false" realoffset="false" parent="false">
<pre>
<item type="string" value="word.id" />
</pre>
<post>
<item type="attribute" name="#" />
</post>
</token>
So far, I haven't been able to find or do anything with the xml:ids.
What I'd like to understand/do is:
For (3), I currently test my attempts like so:
List<String> prefixes = new ArrayList<>();
prefixes.add("t");
prefixes.add("word.id");
List<CodecSearchTree.MtasTreeHit<String>> allHits
= mtasCodecInfo.getPositionedTermsByPrefixesAndPositionRange("content", index, prefixes, spans.startPosition(),
spans.endPosition()-1);
allHits.sort((MtasTreeHit<String> o1, MtasTreeHit<String> o2) -> Integer.compare(o1.startPosition, o2.startPosition));
for (CodecSearchTree.MtasTreeHit<String> hit : allHits){
System.out.print(CodecUtil.termValue(hit.data) + "(" + hit.startPosition + ")" + " / " );
}
I'd be grateful if somebody could point me in the right direction. Thanks in advance.
Hi there,
I found your software after a small search online and it seems to probably fit my needs. Unfortunately, I am a bit lost with the configuration for indexing. I rarely dealt with search engine and I am a bit lost here.
I wrote some XML looking like the main example:
<text xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1">
<seg xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1">
<w xml:id="urn:cts:latinLit:anthoLat.anthoLat.musisquedeoque-lat1:line.002.1#w.1">
<t>Qualis</t>
<pos>PROint</pos>
<lemma>qualis2</lemma>
<morph>Case=Nom|Numb=Sing</morph>
</w>
But it's not clear to me what I should do from there... Can you point me to the relevant example or docs ?
The code we are currently using in INCEpTION to perform an MTAS search looks pretty complicated - but I am pretty sure that is the way we were told that querying MTAS would work:
private static void doQuery(IndexReader indexReader, String field, MtasSpanQuery q,
List<String> prefixes)
throws IOException
{
ListIterator<LeafReaderContext> iterator = indexReader.leaves().listIterator();
IndexSearcher searcher = new IndexSearcher(indexReader);
final float boost = 0;
SpanWeight spanweight = q.rewrite(indexReader).createWeight(searcher, false, boost);
while (iterator.hasNext()) {
LeafReaderContext lrc = iterator.next();
Spans spans = spanweight.getSpans(lrc, SpanWeight.Postings.POSITIONS);
SegmentReader segmentReader = (SegmentReader) lrc.reader();
Terms terms = segmentReader.terms(field);
CodecInfo mtasCodecInfo = CodecInfo.getCodecInfoFromTerms(terms);
if (spans != null) {
while (spans.nextDoc() != Spans.NO_MORE_DOCS) {
...
But normally, querying lucene would use something like searcher.query(q, myCollector)
- and there are also search signatures which would e.g. allow for sorting results etc. So I was wondering (I haven't tried it yet): can the search/collector approach really not be used with MTAS? If not, why? And if it can be used, does anybody have an example for it?
Will there be MTAS versions based on Lucene 9.x?
It is possible to search for a feature having any value, like so:
<layer.feature=""/>
But how can I search for layers with unset feature values? Such a search could look like one of these:
<layer.feature!=""/>
<!layer.feature=""/>
<layer.feature=none/>
<layer.feature=false/>
Imagine having several layers with the same features and values:
<layer_1.feature="value">
<layer_2.feature="value">
<layer_3.feature="value">
Is it possible to search for all layers that have feature="value"
, e.g.
<*.feature="value">
At the moment this works only by chaining all possible layers with or (|
)
<layer_1.feature="value"> | <layer_2.feature="value"> | <layer_2.feature="value">
Lucene is able to work with indexes created with older versions of Lucene.
However, when upgrading MTAS say from 7.7.1.0 to 8.11.1.0, an exception is generated when trying to open the index:
2022-07-01 20:33:06 [main] ERROR MtasDocumentIndex - Unable to read MTAS index: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
org.apache.lucene.index.CorruptIndexException: codec mismatch: actual codec=Lucene70SegmentInfo vs expected codec=Lucene86SegmentInfo (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/bluefire/git/inception-application/inception/inception-search-mtas/target/test-output/MtasUpgradeTest/project/1/indexMtas/_0.si")))
at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:208) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:95) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1037) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
...
Hi
It looks like MTAS doesn't support the full CQL language. This is based on the current specifications.
For example the ability to search on a span using multiple attributes value pairs, eg [t="dog" & pos="NN"] as per https://www.sketchengine.eu/documentation/cql-basics/#boolean
The MTAS documentation also shows an example of doing this in a slightly different way (no square brackets) using something like:
t="dog" & POS="NN"
However this appears to generate an exception in the CQL parser "mtas.parser.cql.ParseException: Encountered "" at line 1, column 1.".
The closest I've come to doing this is using fullyalignedwith to repeatedly match each attribute value pair.
Thanks
Tony
It would be nice if searching for <field/>
would have the same effect as searching for <field=".*"/>
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.