inl / clariah-fcs-endpoints Goto Github PK
View Code? Open in Web Editor NEWREST endpoints for CLARIAH Federated Content Search
License: GNU General Public License v3.0
REST endpoints for CLARIAH Federated Content Search
License: GNU General Public License v3.0
Because of protected members of QueryNode, we had to put the query rewriting classes in eu.clarin.sru.server.fcs.parser, which is an existing package
I suppose this is bad practise, and it would be better to figure this out with the eu.clarin.sru.server.fcs.parser maintainers
In some specific cases, the hits view marks the wrong word as hit.
For example when issuing the following url, a query for the lemma "de": http://localhost:8080/blacklab-sru-server/sru?operation=searchRetrieve&queryType=fcs&x-fcs-context=nederlab&maximumRecords=20&query=[lemma%3D%22de%22]
In the second hit, "van" is marked as hit instead of "de". Probably there is a shift of one position.
<hits:Result>Taal - en Letterkunde bekroond op voorstel <hits:Hit>van</hits:Hit> de keurraad bestaande uit de heren Prof. Dr. E.</hits:Result>
... is completely wrong. They become positive
When issuing a regex query in UD in FCS:
[word="^g.*[^e]$"&pos="ADJ"]
Nederlab does not work and gives an error:
Error during execution of query or back-translation to UD. The query execution failed by this CLARIN-FCS (nederlab) endpoint. Query(cqp=[t_lc="^g.*[^e]$" & pos="ADJ"],server=https://www.nederlab.nl/api/mtas,corpus=nederlab) java.lang.IllegalArgumentException: Illegal group reference
The endpoint only works with a request to a single corpus; if you select a set of resources in the aggregator, this is not mapped down to the individual corpus level.
Workaround: flat list of corpora
Let op: als er een recordIdentifier in de XML komt geeft de aggregator een error, want hij zit niet in sruResponse.xsd
zie https://lists.oasis-open.org/archives/search-ws-comment/201404/msg00000.html
@see eu.clarin.sru.server.SRUSearchResultSet#getRecordIdentifier()
Otherwise, they cannot be included in the federated search
When converting corpus-specific tags back to UD, the mapping rules and code are designed in such a way that always only one UD tag can be outputted. It may be good to change this, so multiple tags (for example POS tag and gender) can be outputted in UD.
In CHN and other INT corpora, interpunction is a token property
In Nederlab and OpenSonar, interpunction is included as a separate token
"IncludeFeatureNameInRegex": true,
[lemma="het"] in gysseling gives error when querying more than 2900 results
[lemma="de"] in gysseling gives error when querying more than 18600 results
Error during back-translation to UD. The query execution failed by this CLARIN-FCS (Blacklab Server) endpoint: null; Query: Query(cqp=[lemma='de'],server=http://svprmc20.ivdnt.org/blacklab-server/,corpus=gysseling)
Question:
[pos="^N.*" & pos =".*mv.*"] gives hits which are not mv.
The similar [pos=".*NOU.*" & pos=".*pl.*"] in corpus hedendaags nederlands does work.
(https://portal.clarin.inl.nl/search/page/search?tab=%23query&view=1&word=&lemma=&pos=.\*indef.\*&querybox=%5Bpos%3D%22.\*NOU.\*%22+%26+pos%3D%22.\*pl.\*%22%5D&titleCombined=&authorCombined=&witnessYear_from__from=&witnessYear_from__to=&max=50&key=28BE55E237DA3E2B8A02E1AE247575E0)
The translation of UD [pos="VERB" & Tense="Past" & VerbForm="Fin"]
(which is wrong btw, [pos="^(WW).*" & pos=".*(verl).*" | pos=".*(vd).*" & pos=".*(pv).*"]) does seem to work
PS: Extra brackets in [(((pos="^(SPEC).*" & pos=".*(deeleigen).*") | (pos="^(N).*" & pos=".*(eigen).*")) & word="kip")] give trouble
Also: [pos="NOUN" & Number="Plur"][pos="VERB" & VerbForm="Inf"]
Becomes (opensonar)
[(pos="^(N).*" & pos=".*(soort).*") & pos=".*(mv).*"] [pos="^(WW).*" & pos=".*(inf).*"]
INTERNAL_ERROR: Internal error (java.lang.NullPointerException) (Internal error code 15) in opensonar.ato.nl
Individually, the two segments of this query do not crash (although the first part is not restricted enough)
Also problems on [pos="VERB"][lemma="ik"]
(translated to [pos="^(WW).*"] [lemma="ik"] )
Nog een:
The query execution failed by this CLARIN-FCS (Blacklab Server) endpoint: {"error":{"code":"INTERNAL_ERROR","stackTrace":"java.lang.IllegalArgumentException: Comparison method violates its general contract!\n\tat java.util.ComparableTimSort.mergeLo(ComparableTimSort.java:744)\n\tat java.util.ComparableTimSort.mergeAt(ComparableTimSort.java:481)\n\tat java.util.ComparableTimSort.mergeCollapse(ComparableTimSort.java:406)\n\tat java.util.ComparableTimSort.sort(ComparableTimSort.java:213)\n\tat java.util.Arrays.sort(Arrays.java:1312)\n\tat java.util.Arrays.sort(Arrays.java:1506)\n\tat java.util.ArrayList.sort(ArrayList.java:1454)\n\tat java.util.Collections.sort(Collections.java:141)\n\tat nl.inl.blacklab.server.search.SearchCache.performLoadManagement(SearchCache.java:312)\n\tat nl.inl.blacklab.server.search.SearchCache.put(SearchCache.java:244)\n\tat nl.inl.blacklab.server.search.SearchCache.search(SearchCache.java:187)\n\tat nl.inl.blacklab.server.search.SearchManager.search(SearchManager.java:94)\n\tat nl.inl.blacklab.server.requesthandlers.RequestHandlerHits.handle(RequestHandlerHits.java:115)\n\tat nl.inl.blacklab.server.BlackLabServer.handleRequest(BlackLabServer.java:214)\n\tat nl.inl.blacklab.server.BlackLabServer.doGet(BlackLabServer.java:145)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:624)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:731)\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)\n\tat org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)\n\tat org.apache.logging.log4j.web.Log4jServletFilter.doFilter(Log4jServletFilter.java:71)\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)\n\tat org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)\n\tat org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)\n\tat org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)\n\tat org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)\n\tat org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)\n\tat org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)\n\tat org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)\n\tat org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)\n\tat org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)\n\tat org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)\n\tat org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)\n\tat java.lang.Thread.run(Thread.java:748)\n","message":"java.lang.IllegalArgumentException: Comparison method violates its general contract! (Internal error code 32)"}}; Query: Query(cqp=[pos="^(N).*" & pos=".*(,|[,)|].*"]{2, 2},server=http://opensonar.ato.inl.nl/blacklab-server/,corpus=opensonar)
Figure out: Will the aggregator be developed/maintained?
According to document https://office.clarin.eu/v/CE-2017-1035-CLARINPLUS-D2_9.pdf, there is a 3.0 version of the Aggregator (with a graphical query builder). It cannot find it in the SVN.
Jan zit in de FCS task force groep, misschien kan hij een vraag posten??
Kunnen we alles naar bijvoorbeeld universal dependencies mappen?
Zie
CGN feature-value maps per corpus were introduced in 2c807ff
These are used to infer features from values for CGN tags where only values are given, and are now corpus-specific for OpenSonar and Nederlab, because of different feature names.
These maps, now stored in java code in CgnMaps.java, have to beconverted to json.
When converting from corpus-specific to UD tags, only one mapping rule is applied. https://github.com/INL/clariah-fcs-endpoints/blob/master/src/main/java/org/ivdnt/fcs/mapping/ConversionEngine.java#L480
This can give strange results, for example giving "Neut" to a noun, without also giving "N": N(eigen,mv,dim):[Neut]
This might be solved by allowing multiple rules to be applied: but this gives new problems, because currently the program relies on just applying the most complex (longest) rule.
Another solution would be allowing multiple-to-multiple mappings, outputting multiple features in UD: #24
Is still rather limited (cf. also issue about figuring out the status)
Should include
Publicly accessible versions of
Zou wat gebruikersvriendelijker zijn
Vgl:
Unclear!
Gysseling can have multiple tags per word, of the form:
VRB(type=main,finiteness=finite,tense=present,inflection=0)+PD(type=pers,other)
We would like to parse and show these.
Desired output would be this:
pos: VRB+PD type: main+pers finiteness:finite+undefined (of na ofzo voor not applicable) tense:present+undefined inflection: 0+undefined
Right now, they are misparsed.
I started work on this, but did not yet finish, on this branch: https://github.com/INL/clariah-fcs-endpoints/tree/multipletags
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.