Coder Social home page Coder Social logo

inl / clariah-fcs-endpoints Goto Github PK

View Code? Open in Web Editor NEW
1.0 11.0 0.0 34.37 MB

REST endpoints for CLARIAH Federated Content Search

License: GNU General Public License v3.0

Shell 0.44% Java 96.02% Scala 2.66% Batchfile 0.12% HTML 0.32% JavaScript 0.45%
clariah fcs corpus

clariah-fcs-endpoints's People

Contributors

dependabot[bot] avatar jan-niestadt avatar jessededoes avatar ljo avatar mfannee avatar peterdekker avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clariah-fcs-endpoints's Issues

Ugly additions to eu.clarin.sru.server.fcs.parser

Because of protected members of QueryNode, we had to put the query rewriting classes in eu.clarin.sru.server.fcs.parser, which is an existing package

I suppose this is bad practise, and it would be better to figure this out with the eu.clarin.sru.server.fcs.parser maintainers

Wrong word marked as hit

In some specific cases, the hits view marks the wrong word as hit.

For example when issuing the following url, a query for the lemma "de": http://localhost:8080/blacklab-sru-server/sru?operation=searchRetrieve&queryType=fcs&x-fcs-context=nederlab&maximumRecords=20&query=[lemma%3D%22de%22]

In the second hit, "van" is marked as hit instead of "de". Probably there is a shift of one position.

<hits:Result>Taal - en Letterkunde bekroond op voorstel <hits:Hit>van</hits:Hit> de keurraad bestaande uit de heren Prof. Dr. E.</hits:Result>

Nederlab does not work with regex

When issuing a regex query in UD in FCS:
[word="^g.*[^e]$"&pos="ADJ"]

Nederlab does not work and gives an error:
Error during execution of query or back-translation to UD. The query execution failed by this CLARIN-FCS (nederlab) endpoint. Query(cqp=[t_lc="^g.*[^e]$" & pos="ADJ"],server=https://www.nederlab.nl/api/mtas,corpus=nederlab) java.lang.IllegalArgumentException: Illegal group reference

Aggregator resource hierarchy

The endpoint only works with a request to a single corpus; if you select a set of resources in the aggregator, this is not mapped down to the individual corpus level.

Workaround: flat list of corpora

Deal with interpunction

In CHN and other INT corpora, interpunction is a token property

In Nederlab and OpenSonar, interpunction is included as a separate token

Querying high result number gives error

[lemma="het"] in gysseling gives error when querying more than 2900 results
[lemma="de"] in gysseling gives error when querying more than 18600 results

Error during back-translation to UD. The query execution failed by this CLARIN-FCS (Blacklab Server) endpoint: null; Query: Query(cqp=[lemma='de'],server=http://svprmc20.ivdnt.org/blacklab-server/,corpus=gysseling)

Integrate federated search with blacklab server and/or autosearch UI

  • We do not want to write a completely new user interface for federated search
  • The Aggregator is very limited
  • Could we simulate a blacklab server 'lite' in order to reused the autosearch UI?
  • (BTW, it might be nice if autosearch has options to publish the corpus for federated search)

And on same token attribute in opensonar

Question:

[pos="^N.*" & pos =".*mv.*"] gives hits which are not mv.

(http://opensonar.ato.inl.nl/search/expert?patt=%5Bpos%3D%22%5EN.\*%22%20%20%26%20pos%20%3D%22.\*mv.\*%22%5D&filter=&within=document&view=1#results)

The similar [pos=".*NOU.*" & pos=".*pl.*"] in corpus hedendaags nederlands does work.
(https://portal.clarin.inl.nl/search/page/search?tab=%23query&view=1&word=&lemma=&pos=.\*indef.\*&querybox=%5Bpos%3D%22.\*NOU.\*%22+%26+pos%3D%22.\*pl.\*%22%5D&titleCombined=&authorCombined=&witnessYear_from__from=&witnessYear_from__to=&max=50&key=28BE55E237DA3E2B8A02E1AE247575E0)

The translation of UD [pos="VERB" & Tense="Past" & VerbForm="Fin"]
(which is wrong btw, [pos="^(WW).*" & pos=".*(verl).*" | pos=".*(vd).*" & pos=".*(pv).*"]) does seem to work

PS: Extra brackets in [(((pos="^(SPEC).*" & pos=".*(deeleigen).*") | (pos="^(N).*" & pos=".*(eigen).*")) & word="kip")] give trouble

Also: [pos="NOUN" & Number="Plur"][pos="VERB" & VerbForm="Inf"]
Becomes (opensonar)
[(pos="^(N).*" & pos=".*(soort).*") & pos=".*(mv).*"] [pos="^(WW).*" & pos=".*(inf).*"]

INTERNAL_ERROR: Internal error (java.lang.NullPointerException) (Internal error code 15) in opensonar.ato.nl

Individually, the two segments of this query do not crash (although the first part is not restricted enough)

Also problems on [pos="VERB"][lemma="ik"]
(translated to [pos="^(WW).*"] [lemma="ik"] )

Nog een:

The query execution failed by this CLARIN-FCS (Blacklab Server) endpoint: {"error":{"code":"INTERNAL_ERROR","stackTrace":"java.lang.IllegalArgumentException: Comparison method violates its general contract!\n\tat java.util.ComparableTimSort.mergeLo(ComparableTimSort.java:744)\n\tat java.util.ComparableTimSort.mergeAt(ComparableTimSort.java:481)\n\tat java.util.ComparableTimSort.mergeCollapse(ComparableTimSort.java:406)\n\tat java.util.ComparableTimSort.sort(ComparableTimSort.java:213)\n\tat java.util.Arrays.sort(Arrays.java:1312)\n\tat java.util.Arrays.sort(Arrays.java:1506)\n\tat java.util.ArrayList.sort(ArrayList.java:1454)\n\tat java.util.Collections.sort(Collections.java:141)\n\tat nl.inl.blacklab.server.search.SearchCache.performLoadManagement(SearchCache.java:312)\n\tat nl.inl.blacklab.server.search.SearchCache.put(SearchCache.java:244)\n\tat nl.inl.blacklab.server.search.SearchCache.search(SearchCache.java:187)\n\tat nl.inl.blacklab.server.search.SearchManager.search(SearchManager.java:94)\n\tat nl.inl.blacklab.server.requesthandlers.RequestHandlerHits.handle(RequestHandlerHits.java:115)\n\tat nl.inl.blacklab.server.BlackLabServer.handleRequest(BlackLabServer.java:214)\n\tat nl.inl.blacklab.server.BlackLabServer.doGet(BlackLabServer.java:145)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:624)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:731)\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)\n\tat org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)\n\tat org.apache.logging.log4j.web.Log4jServletFilter.doFilter(Log4jServletFilter.java:71)\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)\n\tat org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)\n\tat org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)\n\tat org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)\n\tat org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)\n\tat org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)\n\tat org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)\n\tat org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)\n\tat org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)\n\tat org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)\n\tat org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)\n\tat org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)\n\tat java.lang.Thread.run(Thread.java:748)\n","message":"java.lang.IllegalArgumentException: Comparison method violates its general contract! (Internal error code 32)"}}; Query: Query(cqp=[pos="^(N).*" & pos=".*(,|[,)|].*"]{2, 2},server=http://opensonar.ato.inl.nl/blacklab-server/,corpus=opensonar)

Convert CGN feature-value maps to json file

CGN feature-value maps per corpus were introduced in 2c807ff
These are used to infer features from values for CGN tags where only values are given, and are now corpus-specific for OpenSonar and Nederlab, because of different feature names.

These maps, now stored in java code in CgnMaps.java, have to beconverted to json.

Only one rule is applied in corpus-specific->UD conversion

When converting from corpus-specific to UD tags, only one mapping rule is applied. https://github.com/INL/clariah-fcs-endpoints/blob/master/src/main/java/org/ivdnt/fcs/mapping/ConversionEngine.java#L480

This can give strange results, for example giving "Neut" to a noun, without also giving "N": N(eigen,mv,dim):[Neut]

This might be solved by allowing multiple rules to be applied: but this gives new problems, because currently the program relies on just applying the most complex (longest) rule.
Another solution would be allowing multiple-to-multiple mappings, outputting multiple features in UD: #24

Aggregator functionality

Is still rather limited (cf. also issue about figuring out the status)

  • Show metadata if a metadata dataview is included in results
  • Metadata search
  • Number of results
  • Run and export complete query result (slow in fcs but possible)
  • show translation of UD query to corpus-specific; if possible link to query execution in corpus (possible for blacklab-server-based corpora; also for Nederlab), so user can go there and tune query and do nicer things with the query results
  • Kwic view highlights only first word of hit
  • Advanced data view not readable

We need a publicly visible test server.

Should include

  • The endpoints
  • The aggregator

Publicly accessible versions of

  • Blacklab server: OpenSonar, Brieven als Buit, Corpus Gysseling
  • Nederlab: test broker

Parse and show multiple tags per word in Gysseling

Gysseling can have multiple tags per word, of the form:
VRB(type=main,finiteness=finite,tense=present,inflection=0)+PD(type=pers,other)
We would like to parse and show these.
Desired output would be this:
pos: VRB+PD type: main+pers finiteness:finite+undefined (of na ofzo voor not applicable) tense:present+undefined inflection: 0+undefined
Right now, they are misparsed.

I started work on this, but did not yet finish, on this branch: https://github.com/INL/clariah-fcs-endpoints/tree/multipletags

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.