clarin-eric / curation-dashboard Goto Github PK
View Code? Open in Web Editor NEWjava library for CLARIN's CMDI curation
License: GNU General Public License v3.0
java library for CLARIN's CMDI curation
License: GNU General Public License v3.0
this is a test
test
This is wrong. Total number of links shouldn't be queried from the database but it should calculated dynamically during report generation inside java like it was before. Total number of unique links should be queried from the database.
When choosing to generate local report xml files via setting SAVE_REPORT=true in config.properties, the report sub directories "collection", "instances" and "profiles" are created. However, all report types are written to "profiles" while the other directories stay empty.
Best regards,
Florian
I noticed that in the web pages of Curation 3.0, the jquery and datatable js files are loaded twice, both in minified and non-minified form.
from the page source:
...
<!-- #wrapper-footer-full -->
</div>
<script type="text/javascript" src="/view/fundament/vendor/jquery/jquery.min.js"></script>
<script type="text/javascript" src="/view/fundament/js/fundament.min.js"></script>
<script type="text/javascript" src="/view/js/dropzone.min.js"></script>
<script type="text/javascript" src="/view/js/curate.js"></script>
<script type="text/javascript" charset="utf8" src="https://cdn.datatables.net/1.10.19/js/jquery.dataTables.js"></script>
<script type="text/javascript" src="https://code.jquery.com/jquery-3.3.1.js"></script>
<script type="text/javascript" src="https://cdn.datatables.net/1.10.19/js/jquery.dataTables.min.js"></script>
<script type="text/javascript" src="https://cdn.datatables.net/fixedheader/3.1.5/js/dataTables.fixedHeader.min.js"></script>
...
Note that there is both src="https://code.jquery.com/jquery-3.3.1.js"
and src="/view/fundament/vendor/jquery/jquery.min.js"
and both src="https://cdn.datatables.net/1.10.19/js/jquery.dataTables.js"
and src="https://cdn.datatables.net/1.10.19/js/jquery.dataTables.min.js"
The separate application saves the results in a database and curation-module fetches the urls and the results from this database. The separate application should control the requests to different servers in order not to send too many requests to each servers.
It is currently found on google sheets here: https://docs.google.com/spreadsheets/d/18EyqXjL5-e7tc0kpvTHQNaG5ObXr_WNIfdvrcJRiTAg/edit?usp=sharing
This table should be added into the faq.
A small 'i' or '?' icon next to each section, which expands when hovered upon to explain what the section is about.
When validating a record and generating a report, there are two flags under the facet section with different colors:
⚑ - Derived Facet
⚑ - Value Mapping
These don't seem to be used in the following table, so what do they mean?
The output of the facet mapping tool (written by @menzowindhouwer) is still adding value to the curation module for the detailed mapping description it can provide for a given profile. However it is unclear for how long this tool will be maintained or running in its current location. Moreover it is inconvenient for users to have to use two tools with overlapping functionality.
Sample of useful output that (as far as I know) cannot be obtained from the Curation Module:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>VLO mapping for profile talkbank-session (clarin.eu:cr1:p_1393514855466)</title>
</head>
<body>
<h1><a href="index.html">VLO mapping</a> for profile talkbank-session (<a href="http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1393514855466/xml">clarin.eu:cr1:p_1393514855466</a>)
</h1>
<dl>
<dt>Facet: id</dt>
<dd>
<dl></dl>
</dd>
<dt>Facet: _selfLink</dt>
<dd>
<dl></dl>
</dd>
<dt>Facet: collection</dt>
<dd>
<dl></dl>
</dd>
<dt>Facet: projectName</dt>
<dd>
<dl>
<dt>Matched CMD Element ConceptLink: <a href="http://hdl.handle.net/11459/CCR_C-2536_13fc5f10-c14a-1f64-a669-32736f6d3ef5">http://hdl.handle.net/11459/CCR_C-2536_13fc5f10-c14a-1f64-a669-32736f6d3ef5</a></dt>
<dd>
<dl>
<dt>/c:CMD/c:Components/c:talkbank-session/c:Session/c:MDGroup/c:Project/c:Name/text()</dt>
<dd>xpath accepted</dd>
</dl>
</dd>
<dt>Matched CMD Element ConceptLink: <a href="http://hdl.handle.net/11459/CCR_C-2537_fa206273-223a-f4fa-dde3-ba59b965701f">http://hdl.handle.net/11459/CCR_C-2537_fa206273-223a-f4fa-dde3-ba59b965701f</a></dt>
<dd>
<dl>
<dt>/c:CMD/c:Components/c:talkbank-session/c:Session/c:MDGroup/c:Project/c:Title/text()</dt>
<dd>xpath accepted</dd>
</dl>
</dd>
</dl>
</dd>
<dt>Facet: name</dt>
<dd>
<dl>
<dt>Matched CMD Element ConceptLink: <a href="http://hdl.handle.net/11459/CCR_C-2544_3626545e-a21d-058c-ebfd-241c0464e7e5">http://hdl.handle.net/11459/CCR_C-2544_3626545e-a21d-058c-ebfd-241c0464e7e5</a></dt>
<dd>
<dl>
<dt>/c:CMD/c:Components/c:talkbank-session/c:Session/c:Name/text()</dt>
<dd>xpath accepted</dd>
</dl>
</dd>
<dt>Matched CMD Element ConceptLink: <a href="http://hdl.handle.net/11459/CCR_C-2545_d873f2ab-2a2f-29d6-a9ab-260cde57f227">http://hdl.handle.net/11459/CCR_C-2545_d873f2ab-2a2f-29d6-a9ab-260cde57f227</a></dt>
<dd>
<dl>
<dt>/c:CMD/c:Components/c:talkbank-session/c:Session/c:Title/text()</dt>
<dd>xpath accepted</dd>
</dl>
</dd>
</dl>
</dd>
...
I just tried
and got
HTTP Status 500 - java.lang.ClassCastException: eu.clarin.cmdi.curation.report.ErrorReport cannot be cast to eu.clarin.cmdi.curation.report.CMDInstanceReport
type Exception report
message java.lang.ClassCastException: eu.clarin.cmdi.curation.report.ErrorReport cannot be cast to eu.clarin.cmdi.curation.report.CMDInstanceReport
description The server encountered an internal error that prevented it from fulfilling this request.
exception
javax.servlet.ServletException: java.lang.ClassCastException: eu.clarin.cmdi.curation.report.ErrorReport cannot be cast to eu.clarin.cmdi.curation.report.CMDInstanceReport
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:434)
org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:372)
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:389)
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:342)
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:229)
org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
root cause
java.lang.ClassCastException: eu.clarin.cmdi.curation.report.ErrorReport cannot be cast to eu.clarin.cmdi.curation.report.CMDInstanceReport
eu.clarin.rest.CurationRestService.assessInstance(CurationRestService.java:26)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:74)
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:247)
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:388)
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:346)
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:337)
org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
org.glassfish.jersey.internal.Errors.process(Errors.java:315)
org.glassfish.jersey.internal.Errors.process(Errors.java:297)
org.glassfish.jersey.internal.Errors.process(Errors.java:267)
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:280)
org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:316)
org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1084)
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:418)
org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:372)
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:389)
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:342)
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:229)
org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
note The full stack trace of the root cause is available in the Apache Tomcat/8.5.3 logs.
The current validation is not namespace aware, this allows invalid CMD records to pass through unnoticed. Take for example:
The cmdi:Components child elements are all member of the default namespace with URI http://www.openarchives.org/OAI/2.0/. This is not possible in a CMDI records, so this record should be considered invalid.
blue font color on black background is hardly visible especially at presentation. Therefore the color scheme should be changed for a better contrast
Email from Florian Schiel:
But when testing CMDI instances nested in a hierarchy I encountered the following (conceptional?) problem:
Each CMDI instance is tested by the module in isolation. Why is this a problem?
Consider for example a 2-level hierarchy of metadata: on the first level (corpus level) the metadata of a complete collection of resources is stored as in [1]; on the second level (that is linked as resources of type 'Metadata') in the first level) the metadata of a single resource is stored as in [2]. To avoid massive replication, MD that concern all members of the collection are only stored in the first level, for example availabilty.
When analysing a single CMD instance of the second level, we can't find this information in the CMDI. But what we find is a pointer to the upper level, namely the IsPartOf entry in the CMDI header.
So, I guess my questions are:
The HTML rendering of, e.g., the link checking is great. I would find it even easier if the data were available as JSON or XML as this would facilitate analysis in case of large numbers by, e.g. grouping by message rather than code only. Given that the HTML seems to be incrementally loaded, just downloading that is also not an option.
(Maybe I am missing something but I do not find an address to ask a question to, so this seems to be the preferred mode of communication?)
Directly under or next to the collection name.
So when clicking on Ok on TLA, it should say x amount of links to show that this collection has this many links with this category,
For example on this page: https://curate.acdh.oeaw.ac.at/statistics/The_Language_Archive/Ok next to "The Language Archive", it can say Total:xxx
Both projects share some functionality:
Float values in generated reports use comma delimiters. However, point delimiters are default in most programming languages. For compatibility with other programming languages and plotting packages e.g. when analysis the quality of collections, I advise using point delimiters.
Since float formatting in Java is bound to the locale it should be changed by defining other locale (e.g. 'US')
See a stackoverflow post on that issue:
http://stackoverflow.com/questions/4553633/java-float-formatting-depends-on-locale
Currently the subproject curation-module-web uses version 7 of vaadin. Since this version is only supported until the beginning of 2019 (see https://vaadin.com/faq »Are Vaadin 7 and 8 still relevant?«) we should think about an upgrade of vaadin to the current version 10 over the course of the current year.
The latest version of the VLO has implemented the concept of value mappings which enlarges the concept of uniform maps and might replace it in the long run. This new concept has to be reflected in the curation-module
It is first mentioned here: #53
2.) The Header Section contains a score per ID. Sometimes it matches the score in CMD Profile Report, sometimes it seems to be a difference of 1 to this score?
e.g. https://curate.acdh.oeaw.ac.at/collection/IDS_Repository.html
clarin_eu_cr1_p_1455633534543
Score: 1.65
https://curate.acdh.oeaw.ac.at/profile/clarin_eu_cr1_p_1455633534543.html
Total: 2.65 Max: 3.00
Could this be an error or does the score in the Collection report refer to a different score? If so what score?
This seems to be a bug. In fact it seems, in collection header section, it is one less than profile report score. So somewhere an extra 1 point is being added.
See the following screenshot of https://curate.acdh.oeaw.ac.at/profile/table rendered in Firefox:
Some global page layout issues:
Reducing the window size amplifies these issues:
Using column based layout of Bootstrap or some other tried and tested layout framework would be the best way to tackle this.
I tried to assess a profile and received a java null pointer exception
I was assessing https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1447674760337/xsd
and received the following error message.
Error while curating profile from https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1447674760337/xsd! java.lang.NullPointerException at eu.clarin.cmdi.curation.report.CMDProfileReport.calculateScore(CMDProfileReport.java:90) at eu.clarin.cmdi.curation.processor.AbstractProcessor.process(AbstractProcessor.java:17) at eu.clarin.cmdi.curation.entities.CurationEntity.generateReport(CurationEntity.java:32) at eu.clarin.cmdi.curation.main.CurationModule.processCMDProfile(CurationModule.java:31) at eu.clarin.web.views.ResultView.curate(ResultView.java:83) at eu.clarin.web.views.ResultView.enter(ResultView.java:64) at com.vaadin.navigator.Navigator.navigateTo(Navigator.java:616) at com.vaadin.navigator.Navigator.navigateTo(Navigator.java:573) at eu.clarin.web.views.CurationForm.curate(CurationForm.java:55) at eu.clarin.web.views.CurationForm.lambda$new$61446b05$2(CurationForm.java:30) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.vaadin.event.ListenerMethod.receiveEvent(ListenerMethod.java:508) at com.vaadin.event.EventRouter.fireEvent(EventRouter.java:198) at com.vaadin.event.EventRouter.fireEvent(EventRouter.java:161) at com.vaadin.server.AbstractClientConnector.fireEvent(AbstractClientConnector.java:1008) at com.vaadin.ui.Button.fireClick(Button.java:377) at com.vaadin.ui.Button$1.click(Button.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.vaadin.server.ServerRpcManager.applyInvocation(ServerRpcManager.java:158) at com.vaadin.server.ServerRpcManager.applyInvocation(ServerRpcManager.java:118) at com.vaadin.server.communication.ServerRpcHandler.handleInvocations(ServerRpcHandler.java:408) at com.vaadin.server.communication.ServerRpcHandler.handleRpc(ServerRpcHandler.java:273) at com.vaadin.server.communication.UidlRequestHandler.synchronizedHandleRequest(UidlRequestHandler.java:79) at com.vaadin.server.SynchronizedRequestHandler.handleRequest(SynchronizedRequestHandler.java:41) at com.vaadin.server.VaadinService.handleRequest(VaadinService.java:1409) at com.vaadin.server.VaadinServlet.service(VaadinServlet.java:364) at javax.servlet.http.HttpServlet.service(HttpServlet.java:729) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:108) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:522) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:620) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:1096) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:760) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1480) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745)
see XML
implement normalization step in the core component
The Curation Module provides a lot useful information. The results could benefit from a link to a documentation. Here a number of examples:
1.) The Collection Report gives an “Average Score: xx out of 15”
How is the number 15 motivated? Where can one find what exactly needs to be improved?
2.) The Header Section contains a score per ID. Sometimes it matches the score in CMD Profile Report, sometimes it seems to be a difference of 1 to this score?
e.g. https://curate.acdh.oeaw.ac.at/collection/IDS_Repository.html
clarin_eu_cr1_p_1455633534543
Score: 1.65
https://curate.acdh.oeaw.ac.at/profile/clarin_eu_cr1_p_1455633534543.html
Total: 2.65 Max: 3.00
Could this be an error or does the score in the Collection report refer to a different score? If so what score?
3.) To improve usefulness, more information on which entries or concepts are linked to facets and how/why would be very helpful.
e.g. How is the relation between the facets found in the Curation Module and the facets in this tool:
https://cmdi.clarin.eu/mapping/index.html#mapping
4.) A partial documentation of Cmd Component Section and Cmd Concept Section can be found in the specification: https://office.clarin.eu/v/CE-2016-0742-CLARINPLUS-D2_1.pdf but it would be much more comfortable to get the information on one page describing all features of the report.
Find a usage for the history table. Maybe to the history of each link, when clicked on, in the linkchecker statistics page.
Some lists have thousands of links in the link checking statistics view. Therefore I propose to have a button to download an csv report of the full list. This way, users can have the full list and can run automatic tools on it if they wish.
There needs to be a solution for huge data. Some pages have millions of links.
Formats:
Much value would be added to the quality assessment if linked resources could be checked for a number of properties:
The link checker currently rates URLs that don't support GET requests as "Broken Link". This is for example the case for every service URL of the WebLicht framework which has to be specified in the respective CMDI record.
Example-URL: http://clarinws.informatik.uni-leipzig.de:8080/clarinwebservices/frequency/11022/0000-0000-2435-C/frequencytcf04
I'm going to deploy curation 3.0.1 this Friday (July 12th, 12 p.m.). This will include the following features:
It is in code but after the initial setting of 0, it is not calculated. Therefore it is always 0. Is it possible to calculate this by counting the number of links that have a mime type?
So I assume resource proxy links always have an expected mime type associated and this is extracted during report generation. And no other links have an expected mime type. If this assumption is true, I can calculate this by counting the number of links that have a mime type. Is this a good solution or is there another way to determine the number of resource proxy links (and average number of resource proxy links)?
In collection reports, there is a whole resource proxy section the following statistics: total number of resource proxies and total number of resource proxies with MIME. Therefore the number in url section is redundant. I will delete it from there. It was never calculated anyways, so its loss wouldn't be missed.
Wolfgang mentioned that resource proxy section actually belongs to the url report section. So we can talk about incorporating it in there. But for now I'm deleting it from url report.
some classes/code is either copied from or to the vlo project. To synchronize the evolution of both projects and to assure that import validation as well as curation validation are handled in the same way, the curation module should use libraries from the vlo project (eventually we have to repack some classes of the vlo project in separate libraries first).
Would it be possible to have (and somewhere in the application provide) a URL that leads to the XML report for a collection, profile or instance? This would be helpful for sharing purposes, e.g. in the context of the centre assessment.
Hi all,
when running an assessment of the profile clarin.eu:cr1:p_1447674760337 I receive a
java.lang.NullPointerException
The schema (CMDI 1.1) is
https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1447674760337/xsd
here the complete error message:
at eu.clarin.cmdi.curation.report.CMDProfileReport.getName(CMDProfileReport.java:77)
at eu.clarin.web.views.ResultView.curate(ResultView.java:101)
at eu.clarin.web.views.ResultView.enter(ResultView.java:64)
at com.vaadin.navigator.Navigator.navigateTo(Navigator.java:616)
at com.vaadin.navigator.Navigator.navigateTo(Navigator.java:573)
at eu.clarin.web.views.CurationForm.curate(CurationForm.java:55)
at eu.clarin.web.views.CurationForm.lambda$new$61446b05$2(CurationForm.java:30)
at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.vaadin.event.ListenerMethod.receiveEvent(ListenerMethod.java:508)
at com.vaadin.event.EventRouter.fireEvent(EventRouter.java:198)
at com.vaadin.event.EventRouter.fireEvent(EventRouter.java:161)
at com.vaadin.server.AbstractClientConnector.fireEvent(AbstractClientConnector.java:1008)
at com.vaadin.ui.Button.fireClick(Button.java:377)
at com.vaadin.ui.Button$1.click(Button.java:54)
at sun.reflect.GeneratedMethodAccessor112.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.vaadin.server.ServerRpcManager.applyInvocation(ServerRpcManager.java:158)
at com.vaadin.server.ServerRpcManager.applyInvocation(ServerRpcManager.java:118)
at com.vaadin.server.communication.ServerRpcHandler.handleInvocations(ServerRpcHandler.java:408)
at com.vaadin.server.communication.ServerRpcHandler.handleRpc(ServerRpcHandler.java:273)
at com.vaadin.server.communication.UidlRequestHandler.synchronizedHandleRequest(UidlRequestHandler.java:79)
at com.vaadin.server.SynchronizedRequestHandler.handleRequest(SynchronizedRequestHandler.java:41)
at com.vaadin.server.VaadinService.handleRequest(VaadinService.java:1409)
at com.vaadin.server.VaadinServlet.service(VaadinServlet.java:364)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:108)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:522)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79)
at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:620)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:349)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:1110)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:785)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1425)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745)
see XML
The instance assessment form as well as the profiles and colletions tables show a stacktrace (StringIndexOutOfBoundsException
) in the tooltip when hovering over any of the cells.
Links that ultimately redirect to the CLARIN discovery service can be considered to have restricted access, even though the response will consist out of a series of redirects ending up with a 200 at discovery.clarin.eu.
Currently it is not visible in the web-app, when the report was created (even though there is the timeStamp-attribute in the report-xml.
It would be nice to have this information visible in the web interface.
We started the development of curation 3.0 which will have the following features:
Replacement of vaadin
Beside the isolated cases of instance- and the profile analysis, the web-app of curation uses the vaadin framwork to display static content (reports generated 2-times per week as xml-files) in a dynamic way: means most of the view are created in the moment when the user access a certain page with the help of the framework, although the displayed content is static. This approach wastes resources and time.
Hence we want the core module not only to generate the reports in xml format but also the HTML views (static pages for static content!). The two cases where we need to create the pages dynamically will be covered by a servlet which transforms XML (the report) to HTML, user-interaction like sorting and filtering is done by jquery, layout by CSS.
Optimization of memory usage
Currently the curation-core module needs between 2-4GB of heap space while generating the collection reports, since it accumulates the information of each singe CMDI instance in memory to generate the collection report in a final iteration. With some redesign we can pass the required information of each instance directly to the collection report, which would decrease the amount of memory dramatically.
Establishment of multi-threading on the Java-level
In the collection mode the current version of curation-core takes the path to a single collection directory as input parameter and generates a single collection report by analyzing all the files from this collection directory. Means the program has to be called for each collection, which is in our case done by a shell script. Multi-threading is established by the shell script, which runs a configurable number of processes. And for large collections (>10000 files) by the use of stream parallelization in Java.
In curation 3.0 the multi-threading will be established in a configurable way on the Java level.
This includes that coration-core further on is not processing one single collection anymore but it processes all collections descending from a given root.
This approach has also the advantage that it enables curation-core to generate an overview of all collection results as it is needed on the collections view without the need re-read the collection reports from the file system again.
currently only urls starting with http are processed.
Add in configuration handle server url: http://hdl.handle.net/
Example of profile with 0 concept links but ~65% facet coverage: https://catalog.clarin.eu/ds/ComponentRegistry#/?itemId=clarin.eu%3Acr1%3Ap_1337778924932®istrySpace=public
(due to xpath in facet mapping -> potential match assumed)
Chromosome Example: clarin.eu:cr1:p_1337778924932
https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html
HTML-version via the REST-API
Expose the XSL-stylesheets
=> make a printable simple html page (via content negotiation or xslt processing instruction)
Add filters in tables where reasonable - collection name, profile name
To be able to explore the instance data of given collection for further inspection, it would be nice to have a link in the collection report pointing to the appropriate query in the VLO.
(However not sure if there is a reliable matching on any vlo-facet - first candidate would be the vlo:collection facet)
Consider: https://curate.acdh.oeaw.ac.at/statistics/IDS_Repository/307; the first link is:
http://hdl.handle.net/10932/00-03FE-203B-D2BD-4801-9, which currently gives the following information:
Message: HTTP entity too large to be buffered in memory
Expected Content Type:
Content Type: text/html;charset=utf-8
Byte Size: 0
Request Duration(ms): 36
Method: N/A
Timestamp: 2020-03-04 17:20:59.0
This information is dubious:
Expected Content Type
? If it nothing expected, state it; otherwise, it looks like an error.Content Type
is the one provided by the first URL, at the end of the redirect chain application/zip
is provided.Method
is not helpful.Maybe it would be possible to modify the checker to report more conspicuously as follows:
Message: (No diagnosis performed on content. HTTP entity too large to be buffered in memory)
Expected Content Type: (no expectation)
Content Type: application/zip
Byte Size: N/A, due to download size
Request Duration(ms): 36
Method: <SOMETHING_USEFUL>
Timestamp: 2020-03-04 17:20:59.0
For Content Type
, one might also consider the chain of redirects:
Content Type: ["text/html;charset=utf-8"; NONE; "application/zip"]
I had a look at https://curate.acdh.oeaw.ac.at/'#!ResultView/statistics//Slovenian_language_resource_repository_CLARIN_SI/404
as https://curate.acdh.oeaw.ac.at/#!ResultView/collection//Slovenian_language_resource_repository_CLARIN_SI currently reports a stunning "Total number of broken links: 421".
Yet for all the URLs from the 404 list that I checked, the URL is correct and working.
The link checker currently has no statistics file to read:
"Error while reading the statistics file:java.nio.file.NoSuchFileException: /usr/local/curation-module/reports/statistics/linkCheckerStatistics.html"
Example URL: https://curate.acdh-dev.oeaw.ac.at/#!Link%20Checking%20Statistics
The curation module says for the XML validation section:
Invalid Records: https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A02_Gedichte_20Emily_20Dickinson_Gedichtkorpus_20Emily_20Dickinson
This refers to the metadata available at https://talar.sfb833.uni-tuebingen.de:8443/erdora/displaycmdi?path=%2FSFB833%2FA02%2FGedichte+Emily+Dickinson%2FGedichtkorpus+Emily+Dickinson%2FFID_129.cmdi.xml
However, this file is valid - at least according to two XML parsers.
It would be helpful to provide the complete validation error - if any - to verify that there is a problem.
Many records have a (mapped) title that is not very descriptive or even friendly to the human reader. For example the otherwise excellent records from the Leipzig Corpora Collection have names like ukr_newscrawl_2011_1M and LCC data provider "www.elkhabar.com" in resource with identifier 11022/0000-0000-7F4F-B These values come from a 'name' field, there is no additional title; this would be the solution, but this issue is about identifying issues.
I don't have an exact solution but it would be nice if the name/title, as shown in the VLO, could be scored according to some heuristic analysis that could involve the length of the title, number of characters, relative number of spaces, relative number of non-alphabetical characters. Perhaps such a measure already exists. I would not make this a big part of the overall score but it would be nice as a soft warning as part of the metadata quality report.
If I come across existing methods for scoring title quality for human readability, I will share it in a comment to this issue.
In the collection report we find:
modality | 0.5663
Ratio valid Records: 0.9880
Average rate of populated elements: 0.9835%
I would suggest changing all of these to proper percentages with reasonable precision, for example in these cases 56.6%, 98.8% and 98.4%
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.