openaire / iis Goto Github PK
View Code? Open in Web Editor NEWInformation Inference Service of the OpenAIRE system
License: Apache License 2.0
Information Inference Service of the OpenAIRE system
License: Apache License 2.0
mvn clean integration-tests MyTest
mvn clean package -Poozie_package,deploy,run -dconnectionProperties=path_to_file
Removing deploy-local and run-local
The properties in these 2 files should be described.
Find out how people do it, prepare a short summary.
Test fails due to:
eu.dnetlib.iis.workflows.schemas.DocumentContentClasspath
reference in workflow.xml
filedocument_content_classpath.json
and document_text_classpath.json
Both related to IIS modules refactoring where package names were changed.
Before rewriting them to spark.
instead of MiniOozie
The diagram should be slightly smaller since the size of the font used in the diagram is larger than the font of surrounding text and this is not recommended.
instead of using OozieClient
Update README files after restructuring the project and moving it from SVN.
We should add information about Apache 2.0 license to the code. This requires
NOTICE
and LICENSE
in the project's root directory,NOTICE
file which "must" be done)Additional material - information about applying Apache license in your project on the web:
[1] http://www.apache.org/dev/apply-license.html#new
[2] http://blog.maestropublishing.com/2009/11/19/how-to-apply-the-apache-2-0-license-to-your-project/
[3] take a look how it was done in the avroknife project.
We will not be using it anymore.
Remove any code that is related to it.
Move as much of the content of OpenAIRE technical wiki to markdown documents in the source code and to GitHub wiki as possible. In particular:
~/.m2/settings.xml
configuration.Another fix provided by Ioannis:
https://issue.openaire.research-infrastructures.eu/issues/1549#note-21
should be applied.
Versions of dependencies should be defined in parent poms (iis or iis-workflows) in dependencyManagement sections. After that we should get rid of properties with version numbers from iis/pom.xml.
PMC metadata importer fails when empty XML document content provided.
Originally reported in:
https://issue.openaire.research-infrastructures.eu/issues/1619
With the pig node changed to spark.
We should fix all build errors occurring when building packages and performing both unit and integration tests.
Our patched/hacked versions of 3rd party code should contain the original code along with the patch that was applied. This is how it was done in the SVN repo; however, it wasn't migrated to GitHub.
Remember to update README files in the updated Maven projects appropriately.
Currently when copying HBase sequence file dump from CDH4 DM cluster to CDH5 rumcajs cluster the following exception is thrown:
Check-sum mismatch between hftp://namenode1.hadoop.dm.openaire.eu/tmp/infospace_export_production_compressed_BLOCK.seq/part-m-00182 and hdfs://spark-cluster-nn/user/mhorst/workflows/top/primary/main/working_dir/hbase_dump/.distcp.tmp.attempt_1438689608784_26671_m_000012_2. Source and target differ in block-size. Use -pb to preserve block-sizes during copy. Alternatively, skip checksum-checks altogether, using -skipCrc. (NOTE: By skipping checksums, one runs the risk of masking data-corruption during file-transfer.)
Skipping checksum verification doesn't seem to be good way to go while preserving larger block size for this large HBase dumps seems to be quite reasonable.
Current TestingConsumer implementation is quite simple and sometimes we need to be less strict when comparing avro records.
One possible scenario is omitting some of the fields when comparing objects, e.g. Fault datastore's timestamp
which will change at each run or stacktrace
containing loads of text we don't want to specify in JSON file.
Otherwise import from hbase dump fails on CDH5 when trying to
connect to localhost/127.0.0.1:2181
which is invalid zookeeper quorum location (set by default).
This problem does not show up on CDH4 IIS cluster because zookeeper properties are part of environment which means they are already set and client don't need to provide them explicitly.
Without it the endings in windows in *sh scripts (like upload_workflow.sh) are 'crlf's and there are errors during executing them on unix-like systems.
iis-3rdparty-avro-json/eclipse-classes/
should be added to .gitignore
.
Code:
Logger log = LoggerFactory.getLogger(getClass());
log.info("log info");
do not log any info in eclipse when running integration test
Implement according to conclusions made in #26
application-default.properties
mvn integration-test -Dtest=eu.dnetlib...TestClass
deploy-local
, run-local
profiles (all workflows are deployed by ssh)-DconnectionProperties=/path/to/cluster/properties
~/.iis/integration-tests.properties
, but can be overriden by -DconnectionProperties
property: mvn integration-test -DconnectionProperties=/path/to/cluster/properties
This branch should be built based on IIS-CDH-5.3.0
svn branch.
New patches possibly will have to be applied due to further changes in IIS code and CDH5 cluster version upgrade to CDH-5.4.2
.
IIS depends on snapshot version of certain modules. It's not only dangerous but also a big issue in case of mvn release.
hbase-client
dependency is not provided by cluster environment therefore it has to become part of oozie workflow package.
Originally reported in:
https://issue.openaire.research-infrastructures.eu/issues/1415
In this task we should introduce Fault
datastore creation when any exception occurs during PMC metadata extraction. This was already done for PDF metadata extraction (cermine).
This is improved approach of simple logging which is not visible enough and cannot be treated as a subject for further processing and in-depth analysis.
Add file with a list of people that have contributed to project's source code. This is especially important for the people that have contributed before the migration to GitHub because their contribution is not visible in commits any more (although it should still be visible in the source code provided that they signed the code they created).
We have a document that is a summary of development of IIS in the OpenAIRE project which contains a list of involved people and their areas of involvement so we can use that.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.