Coder Social home page Coder Social logo

openaire / iis Goto Github PK

View Code? Open in Web Editor NEW
19.0 18.0 11.0 64.33 MB

Information Inference Service of the OpenAIRE system

License: Apache License 2.0

Java 94.08% Shell 0.11% PigLatin 0.78% Python 0.32% Scala 0.27% HTML 4.39% HiveQL 0.05%
iis openaire information-inference data-mining text-mining data-processing-system big-data hadoop spark

iis's People

Contributors

anonymoususer110 avatar dependabot[bot] avatar johnfouf avatar lsmyrnaios avatar lukdumi avatar madryk avatar marekhorst avatar mateuszneumann avatar mkobos avatar mpol avatar przemyslawjacewicz avatar przemyslawjacewicz-icm avatar s6savahd avatar tasosgig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iis's Issues

fix primary processing integration test

Test fails due to:

  • invalid eu.dnetlib.iis.workflows.schemas.DocumentContentClasspath reference in workflow.xml file
  • invalid resource paths in document_content_classpath.json and document_text_classpath.json

Both related to IIS modules refactoring where package names were changed.

Update README files

Update README files after restructuring the project and moving it from SVN.

Add license

We should add information about Apache 2.0 license to the code. This requires

  1. adding two files: NOTICE and LICENSE in the project's root directory,
  2. adding a short license header to all source files (although I'm not sure if it's really necessary, since in [1] they say that it "should" be done, which is in contrast to description in [1] of adding NOTICE file which "must" be done)
  3. updating README file with an information about the license of the project.

Additional material - information about applying Apache license in your project on the web:

[1] http://www.apache.org/dev/apply-license.html#new
[2] http://blog.maestropublishing.com/2009/11/19/how-to-apply-the-apache-2-0-license-to-your-project/
[3] take a look how it was done in the avroknife project.

Move content from OpenAIRE technical wiki to GitHub

Move as much of the content of OpenAIRE technical wiki to markdown documents in the source code and to GitHub wiki as possible. In particular:

  • Section describing structure and dependencies between Maven projects.
  • Section about Maven ~/.m2/settings.xml configuration.
  • Deployment tutorial.

Migrate diffs in iis-3rdparty-* modules from SVN repo

Our patched/hacked versions of 3rd party code should contain the original code along with the patch that was applied. This is how it was done in the SVN repo; however, it wasn't migrated to GitHub.

Remember to update README files in the updated Maven projects appropriately.

fix crc errors occurring when block size is not uniform between clusters

Currently when copying HBase sequence file dump from CDH4 DM cluster to CDH5 rumcajs cluster the following exception is thrown:

Check-sum mismatch between hftp://namenode1.hadoop.dm.openaire.eu/tmp/infospace_export_production_compressed_BLOCK.seq/part-m-00182 and hdfs://spark-cluster-nn/user/mhorst/workflows/top/primary/main/working_dir/hbase_dump/.distcp.tmp.attempt_1438689608784_26671_m_000012_2. Source and target differ in block-size. Use -pb to preserve block-sizes during copy. Alternatively, skip checksum-checks altogether, using -skipCrc. (NOTE: By skipping checksums, one runs the risk of masking data-corruption during file-transfer.)

Skipping checksum verification doesn't seem to be good way to go while preserving larger block size for this large HBase dumps seems to be quite reasonable.

Introduce more flexible testing consumer

Current TestingConsumer implementation is quite simple and sometimes we need to be less strict when comparing avro records.

One possible scenario is omitting some of the fields when comparing objects, e.g. Fault datastore's timestamp which will change at each run or stacktrace containing loads of text we don't want to specify in JSON file.

hbase dump importer should handle zookepper related properties

Otherwise import from hbase dump fails on CDH5 when trying to
connect to localhost/127.0.0.1:2181 which is invalid zookeeper quorum location (set by default).

This problem does not show up on CDH4 IIS cluster because zookeeper properties are part of environment which means they are already set and client don't need to provide them explicitly.

Improve running oozie packages from maven

Implement according to conclusions made in #26

  • remove application-default.properties
  • running single test: mvn integration-test -Dtest=eu.dnetlib...TestClass
  • remove deploy-local, run-local profiles (all workflows are deployed by ssh)
  • additional parameter for deploying workflows -DconnectionProperties=/path/to/cluster/properties
  • integration tests are configured to use cluster configured in ~/.iis/integration-tests.properties, but can be overriden by -DconnectionProperties property: mvn integration-test -DconnectionProperties=/path/to/cluster/properties

Create branch for CDH5 cluster compatibility

This branch should be built based on IIS-CDH-5.3.0 svn branch.

New patches possibly will have to be applied due to further changes in IIS code and CDH5 cluster version upgrade to CDH-5.4.2.

Add contributors file

Add file with a list of people that have contributed to project's source code. This is especially important for the people that have contributed before the migration to GitHub because their contribution is not visible in commits any more (although it should still be visible in the source code provided that they signed the code they created).

We have a document that is a summary of development of IIS in the OpenAIRE project which contains a list of involved people and their areas of involvement so we can use that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.