Coder Social home page Coder Social logo

oozie's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oozie's Issues

build-setup/setup-maven.sh uses bash-isms but references /bin/sh

Script defines a function and therefore fails on systems where /bin/sh is not bash.

Suggest:

--- setup-maven.sh.orig 2010-09-08 16:10:20.000000000 -0700
+++ setup-maven.sh 2010-09-07 16:14:39.000000000 -0700
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash

Licensed to the Apache Software Foundation (ASF) under one

or more contributor license agreements. See the NOTICE file

unix users/groups for testcases should be fixed and normalized

current test users are '${user.name}, test, test2, test3' and current test group is 'testg'.

many testcases (166) fail if the test user used for oozie is not ${user.name}

for example default values for test users should be: testuser1, testuser2, testuser3, testuser4 and test group 'testgroup1' with users 2 & 3 belonging to it.

methods in the XTestCase should be renamed to be aligned with default values.

Annotate Oozie group IDs in the POMs

Like for Oozie version occurrences, add a comment next to the groupId, ie:

<groupId>com.yahoo.oozie</groupId> <!-- OOZIE_GROUP_ID -->

This annotation enables an easy replacement via scripting of the value as it is already done with the value.

update/simplify examples

Current examples don't work against a cluster running Kerberos.

Also, the setup of examples (what it is done by the prepare script) is twisted and confuses users.

Implement better way of managing test users in unit tests when setting up mini-cluster

Currently, the way minicluster is set requires test users to be exposed to the UNIX environment. This is quite inconvenient, since it requires those who would like to run unit tests to be in a position of playing a role of system administrator on a system: adding test users, adding test group and mapping test users to a test group.

According to the Hadoop development team, there's a better way of achieving the same based on the UserGroupInformation.createUserForTesting.

All we have to do is to setup the test users in XTestCase the same way we're setting up other aspects of the minicluster.

leverage Hadoop rules for JT/NN Kerberos principal resolution

Currently Oozie requires the JT and NN kerberos principals to be in the WF job properties when submitting a job.

Hadoop has built in rules to create these principals (i.e. mapred/_HOST@${local.realm}).

Oozie should leverage those rules if the WF job properties do not include JT/NN kerberos principals thus not requiring them as mandatory WF job properties on WF submission.

2.2.0 distribution download contains no WAR and won't build?!

According to the quickstart (http://yahoo.github.com/oozie/releases/2.2.0/DG_QuickStart.html), the distribution tar.gz contains an oozie.war file.

But http://github.com/yahoo/oozie/tarball/oozie-2.2.0 contains no such artifact in the tar.gz.

Worse, when I try to build with "mvn clean package" or "mvn clean package assembly:single", maven fails with:

[INFO] Building Oozie Core
[INFO] task-segment: [clean, package]
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] 'add-resource' was specified in an execution, but not found in the plugin

Has anyone else been able to use and/or build the 2.2.0 distribution?

Oozie bundles commands

Create commands for bundles logic. The parent issue of bundles is GH-49[http://github.com/yahoo/oozie/issues/#issue/49].

parameterize hadoop/pig artifactIds in the POMs

POMs have groupId and version parameterized with a property from the main POM.

The same should be done for the artifactIds to enable use of JARs available under alternate artifact names.

As with groupId and version, the default values should remain as today.

Enable build to include Hadoop JARs in oozie.war

The generated Oozie WAR does not include the Hadoop JARs.

Add a build option that would force the inclusion of the Hadoop JARs used for building Oozie.

Default behavior should be the current one (no Hadoop JARs in the WAR)

actions should not be materialized after nominal time for current mode jobs

I ran a coordinator job under current mode, then check its status:

[chaow@pressglass examples]$ oozie job -info 0000007-100727102157647-oozie-chao-C

ID Created Nominal Time
...
0000007-......@2 2010-07-27 19:41 2010-07-27 19:40
...

We see that creation time is after nominal time, which is not correct.
Note here the cluster is not stressed at all - so actions should be created a bit earlier than the nominal time.

Oozie should not materialize a coordinator job right after its submission if the job will only run in far future.

Currently Oozie will materialize a coordinator job right after job submission, even if the job will only run in far future.

We need to modify CoordJobMatLookupCommand so that it also checks materialization's start time (set as job's start time for a newly submitted job) against current time. Only if it falls into a valid range (say, within one hour of future), we proceed with materialization.

Oozie bundle client support + servlet + bundle engine

Provide client support for Oozie bundles, provide corresponding support in servlets,
and further provide a bundle engine similar to DagEngine and CoordinatorEngine.

The parent issue of bundles is GH-49 [http://github.com/yahoo/oozie/issues/#issue/49].

Add Hive action

Add support for Hive actions in workflows.

This would be via a new action executor and an extension schema.

Add support the coordiator job submitted to run in far future

The following code in PriorityDelayQueue.java class

public int compareTo(Delayed o) {
        return (int) (getDelay(TimeUnit.MILLISECONDS) - o.getDelay(TimeUnit.MILLISECONDS));
}

should be replace by something like this:

public int compareTo(Delayed o) {
        long diff = (getDelay(TimeUnit.MILLISECONDS) - o.getDelay(TimeUnit.MILLISECONDS));
        if(diff > 0) {
            return 1;
         } else if(diff < 0) {
            return -1;
         } else {
            return 0;
         } 
}

coordinator rerun doesn't consider empty output-event

User experience:

If the user selects cleanup but there is no "output-event" in the cooridnator xml, the code will throw NullPointerException (NPE):

private void cleanupOutputEvents(Element eAction, String user, String group) {
        Element outputList = eAction.getChild("output-events", eAction.getNamespace());
        for (Element data : (List) outputList.getChildren("data-out", eAction.getNamespace())) {

Line 3 will throw NPE.

Solution:

if (outputList != null) {
    for (Element data : (List) outputList.getChildren("data-out", eAction.getNamespace())) {
     ......
}

Additional activities:

  1. Add an unit test
  2. QA needs to add one test case with this scenario.

add support for a share lib directory in HDFS for action binaries.

Currently every workflow that uses a pig action must bundle the Pig JAR in the workflow lib/ directory.

This is also true for commonly use JARs files across different workflow apps.

By adding a share lib job property, which is added as a secondary lib/ directory, all commons JARs (Pig, Hive) can be added in a /usr/share/lib directory in HDFS and used by multiple workflow applications without having to have a private copy per workflow app.

The location of HDFS share lib would be specified as job property (a later rev of the workflow XML schema may add support for it too)

add support for multiple workflow XMLs in a single HDFS directory

Currently a workflow XML is the 'workflow.xml' file under the HDFS directory specified in the job property 'oozie.wf.application.path'.

This means that a given HDFS directory can have only one workflow app (the workflow.xml file).

In many cases is desirable to share configurations and binaries among multiple workflow apps.

Today this is not possible.

Proposal:

1* If 'oozie.wf.application.path' points to a HDFS directory, the workflow app is 'workflow.xml' (today's behavior)
2* If 'oozie.wf.application.path' points to an XML file in HDFS, the workflow app is the specified file path and the workflow app directory (for all resources and and binaries) is the parent directory.

This proposal preserves backwards compatibility.

Supporting bundle in oozie

Oozie currently has two level of abstractions:

  1. Workflow that execute DAG of actions.
  2. Coordinator that executes workflow periodically when the specified set of data directories are available.

This issue proposes another abstraction called 'bundle' that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun in the bundle level.

******* The proposed high-level requirements to support bundle are enumerated below:

  1. This feature will allow user to specify a list of coordinator applications in XML file format.
  2. The name of the bundle xml file is not hard-coded. User can specify any name as bundle file.
  3. User will submit a bundle by specifying the bundle application path in config file . An example command is: oozie job -run -config <bundle.properties>
  4. Bundle application path is defined in config file as property "oozie.application.bundle.path" with a value of full path to bundle xml in the hdfs.
  5. User can also submit a bundle job through WS API.
  6. User will be able to define variables /parameters for each coordinator application.
  7. All variables should be resolved during job submission. For any resolved variable, oozie will throw an Exception.
  8. User will be able to submit a bundle with an user-defined external id to avoid duplicate submissions in case of Timeout in first submission.
  9. Oozie will not support any explicit dependencies among the coordinator XML in bundle definition.
  10. Oozie will not support any partial bundle submission.
  11. When user will submit a bundle , it will get a bundle id to track. Oozie will put the bundle job into PREP state.
  12. User will be able to start a bundle using bundle id. It will put the bundle job into RUNNING state.
  13. User will be able to combine submit and start into run that will start the bundle immediately.
  14. User will be able to optionally specify the kick-off time to determine when to start a bundle. The bundle will not run until kick-off time reached.
  15. User will be able to query Oozie for its status through CLI and WS API.
  16. User will be able to query Oozie for all coordinator jobs that it started through CLI and WS API.
  17. User will be able to kill a bundle id that will kill all spawned coordinator jobs.
  18. User will be able to suspend a bundle id that will suspend all spawned coordinator jobs.
  19. User will be able to pause a bundle id with a future time that will pause all spawned coordinator jobs.
  20. User will be able to resume a bundle id that will resume all spawned coordinator jobs.
  21. Bundle rerun requirements TBD.

This is a sample bundle XML :

<bundle-app name="MY_BUNDLE" xmlns="uri:oozie:bundle:0.1">

  <controls>
       <kick-off-time>2009-02-02T00:00Z</kick-off-time>
  </controls>

   <coordinator>
       <configuration>
         <property>
              <name>START_TIME</name>
              <value>2009-02-01T00:00Z</value>
          <property>
          .................
          ...............
      </configuration>
      <app-path>hdfs:${NAME_NODE}/tmp/bundle-apps/coordinator1.xml</app-path>
   <coordinator>

   <coordinator>
       <configuration>
         <property>
              <name>END_TIME</name>
              <value>2010-02-01T00:00Z</value>
          <property>
          .................
          ...............
      </configuration>
      <app-path>hdfs:${NAME_NODE}/tmp/bundle-apps/coordinator1.xml</app-path>
   <coordinator>          
</bundle-app>

POMs cleanup, remove unneeded repositories, remove/exclude commons-cli 2.0

The main POM file contains references to snapshot repositories which are not needed to build Oozie.

The core and example POM have references to commons-cli 2.0 that are not needed.

The POMs of hadoop artifacts used by Oozie have references to commons-cli 2.0 which are not needed and they should be excluded.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.