yahooarchive / oozie Goto Github PK
View Code? Open in Web Editor NEWOozie - workflow engine for Hadoop
Home Page: http://yahoo.github.com/oozie/
License: Apache License 2.0
Oozie - workflow engine for Hadoop
Home Page: http://yahoo.github.com/oozie/
License: Apache License 2.0
This is in response to an Oracle Bug 9577583: False ORA-942 or other errors when multiple schemas have identical object names.
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html
Currently a workflow XML is the 'workflow.xml' file under the HDFS directory specified in the job property 'oozie.wf.application.path'.
This means that a given HDFS directory can have only one workflow app (the workflow.xml file).
In many cases is desirable to share configurations and binaries among multiple workflow apps.
Today this is not possible.
Proposal:
1* If 'oozie.wf.application.path' points to a HDFS directory, the workflow app is 'workflow.xml' (today's behavior)
2* If 'oozie.wf.application.path' points to an XML file in HDFS, the workflow app is the specified file path and the workflow app directory (for all resources and and binaries) is the parent directory.
This proposal preserves backwards compatibility.
It is using "users", it should use the method getTestGroup()
According to Alejandro, these have been deprecated.
when I run it, I get
The assembly file should take care of that
The following variables need to be redefined: VC_REV, VC_URL
Currently Oozie requires the JT and NN kerberos principals to be in the WF job properties when submitting a job.
Hadoop has built in rules to create these principals (i.e. mapred/_HOST@${local.realm}).
Oozie should leverage those rules if the WF job properties do not include JT/NN kerberos principals thus not requiring them as mandatory WF job properties on WF submission.
in workflow.xml, when use the following line:
/tmp/a.jar#a.jar
then this jar is added to distributed cache ("mapred.cache.files") only.
but this jar also needs to be in java classpath ("java.class.path").
The correct way of doing this would be using an SPNEGO filter on the server side.
Ideally authentication should be plugglable, allowing support for cookie based auth, certs, etc.
Script defines a function and therefore fails on systems where /bin/sh is not bash.
Suggest:
--- setup-maven.sh.orig 2010-09-08 16:10:20.000000000 -0700
+++ setup-maven.sh 2010-09-07 16:14:39.000000000 -0700
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
Provide client support for Oozie bundles, provide corresponding support in servlets,
and further provide a bundle engine similar to DagEngine and CoordinatorEngine.
The parent issue of bundles is GH-49 [http://github.com/yahoo/oozie/issues/#issue/49].
there is a 'git info' type thing here - http://justamemo.com/2009/02/09/git-info-almost-like-svn-info/ which may help
There are some deprecated code can be removed, such as OozieSchema.java and Schema.java.
Oozie should use fully qualified names for the data base objects like .
Add support for Hive actions in workflows.
This would be via a new action executor and an extension schema.
The generated Oozie WAR does not include the Hadoop JARs.
Add a build option that would force the inclusion of the Hadoop JARs used for building Oozie.
Default behavior should be the current one (no Hadoop JARs in the WAR)
The invocation of Oozie CLI use invalid $EXECCLASS variable.
It should be removed
Current examples don't work against a cluster running Kerberos.
Also, the setup of examples (what it is done by the prepare script) is twisted and confuses users.
Currently Oozie will materialize a coordinator job right after job submission, even if the job will only run in far future.
We need to modify CoordJobMatLookupCommand so that it also checks materialization's start time (set as job's start time for a newly submitted job) against current time. Only if it falls into a valid range (say, within one hour of future), we proceed with materialization.
current test users are '${user.name}, test, test2, test3' and current test group is 'testg'.
many testcases (166) fail if the test user used for oozie is not ${user.name}
for example default values for test users should be: testuser1, testuser2, testuser3, testuser4 and test group 'testgroup1' with users 2 & 3 belonging to it.
methods in the XTestCase should be renamed to be aligned with default values.
currently the main POM contains the internal repo reference (a local dir) for plugins only, it should be there for artifacts also.
It will be good to have ability to supply a comma separated list of jars in an 'archive tag' instead of putting each jar in a new line. Hadoop distributed cache allows listing a comma separated list of files.
This can be done with a new MapReduceMain class.
Oozie currently has two level of abstractions:
This issue proposes another abstraction called 'bundle' that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun in the bundle level.
******* The proposed high-level requirements to support bundle are enumerated below:
<bundle-app name="MY_BUNDLE" xmlns="uri:oozie:bundle:0.1">
<controls>
<kick-off-time>2009-02-02T00:00Z</kick-off-time>
</controls>
<coordinator>
<configuration>
<property>
<name>START_TIME</name>
<value>2009-02-01T00:00Z</value>
<property>
.................
...............
</configuration>
<app-path>hdfs:${NAME_NODE}/tmp/bundle-apps/coordinator1.xml</app-path>
<coordinator>
<coordinator>
<configuration>
<property>
<name>END_TIME</name>
<value>2010-02-01T00:00Z</value>
<property>
.................
...............
</configuration>
<app-path>hdfs:${NAME_NODE}/tmp/bundle-apps/coordinator1.xml</app-path>
<coordinator>
</bundle-app>
The servlets receiving a job submission/rerun should resolve values with variables to their concrete values before proceeding with the submission.
For example:
a=A
Should be resolved to
a=A
I ran a coordinator job under current mode, then check its status:
[chaow@pressglass examples]$ oozie job -info 0000007-100727102157647-oozie-chao-C
ID Created Nominal Time
...
0000007-......@2 2010-07-27 19:41 2010-07-27 19:40
...
We see that creation time is after nominal time, which is not correct.
Note here the cluster is not stressed at all - so actions should be created a bit earlier than the nominal time.
Create commands for bundles logic. The parent issue of bundles is GH-49[http://github.com/yahoo/oozie/issues/#issue/49].
The core/pom.xml has
${maven.compile.encoding}
Using UTF-8 instead removes the warning
Currently every workflow that uses a pig action must bundle the Pig JAR in the workflow lib/ directory.
This is also true for commonly use JARs files across different workflow apps.
By adding a share lib job property, which is added as a secondary lib/ directory, all commons JARs (Pig, Hive) can be added in a /usr/share/lib directory in HDFS and used by multiple workflow applications without having to have a private copy per workflow app.
The location of HDFS share lib would be specified as job property (a later rev of the workflow XML schema may add support for it too)
According to the quickstart (http://yahoo.github.com/oozie/releases/2.2.0/DG_QuickStart.html), the distribution tar.gz contains an oozie.war file.
But http://github.com/yahoo/oozie/tarball/oozie-2.2.0 contains no such artifact in the tar.gz.
Worse, when I try to build with "mvn clean package" or "mvn clean package assembly:single", maven fails with:
[INFO] Building Oozie Core
[INFO] task-segment: [clean, package]
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] 'add-resource' was specified in an execution, but not found in the plugin
Has anyone else been able to use and/or build the 2.2.0 distribution?
The index.html file that launches oozie console references the RowExpander.js file from "ext-2.2/RowExpander.js". The correct location of this file however is "ext-2.2/examples/grid/RowExpander.js".
This causes the console to not show up correctly.
The main POM file contains references to snapshot repositories which are not needed to build Oozie.
The core and example POM have references to commons-cli 2.0 that are not needed.
The POMs of hadoop artifacts used by Oozie have references to commons-cli 2.0 which are not needed and they should be excluded.
Like for Oozie version occurrences, add a comment next to the groupId, ie:
<groupId>com.yahoo.oozie</groupId> <!-- OOZIE_GROUP_ID -->
This annotation enables an easy replacement via scripting of the value as it is already done with the value.
To make GIT ignore Maven, Eclipse, Intellij, Structures101 and other build files/dirs
POMs have groupId and version parameterized with a property from the main POM.
The same should be done for the artifactIds to enable use of JARs available under alternate artifact names.
As with groupId and version, the default values should remain as today.
The following code in PriorityDelayQueue.java class
public int compareTo(Delayed o) {
return (int) (getDelay(TimeUnit.MILLISECONDS) - o.getDelay(TimeUnit.MILLISECONDS));
}
should be replace by something like this:
public int compareTo(Delayed o) {
long diff = (getDelay(TimeUnit.MILLISECONDS) - o.getDelay(TimeUnit.MILLISECONDS));
if(diff > 0) {
return 1;
} else if(diff < 0) {
return -1;
} else {
return 0;
}
}
One workaround would be to run the tests as test user and have the environment setup correctly for that. The other solution would be to by default use the current user for test purposes, and overwrite that to another user where necessary.
Currently, the way minicluster is set requires test users to be exposed to the UNIX environment. This is quite inconvenient, since it requires those who would like to run unit tests to be in a position of playing a role of system administrator on a system: adding test users, adding test group and mapping test users to a test group.
According to the Hadoop development team, there's a better way of achieving the same based on the UserGroupInformation.createUserForTesting.
All we have to do is to setup the test users in XTestCase the same way we're setting up other aspects of the minicluster.
If the user selects cleanup but there is no "output-event" in the cooridnator xml, the code will throw NullPointerException (NPE):
private void cleanupOutputEvents(Element eAction, String user, String group) {
Element outputList = eAction.getChild("output-events", eAction.getNamespace());
for (Element data : (List) outputList.getChildren("data-out", eAction.getNamespace())) {
Line 3 will throw NPE.
Solution:
if (outputList != null) {
for (Element data : (List) outputList.getChildren("data-out", eAction.getNamespace())) {
......
}
mvn install from http://svn.apache.org/repos/asf/commons/sandbox/cli2/trunk creates an artifact with this info:
org.apache.commons
commons-cli2
2.0-SNAPSHOT
this page: http://yahoo.github.com/oozie/releases/2.2.0/DG_QuickStart.html
links to both http://yahoo.github.com/oozie/downloads (from the view of the HTML) and http://yahoo.github.com/oozie/releases/2.2.0/Http://yahoo.github.com/oozie/downloads.html as the actual link. Neither of which are valid.
Currently it requires the user to manually download and expand the extjs ZIP file.
addtowar.sh It should handle also when the ZIP file is given.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.