The rmr2 from revolutionanalytics

map calls are grouped by key when key present in local mode

reported here

https://groups.google.com/forum/#!topic/rhadoop/1TOv6ILsFDo

rmr2 failed - Rscript error

I tried to run rmr2 tutorial on CDH4, and I compiled source code of R and made it executable by all users. Now I can run WordCount jave code, but stucked at rmr2 package, because I saw this message first:

Error: java.lang.RuntimeException: Error in configuring object

I wonder how I can debug rmr2 package, any thoughts?

Here are details.
Source code:

[cloudera@localhost ~]$ cat test.rmr.R
library(rmr2)
small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))

Rscript is installed on /usr/bin:

[cloudera@localhost ~]$ ls -l /usr/bin/Rscript
lrwxrwxrwx 1 root root 37 Apr 1 15:19 /usr/bin/Rscript -> /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript
[cloudera@localhost ~]$ ls -l /home/cloudera/software/R/bin/Rscript
-rwxr-xr-x 1 cloudera cloudera 17730 Apr 1 10:23 /home/cloudera/software/R/bin/Rscript

Error message:

[cloudera@localhost ~]$ Rscript test.rmr.R
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
13/04/01 17:35:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/04/01 17:35:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning message:
In to.dfs(1:1000) : Converting to.dfs argument to keyval with a NULL key
13/04/01 17:35:29 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/Rtmps3CHWS/rmr-local-env406f381b22dd, /tmp/Rtmps3CHWS/rmr-global-env406f575cacdf, /tmp/Rtmps3CHWS/rmr-streaming-map406f54bdce9f] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.2.0.jar] /tmp/streamjob34376797830229279.jar tmpDir=null
13/04/01 17:35:31 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/01 17:35:32 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/01 17:35:33 INFO mapred.FileInputFormat: Total input paths to process : 1
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: number of splits:2
13/04/01 17:35:33 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
13/04/01 17:35:33 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
13/04/01 17:35:33 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
13/04/01 17:35:33 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
13/04/01 17:35:33 WARN conf.Configuration: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
13/04/01 17:35:33 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
13/04/01 17:35:33 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
13/04/01 17:35:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1364677243840_0008
13/04/01 17:35:33 INFO client.YarnClientImpl: Submitted application application_1364677243840_0008 to ResourceManager at /0.0.0.0:8032
13/04/01 17:35:33 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1364677243840_0008/
13/04/01 17:35:33 INFO mapreduce.Job: Running job: job_1364677243840_0008
13/04/01 17:35:45 INFO mapreduce.Job: Job job_1364677243840_0008 running in uber mode : false
13/04/01 17:35:45 INFO mapreduce.Job: map 0% reduce 0%
13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:00 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:12 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:14 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:23 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:27 INFO mapreduce.Job: Task Id : attempt_1364677243840_0008_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:424)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "Rscript": java.io.IOException: error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

13/04/01 17:36:36 INFO mapreduce.Job: map 50% reduce 0%
13/04/01 17:36:36 INFO mapreduce.Job: Job job_1364677243840_0008 failed with state FAILED due to: Task failed task_1364677243840_0008_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

13/04/01 17:36:37 INFO mapreduce.Job: Counters: 6
Job Counters
Failed map tasks=7
Launched map tasks=8
Other local map tasks=6
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=85605
Total time spent by all reduces in occupied slots (ms)=0
13/04/01 17:36:37 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Calls: mapreduce -> mr
Execution halted

Multiple dataframe outputs from the same reducer

Picking up from RevolutionAnalytics/RHadoop#177

It does seem a bit complicated to do. Your data is structured, but has two structures. So we need to deal with it as unstructured data, that is to use lists as arguments to keyval. One way of doing that is to split the data frames by the key you want, as in

splitcars = split(mtcars, mtcars$mpg)
keyal(names(splitcars), splitcars)

The same for the other type of frames. One caveat is that if you have many different keys this is going to create lots of small data frames, which is very inefficient. rmr2 has the same limitation, which comes straight from R. I am not sure I understand the overall goals of your program, so if this doesn't do it please give me a bit more context.

configure vectorization based on size

The current setting of N records doesn't generalize from small to large records. What we want is to load as much data as possible in one shot, without running out of memory. As it is now it's actually easier than a set number of records for binary formats. The short term approach could be to leave it a number of lines for text formats and make it a number of bytes or MB for binary formats.

allow control of read size

started work with 7a03b29 need to change man

vectorized reduce always reads only one key at a time.

keyval.length defaults not fit for vectorized reduce

rmr.str returns inspected value

Makes for less boilerplate code

rmr doesn't clean up temp dir

file clean up works, but the directory is left behinf

reduce calls counter counts wrong thing

It's not the number of calls, it's the number of records that go through any number of calls. This doesn't help with vectorization issue in a way because what we need to keep in check is the number of reduce calls, not the number of records, which can be big and yet not imply inefficiency if the number of distinct keys is small and the code vectorized. Unfortunately an efficient fix is not obvious because the split into distinct keys happens inside the reduce.keyval function which doesn't return the number of calls to reduce. Moving the counter call inside reduce.keyval it seems to me, would violate separation of concerns (the keyval concept doesn't depend on any backend or any other part of mapreduce).

outer equijoins failing on single col select

rbind.fill must be a data frame

Hi,

i was trying to use the equijoin function of rmr2.1 .While using the equijoin I encountered the error "rbind.fill must be a dataframe". Only few of my reduce tasks were affected with this error.
I checked the code and my suspicion was there might be keys being passed with NA or null value. Checked it out that wasn't the case.
Then I even coerced my object to a data frame and tried running the equijoin it still fails.
Can you help me figure out the possible other causes for this problem?

Thanks,
Mayank

Output format providing trailing "\t"

Hi,

I was the output.format command for my mapreduce output with rmr2.1
The output file gets created peacefully. If i specify my output format as the following way,
-output.format = make.output.format("csv",sep=",")
the file gets created with no issue. But if i specify it as the following way,
-output.format = make.output.format("csv",sep="|")
the last column of my data frame has a "\t" appended at the end of each row.
Is there a reason for the "\t" to be present?

Thanks,
Mayank

map side joins

After reading http://www.slideshare.net/Hadoop_Summit/innovations-in-apache-hadoop-mapreduce-pig-hive-for-improving-query-performance slide 19 in particular not only I was reminded of the large performance advantages of map side joins but that they have natural use cases in things like star schemas and the like. Moreover, it seems like an rmr implementation shouldn't be all that difficult. One decision is if we should hide it behind the regular equijoin interface as an implementation change, with at most an API hint to use map side algorithm or some addition to the API. The latter is less conservative, makes the API more complex, but it also allows to do big-to-many-small joins typical of a star schema in one step, if all the small tables fit in memory, which allows to skip persisting of intermediate results, one for each small table.

to.dfs(...) fails with "Permission denied"

I am facing the following problem:
these two R commands:

library(rmr2)
small.ints = to.dfs(1:1000)

produce the following output:

library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
small.ints = to.dfs(1:1000)
sh: 1: /usr/local/hadoop: Permission denied
Warning message:
In to.dfs(1:1000) : Converting to.dfs argument to keyval with a NULL key

HADOOP_CMD is set to /usr/local/hadoop. The directory should be accessible, I even used chmod -R 777...
I use hadoop v1.1.2 which resides in /usr/local/hadoop.

R is selfcompiled v3.0.1 and stored in /usr/local/R.

Are there any good ideas or can anyone give advice what I can try next?

`backend.parameters` for `rmr.options`?

If backend.parameters is considered deprecated, what is the best-practice for things like rmr-wide options. For example, a multi-purpose Hadoop cluster is unlikely to have the configuration optimized for R tasks. In particular, things like mapred.child.java.opts are likely to be set for Java jobs, allocating a large amount of memory to the JVM. mapreduce jobs need to either reduce maximum heap space allocated to the JVM (128 MB is as low as I could go) or increase mapred.job.map.memory.mb which is likely to make everyone else angry at you since this is probably configured for your cluster's specific hardware to allow efficient distribution of tasks.

Is there/should there be another way to set rmr Hadoop parameters (as opposed to job-specific parameters)?

OO refactor for big data objects

Right now a big data object (returned by mapreduce when output is NULL) is an alternative to an explicit hdfs path with some interesting properties (garbage collected). I would like to explore the possibility of turning this concept into a class to unify support of different I/O possibilities and better hide the implementation. A big data object could be an explicit path, a temporary, garbage collected file or directory, a managed file which is refreshed when it is stale compared to its inputs and generating function, a hbase table and what not. It would incorporate a notion of a format which is not handled separately.

Converting to.dfs argument to keyval with a NULL key

Hi,

I was using the rmr2.1 package and while using the to.dfs I encountered the error below
"Converting to.dfs argument to keyval with a NULL key log4j:ERROR Could not find value for key log4j.appender."
This error is rmr2.1 specific only, I try the same commands on rmr2.02 and they run.
Is there any reason for this to fail in rmr2.1?
Also what could be a fix to solve this issue?

Thanks,
Mayank

extending typedbytes to represent atomic vectors and NULLs

partially picking up from RevolutionAnalytics/RHadoop#48

We need to file issue upstream with the streaming folks and even better submit a patch. We need to remark how the hive and streaming implementations of typedbytes have diverged and that this is not good.

csv format to import export from hive

splitting this from issue #50

Equijoin - input works, but specifying the same input in both left.input and right.input does not

Picking up from RevolutionAnalytics/RHadoop#178 What you are trying to do is not supported. Please try with two separate inputs.

Optimizer or planner for mapreduce jobs

It is common practice to apply transformation to mapreduce programs that change the number and nature of jobs involved, usually to minimize I/O while preserving the same function. It is done in Hive, Pig and Cascading for example. In rmr it is a little more challenging because

the I/O bound assumption which is behind for instance, the Cascading optimizer (called a planner in that context), is not necessarily true for complex analytics programs.
the variety of programs that can be written with rmr. It is not a little crippled special language, it is allows the full power of R. So it's going to be difficult to apply general transformations while preserving semantic equivalence.
The unavailability of some advanced java-only features such as multiple output formats.

On the positive side are the reflection capabilities of R that allows to inspect the parse tree, for instance. A little example of what could be done is in a function optimize in the source, completely untested. The only optimization applied it to reduce a chain of mapreduce calls that have a reduce only at the end to a single mapreduce job by composing the mappers.

Fail on mix with zero-length keyvals

According to changes in 2.1 we cannot return NULL keys in reduce function. Instead, we should use 0-length values. It works with local backend but fails on hadoop:

WORKS:

rmr.options(backend = "local")
mapreduce(to.dfs(keyval(1:3, 1:3)), reduce = function(k, v) if (k %% 2) keyval(k, v) else keyval(integer(0), integer(0)))

WORKS:

rmr.options(backend = "hadoop")
mapreduce(to.dfs(keyval(1:2, 1:2)), reduce = function(k, v) if (k %% 2) keyval(k, v) else keyval(integer(0), integer(0)))

FAILS:

rmr.options(backend = "hadoop")
mapreduce(to.dfs(keyval(1:3, 1:3)), reduce = function(k, v) if (k %% 2) keyval(k, v) else keyval(integer(0), integer(0)))

Duplicate values returned from the mapper under rmr 2.1.0

opening on behalf of @everdark

Hi,

Recently I update the package to 2.1.0 (from 1.3.1!) and found something unexpected in a even simplest form of a mapreduce job. Here it is:

test <- from.dfs(
mapreduce(
input=fname.sample,
map=function(.,obs) keyval(NULL,1),
input.format=make.input.format(format="csv", sep=","),
reduce=NULL,
combine=NULL
))
test

where fname.sample is a string indicating the path of a .csv file stored in Hadoop.
The result given was:

test
$key
NULL
$val
[1] 1 1

which is quite weird for me to understand what's going on.
Why did the values get duplicated? This case does happen in my code for a more serious job (where the key will unnecessarily recycles...), which in turn make the result quite unpredictable.

Does anyone have ideas on this issue?

The complete log in the R console is as follows.

packageJobJar: [/tmp/RtmpVpDPTq/rmr-local-env16ec54892904, /tmp/RtmpVpDPTq/rmr-global-env16ec49ffa7b3, /tmp/RtmpVpDPTq/rmr-streaming-map16ec5fd2c921, /tmp/hadoop-mis/hadoop-unjar6866723637623951400/] [] /tmp/streamjob7363209296587117900.jar tmpDir=null
13/03/15 00:09:31 INFO mapred.FileInputFormat: Total input paths to process : 1
13/03/15 00:09:32 INFO streaming.StreamJob: getLocalDirs(): [/data/hadoop/mapred/temp/]
13/03/15 00:09:32 INFO streaming.StreamJob: Running job: job_201302201645_2649
13/03/15 00:09:32 INFO streaming.StreamJob: To kill this job, run:
13/03/15 00:09:32 INFO streaming.StreamJob: /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=s1dhd02.buyabs.corp:8021 -kill job_201302201645_2649
13/03/15 00:09:32 INFO streaming.StreamJob: Tracking URL: XXXXXX
13/03/15 00:09:33 INFO streaming.StreamJob: map 0% reduce 0%
13/03/15 00:09:43 INFO streaming.StreamJob: map 100% reduce 0%
13/03/15 00:09:45 INFO streaming.StreamJob: map 100% reduce 100%
13/03/15 00:09:45 INFO streaming.StreamJob: Job complete: job_201302201645_2649
13/03/15 00:09:45 INFO streaming.StreamJob: Output: /tmp/RtmpVpDPTq/file16ec62d597d8
13/03/15 00:09:52 WARN snappy.LoadSnappy: Snappy native library is available
13/03/15 00:09:52 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/15 00:09:52 INFO snappy.LoadSnappy: Snappy native library loaded
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:52 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 WARN snappy.LoadSnappy: Snappy native library is available
13/03/15 00:09:54 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/15 00:09:54 INFO snappy.LoadSnappy: Snappy native library loaded
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor
13/03/15 00:09:54 INFO compress.CodecPool: Got brand-new decompressor

Add support for multiple outputs

http://dumbotics.com/2009/06/08/multiple-outputs/
https://github.com/klbostee/dumbo/blob/master/dumbo/backends/streaming.py

possibly reuse some classes from dumbo-related feathers project, also useful forhttps://github.com/RevolutionAnalytics/RHadoop/issues/4 Keep in mind though that all keys are raw type (bytes in Java) when using the native format

rmr2 failed because Rscript error