sryza / aas Goto Github PK
View Code? Open in Web Editor NEWCode to accompany Advanced Analytics with Spark from O'Reilly Media
License: Other
Code to accompany Advanced Analytics with Spark from O'Reilly Media
License: Other
I started adding an Acknowledgements section to the Preface; we'll need to make sure it's complete before finishing.
Inside package com.cloudera.datascience.risk
, I am not able to find out these two classes
import com.cloudera.datascience.risk.ComputeFactorWeights._
import com.cloudera.datascience.risk.MonteCarloReturns._
Here is the link https://github.com/sryza/aas/blob/master/ch09-risk/shell.scala#L11-L12 where you are importing these classes.
It's breaking when calling trimmed.head._1
in trimToRegion
from line 200 in featurize()
. I'm still relatively new to scala so I'm not well equipped to investigate what exactly is breaking. I tried only running it on factors1
and also only on factors2
instead of the two concatenated together, and it still breaks both times. factors2
consists of the S&P and NASDAQ data downloaded with the provided shell script -- as opposed to crude oil and us t-bonds copy-and-pasted from investing.com -- so I'm thinking this might be a true bug rather than something I introduced. I'll go hone my scala skills a bit more so I can investigate this further. If I find anything I'll let you know.
spark-submit --class com.cloudera.datascience.risk.RunRisk --master local target/ch09-risk-1.0.2-jar-with-dependencies.jar
Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
at scala.collection.IterableLike$class.head(IterableLike.scala:91)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
at com.cloudera.datascience.risk.RunRisk$.trimToRegion(RunRisk.scala:200)
at com.cloudera.datascience.risk.RunRisk$$anonfun$16.apply(RunRisk.scala:112)
at com.cloudera.datascience.risk.RunRisk$$anonfun$16.apply(RunRisk.scala:112)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at com.cloudera.datascience.risk.RunRisk$.readStocksAndFactors(RunRisk.scala:112)
at com.cloudera.datascience.risk.RunRisk$.main(RunRisk.scala:34)
at com.cloudera.datascience.risk.RunRisk.main(RunRisk.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
@sryza I notice some cases where you name a class like RowMatrix but it's not in code font, as with +RowMatrix+
. Is it worth standardizing on this, to mark class and method names inline in the text as code format?
Following the example in chapter 6, I am getting the following error shortly after running: docTermFreqs.flatMap(_.keySet).distinct().count()
It starts splitting input and executing tasks then:
15/07/10 15:42:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $line171.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
edu.stanford.nlp.pipeline.AnnotatorImplementations:
15/07/10 15:42:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at $line171.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.createNLPPipeline(:70)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:92)
at $line175.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/07/10 15:42:41 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Adding annotator pos
15/07/10 15:42:41 INFO TaskSchedulerImpl: Cancelling stage 0
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 9.0 in stage 0.0 (TID 9)
15/07/10 15:42:41 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 10.0 in stage 0.0 (TID 10)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 4.0 in stage 0.0 (TID 4)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 11.0 in stage 0.0 (TID 11)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 12.0 in stage 0.0 (TID 12)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 13.0 in stage 0.0 (TID 13)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 5.0 in stage 0.0 (TID 5)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 6.0 in stage 0.0 (TID 6)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 7.0 in stage 0.0 (TID 7)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 14.0 in stage 0.0 (TID 14)
15/07/10 15:42:41 INFO Executor: Executor is trying to kill task 8.0 in stage 0.0 (TID 8)
15/07/10 15:42:41 INFO DAGScheduler: Job 0 failed: count at :96, took 1.887817 s
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
Adding annotator tokenize
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: No annotator named tokenize
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:83)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:292)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:129)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:125)
at
at
at
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
in RunLSA.scala
error: value containsKey is not a member of scala.collection.immutable.Map[String,Int]
case (term, freq) => bTermToId.containsKey(term)
http://www.scala-lang.org/api/2.11.5/index.html#scala.collection.immutable.Map
looks like it should be "contains" instead of "containsKey"
Ch 6, the following code snippet generates an error.
import scala.collection.mutable.HashMap
val docTermFreqs = lemmatized.map(terms => {
val termFreqs = terms.foldLeft(new HashMap[String, Int]()) {
(map, term) => {
map += term -> (map.getOrElse(term, 0) + 1)
map
}
}
termFreqs
})
The error is
<console>:64: error: value foldLeft is not a member of (String, Seq[String])
val termFreqs = terms.foldLeft(new HashMap[String, Int]()) {
Hi,
I'm trying chapter 6 and i have 2 questions:
First,
cd aas
mvn install
cd ch06-lsa
mvn package
cd ..
./spark/bin/spark-submit --class com.cloudera.datascience.lsa.RunLSA aas/ch06-lsa/target/ch06-lsa-1.0.0.jar
but I get an error :
15/05/28 18:07:33 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.NoClassDefFoundError: edu/umd/cloud9/collection/wikipedia/WikipediaPage
at com.cloudera.datascience.lsa.RunLSA$.preprocessing(RunLSA.scala:54)
at com.cloudera.datascience.lsa.RunLSA$.main(RunLSA.scala:33)
at com.cloudera.datascience.lsa.RunLSA.main(RunLSA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: edu.umd.cloud9.collection.wikipedia.WikipediaPage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
I'm launching from the master node of an EC2 Spark install (https://spark.apache.org/docs/latest/ec2-scripts.html).
Secondly, how do I launch the main function from RunLSA in the SparkShell ?
./spark/bin/spark-shell --jars aas/ch06-lsa/target/ch06-lsa-1.0.0.jar
I have been trying
import com.cloudera.datascience.lsa.RunLSA
RunLSA.main(Array("100","1000","0.1"))
but I get the error
15/05/28 18:14:21 WARN spark.SparkContext: Multiple running SparkContexts detected in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:80)
Just looking for your best practice.
Thanks a lot.
scala> val topConceptTerms = topTermsInTopConcepts(svd, 4, 10, termIds)
<console>:108: error: type mismatch;
found : scala.collection.immutable.Map[String,Int]
required: Map[Int,String]
val topConceptTerms = topTermsInTopConcepts(svd, 4, 10, termIds)
^(under termIds)
using code
import scala.collection.mutable.ArrayBuffer
def topTermsInTopConcepts(svd: SingularValueDecomposition[RowMatrix, Matrix], numConcepts: Int,
numTerms: Int, termIds: Map[Int, String]): Seq[Seq[(String, Double)]] = {
val v = svd.V
val topTerms = new ArrayBuffer[Seq[(String, Double)]]()
val arr = v.toArray
for (i <- 0 until numConcepts) {
val offs = i * v.numRows
val termWeights = arr.slice(offs, offs + v.numRows).zipWithIndex
val sorted = termWeights.sortBy(-_._1)
topTerms += sorted.take(numTerms).map{
case (score, id) => (termIds(id), score)
}
}
topTerms
}
any ideas? Also the function in the book is missing the function definition, though there is still a return statement.
In the latest release of 'adam' adamLoad() has been changed to more specific methods:
E.g. - for reading 'adamLoad' has been replaced with 'loadAlignments':
val readsRDD:RDD[AlignmentRecord] = sc.loadAlignments("/Users/davidlaxer/genomics/reads/HG00103")
See:
Kept getting XML errors when running the code pasted from here and found the issues at lines 186 and 187 - that should be MedlineCitationSet.
Ch 3 has a statement: "These lines cause a NumberFormatException". The lines don't cause a NumberFormatException but could throw or cause a NumberFormatException.
Hi,
I'm running on Ubuntu 14.04 LTS on an EC2 instance.
ubuntu@ip-10-0-1-186:/aas$ lsb_release -a/aas$
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
ubuntu@ip-10-0-1-186:
When I tried to run $mvn package in ch06-lsa:
ubuntu@ip-10-0-1-186:/aas/ch06-lsa$ mvn package/aas/ch06-lsa$
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Wikipedia Latent Semantic Analysis 1.0.0
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for com.cloudera.datascience:common:jar:1.0.0 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.057s
[INFO] Finished at: Tue May 26 21:14:19 UTC 2015
[INFO] Final Memory: 11M/225M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project ch06-lsa: Could not resolve dependencies for project com.cloudera.datascience:ch06-lsa:jar:1.0.0: Failure to find com.cloudera.datascience:common:jar:1.0.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
ubuntu@ip-10-0-1-186:
When I tried to run mvn in the root:
ubuntu@ip-10-0-1-186:/aas$ mvn install/aas$
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Advanced Analytics with Spark
[INFO] Advanced Analytics with Spark Common
[INFO] Introduction to Data Analysis with Scala and Spark
[INFO] Recommender Engines with Audioscrobbler data
[INFO] Covtype with Random Decision Forests
[INFO] Anomaly Detection with K-means
[INFO] Wikipedia Latent Semantic Analysis
[INFO] Network Analysis with GraphX
[INFO] Temporal and Geospatial Analysis
[INFO] Value at Risk through Monte Carlo Simulation
[INFO] Genomics Analysis with ADAM
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Advanced Analytics with Spark 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce) @ spark-book-parent ---
[WARNING] Rule 1: org.apache.maven.plugins.enforcer.RequireMavenVersion failed with message:
Detected Maven Version: 3.0.5 is not in the allowed range 3.1.1.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Advanced Analytics with Spark ..................... FAILURE [1.570s]
[INFO] Advanced Analytics with Spark Common .............. SKIPPED
[INFO] Introduction to Data Analysis with Scala and Spark SKIPPED
[INFO] Recommender Engines with Audioscrobbler data ...... SKIPPED
[INFO] Covtype with Random Decision Forests .............. SKIPPED
[INFO] Anomaly Detection with K-means .................... SKIPPED
[INFO] Wikipedia Latent Semantic Analysis ................ SKIPPED
[INFO] Network Analysis with GraphX ...................... SKIPPED
[INFO] Temporal and Geospatial Analysis .................. SKIPPED
[INFO] Value at Risk through Monte Carlo Simulation ...... SKIPPED
[INFO] Genomics Analysis with ADAM ....................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.867s
[INFO] Finished at: Tue May 26 21:15:21 UTC 2015
[INFO] Final Memory: 14M/285M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce (enforce) on project spark-book-parent: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
ubuntu@ip-10-0-1-186:
Any ideas?
When using the book's example code:
val idfs = docFreqs.map{
| case (term, count) => (term, math.log(numDocs.toDouble / count))
| }.toMap
I get back:
:104: error: value toMap is not a member of org.apache.spark.rdd.RDD[(String, Double)]
possible cause: maybe a semicolon is missing before `value toMap'?
}.toMap
^
Currently importing:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
import org.apache.spark.SparkContext._
//lemmatization
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._
//need properties class
import java.util.Properties
//need array buffer class
import scala.collection.mutable.ArrayBuffer
//need rdd class
import org.apache.spark.rdd.RDD
//need foreach
import scala.collection.JavaConversions._
//computing the tf-idfs
import scala.collection.mutable.HashMap
In the chapter 6 text,
case (term,freq) => (bTermIds(term), bIdfs(term) * termFreqs(term) / docTotalTerms)
bIdfs isn't defined and I cant find an equivalent in the RunLSA code.
Hi, I have built the mvn package from the root to create all the jar files for the chapters. I copied the jar file for ch08 into the external folder of the spark-1.5.0 install and start with ./bin/spark-shell --jars external/ch08-geotime-1.0.1-jar-with-dependencies.jar --master local[*]
All the imports work ok including import com.github.nscala_time.time.Imports._
I can create new datatime e.g. val test = new DateTime
but when I run
case class Trip(
pickupTime: DateTime,
dropoffTime: DateTime,
pickupLoc: Point,
dropoffLoc: Point)
I get the error: error: not found: type DateTime
Could you please give me a tip as to why it would work for creating a new object but not recognised in the case class?
Many thanks,
Getting the following in the adam-shell:
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._
scala> import org.bdgenomics.formats.avro.AlignmentRecord
import org.bdgenomics.formats.avro.AlignmentRecord
scala> val readsRDD: RDD[AlignmentRecord] = sc.adamLoad("genomics/reads/HG00103")
<console>:20: error: value adamLoad is not a member of org.apache.spark.SparkContext
val readsRDD: RDD[AlignmentRecord] = sc.adamLoad("genomics/reads/HG00103")
Hi,
It would be great to have data sources to run examples. Could you provide either a link, the data itself or a way to get it ?
Cheers,
Yann
It's not clear how to run the .jar files from the documentation. For example, ch06-lsa requires the file stopwords.txt
, but that file is in target/classes/stopwords.txt and in src/main/resources/stopwords.txt. When the jar file is run with this command: spark-submit --class com.cloudera.datascience.lsa.RunLSA target/ch06-lsa-1.0.2-jar-with-dependencies.jar
an error is generated indicating that the file stopwords.txt
can't be found.
The following statement from ch 6 generates an error:
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
<console>:59: error: not enough arguments for method plainTextToLemmas: (text: S
tring, stopWords: Set[String], pipeline: edu.stanford.nlp.pipeline.StanfordCoreN
LP)Seq[String].
Unspecified value parameter pipeline.
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
The repo code is fine.
val lemmatized = plainText.mapPartitions(iter => {
val pipeline = createNLPPipeline()
iter.map{ case(title, contents) => (title, plainTextToLemmas(contents, stopWords, pipeline))}
})
Let's talk a little about the namespace, API, etc and whether it should be inlined into the book code repo.
On page 92 in calculating sumSquares
, the code
val sumSquares = dataAsArray.fold(
new Array[Double](numCols)
)(
(a,b) => a.zip(b).map(t => t._1 + t._2 * t._2)
)
As the RDD.fold
requires operator to be communicative, which was violated by asymmetry in the map() function, the result might be different for different number of partitions in RDD.
@srowen indents up to the open paren where the args start. I indent four spaces past the previous line's indentation. Sean's is the way Intellij defaults. Mine is the way used inside Spark.
I prefer mine, but don't have a very strong opinion.
<console>:93: error: value _1 is not a member of scala.collection.mutable.HashMap[String,Int]
val docIds = docTermFreqs.map(_._1).zipWithUniqueId().map(_.swap).collectAsMap()
The code in the book doesn't define any variables called docIds and I don't see any comments about it in the code. Having trouble debugging this as I'm not sure exactly what this line is trying to accomplish. What does (_._1)
mean?
It's a bit frustrating that the code in the book doesn't work and I can't manage to get the code on github to work either. Any help would be much appreciated
I'm trying to get the zebrafish data set. Seems the down-scaled sample is no longer in the Thunder distro.
Hi ,
I am trying to practice chapter 6. I am trying to build the package as mentioned in the book but i am stuck with the below error.
[ERROR] Failed to execute goal on project ch06-lsa: Could not resolve dependencies for project com.cloudera.datascience:ch06-lsa:jar:1.0.0: Failure to find com.cloudera.datascience:common:jar:1.0.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]
Thanks,
Vishnu
Maven build errors for XmlInputFormat.java. Stacktrace indicates maven plugin issue (MojoExecutor)
errors like this
".../common/XmlInputFormat.java:[21,24] cannot find symbol
symbol : class StandardCharsets
location: package java.nio.charset "
I'm on java 7.
java version "1.7.0_71"
rolling back to previous version of XmlInputFormat.java (including Guava) works.
Thanks for any guidance.
Can't get 16 gb dataset:
$ curl ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
curl: (6) Could not resolve host: ftp-trace.ncbi.nih.gov
The following code from ch 6 generates error.
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP)
: Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma) && isOnlyLetters(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
The error is
<console>:37: error: value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnot
ation])) {
^
For this line
I am wondering if it should be changed to
implicit val ordering: Ordering[(K,S)] = Ordering.by(_._2)
so it can be sorted by the pickup time. If using the _1, it means it will use lic to sort, but in the same partition, it is always the same anyway. I think what we need is to sort by the pickup time within the partition.
When trying to follow along with the example in chapter 6, I get an error when trying to convert the xml to plain text.
scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
:42: error: value flatmap is not a member of org.apache.spark.rdd.RDD[String]
val plaintext = rawXmls.flatmap(wikiXmlToPlainText)
^
Any ideas?
In chapter 6,
termDocMatrix.cache()
This variable isn't defined earlier in the chapter. I haven't been able to find a suitable way to do this without changing everything to be more like RunLSA, which creates a differnt set of issues.
Any assistance appreciated
I have packaged the chapter 6 and included the jar using spark-shell.
When I am trying to execute the below code without @transient
@transient val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "")
conf.set(XmlInputFormat.END_TAG_KEY, "")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
I get Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration .
With transient in place I can proceed further, but after the below transformation
val plainText = rawXmls.flatMap(wikiXmlToPlainText)
I ran a plainText.count
And it gives me the below error.
java.lang.NoClassDefFoundError: com/google/common/base/Charsets
at com.cloudera.datascience.common.XmlInputFormat$XmlRecordReader.(XmlInputFormat.java:79)
at com.cloudera.datascience.common.XmlInputFormat.createRecordReader(XmlInputFormat.java:55)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
Am I missing something here.
I am using spark 1.2 and Hadoop 2.5.2
The book example uses the path to wikidump.xml, but the github code is looking at a directory. Where and how was the xml file broken up? I'm getting this error in the preprocessing function when running flatmap.
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:303)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:302)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:302)
at
at
Additionally, is there any documentation on how to run RunLSA? The book example uses spark-shell but I've had to change a few things to get the github code to play nicely with spark-shell
Hi ,
I tried to add the nscala-time_2.10-1.8.0.jar to spark shell and imported the package. But unfortunately when i use it , i end up with this error.
scala> import com.github.nscala_time.time.Imports._
import com.github.nscala_time.time.Imports._
scala> val dt = new DateTime(2015,2,2,20,0)
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in BuilderImplicits.class refers to term time
in value org.joda which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling BuilderImplicits.class.
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run each line except the last one.
Can you help me to resolve this.
To run the following statement in Spark shell make the variable transient.
val conf = new Configuration()
Replace with:
@transient val conf = new Configuration()
As the ch code is for Spark shell should be mentioned to make the variable @transient.
I cant'n find this dependency in mvnrepository or github as described in ch6 and ch7, can you give it to me?
you book is great,and i learn a lot, thanks
<dependency>
<groupId>com.cloudera.datascience</groupId>
<artifactId>common</artifactId>
<version>${project.version}</version>
</dependency>
We have some illustrations in chapters, but some chapters still have none. This is a placeholder to remind us to go back and review more recent chapters for illustrations.
Can I run the 'AAS' Code from Zepplin?
If yes, how do I import the chapter's .jar?
Spark Shell:
~/spark/bin/spark-shell --jars target/ch06-lsa-1.0.0.jar
Zeppelin:
./bin/zeppelin-daemon.sh start
Pid dir doesn't exist, create /home/ubuntu/incubator-zeppelin/run
Zeppelin start [ OK ]
Thanks in advance!
there is nothing wrong when i run mvn assembly:assembly under the folder of ch07-graph.
INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResource
sRatio: 0.8
Exception in thread "main" java.lang.NoClassDefFoundError: com/cloudera/datascience/common/XmlInputFormat
at com.cloudera.datascience.graph.RunGraph$.loadMedline(RunGraph.scala:188)
at com.cloudera.datascience.graph.RunGraph$.main(RunGraph.scala:29)
at com.cloudera.datascience.graph.RunGraph.main(RunGraph.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.cloudera.datascience.common.XmlInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 12 more
16/01/15 16:36:50 INFO spark.SparkContext: Invoking stop() from shutdown hook
I just pulled the book source (master 94fa09d) and got the following error when running mvn:
[INFO] --- scala-maven-plugin:3.2.0:compile (default) @ ch10-genomics ---
[INFO] artifact joda-time:joda-time: checking for updates from central
[INFO] /Users/tom/src/scala/aas/ch10-genomics/src/main/scala:-1: info: compiling
[INFO] Compiling 1 source files to /Users/tom/src/scala/aas/ch10-genomics/target/classes at 1424131711406
[ERROR] /Users/tom/src/scala/aas/ch10-genomics/src/main/scala/com/cloudera/datascience/genomics/RunTFPrediction.scala:16: error: object FeaturesContext is not a member of package org.bdgenomics.adam.rdd.features
[ERROR] import org.bdgenomics.adam.rdd.features.FeaturesContext._
[ERROR] ^
followed by 8 others in the same file.
This would appear to be caused by the last commit to ADAM (bigdatagenomics/adam@3f0eadb) which removed the FeaturesContext
and GeneContext
classes and apparently replaced their functions with the more generic loadFeatures
function in the ADAMContext.
I think in the end we may want module names, and descriptions, that are taken directly from the book instead of things like rdf
Forgive me if I'm opening and closing too many issues...
Chapter 6 has the code:
val idfs = docFreqs.map{
case (term, count) => (term, math.log(numDocs.toDouble / count))
}.toMap
I don't see numDocs defined in any of the code up to this point. Is this a typo? Should it be numTerms?
1:I tried the code in chapt5 Kmens-clousters. In my eclipse for scal, there are some complie error about Vector.
def distance(a: Vector,b: Vector) = math.sqrt(a.toArray.zip(b.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
there is a error about Vector: type vector takes type parameters.
I don't know what's wrong.
2:Then I tried the code in chapt8. I couldn't get a jar where contain "com.cloudera.datascience.geotime.GeoJsonProtocol._".
In pom.xml there is a dependency
com.cloudera.datascience
common
${project.version}
where can I fetch this "common" jar. I don't found it in site "http://mvnrepository.com/".
Could someone help me ?
thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.