The book example uses the path to wikidump.xml, but the github code is looking at a di

Ch06 - org.apache.spark.SparkException: Task not serializable about aas HOT 8 CLOSED

jackRogers commented on July 28, 2024

Ch06 - org.apache.spark.SparkException: Task not serializable

from aas.

Comments (8)

srowen commented on July 28, 2024

Dir vs file won't matter, as Hadoop libs will parse either. A dir means "all files in the dir" and there is just one.

I don't know the nature of this problem but it's not related to a file vs dir. The shell is not able to work 100% like compiled code and occasionally you get failures because something can't be removed from the closure that is not serializable. It usually means rewriting the code slightly to make the closure capture less. Is this from executing the code in the book?

The book code is meant to stand alone as-is and be executable in the shell in order. Occasionally it omits a detail and delegates to earlier listings or the Github repo. The Github code is supposed to be a slightly more fleshed out, complete version that could be compiled as a unit, but may not be useful to execute. It should however run as-is as well.

What specifically are the problems?

from aas.

jackRogers commented on July 28, 2024

Ah, thats good to know.

I will go through the process again today and detail the issues I run into

from aas.

jackRogers commented on July 28, 2024

I have two version of my code in different states of brokenness.

Both are using a xml dump of Zendesk Tickets.
Using WikipediaPage.java as a template, I created a parser for Zendesk Tickets.
https://github.com/jackRogers/Spark-LSA/blob/master/ZendeskTicket.java

The parser has worked great and I was able to use mostly book examples with the parser to get the SVD computed before type mismatch errors due to issues with the code in the textbook began to pop up.
https://github.com/jackRogers/Spark-LSA/blob/master/ticketanalysis-book.scala

I call this script like this:

sudo /opt/spark-1.4.0/bin/spark-shell --master spark://Redacted:7077 --jars aas-master/ch06-lsa/target/ch06-lsa-1.0.0-jar-with-dependencies.jar,.m2/repository/edu/umd/cloud9/1.5.0/cloud9-1.5.0.jar,.m2/repository/com/redacted/zendesk/1.0-SNAPSHOT/zendesk-1.0-SNAPSHOT.jar --total-executor-cores 8 --driver-cores 10 --driver-memory 10G --executor-memory 1G -i ticketanalysis-book.scala

I tried a second time mostly using the github code as a base, but things blow up sooner, flatmap throwing a serialization error when plaintext is computed using flatmap. This error is shown in the first post of this ticket. I couldn't figure out what to do with the seperation between ParseWikipedia.scala dna RunLSA.scala, so I combined them into a single file.
https://github.com/jackRogers/Spark-LSA/blob/master/ticketanalysis-github.scala

I run this script like this:

Any help would be appreciated. I know the issue isn't with my parser as it works fine in the first example with pretty much the same code, so there is something early on in the github-code based version that causes it to blow up.

from aas.

srowen commented on July 28, 2024

I think this may be a bit beyond the scope of the book code and repo. In general the spark-shell is good for prototyping and experimentation, but by its nature, has some extra quirks about how it compiles code. "Not serializable" errors are common and sometimes are in fact due to problems with how the Spark app has been constructed. These can be more acute through the shell. So one recommendation is to review exactly what isn't serializable in your code and debug that, and if at a loss, try running as a compiled program outside the shell to compare.

Right now it's not clear what's not serializable and I'm not sure if this is something you're reporting vs code exactly from the book or repo.

from aas.

jackRogers commented on July 28, 2024

I will try to compile it and see if that helps. I guess I'm reporting code from both the book and the repo, each one running into different errors whether I use the original wikipedia dataset or my own. The changes I've made from the original book and github code are almost all from other issues I've reported on here.

Is there any documentation on how to compile and run the RunLSA/ParseWikipedia code from ch06 in this repo?

from aas.

srowen commented on July 28, 2024

You can compile it as-is with Maven. You can probably run it via Maven if you whip out mvn:exec. It wasn't really meant to be run this way, though you can; it was more meant to be an elaborated version of the snippets in the book, which are snippets more than a program. In Ch 6 I think it diverged a bit much. In other chapters the two match more closely and the text is a subset of the repo.

from aas.

jackRogers commented on July 28, 2024

I'll give that a shot. I just want to successfully perform LSA using spark. The books seems to show that you were able to do that at somepoint with some set of code. I'm hoping for an example of an implementation I can use to get this thing working, any working example of this thing. If the github snippets weren't meant to be used and the code in the book is incomplete then I am kind of screwed.

from aas.

srowen commented on July 28, 2024

I didn't say that; the snippets in the book are meant to be used, as snippets, which might help write your own similar code. The Github repo is supposed to be the same code but with all the extra details around it filled in so it can compile normally, and some extra structure. It's also meant to be used in the same sense. I'm not sure running the code from a chapter end to end does something useful other than execute all of the illustrations in order, yes. None of the code is a library. You'll want to lift and adapt sections to create your own application. If there's a problem in your app that you're pretty sure manifests in the code here too, we can have a look; if it's really your app code, might be out of scope.

from aas.

Ch06 - org.apache.spark.SparkException: Task not serializable about aas HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent