Comments (8)
Dir vs file won't matter, as Hadoop libs will parse either. A dir means "all files in the dir" and there is just one.
I don't know the nature of this problem but it's not related to a file vs dir. The shell is not able to work 100% like compiled code and occasionally you get failures because something can't be removed from the closure that is not serializable. It usually means rewriting the code slightly to make the closure capture less. Is this from executing the code in the book?
The book code is meant to stand alone as-is and be executable in the shell in order. Occasionally it omits a detail and delegates to earlier listings or the Github repo. The Github code is supposed to be a slightly more fleshed out, complete version that could be compiled as a unit, but may not be useful to execute. It should however run as-is as well.
What specifically are the problems?
from aas.
Ah, thats good to know.
I will go through the process again today and detail the issues I run into
from aas.
I have two version of my code in different states of brokenness.
Both are using a xml dump of Zendesk Tickets.
Using WikipediaPage.java as a template, I created a parser for Zendesk Tickets.
https://github.com/jackRogers/Spark-LSA/blob/master/ZendeskTicket.java
The parser has worked great and I was able to use mostly book examples with the parser to get the SVD computed before type mismatch errors due to issues with the code in the textbook began to pop up.
https://github.com/jackRogers/Spark-LSA/blob/master/ticketanalysis-book.scala
I call this script like this:
sudo /opt/spark-1.4.0/bin/spark-shell --master spark://Redacted:7077 --jars aas-master/ch06-lsa/target/ch06-lsa-1.0.0-jar-with-dependencies.jar,.m2/repository/edu/umd/cloud9/1.5.0/cloud9-1.5.0.jar,.m2/repository/com/redacted/zendesk/1.0-SNAPSHOT/zendesk-1.0-SNAPSHOT.jar --total-executor-cores 8 --driver-cores 10 --driver-memory 10G --executor-memory 1G -i ticketanalysis-book.scala
I tried a second time mostly using the github code as a base, but things blow up sooner, flatmap throwing a serialization error when plaintext is computed using flatmap. This error is shown in the first post of this ticket. I couldn't figure out what to do with the seperation between ParseWikipedia.scala dna RunLSA.scala, so I combined them into a single file.
https://github.com/jackRogers/Spark-LSA/blob/master/ticketanalysis-github.scala
I run this script like this:
sudo /opt/spark-1.4.0/bin/spark-shell --master spark://Redacted:7077 --jars aas-master/ch06-lsa/target/ch06-lsa-1.0.0-jar-with-dependencies.jar,.m2/repository/edu/umd/cloud9/1.5.0/cloud9-1.5.0.jar,.m2/repository/com/redacted/zendesk/1.0-SNAPSHOT/zendesk-1.0-SNAPSHOT.jar --total-executor-cores 8 --driver-cores 10 --driver-memory 10G --executor-memory 1G -i ticketanalysis-github.scala
Any help would be appreciated. I know the issue isn't with my parser as it works fine in the first example with pretty much the same code, so there is something early on in the github-code based version that causes it to blow up.
from aas.
I think this may be a bit beyond the scope of the book code and repo. In general the spark-shell
is good for prototyping and experimentation, but by its nature, has some extra quirks about how it compiles code. "Not serializable" errors are common and sometimes are in fact due to problems with how the Spark app has been constructed. These can be more acute through the shell. So one recommendation is to review exactly what isn't serializable in your code and debug that, and if at a loss, try running as a compiled program outside the shell to compare.
Right now it's not clear what's not serializable and I'm not sure if this is something you're reporting vs code exactly from the book or repo.
from aas.
I will try to compile it and see if that helps. I guess I'm reporting code from both the book and the repo, each one running into different errors whether I use the original wikipedia dataset or my own. The changes I've made from the original book and github code are almost all from other issues I've reported on here.
Is there any documentation on how to compile and run the RunLSA/ParseWikipedia code from ch06 in this repo?
from aas.
You can compile it as-is with Maven. You can probably run it via Maven if you whip out mvn:exec
. It wasn't really meant to be run this way, though you can; it was more meant to be an elaborated version of the snippets in the book, which are snippets more than a program. In Ch 6 I think it diverged a bit much. In other chapters the two match more closely and the text is a subset of the repo.
from aas.
I'll give that a shot. I just want to successfully perform LSA using spark. The books seems to show that you were able to do that at somepoint with some set of code. I'm hoping for an example of an implementation I can use to get this thing working, any working example of this thing. If the github snippets weren't meant to be used and the code in the book is incomplete then I am kind of screwed.
from aas.
I didn't say that; the snippets in the book are meant to be used, as snippets, which might help write your own similar code. The Github repo is supposed to be the same code but with all the extra details around it filled in so it can compile normally, and some extra structure. It's also meant to be used in the same sense. I'm not sure running the code from a chapter end to end does something useful other than execute all of the illustrations in order, yes. None of the code is a library. You'll want to lift and adapt sections to create your own application. If there's a problem in your app that you're pretty sure manifests in the code here too, we can have a look; if it's really your app code, might be out of scope.
from aas.
Related Issues (20)
- Pyspark implementation of these HOT 1
- Chapter 9: Getting the Data: 403 Forbidden results HOT 7
- Chapter 10 LeftOuterShuffleRegionJoin issue HOT 6
- Importing projects into IntelliJ HOT 2
- [ch-03] match Error with function: buildArtistAlias HOT 3
- transform RDD [(String, String)] to DATASET [ (String, String)] HOT 1
- Ch 03: audioscrobbler data not available HOT 2
- NullPointerException in chapter9 HOT 1
- how to work around "next on empty iterator"in chapter 9th? HOT 11
- ValueError: cannot decompress PACKBITS in chapter11 HOT 4
- [Question] Chapter 3 - Use the CROSS JOIN syntax to allow cartesian products between these relations HOT 4
- java.lang.NoClassDefFoundError: scala/reflect/internal/Trees HOT 1
- [Question] Chapter 2 - about function "scoreMatchData" HOT 5
- but which data set to download from https://ti.arc.nasa.gov HOT 1
- Increased maven memory for build project HOT 1
- Chapter 1, page 46 HOT 13
- Chapter 3. Recommending Music and the Audioscrobbler Dataset HOT 2
- Chapter 3: Convert PySpark DataFrame to Pandas HOT 4
- Chapter 3: ROC HOT 4
- Where is the code for AA with PySpark Book by Akash Tandon? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aas.