Coder Social home page Coder Social logo

Comments (8)

srowen avatar srowen commented on July 28, 2024

Dir vs file won't matter, as Hadoop libs will parse either. A dir means "all files in the dir" and there is just one.

I don't know the nature of this problem but it's not related to a file vs dir. The shell is not able to work 100% like compiled code and occasionally you get failures because something can't be removed from the closure that is not serializable. It usually means rewriting the code slightly to make the closure capture less. Is this from executing the code in the book?

The book code is meant to stand alone as-is and be executable in the shell in order. Occasionally it omits a detail and delegates to earlier listings or the Github repo. The Github code is supposed to be a slightly more fleshed out, complete version that could be compiled as a unit, but may not be useful to execute. It should however run as-is as well.

What specifically are the problems?

from aas.

jackRogers avatar jackRogers commented on July 28, 2024

Ah, thats good to know.

I will go through the process again today and detail the issues I run into

from aas.

jackRogers avatar jackRogers commented on July 28, 2024

I have two version of my code in different states of brokenness.

Both are using a xml dump of Zendesk Tickets.
Using WikipediaPage.java as a template, I created a parser for Zendesk Tickets.
https://github.com/jackRogers/Spark-LSA/blob/master/ZendeskTicket.java

The parser has worked great and I was able to use mostly book examples with the parser to get the SVD computed before type mismatch errors due to issues with the code in the textbook began to pop up.
https://github.com/jackRogers/Spark-LSA/blob/master/ticketanalysis-book.scala

I call this script like this:

sudo /opt/spark-1.4.0/bin/spark-shell --master spark://Redacted:7077 --jars aas-master/ch06-lsa/target/ch06-lsa-1.0.0-jar-with-dependencies.jar,.m2/repository/edu/umd/cloud9/1.5.0/cloud9-1.5.0.jar,.m2/repository/com/redacted/zendesk/1.0-SNAPSHOT/zendesk-1.0-SNAPSHOT.jar --total-executor-cores 8 --driver-cores 10 --driver-memory 10G --executor-memory 1G -i ticketanalysis-book.scala

I tried a second time mostly using the github code as a base, but things blow up sooner, flatmap throwing a serialization error when plaintext is computed using flatmap. This error is shown in the first post of this ticket. I couldn't figure out what to do with the seperation between ParseWikipedia.scala dna RunLSA.scala, so I combined them into a single file.
https://github.com/jackRogers/Spark-LSA/blob/master/ticketanalysis-github.scala

I run this script like this:

sudo /opt/spark-1.4.0/bin/spark-shell --master spark://Redacted:7077 --jars aas-master/ch06-lsa/target/ch06-lsa-1.0.0-jar-with-dependencies.jar,.m2/repository/edu/umd/cloud9/1.5.0/cloud9-1.5.0.jar,.m2/repository/com/redacted/zendesk/1.0-SNAPSHOT/zendesk-1.0-SNAPSHOT.jar --total-executor-cores 8 --driver-cores 10 --driver-memory 10G --executor-memory 1G -i ticketanalysis-github.scala

Any help would be appreciated. I know the issue isn't with my parser as it works fine in the first example with pretty much the same code, so there is something early on in the github-code based version that causes it to blow up.

from aas.

srowen avatar srowen commented on July 28, 2024

I think this may be a bit beyond the scope of the book code and repo. In general the spark-shell is good for prototyping and experimentation, but by its nature, has some extra quirks about how it compiles code. "Not serializable" errors are common and sometimes are in fact due to problems with how the Spark app has been constructed. These can be more acute through the shell. So one recommendation is to review exactly what isn't serializable in your code and debug that, and if at a loss, try running as a compiled program outside the shell to compare.

Right now it's not clear what's not serializable and I'm not sure if this is something you're reporting vs code exactly from the book or repo.

from aas.

jackRogers avatar jackRogers commented on July 28, 2024

I will try to compile it and see if that helps. I guess I'm reporting code from both the book and the repo, each one running into different errors whether I use the original wikipedia dataset or my own. The changes I've made from the original book and github code are almost all from other issues I've reported on here.

Is there any documentation on how to compile and run the RunLSA/ParseWikipedia code from ch06 in this repo?

from aas.

srowen avatar srowen commented on July 28, 2024

You can compile it as-is with Maven. You can probably run it via Maven if you whip out mvn:exec. It wasn't really meant to be run this way, though you can; it was more meant to be an elaborated version of the snippets in the book, which are snippets more than a program. In Ch 6 I think it diverged a bit much. In other chapters the two match more closely and the text is a subset of the repo.

from aas.

jackRogers avatar jackRogers commented on July 28, 2024

I'll give that a shot. I just want to successfully perform LSA using spark. The books seems to show that you were able to do that at somepoint with some set of code. I'm hoping for an example of an implementation I can use to get this thing working, any working example of this thing. If the github snippets weren't meant to be used and the code in the book is incomplete then I am kind of screwed.

from aas.

srowen avatar srowen commented on July 28, 2024

I didn't say that; the snippets in the book are meant to be used, as snippets, which might help write your own similar code. The Github repo is supposed to be the same code but with all the extra details around it filled in so it can compile normally, and some extra structure. It's also meant to be used in the same sense. I'm not sure running the code from a chapter end to end does something useful other than execute all of the illustrations in order, yes. None of the code is a library. You'll want to lift and adapt sections to create your own application. If there's a problem in your app that you're pretty sure manifests in the code here too, we can have a look; if it's really your app code, might be out of scope.

from aas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.