jadianes / spark-movie-lens Goto Github PK
View Code? Open in Web Editor NEWAn on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
License: Other
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
License: Other
Hi jadianes,
Thank you for your all work, it really helped me. But iteration in engine causes error in my system when it gets bigger than 5. I think 5 iterations are not enough for a good recommendation. Can you suggest any way to fix it? Error is this:
File "F:\bitirme\spark-2.0.1-bin-hadoop2.7\python\pyspark\mllib\common.py", line 123, in callJavaFunc
17/04/24 22:58:09 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-7,5,main]
java.lang.StackOverflowError
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:147)
at org.apache.spark.util.ByteBufferInputStream.read(ByteBufferInputStream.scala:52)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(5,1493063889491,JobFailed(org.apache.spark.SparkException: Job 5 cancelled because SparkContext was shut down)
my system spesifications
i7 2th gen
6 gb ram
ssd 550/440
asus n53sv laptop
(C:\Program Files\Anaconda3) F:\Data Science\movielens\spark-movie-lens>server.py
Traceback (most recent call last):
File "F:\Data Science\movielens\spark-movie-lens\server.py", line 3, in
from app import create_app
File "F:\Data Science\movielens\spark-movie-lens\app.py", line 5, in
from engine import RecommendationEngine
File "F:\Data Science\movielens\spark-movie-lens\engine.py", line 115
self.seed = 5L
^
SyntaxError: invalid syntax
Currently, the example code looks like:
my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)
individual_movie_rating_RDD.take(1)
Should it be this ...?
my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(my_movie)
individual_movie_rating_RDD.collect()
Hello, i submit the code building-recommender.ipynb by pyspark ,when code goes here
error_complete = math.sqrt(rates_and_preds_complete.map(lambda r:(r[1][0]-r[1][1])**2).mean())
i got an error below
It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.
Hello,
As I was following the guide, I found the variable sc which was not defined, I figured it belonged to Spark.
However, I don't know how to configure Spark to run the notebook.
I'm on windows, any help?
file:engine.py ->function:get_top_ratings, code as
user_unrated_movies_RDD = self.movies_RDD.filter(lambda rating: not rating[1]==user_id).map(lambda x: (user_id, x[0]))
Element of self.movies_RDD as (movie_id, movie_title, movie_category), "rating[1]" represent "movie_title";I guess "self.movies_RDD" should be "self.ratings_RDD"; Please check this question.
py4j.protocol.Py4JJavaError: An error occurred while calling o96.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 7.0 failed 1 times, most recent failure: Lost task 5.0 in stage 7.0 (TID 54, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
Unable to proceed further..
Any help is much appreciated!
spark-movie-lens/engine.py", line 80, in get_top_ratings
user_unrated_movies_rdd = self.movies_rdd.filter(lambda rating: not rating[1] == user_id)\
AttributeError: RecommendationEngine instance has no attribute 'movies_rdd'
FYI:
spark-movie-lens/engine.py", line 121
self.seed = 5L
SyntaxError: invalid syntax
Python 3.5 does not support this.
I want a real-time recommend system,but how can i achieve it by your spark-movie-lens project,Can you give me some suggistions?thank you very much
Thanks for you open source code and I have a slight problem about update the model for new users and new movies.
In the engine.py class when we add rating , the "enginer" will do this "self.__train_model()",it's mean compute all rating onces again.do you know how to augmenting the model using new ratings.
thangk you.
Any pointers would help.Thank you very much.
Hello,
I've managed to run the project locally, and the output from getting top recommendations shows the same movie.
Does anyone else experience the same behavior?
I mention that I've run it with the exact source files as in this repo.
The mentioned output looks like this.
"[["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30]]"
Thank you.
new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0])))
The list of unrated movies contains duplicates:
print(new_user_unrated_movies_RDD.take(10))
[(0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1)]
Should there be a distinct added?
new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0]))).distinct()
print(new_user_unrated_movies_RDD.take(10))
[(0, 378), (0, 1934), (0, 3282), (0, 5606), (0, 862), (0, 2146), (0, 3766), (0, 1330), (0, 2630), (0, 4970)]
The predict function that receives new_user_unrated_movies_RDD:
# Use the input RDD, new_user_unrated_movies_RDD, with new_ratings_model.predictAll() to predict new ratings for the movies
new_user_recommendations_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)
I am using the same notebook on Cloudera's quickstart VM and Anaconda installed. I have done no other changes.
On this step:
small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]
it gives an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-21-61849ee50ee7> in <module>()
----> 1 small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]
/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1265 """
1266 items = []
-> 1267 totalParts = self.getNumPartitions()
1268 partsScanned = 0
1269
/usr/lib/spark/python/pyspark/rdd.py in getNumPartitions(self)
354 2
355 """
--> 356 return self._jrdd.partitions().size()
357
358 def filter(self, f):
/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling o108.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/datasets/ml-latest-small/ratings.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:64)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
I checked the previous step:
small_ratings_raw_data
This gives the result:
/home/cloudera/datasets/ml-latest-small/ratings.csv MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:-2
Could you please help me with this?
Getting StackOverflow error while running the application engine
An error occurred while calling o90.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 49, 192.168.110.130): java.lang.StackOverflowError
at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2846)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1455)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invok
Hi Jose, Great job !
Am new to github so please pardon if this should not be reported as an issue but I just wanted to bring to your attention on the ratings that we are providing to the complete data for a new user.
The range for new user ratings seems to be [0,10] and when the reco engine makes predictions it throws predicted ratings in the similar range. Shouldn't it be in [0,5] range. When i supply ratings in this range it predicts movie ratings in the [0,5] range. But the predictions are drastically different than what they were earlier. Am i missing something here ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.