jadianes / spark-movie-lens Goto Github PK

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

License: Other

Shell 0.64% Python 10.77% Jupyter Notebook 88.58%

big-data bigdata flask movie-recommendation movielens-dataset python spark

spark-movie-lens's Issues

engine.iteration

Hi jadianes,
Thank you for your all work, it really helped me. But iteration in engine causes error in my system when it gets bigger than 5. I think 5 iterations are not enough for a good recommendation. Can you suggest any way to fix it? Error is this:

File "F:\bitirme\spark-2.0.1-bin-hadoop2.7\python\pyspark\mllib\common.py", line 123, in callJavaFunc
17/04/24 22:58:09 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-7,5,main]
java.lang.StackOverflowError
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:147)
at org.apache.spark.util.ByteBufferInputStream.read(ByteBufferInputStream.scala:52)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)

ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(5,1493063889491,JobFailed(org.apache.spark.SparkException: Job 5 cancelled because SparkContext was shut down)

my system spesifications
i7 2th gen
6 gb ram
ssd 550/440
asus n53sv laptop

Invalid Syntax in engine.py - line 115, self.seed = 5L

(C:\Program Files\Anaconda3) F:\Data Science\movielens\spark-movie-lens>server.py
Traceback (most recent call last):
File "F:\Data Science\movielens\spark-movie-lens\server.py", line 3, in
from app import create_app
File "F:\Data Science\movielens\spark-movie-lens\app.py", line 5, in
from engine import RecommendationEngine
File "F:\Data Science\movielens\spark-movie-lens\engine.py", line 115
self.seed = 5L
^
SyntaxError: invalid syntax

Getting individual ratings

Currently, the example code looks like:

my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)
individual_movie_rating_RDD.take(1)

Should it be this ...?

my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(my_movie)
individual_movie_rating_RDD.collect()

Ger error when execute " error_complete = math.sqrt(rates_and_preds_complete.map(lambda r:(r[1][0]-r[1][1])**2).mean())"

Hello, i submit the code building-recommender.ipynb by pyspark ,when code goes here

error_complete = math.sqrt(rates_and_preds_complete.map(lambda  r:(r[1][0]-r[1][1])**2).mean())

i got an error below

It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.

Importing Spark

Hello,
As I was following the guide, I found the variable sc which was not defined, I figured it belonged to Spark.
However, I don't know how to configure Spark to run the notebook.
I'm on windows, any help?

logic error in function "get_top_ratings" when get "user_unrated_moies_RDD"

file:engine.py ->function:get_top_ratings, code as
user_unrated_movies_RDD = self.movies_RDD.filter(lambda rating: not rating[1]==user_id).map(lambda x: (user_id, x[0]))
Element of self.movies_RDD as (movie_id, movie_title, movie_category), "rating[1]" represent "movie_title";I guess "self.movies_RDD" should be "self.ratings_RDD"; Please check this question.

Unable to proceed past stage 7.0 (OutOfMemoryError: Java heap space)

py4j.protocol.Py4JJavaError: An error occurred while calling o96.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 7.0 failed 1 times, most recent failure: Lost task 5.0 in stage 7.0 (TID 54, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space

Unable to proceed further..
Any help is much appreciated!

localhost:5432/0/ratings/top/10

spark-movie-lens/engine.py", line 80, in get_top_ratings
user_unrated_movies_rdd = self.movies_rdd.filter(lambda rating: not rating[1] == user_id)\

AttributeError: RecommendationEngine instance has no attribute 'movies_rdd'

self.seed

FYI:

spark-movie-lens/engine.py", line 121
    self.seed = 5L
SyntaxError: invalid syntax

Python 3.5 does not support this.

real-time recommend

I want a real-time recommend system,but how can i achieve it by your spark-movie-lens project,Can you give me some suggistions?thank you very much

update the model for new users and new movies

Thanks for you open source code and I have a slight problem about update the model for new users and new movies.
In the engine.py class when we add rating , the "enginer" will do this "self.__train_model()",it's mean compute all rating onces again.do you know how to augmenting the model using new ratings.
thangk you.
Any pointers would help.Thank you very much.

GETing top recommendations shows the same movie

Hello,

I've managed to run the project locally, and the output from getting top recommendations shows the same movie.
Does anyone else experience the same behavior?
I mention that I've run it with the exact source files as in this repo.

The mentioned output looks like this.
"[["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30], ["The War (2007)", 8.836370207914264, 30]]"

Thank you.

duplicates of new user unrated moves passed to predict

new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0])))

The list of unrated movies contains duplicates:

print(new_user_unrated_movies_RDD.take(10))
[(0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1)]

Should there be a distinct added?

new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0]))).distinct()

print(new_user_unrated_movies_RDD.take(10))
[(0, 378), (0, 1934), (0, 3282), (0, 5606), (0, 862), (0, 2146), (0, 3766), (0, 1330), (0, 2630), (0, 4970)]

The predict function that receives new_user_unrated_movies_RDD:

# Use the input RDD, new_user_unrated_movies_RDD, with new_ratings_model.predictAll() to predict new ratings for the movies
new_user_recommendations_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist

I am using the same notebook on Cloudera's quickstart VM and Anaconda installed. I have done no other changes.

On this step:

small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

it gives an error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-21-61849ee50ee7> in <module>()
----> 1 small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
   1265         """
   1266         items = []
-> 1267         totalParts = self.getNumPartitions()
   1268         partsScanned = 0
   1269 

/usr/lib/spark/python/pyspark/rdd.py in getNumPartitions(self)
    354         2
    355         """
--> 356         return self._jrdd.partitions().size()
    357 
    358     def filter(self, f):

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o108.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/datasets/ml-latest-small/ratings.csv
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:64)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:46)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

I checked the previous step:

small_ratings_raw_data

This gives the result:

/home/cloudera/datasets/ml-latest-small/ratings.csv MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:-2

Could you please help me with this?

StackOverflow

Getting StackOverflow error while running the application engine
An error occurred while calling o90.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 49, 192.168.110.130): java.lang.StackOverflowError
at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2846)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1455)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invok

New User Ratings

Hi Jose, Great job !
Am new to github so please pardon if this should not be reported as an issue but I just wanted to bring to your attention on the ratings that we are providing to the complete data for a new user.

The range for new user ratings seems to be [0,10] and when the reco engine makes predictions it throws predicted ratings in the similar range. Shouldn't it be in [0,5] range. When i supply ratings in this range it predicts movie ratings in the [0,5] range. But the predictions are drastically different than what they were earlier. Am i missing something here ?

jadianes / spark-movie-lens Goto Github PK

spark-movie-lens's Issues

engine.iteration

Invalid Syntax in engine.py - line 115, self.seed = 5L

Getting individual ratings

Ger error when execute " error_complete = math.sqrt(rates_and_preds_complete.map(lambda r:(r[1][0]-r[1][1])**2).mean())"

Importing Spark

logic error in function "get_top_ratings" when get "user_unrated_moies_RDD"

Unable to proceed past stage 7.0 (OutOfMemoryError: Java heap space)

localhost:5432/0/ratings/top/10

self.seed

real-time recommend

update the model for new users and new movies

GETing top recommendations shows the same movie

duplicates of new user unrated moves passed to predict

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist

StackOverflow

New User Ratings

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent