Coder Social home page Coder Social logo

ee-unn_datasets's Introduction

EE-UNN_datasets

Four recommendation scenarios were created from the original MovieLens and Netflix datasets: (1) an active dataset, (2) a balanced dataset, (3) a trendy dataset, and (4) a repeat dataset. According to the different scenarios, certain users and movies were removed from the dataset. To evaluate the various scenarios under different datasets with the baseline, we conducted 300 rounds of recommendations in all MovieLens experiments (small datasets) and 2000 rounds in all Netflix experiments (large datasets).

  • Active dataset: The active dataset reflects the naive state of data in the real world: no manipulation but the choice of the most active users/movies. A movie is marked as active only when at least 30% of the user have rated it, and an active user must rate at least 30% of the movies in the dataset. We removed non-active movies and users from the original dataset, and stopped when the remaining users and movies were all active. The proportion of like in the active dataset was much greater than that for dislike in both MovieLens and Netflix, which indicates that active users tend to like active movies. The ten target movies were randomly selected from the active dataset; we randomly chose 200 movies as the final active dataset for both MovieLens and Netflix, respectively. This dataset was used to evaluate the performance among various recommendation models. Note that because the original data in Netflix was filtered once already, the skewness of the Netflix active dataset was greater than that of MovieLens.

  • Balanced dataset: As with most RSs, in this study we address data skew by assuming that user preferences are ambivalent, and generated the balanced dataset with at least half of favored movies. This dataset was generated from the active dataset with the additional requirement that users like around half of the movies. Since the active dataset has more likes than dislikes, it is easier for RSs to make a successful recommendation. Therefore, the balanced dataset contains only around 50% of likes, making it more suitable to evaluate different RSs because RSs have a lower chance to make a successful recommendation. Theoretically, the Random RS has a 50% chance of making a successful recommendation; we expect other RSs to yield better results. The 10 target movies were still randomly selected from the dataset. This dataset was used to examine the performance of various models under the balanced situation.

  • Trendy dataset: Assuming that popular tastes change with time, this study used the Euclidean distance between movies to represent changing tastes in movies: the greater the distance gap, the more different the taste. To observe changes in taste trends, we found three movies from the active dataset with the longest Euclidean distances as our target movies. Then, the trendy dataset was generated from the active dataset by removing the users who did not rate the three target movies. This dataset was used to evaluate the model performance when it is necessary to capture changes in trends. To simulate trendy preference, users were sorted based on the preference patterns among the three movies.

  • Repeat dataset: This dataset was used to evaluate how many different liked movies were recommend by the RS to a given user when the user used the RS multiple times, because we observed that some RSs tend to recommend the same high-rated movie without providing chances for exploration. Although this yields low regret, users are unlikely to be satisfied by such repeated recommendations. A RS should cover as many different liked movies as possible during repeated entries by a given user. To understand how different models explore during such a recommendation process, we randomly selected 10 users and 10 movies from the balanced dataset and simulated the users visiting the RS repeatedly. Theoretically, a RS needs only 100 trials to determine the like/dislike ground truth among these 10 users and 10 movies.

ee-unn_datasets's People

Contributors

ritatang242 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.