Coder Social home page Coder Social logo

wey-gu / movie-recommendation-dataset Goto Github PK

View Code? Open in Web Editor NEW
9.0 1.0 4.0 15.23 MB

A dataset for Movie Recommendation with NebulaGraph, ETL to merge two dataset: OMDB & Movielens with dbt, postgres and Nebula-Importer

Home Page: https://www.siwei.io/nebulagraph-etl-dbt/

License: Apache License 2.0

Python 100.00%
dataset dbt graph-database recommendation-system nebulagraph

movie-recommendation-dataset's Introduction

This is a movie Knowledge Graph Dataset ready for NebulaGraph.

The dataset comes from OMDB for cast and movie genre and MovieLens for real world user-movie interaction records.

modeling_omdb_movielens

How to use this Dataset

Refer to https://www.siwei.io/recommendation-system-with-graphdb/

Scehma of the Dataset

The data aggregates OMDB and MovieLens to become a knowledge graph with schema:

  • Tag:

    • user(user_id)
    • movie(name)
    • person(name, birthdate)
    • genre(name)
  • Edge Type:

    • watched(rate(double))
    • with_genre
    • directed_by
    • acted_by

schema_0

Mapping between two Tabular datasources and NebulaGraph property graph model

schema_mapping_to_graph

ELT Process

  • Raw Data wangling
  • Data Loading into Data Warehouse(Postgress)
  • Transform Data into the form ready for Property Graph Model(dbt), export as CSV
  • Load the CSV files into NebulaGraph(Nebula-Importer)

ETL_dbt_nebulagraph_importer

dbt env preparation

Install dbt and dbt-postgres plugin.

python3 -m venv .venv
source .venv/bin/activate
pip install dbt-postgres

dbt init dbt_project
cd dbt_project

Run a postgres container with the following command:

docker run --rm --name postgres \
    -e POSTGRES_PASSWORD=nebula \
    -e POSTGRES_USER=nebula \
    -e POSTGRES_DB=warehouse -d \
    -p 5432:5432 postgres

Raw Data

Download

Download files into dbt_project/raw_data

mkdir -p dbt_project/raw_data
cd dbt_project/raw_data
wget www.omdb.org/data/all_people.csv.bz2
wget www.omdb.org/data/all_people_aliases.csv.bz2
wget www.omdb.org/data/people_links.csv.bz2
wget www.omdb.org/data/all_casts.csv.bz2
wget www.omdb.org/data/job_names.csv.bz2
wget www.omdb.org/data/all_characters.csv.bz2
wget www.omdb.org/data/movie_categories.csv.bz2
wget www.omdb.org/data/movie_keywords.csv.bz2
wget www.omdb.org/data/category_names.csv.bz2
wget www.omdb.org/data/all_categories.csv.bz2
wget www.omdb.org/data/all_movie_aliases_iso.csv.bz2
bunzip2 *.bz2
wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
unzip ml-latest-small.zip
rm *.zip

Data Wrangling

Remove \" from the data and copy to seeds.

grep -v '\\"' raw_data/all_movie_aliases_iso.csv > seeds/all_movie_aliases_iso.csv
grep -v '\\"' raw_data/all_casts.csv > seeds/all_casts.csv
grep -v '\\"' raw_data/all_characters.csv > seeds/all_characters.csv
grep -v '\\"' raw_data/all_people.csv > seeds/all_people.csv
grep -v '\\"' raw_data/category_names.csv > seeds/category_names.csv
grep -v '\\"' raw_data/job_names.csv > seeds/job_names.csv
cp raw_data/movie_categories.csv seeds/movie_categories.csv
cp raw_data/movie_keywords.csv seeds/movie_keywords.csv
cp raw_data/all_categories.csv seeds/all_categories.csv
cp raw_data/ml-latest-small/ratings.csv seeds/movielens_ratings.csv
cp raw_data/ml-latest-small/movies.csv seeds/movielens_movies.csv

Add raw CSV files into data warehouse for further processing.

cd dbt_project
❯ dbt seed
05:58:27  Running with dbt=1.3.0
05:58:27  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 11 seed files, 0 sources, 0 exposures, 0 metrics
05:58:28  
05:58:28  Concurrency: 8 threads (target='dev')
05:58:28  
05:58:28  1 of 11 START seed file public.all_casts ....................................... [RUN]
...
07:10:11  1 of 11 OK loaded seed file public.all_casts ................................... [INSERT 1082228 in 4303.78s]
07:10:11  
07:10:11  Finished running 11 seeds in 1 hours 11 minutes and 43.93 seconds (4303.93s).
07:10:11  
07:10:11  Completed successfully
07:10:11  
07:10:11  Done. PASS=11 WARN=0 ERROR=0 SKIP=0 TOTAL=11

Transform Data with dbt

Process data for NebulaGraph Import.

The transformation magic was done via the model files in dbt_project/models.

dbt run -m acted_by
dbt run -m user_watched_movies
dbt run -m directed_by
dbt run -m with_genre
dbt run -m people
dbt run -m genres
dbt run -m movies

Export data as CSV files

COPY acted_by TO '/tmp/acted_by.csv'  WITH DELIMITER ',' CSV HEADER;
COPY directed_by TO '/tmp/directed_by.csv'  WITH DELIMITER ',' CSV HEADER;
COPY with_genre TO '/tmp/with_genre.csv'  WITH DELIMITER ',' CSV HEADER;
COPY people TO '/tmp/people.csv'  WITH DELIMITER ',' CSV HEADER;
COPY movies TO '/tmp/movies.csv'  WITH DELIMITER ',' CSV HEADER;
COPY genres TO '/tmp/genres.csv'  WITH DELIMITER ',' CSV HEADER;
# for user_watched_movies we dont export header
COPY user_watched_movies TO '/tmp/user_watched_movies.csv'  WITH DELIMITER ',' CSV;
mkdir -p to_nebulagraph
docker cp postgres:/tmp/. to_nebulagraph/

Import Data into NebulaGraph

Create Schema

Run following DDL to create schema.

CREATE SPACE moviegraph(partition_num=10,replica_factor=1,vid_type=fixed_string(32));
# sleep for 20 seconds
USE moviegraph;
CREATE TAG person(name string, birthdate string);
CREATE TAG movie(name string);
CREATE TAG genre(name string);
CREATE TAG user(user_id string);
CREATE EDGE acted_by();
CREATE EDGE directed_by();
CREATE EDGE with_genre();
CREATE EDGE watched(rate float);

Import Data

We use Nebula Importer to import data into NebulaGraph.

The configuration file that tells Nebula Importer how CSV should be loaded into NebulaGraph is here.

cd ..

docker run --rm -ti \
    --network=nebula-net \
    -v ${PWD}:/root/ \
    -v ${PWD}/dbt_project/to_nebulagraph/:/data \
    vesoft/nebula-importer:v3.2.0 \
    --config /root/nebula-importer.yaml

Verify data stats:

USE moviegraph;
SHOW STATS;

It should look like this:

(root@nebula) [moviegraph]> SHOW STATS;
+---------+---------------+---------+
| Type    | Name          | Count   |
+---------+---------------+---------+
| "Tag"   | "genre"       | 14397   |
| "Tag"   | "movie"       | 20701   |
| "Tag"   | "person"      | 263907  |
| "Tag"   | "user"        | 610     |
| "Edge"  | "acted_by"    | 673763  |
| "Edge"  | "directed_by" | 101949  |
| "Edge"  | "watched"     | 31781   |
| "Edge"  | "with_genre"  | 194009  |
| "Space" | "vertices"    | 299615  |
| "Space" | "edges"       | 1001502 |
+---------+---------------+---------+
Got 10 rows (time spent 1693/15136 us)

movie-recommendation-dataset's People

Contributors

harrischu avatar wey-gu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

movie-recommendation-dataset's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.