Coder Social home page Coder Social logo

mpbxyz / movielens-dataform Goto Github PK

View Code? Open in Web Editor NEW

This project forked from wintermi/movielens-dataform

0.0 0.0 0.0 56 KB

An example Dataform project which will use the publicly available Movielens dataset to demonstrate how to upload your product catalog and user events into either the Google Cloud Retail API or Google Cloud Discovery Engine and train a personalised product recommendation model.

License: Apache License 2.0

Shell 100.00%

movielens-dataform's Introduction

Movielens Dataform Project

TESTING

About

An example Dataform project to load and transform the publicly available dataset from Movielens into a format which can be imported into Discovery for Media or Vertex AI Search and Conversation, allowing you to train a media recommendation model.

This example extends on the tutorial found in the documentation here and here.

Prerequisites

Google Cloud Project

Google Cloud projects form the basis for creating, enabling, and using all Google Cloud services, such as Dataform, BigQuery and the Retail API.

If you do not already have a Google Cloud project for which you want to load the IMDB dataset into, then you will need to create a new Google Cloud project. The documentation on how to do this can be found here.

Once you have a Google Cloud project, remember to take note of the Project Number and Project ID. These can be found on the Google Cloud project console welcome page, which you can find here.

Google Cloud Storage Bucket

Now you have a Google Cloud project, you need to create a Google Cloud Storage Bucket for which the IMDB dataset will be uploaded into and Dataform will use to source the data in which to load data into BigQuery. The documentation on how to create a new storage bucket can be found here.

Remeber to take note of the bucket name as this will be required for one of the Dataform config variables.

Enable Dataform Service

Next, you will need to enable the Dataform service within the Google Cloud project just created. This can be achieved by clicking the "Enable" button here.

Create a Dataform Repository

After the Dataform Service has been enabled, you will be redirected to the BigQuery Dataform page within the Google Cloud console. For reference, this can be found here.

Go ahead and create a repository. For more information on how to do this, go to the documentation page found here.

Grant Permissions to Dataform Service Account

When you create your first Dataform repository, Dataform automatically generates a service account. Dataform uses the service account to interact with BigQuery on your behalf.

Your Dataform service account ID is in the following format:

service-YOUR_PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com

Replace YOUR_PROJECT_NUMBER with the Project Number of your Google Cloud project, which you previously took note of.

The Dataform service account requires a number of IAM roles with which to be able to execute the workflows in BigQuery and load data from the Google Cloud Storage Bucket. This can be achieved by following these steps:

  1. In the Google Cloud console, go to the IAM page.
  2. Click Add.
  3. In the New principals field, enter your Dataform service account ID.
  4. In the Select a role drop-down list, select the BigQuery Job User role.
  5. Click Add another role, and then in the Select a role drop-down list, select the BigQuery Data Editor role.
  6. Click Add another role, and then in the Select a role drop-down list, select the BigQuery Data Viewer role.
  7. Click Add another role, and then in the Select a role drop-down list, select the Storage Object Viewer role.
  8. Click Save.

DataForm Workflow Settings

The workflow_settings.yaml contains the following parameters

  • defaultProject: The Project ID of your Google Cloud project, which you previously took note of
  • defaultLocation: Target BigQuery Location
  • defaultDataset: Name of the BigQuery Dataset for which the Movielens tables are to be created
  • defaultAssertionDataset: Name of the BigQuery Dataset for which any Dataform Assertions are to be created and executed against
  • LOAD_GCS_BUCKET: Name of the Google Cloud Storage Bucket, which you previously took note of
  • RAW_DATA: Name of the BigQuery Dataset for which the Movielens data files are to be loaded into
  • TARGET_DATA: Name of the BigQuery Dataset for which the final transformed Movielens tables are to be located

Here is what an example configuration looks like

dataformCoreVersion: 3.0.0-beta.4
defaultProject: winter-dataform
defaultLocation: australia-southeast1
defaultDataset: movielens
defaultAssertionDataset: movielens_assertions
vars:
    LOAD_GCS_BUCKET: winter-data/movielens
    RAW_DATA: movielens_staging
    TARGET_DATA: movielens

movielens-dataform's People

Contributors

wintermi avatar mpbxyz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.