Coder Social home page Coder Social logo

gocardless / bigquery-to-datastore Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yu-iskw/bigquery-to-datastore

0.0 8.0 1.0 89 KB

Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow

Dockerfile 0.48% Makefile 2.96% Shell 3.82% Java 92.75%

bigquery-to-datastore's Introduction

bigquery-to-datastore

Build Status codecov

This allows us to export a BigQuery table to a Google Datastore kind using Apache Beam on top of Google Dataflow.

You don't have to have duplicated rows whose key values are same. Apache Beam's DatastoreIO doesn't allow us to write same key at once.

Data Pipeline

Requirements

  • Maven
  • Java 1.8+
  • Google Cloud Platform account

Usage

Required arguments

  • --project: Google Cloud Project
  • --inputBigQueryDataset: Input BigQuery dataset ID
  • --inputBigQueryTable: Input BigQuery table ID
  • --keyColumn: BigQuery column name for a key of Google Datastore kind
  • --outputDatastoreNamespace: Output Google Datastore namespace
  • --outputDatastoreKind: OUtput Google Datastore kind
  • --tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
  • --gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional arguments

  • --runner: Apache Beam runner.
    • When you don't set this option, it will run on your local machine, not Google Dataflow.
    • e.g. DataflowRunner
  • --parentPaths: Output Google Datastore parent path(s)
    • e.g. Parent1:p1,Parent2:p2 ==> KEY('Parent1', 'p1', 'Parent2', 'p2')
  • --indexedColumns: Indexed columns on Google Datastore.
    • e.g. col1,col2,col3 ==> col1, col2 and col2 are indexed on Google Datastore.
  • --numWorkers: The number of workers when you run it on top of Google Dataflow.
  • --workerMachineType: Google Dataflow worker instance type
    • e.g. n1-standard-1, n1-standard-4

Example to run on Google Dataflow

# compile
mvn clean package

# Run bigquery-to-datastore via the compiled JAR file
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.5.1.jar \
  com.github.yuiskw.beam.BigQuery2Datastore \
  --project=your-gcp-project \
  --runner=DataflowRunner \
  --inputBigQueryDataset=test_dataset \
  --inputBigQueryTable=test_table \
  --outputDatastoreNamespace=test_namespace \
  --outputDatastoreKind=TestKind \
  --parentPaths=Parent1:p1,Parent2:p2 \
  --keyColumn=id \
  --indexedColumns=col1,col2,col3 \
  --tempLocation=gs://test_bucket/test-log/ \
  --gcpTempLocation=gs://test_bucket/test-log/

How to run

How to build and run it with java

# compile
mvn clean package
# or
make package

# run
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.5.1.jar --help
# or
./bin/bigquery-to-datastore --help

How to run it on docker

We also offers docker images for this project in yuiskw/bigquery-to-datastore - Docker Hub. We have several docker images based on Apache Beam versions.

docker run yuiskw/bigquery-to-datastore:0.5.1-beam-2.1 --help

How to install it with homebrew

You can install it with homebrew from yu-iskw/homebrew-bigquery-to-datastore.

# install
brew install yu-iskw/bigquery-to-datastore/bigquery-to-datastore

# show help
./bin/bigquery-to-datastore --help

Type conversions between BigQuery and Google Datastore

The below table describes the type conversions between BigQuery and Google Datastore. Since Datastore unfortunately doesn't have any data type for time, bigquery-to-datastore ignore BigQuery columns whose data type are TIME.

BigQuery Datastore
BOOLEAN bool
INTEGER int
DOUBLE double
STRING string
TIMESTAMP timestamp
DATE timestamp
TIME ignored: Google Datastore doesn't have time type.
RECORD array
STRUCT Entity

Note

As you probably know, Google Datastore doesn't have any feature much like UPDATE of MySQL. Since DatastoreIO.Write upsert given input entities, it will just overwrite an entity whether or not it already exists. If we would like to insert multiple data separately, we have to combine them on bigquery beforehand.

License

Copyright (c) 2017 Yu Ishikawa.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.