yu-iskw / bigquery-to-datastore Goto Github PK

View Code? Open in Web Editor NEW

58.0 4.0 18.0 96 KB

Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow

Makefile 2.95% Shell 3.77% Java 92.81% Dockerfile 0.48%

google-datastore bigquery google-cloud google-dataflow apache-beam beam

bigquery-to-datastore's People

Contributors

Stargazers

Watchers

Forkers

aikudo alexabbott gt2985 podopie ovotech jontradesy gocardless cunvoas tadeegan saschaheyer seth-deal xugenfa gopinath678 niraj06 iihnordic markedmondson1234 kamibayashi

bigquery-to-datastore's Issues

Generate identifiers

Hi, would that be possible to enable an option to automatically generate key identifier when inserting the data? In the way how is it described in here: https://cloud.google.com/datastore/docs/concepts/entities#assigning_identifiers

Unable to write to default namespace

I've tried using default and [default] for the datastore namespace, but neither of these write to the correct '[default]' namespace

how do i auth against my gc?

I do have a GC login and am also logged in via the CLI. How do I tell either the JAR or the docker image to pick up my credentials?

Install it with brew tap

It would be nice like that.

brew tap yu-iskw/bigquery-to-datastore
brew install bigquery-to-datastore

Add flag for indexing

What is the reason for setting setExcludedFromIndexes to true? Ideally this would be an additional flag when running the main shell script

Upgrade apache beam to 2.16.0

Apache Beam 2.3 was released.

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341608

Can specify indexed columns

Overview

I don't make any index of any values at version 0.2. Sometimes, I guess users want to do indexing specific columns.

Command Line Options Spec

java -cp ...bigquery-to-datastore.jar
  ...
  --indexedColumns="age,name"
  ...

Import failing

Hey Yu,

Really great you put this together. I an finally getting successful builds however I am not seeing any data appear in my datastore. Is there something I am doing wrong?

Output is:

[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building bigquery-to-datastore 0.2
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ bigquery-to-datastore ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/cwilliams/Dropbox/Development/DevOps/Google/interview/bestbuy/bigquery-to-datastore/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:compile (default-compile) @ bigquery-to-datastore ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ bigquery-to-datastore ---
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.options.DataflowPipelineOptions$StagingLocationFactory create
INFO: No stagingLocation provided, falling back to gcpTempLocation
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 106 files. Enable logging at DEBUG level to see which files will be staged.
Nov 12, 2017 5:08:37 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read validate
INFO: Project of TableReference not set. The value of BigQueryOptions.getProject() at execution time will be used.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Uploading 106 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Staging files complete: 106 files cached, 0 files newly uploaded
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/Read(BigQueryTableSource) as step s1
Nov 12, 2017 5:08:40 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource setDefaultProjectIfAbsent
INFO: Project ID not set in TableReference. Using default project from BigQueryOptions.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/ParMultiDo(Identity) as step s2
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/ParDo(ToIsmRecordForGlobalWindow) as step s3
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/CreateDataflowView as step s4
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Create(CleanupOperation)/Read(CreateSource) as step s5
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Cleanup as step s6
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding ParDo(TableRow2Entity) as step s7
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Convert to Mutation/Map as step s8
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Write Mutation to Datastore as step s9
Dataflow SDK version: 2.1.0
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/bestbuy-185314/dataflow/job/2017-11-12_08_08_41-5441556467331747849
Submitted job: 2017-11-12_08_08_41-5441556467331747849
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To cancel the job using the 'gcloud' tool, run:

gcloud beta dataflow jobs --project=bestbuy-185314 cancel 2017-11-12_08_08_41-5441556467331747849
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12.604 s
[INFO] Finished at: 2017-11-12T17:08:42+01:00
[INFO] Final Memory: 34M/113M
[INFO] ------------------------------------------------------------------------

Any ideas?

Best
Chris

Timestamp Issue

Having issue importing a timestamp back into datastore.

: com.google.datastore.v1.client.DatastoreException: Invalid PROTO payload received. Timestamp seconds exceeds limit for field: timestampValue, code=INVALID_ARGUMENT
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.finishBundle(DatastoreV1.java:1260)

Make a docker image

It would be nice to offer this tool by docker.

docker run yuiskw/bigquery-to-datastore \
  --project=your-gcp-project \
  --runner=DataflowRunner \
  --inputBigQueryDataset=test_dataset \
  --inputBigQueryTable=test_table \
  --outputDatastoreNamespace=test_namespace \
  --outputDatastoreKind=TestKind \
  --parentPaths=Parent1:p1,Parent2:p2 \
  --keyColumn=id \
  --indexedColumns=col1,col2,col3 \
  --tempLocation=gs://test_bucket/test-log/ \
  --gcpTempLocation=gs://test_bucket/test-log/

attempting to try this

Maybe I am missing something, but when I try to run the job I'm getting this error:

(dfb1d562509e1bce): java.lang.NullPointerException
at com.github.yuiskw.beam.TableRow2EntityFn.convertTableRowToEntity(TableRow2EntityFn.java:149)
at com.github.yuiskw.beam.TableRow2EntityFn.processElement(TableRow2EntityFn.java:55)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.