yu-iskw / bigquery-to-datastore Goto Github PK
View Code? Open in Web Editor NEWExport a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow
Hi, would that be possible to enable an option to automatically generate key identifier when inserting the data? In the way how is it described in here: https://cloud.google.com/datastore/docs/concepts/entities#assigning_identifiers
I've tried using default and [default] for the datastore namespace, but neither of these write to the correct '[default]' namespace
I do have a GC login and am also logged in via the CLI. How do I tell either the JAR or the docker image to pick up my credentials?
It would be nice like that.
brew tap yu-iskw/bigquery-to-datastore
brew install bigquery-to-datastore
What is the reason for setting setExcludedFromIndexes to true? Ideally this would be an additional flag when running the main shell script
Apache Beam 2.3 was released.
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341608
I don't make any index of any values at version 0.2. Sometimes, I guess users want to do indexing specific columns.
java -cp ...bigquery-to-datastore.jar
...
--indexedColumns="age,name"
...
Hey Yu,
Really great you put this together. I an finally getting successful builds however I am not seeing any data appear in my datastore. Is there something I am doing wrong?
Output is:
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building bigquery-to-datastore 0.2
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ bigquery-to-datastore ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/cwilliams/Dropbox/Development/DevOps/Google/interview/bestbuy/bigquery-to-datastore/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:compile (default-compile) @ bigquery-to-datastore ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ bigquery-to-datastore ---
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.options.DataflowPipelineOptions$StagingLocationFactory create
INFO: No stagingLocation provided, falling back to gcpTempLocation
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 106 files. Enable logging at DEBUG level to see which files will be staged.
Nov 12, 2017 5:08:37 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read validate
INFO: Project of TableReference not set. The value of BigQueryOptions.getProject() at execution time will be used.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Uploading 106 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Staging files complete: 106 files cached, 0 files newly uploaded
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/Read(BigQueryTableSource) as step s1
Nov 12, 2017 5:08:40 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource setDefaultProjectIfAbsent
INFO: Project ID not set in TableReference. Using default project from BigQueryOptions.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/ParMultiDo(Identity) as step s2
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/ParDo(ToIsmRecordForGlobalWindow) as step s3
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/CreateDataflowView as step s4
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Create(CleanupOperation)/Read(CreateSource) as step s5
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Cleanup as step s6
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding ParDo(TableRow2Entity) as step s7
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Convert to Mutation/Map as step s8
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Write Mutation to Datastore as step s9
Dataflow SDK version: 2.1.0
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/bestbuy-185314/dataflow/job/2017-11-12_08_08_41-5441556467331747849
Submitted job: 2017-11-12_08_08_41-5441556467331747849
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To cancel the job using the 'gcloud' tool, run:
gcloud beta dataflow jobs --project=bestbuy-185314 cancel 2017-11-12_08_08_41-5441556467331747849
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12.604 s
[INFO] Finished at: 2017-11-12T17:08:42+01:00
[INFO] Final Memory: 34M/113M
[INFO] ------------------------------------------------------------------------
Any ideas?
Best
Chris
Having issue importing a timestamp back into datastore.
: com.google.datastore.v1.client.DatastoreException: Invalid PROTO payload received. Timestamp seconds exceeds limit for field: timestampValue, code=INVALID_ARGUMENT
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.finishBundle(DatastoreV1.java:1260)
It would be nice to offer this tool by docker.
docker run yuiskw/bigquery-to-datastore \
--project=your-gcp-project \
--runner=DataflowRunner \
--inputBigQueryDataset=test_dataset \
--inputBigQueryTable=test_table \
--outputDatastoreNamespace=test_namespace \
--outputDatastoreKind=TestKind \
--parentPaths=Parent1:p1,Parent2:p2 \
--keyColumn=id \
--indexedColumns=col1,col2,col3 \
--tempLocation=gs://test_bucket/test-log/ \
--gcpTempLocation=gs://test_bucket/test-log/
Maybe I am missing something, but when I try to run the job I'm getting this error:
(dfb1d562509e1bce): java.lang.NullPointerException
at com.github.yuiskw.beam.TableRow2EntityFn.convertTableRowToEntity(TableRow2EntityFn.java:149)
at com.github.yuiskw.beam.TableRow2EntityFn.processElement(TableRow2EntityFn.java:55)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.