anelendata / tap-bigquery Goto Github PK
View Code? Open in Web Editor NEWSinger.io tap for extracting data from BigQuery tables
License: Apache License 2.0
Singer.io tap for extracting data from BigQuery tables
License: Apache License 2.0
--start_datetime
is a required argument.
Running this
~/.venvs/tap-bigquery/bin/tap-bigquery \
-c ~/singer.io/tap_bq_systemchecker/tap_config.json \
--catalog ~/singer.io/tap_bq_systemchecker/catalog.json \
| ~/.venvs/pipelinewise-target-snowflake/bin/target-snowflake \
-c ~/singer.io/pipelinewise_target_snowflake/systemchecker_config.json \
>> ~/singer.io/tap_bq_systemchecker/state.json
returns this
(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
CRITICAL start_datetime not specified
It should be an optional argument.
Trying to extract a table with (record, repeated) fields, coming from Google Analytics 4:
using this config:
{
"streams": [
{
"name": "analytics_257428027",
"table": "orchest-ga4-data.analytics_257428027.events_20220810",
"columns": [
"event_timestamp",
"event_name",
"event_params.key",
"event_params.value.string_value",
"event_params.value.int_value",
"traffic_source.name",
"traffic_source.medium",
"traffic_source.source"
],
"datetime_key": "event_date"
}
],
"credentials_path": "/project-dir/meltano/client_secrets.json",
"start_datetime": "2022-08-10T00:00:00Z",
"start_always_inclusive": true
}
I'm getting this error:
INFO Running query:
SELECT event_timestamp,event_name,event_params.key,event_params.value.string_value,event_params.value.int_value,traffic_source.name,traffic_source.medium,traffic_source.source,event_date FROM orchest-ga4-data.analytics_257428027.events_20220810 WHERE 1=1 AND datetime '2022-08-10 00:00:00.000000' <= CAST(event_date as datetime) AND CAST(event_date as datetime) < datetime '2022-08-11 10:00:20.205737' ORDER BY event_date LIMIT 100
CRITICAL 400 Cannot access field key on a value with type ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>> at [1:48]
The instructions for generating the client secrets https://github.com/anelendata/tap-bigquery#step-1-activate-the-google-bigquery-api are very detailed (thanks!) but unfortunately it looks like OAuth is the wrong method to use here. When I tried to use it, first I was hit by gh-22, then I got an error:
expected one of ('authorized_user', 'service_account', 'external_account', 'impersonated_service_account', 'gdch_service_account')
Creating a service account, as explained in target-bigquery
https://github.com/adswerve/target-bigquery#step-2-authenticate-with-a-service-account, worked ๐๐ฝ
--start_datetime
is a required argument.
Currently when it is provided, it overrides the last_update
property value from state.json
{"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}
(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
INFO Running query:
SELECT hash_id,processor,processor_cores,string_created_at,created_at,primary_resolution,memory,os,email,browser,netspeed,record_source FROM `dxdiag-aa08f.unioned.vw_system_checker` WHERE 1=1 AND datetime '2020-01-01 00:00:00.000000' <= CAST(created_at as datetime) AND CAST(created_at as datetime) < datetime '2020-09-09 07:07:52.550010' ORDER BY created_at
time=2020-09-09 07:07:54 name=target_snowflake level=INFO message=Table 'systemchecker_singer."SYSTEM_CHECKER"' does not exist. Creating...
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 9147, "tags": {"endpoint": "system_checker"}}
INFO Syncing stream:system_checker
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Uploading 9147 rows to external snowflake stage on S3
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Target S3 bucket: cfsnowflakestage, local file: /tmp/records_b51fywzm.csv, S3 key: prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Loading 9147 rows into 'systemchecker_singer."SYSTEM_CHECKER"'
time=2020-09-09 07:08:03 name=target_snowflake level=INFO message=Loading into systemchecker_singer."SYSTEM_CHECKER": {"inserts": 0, "updates": 9147, "size_bytes": 500703}
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Deleting prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv from external snowflake stage on S3
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Emitting state {"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}
if --start_datetime continues to be required, a value in state.json should override the --start_datetime argument value.
This breaks many target including target-snowflake and Anelen's implementation of tap_bigquery.
Currently:
{"type": "SCHEMA", "stream": "pardot_export", "schema": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "key_properties": []}
And it should be:
{"type": "SCHEMA", "stream": "pardot_export", "schema": {"properties": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "selected": true, "type": "object"}, "key_properties": []}
Hi please could you create a new release for the functionality added in #18? Thank you ๐โโ๏ธ
The releases only list up to 0.3.6 https://github.com/anelendata/tap-bigquery/releases, but the latest commit (acaf9f2) suggests to be at 0.3.7 to support env. Can we have the latest release be at 0.3.7?
By convention:
The config file contains whatever parameters the Tap needs in order to pull data from the source. Typically this will include the credentials for the API or data source.
Reliance on a required GOOGLE_APPLICATION_CREDENTIALS
env var makes this tap an exception, complicating integration with singer-runner tooling.
In some runtimes, it is not feasible to load a physical client_secrets.json
file - for instance in managed or ephemeral environments.
Could we add support for something like credentials_json
as a JSON string of the contents of client_secrets.json
?
Relevant code snippets:
The project documentation is saying :
"start_datetime must also be set in the config file or as the command line argument",
but I have an "state or start_datetime must be specified"
error even if I have a specified start_datetime in my tap_config.json
Thanks,
Dusty
Is it possible to use this tap without the datetime_key? I'm trying to sync a table where the date column is sometimes null but I still want to grab all of the rows.
Problem
In tables that use any form of epoch time, a view must be applied in order to convert epoch time into a datetime data type. Other methods of watermarking a table also use integers as well. For example, if a postgres xmin field were carried into Big Query table, it is an effective watermark for table activity.
In the case of epoch time, the only current solution I can think of is to create a view on the bigquery table.
Solution
Abstract datetime_key to allow integer watermarks as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.