anelendata / tap-bigquery Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 25.0 70 KB

Singer.io tap for extracting data from BigQuery tables

License: Apache License 2.0

Python 92.24% Shell 7.76%

tap-bigquery's People

Contributors

Stargazers

Watchers

tap-bigquery's Issues

start_datetime argument required

--start_datetime is a required argument.

Running this

~/.venvs/tap-bigquery/bin/tap-bigquery \
    -c ~/singer.io/tap_bq_systemchecker/tap_config.json \
    --catalog ~/singer.io/tap_bq_systemchecker/catalog.json \
    | ~/.venvs/pipelinewise-target-snowflake/bin/target-snowflake \
    -c ~/singer.io/pipelinewise_target_snowflake/systemchecker_config.json \
    >> ~/singer.io/tap_bq_systemchecker/state.json

returns this

(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
CRITICAL start_datetime not specified

It should be an optional argument.

`Cannot access field key on a value with type ARRAY` when trying to select (record, repeated) fields

Trying to extract a table with (record, repeated) fields, coming from Google Analytics 4:

using this config:

{
  "streams": [
    {
      "name": "analytics_257428027",
      "table": "orchest-ga4-data.analytics_257428027.events_20220810",
      "columns": [
        "event_timestamp",
        "event_name",
        "event_params.key",
        "event_params.value.string_value",
        "event_params.value.int_value",
        "traffic_source.name",
        "traffic_source.medium",
        "traffic_source.source"
      ],
      "datetime_key": "event_date"
    }
  ],
  "credentials_path": "/project-dir/meltano/client_secrets.json",
  "start_datetime": "2022-08-10T00:00:00Z",
  "start_always_inclusive": true
}

I'm getting this error:

 INFO Running query:
    SELECT event_timestamp,event_name,event_params.key,event_params.value.string_value,event_params.value.int_value,traffic_source.name,traffic_source.medium,traffic_source.source,event_date FROM orchest-ga4-data.analytics_257428027.events_20220810 WHERE 1=1 AND datetime '2022-08-10 00:00:00.000000' <= CAST(event_date as datetime) AND CAST(event_date as datetime) < datetime '2022-08-11 10:00:20.205737' ORDER BY event_date LIMIT 100
CRITICAL 400 Cannot access field key on a value with type ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>> at [1:48]

Incorrect auth instructions

The instructions for generating the client secrets https://github.com/anelendata/tap-bigquery#step-1-activate-the-google-bigquery-api are very detailed (thanks!) but unfortunately it looks like OAuth is the wrong method to use here. When I tried to use it, first I was hit by gh-22, then I got an error:

expected one of ('authorized_user', 'service_account', 'external_account', 'impersonated_service_account', 'gdch_service_account')

Creating a service account, as explained in target-bigquery https://github.com/adswerve/target-bigquery#step-2-authenticate-with-a-service-account, worked 👍🏽

when available use state.json bookmark instead of --start_datetime argument

--start_datetime is a required argument.

Currently when it is provided, it overrides the last_update property value from state.json

{"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}

(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
INFO Running query:
    SELECT hash_id,processor,processor_cores,string_created_at,created_at,primary_resolution,memory,os,email,browser,netspeed,record_source FROM `dxdiag-aa08f.unioned.vw_system_checker` WHERE 1=1 AND datetime '2020-01-01 00:00:00.000000' <= CAST(created_at as datetime) AND CAST(created_at as datetime) < datetime '2020-09-09 07:07:52.550010' ORDER BY created_at
time=2020-09-09 07:07:54 name=target_snowflake level=INFO message=Table 'systemchecker_singer."SYSTEM_CHECKER"' does not exist. Creating...
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 9147, "tags": {"endpoint": "system_checker"}}
INFO Syncing stream:system_checker
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Uploading 9147 rows to external snowflake stage on S3
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Target S3 bucket: cfsnowflakestage, local file: /tmp/records_b51fywzm.csv, S3 key: prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Loading 9147 rows into 'systemchecker_singer."SYSTEM_CHECKER"'
time=2020-09-09 07:08:03 name=target_snowflake level=INFO message=Loading into systemchecker_singer."SYSTEM_CHECKER": {"inserts": 0, "updates": 9147, "size_bytes": 500703}
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Deleting prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv from external snowflake stage on S3
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Emitting state {"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}

if --start_datetime continues to be required, a value in state.json should override the --start_datetime argument value.

Schema row does not contain properties

This breaks many target including target-snowflake and Anelen's implementation of tap_bigquery.

Currently:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "key_properties": []}

And it should be:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"properties": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "selected": true, "type": "object"}, "key_properties": []}

New release

Hi please could you create a new release for the functionality added in #18? Thank you 🙇‍♂️

Update latest release version

The releases only list up to 0.3.6 https://github.com/anelendata/tap-bigquery/releases, but the latest commit (acaf9f2) suggests to be at 0.3.7 to support env. Can we have the latest release be at 0.3.7?

Path to `client_secrets.json` cannot be specified via `config.json`

By convention:

The config file contains whatever parameters the Tap needs in order to pull data from the source. Typically this will include the credentials for the API or data source.

Reliance on a required GOOGLE_APPLICATION_CREDENTIALS env var makes this tap an exception, complicating integration with singer-runner tooling.

Invalid cast from INT64 to DATETIME with event_timestamp as datetime_key

Hi,

I encounter a Google API BadRequest Exception when I select the event_timestamp as the datetime_key in the tap_config.json, when running discovery command.

Dusty

Support `credentials_json` as option for full-text of `client_secrets.json`

In some runtimes, it is not feasible to load a physical client_secrets.json file - for instance in managed or ephemeral environments.

Could we add support for something like credentials_json as a JSON string of the contents of client_secrets.json?

Relevant code snippets:

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/target.py#L318-L322

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L213-L222

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L578-L584

start_datetime is required argument even with start_datetime in tap_config.json

The project documentation is saying :

"start_datetime must also be set in the config file or as the command line argument",

but I have an "state or start_datetime must be specified" error even if I have a specified start_datetime in my tap_config.json

Thanks,

Dusty

Running tap without datetime_key?

Is it possible to use this tap without the datetime_key? I'm trying to sync a table where the date column is sometimes null but I still want to grab all of the rows.

datetime_key only allows datetime incremental watermarks

Problem
In tables that use any form of epoch time, a view must be applied in order to convert epoch time into a datetime data type. Other methods of watermarking a table also use integers as well. For example, if a postgres xmin field were carried into Big Query table, it is an effective watermark for table activity.

In the case of epoch time, the only current solution I can think of is to create a view on the bigquery table.

Solution
Abstract datetime_key to allow integer watermarks as well.

anelendata / tap-bigquery Goto Github PK

tap-bigquery's People

Contributors

Stargazers

Watchers

Forkers

tap-bigquery's Issues

Recommend Projects

Recommend Topics

Recommend Org