Coder Social home page Coder Social logo

tap-bigquery's People

Contributors

cjohnhanson avatar daigotanaka avatar niallrees avatar seanglynn-thrive avatar sphinks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tap-bigquery's Issues

start_datetime argument required

--start_datetime is a required argument.

Running this

~/.venvs/tap-bigquery/bin/tap-bigquery \
    -c ~/singer.io/tap_bq_systemchecker/tap_config.json \
    --catalog ~/singer.io/tap_bq_systemchecker/catalog.json \
    | ~/.venvs/pipelinewise-target-snowflake/bin/target-snowflake \
    -c ~/singer.io/pipelinewise_target_snowflake/systemchecker_config.json \
    >> ~/singer.io/tap_bq_systemchecker/state.json

returns this

(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
CRITICAL start_datetime not specified

It should be an optional argument.

`Cannot access field key on a value with type ARRAY` when trying to select (record, repeated) fields

Trying to extract a table with (record, repeated) fields, coming from Google Analytics 4:

image

using this config:

{
  "streams": [
    {
      "name": "analytics_257428027",
      "table": "orchest-ga4-data.analytics_257428027.events_20220810",
      "columns": [
        "event_timestamp",
        "event_name",
        "event_params.key",
        "event_params.value.string_value",
        "event_params.value.int_value",
        "traffic_source.name",
        "traffic_source.medium",
        "traffic_source.source"
      ],
      "datetime_key": "event_date"
    }
  ],
  "credentials_path": "/project-dir/meltano/client_secrets.json",
  "start_datetime": "2022-08-10T00:00:00Z",
  "start_always_inclusive": true
}

I'm getting this error:

 INFO Running query:
    SELECT event_timestamp,event_name,event_params.key,event_params.value.string_value,event_params.value.int_value,traffic_source.name,traffic_source.medium,traffic_source.source,event_date FROM orchest-ga4-data.analytics_257428027.events_20220810 WHERE 1=1 AND datetime '2022-08-10 00:00:00.000000' <= CAST(event_date as datetime) AND CAST(event_date as datetime) < datetime '2022-08-11 10:00:20.205737' ORDER BY event_date LIMIT 100
CRITICAL 400 Cannot access field key on a value with type ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>> at [1:48]

Incorrect auth instructions

The instructions for generating the client secrets https://github.com/anelendata/tap-bigquery#step-1-activate-the-google-bigquery-api are very detailed (thanks!) but unfortunately it looks like OAuth is the wrong method to use here. When I tried to use it, first I was hit by gh-22, then I got an error:

expected one of ('authorized_user', 'service_account', 'external_account', 'impersonated_service_account', 'gdch_service_account')

Creating a service account, as explained in target-bigquery https://github.com/adswerve/target-bigquery#step-2-authenticate-with-a-service-account, worked ๐Ÿ‘๐Ÿฝ

when available use state.json bookmark instead of --start_datetime argument

--start_datetime is a required argument.

Currently when it is provided, it overrides the last_update property value from state.json

{"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}
(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
INFO Running query:
    SELECT hash_id,processor,processor_cores,string_created_at,created_at,primary_resolution,memory,os,email,browser,netspeed,record_source FROM `dxdiag-aa08f.unioned.vw_system_checker` WHERE 1=1 AND datetime '2020-01-01 00:00:00.000000' <= CAST(created_at as datetime) AND CAST(created_at as datetime) < datetime '2020-09-09 07:07:52.550010' ORDER BY created_at
time=2020-09-09 07:07:54 name=target_snowflake level=INFO message=Table 'systemchecker_singer."SYSTEM_CHECKER"' does not exist. Creating...
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 9147, "tags": {"endpoint": "system_checker"}}
INFO Syncing stream:system_checker
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Uploading 9147 rows to external snowflake stage on S3
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Target S3 bucket: cfsnowflakestage, local file: /tmp/records_b51fywzm.csv, S3 key: prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Loading 9147 rows into 'systemchecker_singer."SYSTEM_CHECKER"'
time=2020-09-09 07:08:03 name=target_snowflake level=INFO message=Loading into systemchecker_singer."SYSTEM_CHECKER": {"inserts": 0, "updates": 9147, "size_bytes": 500703}
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Deleting prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv from external snowflake stage on S3
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Emitting state {"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}

if --start_datetime continues to be required, a value in state.json should override the --start_datetime argument value.

Schema row does not contain properties

This breaks many target including target-snowflake and Anelen's implementation of tap_bigquery.

Currently:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "key_properties": []}

And it should be:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"properties": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "selected": true, "type": "object"}, "key_properties": []}

New release

Hi please could you create a new release for the functionality added in #18? Thank you ๐Ÿ™‡โ€โ™‚๏ธ

Path to `client_secrets.json` cannot be specified via `config.json`

By convention:

The config file contains whatever parameters the Tap needs in order to pull data from the source. Typically this will include the credentials for the API or data source.

Reliance on a required GOOGLE_APPLICATION_CREDENTIALS env var makes this tap an exception, complicating integration with singer-runner tooling.

Support `credentials_json` as option for full-text of `client_secrets.json`

In some runtimes, it is not feasible to load a physical client_secrets.json file - for instance in managed or ephemeral environments.

Could we add support for something like credentials_json as a JSON string of the contents of client_secrets.json?

Relevant code snippets:

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/target.py#L318-L322

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L213-L222

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L578-L584

Running tap without datetime_key?

Is it possible to use this tap without the datetime_key? I'm trying to sync a table where the date column is sometimes null but I still want to grab all of the rows.

datetime_key only allows datetime incremental watermarks

Problem
In tables that use any form of epoch time, a view must be applied in order to convert epoch time into a datetime data type. Other methods of watermarking a table also use integers as well. For example, if a postgres xmin field were carried into Big Query table, it is an effective watermark for table activity.

In the case of epoch time, the only current solution I can think of is to create a view on the bigquery table.

Solution
Abstract datetime_key to allow integer watermarks as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.