lquerel / gcp-bigquery-client Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 57.0 324 KB

GCP BigQuery Client (Rust)

License: Apache License 2.0

Rust 100.00%

gcp-bigquery-client's People

Contributors

Stargazers

Watchers

Forkers

robtova braincow seeyarh agelleo komi1230 jameshinshelwood nixxholas swlynch99 newapplesho renancloudwalk elvisyong jeffreybolle shadesfear mathiaskindberg wakkadojo 335g podimo tblazina wizpra lee-hen z3z1ma james-aviani emschwartz iskakaushik kiibo382 prisma repack-tech tu-artem marcoieni mikhailms jichaos alu caddijp rahul-soshte thedrummeraki andyquinterom blombern mchlgibs magnalite shirlo lawngnome enricozb dzil123 fortanix deniskore neonlabsorg klover-fintech bluecodecom singular-labs dallinbentley nate-kelley-buster buster-so omristeiner

gcp-bigquery-client's Issues

Project model (numeric_id) is incorrect.

I ran the following code.

client
    .project()
    .list(GetOptions::default().max_results(10))
    .await?
    .projects

However, an error occurred as follows.

Error: RequestError(reqwest::Error { kind: Decode, source: Error("invalid type: string \"88888\", expected u64", line: 8, column: 33) })

Perhaps the type of numeric_id is incorrect.

Therefore, It is necessary to change from u64 to String.

Let `Client::from_service_account_key_file` return `Result<Client, BQError>`

The function Client::from_service_account_key_file is easy to fail by common errors, such as wrong file path, invalid key file, have no permission and etc. Any particular reason why it doesn't return Result<Client, BQError> like Client::from_service_account_key?

job: get_job: builder error: unsupported pair

Stacktrace: Oct 28 17:51:18.337 ERROR solana_skip_indexer::helpers::bq_utils: FATAL! builder error: unsupported pair

All we got from the client was builder error: unsupported pair

Access token was a vector of u8 with a size of 990.
The rest of the data required to execute get_job is as shown below

Rationale

The purpose of getting this fixed is not just to keep it working, but to allow users of jobs be able to check the job's status again should it be in 'RUNNING' or 'PENDING' state after awaiting for the initial 'insert' or delete' or etc. function.

struct ServiceAccount is used but not available

Hi, this might be an issue on my side understanding the whole situation but i ran into an issue using service-account keys.

Authenticating a client using

    /// Constructs a new BigQuery client from a [`ServiceAccountKey`].
    /// # Argument
    /// * `sa_key` - A GCP Service Account Key `yup-oauth2` object.
    /// * `readonly` - A boolean setting whether the acquired token scope should be readonly.
    ///
    /// [`ServiceAccountKey`]: https://docs.rs/yup-oauth2/*/yup_oauth2/struct.ServiceAccountKey.html
    pub async fn from_service_account_key(sa_key: ServiceAccountKey, readonly: bool) -> Result<Self, BQError> {
        ClientBuilder::new()
            .build_from_service_account_key(sa_key, readonly)
            .await
    }

refrences the struct ServiceAccountKey (Struct origins at [email protected])

While using the default features inside this crate, i cannot find the ServiceAccountKey.
So i cannot create a struct which es needed as a parameter.

I am not sure how to manage this and if it is even related to this crate, since i don't know the responsibilities on who has to include what.
Any hint highly appreciated.

Workaround

A workaround for me is to read the key-file of the service account and use

from_service_account_key_file(sa_key_file: &str) -> Result<Self, BQError>

Nice lib btw!

README example is vulnerable to SQL injection

The example in the README uses the following code to build a query:

    // Query
    let mut rs = client
        .job()
        .query(
            project_id,
            QueryRequest::new(format!(
                "SELECT COUNT(*) AS c FROM `{}.{}.{}`",
                project_id, dataset_id, table_id
            )),
        )
        .await?;

This appears to be vulnerable to SQL injection: if any of the project_id, dataset_id, table_id fields come from an untrusted source, they may contain additional SQL statements, e.g. DROP TABLE, which will be injected into the query and passed on to the BigQuery API.

If this is indeed the case, an example should be provided that avoids the issue. If BigQuery does not provide an API that's immune to SQL injection, the inputs should be sanitized of SQL statements recognized by BigQuery.

Question about using the QueryParameter

Hey!

Originally posted this in the dicsussion forum but realized it might not have been the right place to post this in, so creating an issue instead:

Thanks for creating this library, I'm really happy to be able to use rust for working with bigquery data!

I have a question about using the query parameter option:

Looking through the docs I've found that there is the option QueryParameter however it requires you to declare a QueryParameterType which I'm not quite clear on how to do.

Looking at the code snippet for the QueryParameterType, I'm understanding this as a recursive type since there is no Option around array_type, I know this is normally fine because of the Box but I don't understand how to opt out of this without an option or some other kind of enum or break point.

pub struct QueryParameterType {
    pub array_type: Box<QueryParameterType>,
    pub struct_types: Option<Vec<QueryParameterTypeStructTypes>>,
    pub type: String,
}

Maybe there is some trick I don't know with Box to not make it infinitely recursive? Would love to just get a short code snippet of how to define a simple parameter type like a string or int if that would be possible?

The tabledata.list API method return type might be wrong?

Support for `ARRAY` type

Thank you for this library.
I need support for the array type. Would you accept a PR for it?

Tips for reading TableRows

Hi, thanks for writing and releasing this repository!

I'm using the query_all API to get a bunch of rows, and I'm trying to transform the results from TableRow to some struct.

As #31 mentioned, it seems like all the results are all returned as strings. It's also pretty cumbersome to do this transformation. Consider the parse function in the example below:

struct Example {
    letter: String,
    number: i64,
}

fn parse(row: &TableRow) -> Example {
    Example {
        letter: row
            .columns
            .as_ref()
            .unwrap()
            .get(0)
            .unwrap()
            .value
            .as_ref()
            .unwrap()
            .as_str()
            .unwrap()
            .to_string(),
        number: row
            .columns
            .as_ref()
            .unwrap()
            .get(1)
            .unwrap()
            .value
            .as_ref()
            .unwrap()
            .as_str()
            .unwrap()
            .parse::<i64>()
            .unwrap(),
    }
}

async fn load_examples() -> Result<Vec<Example>, BQError> {
    let client = Client::from_service_account_key_file(BQ_SA_KEY).await?;
    let response = client.job().query_all(
        GCP_PROJECT,
        JobConfigurationQuery {
            query: "SELECT x AS letter, 1 AS number FROM UNNEST(['a', 'b', 'c']) x".to_string(),
            use_legacy_sql: Some(false),
            ..Default::default()
        },
        Some(2),
    );

    tokio::pin!(response);

    let mut examples: Vec<Example> = vec![];
    while let Some(page) = response.next().await {
        match page {
            Ok(rows) => {
                examples.extend(rows.iter().map(parse));
            }
            Err(e) => {
                return Err(e);
            }
        }
    }
    Ok(examples)
}

fn main() {
    let rt = Runtime::new().unwrap();
    let examples = rt.block_on(load_examples()).expect("bigquery error");
    for e in examples {
        println!("letter: {}\tnumber: {}", e.letter, e.number);
    }
}

It's pretty awkward! There are two issues at play:

I need to take the number result as a string and then parse an i64 out of it.
It's pretty cumbersome to get the actual result values from a TableRow.

I'm sure there exists a good way to, given a TableRow and a TableSchema, construct a struct, but I'm not sure how to do it. And maybe the first issue is not a bug, and just an issue with how I'm reading the data, but I can't find a way to get it to work.

Would it be possible to create an example of how to use this API to generate a clean data structure out of a TableRow?

support installed OAuth

support installed OAuth2.0 flow.

See Gooogle OAuth2.0 Documentation.

Allow to change the base url

Hi! I would like to test this crate locally by using bigquery-emulator.
However, this crate doesn't allow changing the base url (it always assumes https://bigquery.googleapis.com/bigquery/v2/):

gcp-bigquery-client/src/project.rs

Line 46 in 51f1acb

let req_url = "https://bigquery.googleapis.com/bigquery/v2/projects";

Instead, google_bigquery2 allows to customize both the base_url and the root_url.

Do you plan to add support for this?
If not, would you accept a PR?
Thanks 🙏

Support BigQuery Storage API

I'd like to use BigQuery Storage API through this crate.
And this should be planned, so I created this issue.

Yes, covering all the capabilities of BQ is the goal, even if many things are still missing (e.g. the storage read/write API). #11 (reply in thread)

"Rows are not present" panics for query_all

I've been using version 0.16.6 since it's release, paging through big results with occasional panics but successfully. Since a few days ago my ingestion panics on every request, making it unusable. I'm wondering if there's been a change on Google's end?

I'm now running on 0.16.7 and the only way i could get some successful queries going was to cut my query size way way down. But this is multiplying my query cost and still panicking quite often.

Default Return Type

Is the default return type of TableCell a string? If not, where can I change that? Am I missing some setting in BQ?

Thank you for your help!

Support all common GCP authentication flows

See #21 for more details.

JSON Data Type

Do you have any interest in supporting the JSON data type?

It would be very helpful for resolving supabase/wrappers#70

thanks!

[request] release a new crate version

Hi @lquerel, first of all, thanks a lot for developing this crate! I'm using it and it works quite nicely!

And the reason for this issue, can you release a new crate? I'm having issues running a project that relies on it due to conflicts on the hyper-rustls version. Releasing a new version with the hyper-rustls 0.24 should fix it.

   --> /Users/andrehahn/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gcp-bigquery-client-0.16.6/src/auth.rs:66:28
    |
66  |                 auth: Some(auth),
    |                       ---- ^^^^ expected `HttpsConnector<HttpConnector>`, found `hyper_rustls::connector::HttpsConnector<HttpConnector>`
    |                       |
    |                       arguments to this enum variant are incorrect
    |
    = note: `hyper_rustls::connector::HttpsConnector<HttpConnector>` and `HttpsConnector<HttpConnector>` have similar names, but are actually distinct types
note: `hyper_rustls::connector::HttpsConnector<HttpConnector>` is defined in crate `hyper_rustls`
   --> /Users/andrehahn/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-rustls-0.24.0/src/connector.rs:19:1
    |
19  | pub struct HttpsConnector<T> {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: `HttpsConnector<HttpConnector>` is defined in crate `hyper_rustls`
   --> /Users/andrehahn/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-rustls-0.23.2/src/connector.rs:20:1
    |
20  | pub struct HttpsConnector<T> {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    = note: perhaps two different versions of crate `hyper_rustls` are being used?

Thanks!

location-based APIs are broken

Any Reqwest building with location declaration will break.

This is an example of the fix in job/get_job:

4c30e1c

JobApi query() method only returns partial results

It is really unclear from the comments/documentation that this method leaves off some data if query result is big enough

Support profiling queries via tracing

Adding support for tracing via https://docs.rs/tracing/latest/tracing/ would be useful. Given that most of the requests are going through reqwest, seems like https://github.com/TrueLayer/reqwest-middleware might be a good candidate.

Let me know if this is something that you think will be generally useful and I'm happy to submit a PR for this. This can be behind a cargo feature, maybe profile, that way it can be enabled only when profiling.

Thanks for the awesome crate.

Replace chrono dependency with time?

result of cargo audit when trying to use gcp-bigquery-client:

Crate:         chrono
Version:       0.4.19
Title:         Potential segfault in `localtime_r` invocations
Date:          2020-11-10
ID:            RUSTSEC-2020-0159
URL:           https://rustsec.org/advisories/RUSTSEC-2020-0159
Solution:      No safe upgrade is available!
Dependency tree: 
chrono 0.4.19
├── yup-oauth2 6.3.1
│   └── gcp-bigquery-client 0.11.0
│       └── cjms 1.0.0
└── gcp-bigquery-client 0.11.0

chrono does not appear to be actively maintained.

Could the time crate meet the needs of this repo?

I have recently had a PR merged into yup-oauth2 that moves that crate from chrono to time. dermesser/yup-oauth2#172 I'd be happy to do a PR here too.

This would have helped me with integrating gcp-bigquery-client into my project. In the end I've needed such a small fraction of the power of your repo that I've just taken the few pieces that I need.

Direct struct interpretation sample

Could be good if we showcase an example of how we can deserialise the rows into a sturct via serde or avro!

https://github.com/nixxholas/gcp-bigquery-client/blob/4c30e1ca25281f67c5de08902a8a360974ea8c65/examples/client.rs#L198

Support for GZIP compression

Hey @lquerel!
I've noticed that, for some reason, GZIP is not enabled for outgoing request body. Based on my data, enabling GZIP compression for request body results in faster transfer speeds. I want to contribute a small pull request to implement it.

I see two ways to implement this:

Adding a custom feature, GZIP.
Adding a parameter with the type of enum to the function TableDataApi::insert_all, which indicates the compression algorithm.

The same data is sent three times with a max batch size of 50_000.

Without GZIP:

Inserting 52511 rows to ***, geographic location europe-west2
Inserted 52511 rows to *** in 13.08 seconds

Inserting 52511 rows to ***, geographic location europe-west2
Inserted 52511 rows to *** in 12.93 seconds

Inserting 52511 rows to ***, geographic location europe-west2
Inserted 52511 rows to *** in 12.56 seconds

With GZIP:

Inserting 52511 rows to ***, geographic location europe-west2
Inserted 52511 rows to *** in 7.49 seconds

Inserting 52511 rows to ***, geographic location europe-west2
Inserted 52511 rows to *** in 7.57 seconds

Inserting 52511 rows to ***, geographic location europe-west2
Inserted 52511 rows to *** in 7.26 seconds

How to propagate location setting to Job configuration

Hello there,

It may be a bit of a silly question, but how can one propagate location setting when creating job against BigQuery table?
Something like your pagination.rs example, but with location set?

Looking through the source code/docs it is not really obvious where it needs to be set - I'd assume it to be part of JobConfigurationQuery but doesn't look like that struct has a location field

I tried to use ConnectionProperty (which is field of JobConfigurationQuery) but doesn't look like a correct option
And as far as I can see here JobConfigurationQuery is the only thing which is can be passed down to the actual call to create the job

Would definitely appreciate some help on this one :)

UPD:
Found this LoC - so one can specify location of the job, however it seems not be exposed when calling client.job().query_all - so feels like a small bug (unless I am missing something else here)

The tabledata.list API method return type might be wrong?

The TableDataApi::list method (source here):

    pub async fn list(
        &self,
        project_id: &str,
        dataset_id: &str,
        table_id: &str,
        parameters: ListQueryParameters,
    ) -> Result<TableDataInsertAllResponse, BQError> {

returns a TableDataInsertAllResponse. Shouldn't this return the rows?