Is there an existing feature request for this?
Describe the Feature
Background
Currently, dbt-bigquery profile.yml has two configurations: Timeouts and Retries
We at Semios have been using dbt extensively to run high-frequency data pipelines on BigQuery (some of them run every 10 minutes). Because of the strict SLA requirement, we require more granular control on BigQuery timeout and retry mechanism to minimize the intermittent query failures and its impact to our pipeline.
We found that the existing configs are not enough to mitigate all intermittent query failures. for example, the Deadline of 120.0s exceeded 503 errors raised by BigQuery JobInsert API that occurred from time to time and impacted several dbt-BigQuery users in the dbt community.
We would like to suggest a more comprehensive BigQuery retry config to minimize all these errors.
Technical Context
Before I propose the suggestion, let me explain how BigQuery query works and how we would like to retry.
How BigQuery query works in dbt
at the core, BigQuery query is made by two steps in dbt
def _query_and_results(self, client, sql, conn, job_params, timeout=None):
"""Query the client and wait for results."""
# Cannot reuse job_config if destination is set and ddl is used
job_config = google.cloud.bigquery.QueryJobConfig(**job_params) # <--- Step 1
query_job = client.query(sql, job_config=job_config) # <--- Step 2
iterator = query_job.result(timeout=timeout)
return query_job, iterator
In the first step, client.query()
submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return the query job id
back to the client as part ofquery_job
object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.
In the 2nd step, query_job.result()
await for the BigQuery executing (running) the query and return the results back to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.
how dbt currently retries BigQuery queries
Currently, The dbt retry
config is used to set the retries of the overall _query_and_results
. the dbt timeout
config is used only to control step 2, query_job.result(timeout=timeout)
. These configs make sense on its own.
However, you might notice that there is no control over the timeout of query()
in step 1, other than relying on its [default value] (https://github.com/googleapis/python-bigquery/blob/1762e5bd98ab089abf65a0d146234c77217b8dbe/google/cloud/bigquery/client.py#L3193). When BQ JobInsert API sharding cluster is unstable. it could take up to 4 minutes to create the query job id in some rare cases. Because the query()
had a client-side [default deadline of 120s](DEFAULT_RETRY = retry.Retry(predicate=_should_retry)), the client quits at 120s while server side is still waiting for the job creation. That's why several dbt community members had experienced the [Deadline of 120.0s exceeded errors (https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400).
BQ team has since patched the query()
to make a client-side [default deadline of 10 mins](DEFAULT_RETRY = retry.Retry(predicate=_should_retry, deadline=600.0). It means by default, query()
could take up to 4 minutes to make one attempt to create a new job id. If it fails, the client-side could have up to 10 minutes to retry. This would work fine for users who don't have stringent timing requirements. They won't mind waiting for 10 minutes to create a new BQ job ID and then waiting for another 10 minutes to get the result.
Unfortunately, it doesn't work for us. In situations like this, we would rather fail faster and try again. It shall not take more than 30 seconds to create a new job id. if it takes too long, we would rather let the query() timeout and fail, so that we could retry it again.
the Proposed Solution
In order to gain fine control of the timeout mechanism of step 1 and step 2, we would like to propose the following 4 dbt configs
job_creation_timeout_seconds # specific for initiate BQ job, to control the timeout of step 1, query()
job_execution_timeout_seconds # specific for awaiting job result, to control the timeout of step 2, result()
job_retry_deadline_seconds # to control the overal query, retry_deadline of _query_and_results()
job_retries # to control the overall query, retries of _query_and_results()
For example, we could set the configs below to fail faster on the step of BQ job creation, while allowing queries with long-running results.
job_creation_timeout_seconds=30
job_execution_timeout_seconds=1200
job_retry_deadline_seconds=1500
job_retries=3
These settings would allow us to control the timeout behaviors of step 1 and step 2 on their own, hence maximizing our chances to mitigate different kinds of intermittent errors.
NOTE:
job_execution_timeout_seconds
is the renaming of the current timeout
config.
job_retries
is the renaming of the current retries
config.
Describe alternatives you've considered
Currently, we have to override the dbt-core code to allow the query()
to fail faster and retry, and to mitigate the 503 error.
Who will this benefit?
the dbt-bigquery users who would need more fine control of BigQuery query behaviors
Are you interested in contributing this feature?
yes
Anything else?
this issue has been discussed by several dbt community members in slack
https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400