Assets related to the operation of dbt Labs.
dbt-labs / corp Goto Github PK
View Code? Open in Web Editor NEWAssets related to the operation of Fishtown Analytics.
License: Apache License 2.0
Assets related to the operation of Fishtown Analytics.
License: Apache License 2.0
The link to the company values is broken for at least a few of the job postings.
I saw this on the following job postings:
Engineering Manager, Cloud Application
https://boards.greenhouse.io/dbtlabsinc/jobs/4350465005
Engineering Manager, dbt Explorer
https://boards.greenhouse.io/dbtlabsinc/jobs/4351839005
For both of those, the bullet point "Aligns with [our core values" points to https://github.com/fishtown-analytics/corp/blob/master/values.md."
That link returns a 404.
I did notice that the posting for Engineering Manager, Orchestration (https://boards.greenhouse.io/dbtlabsinc/jobs/4326424005) works fine since it points to your webpage: https://www.getdbt.com/dbt-labs/values
I hope this channel is ok to report this - I would want someone to tell me and since the 404 was in this git repo it seemed the most obvious way.
Remove or adjust the "ON
over USING
" recommendation in the style guide.
The USING
syntax can make queries more readable and easier to understand when used appropriately.
The style guide recommends using ON
instead of USING
in the SQL style guide section:
Line 286 in 09e4d4d
This is definitely appropriate when there is a mix of (left) joins over different sets of columns.
However, there are very good use cases for the USING
syntax in the databases that support it -- in particular, with inner joins and full joins where the joins column(s) are always the same.
Suppose we have 3 tables: prospects
, applicants
, and customers
, which each share a user_id
and each have a corresponding date column. A simple inner join between the 3 of them using the ON
syntax might look like:
SELECT
prospects.user_id,
prospects.prospect_date,
applicants.application_date,
customers.onboard_date
FROM prospects
INNER JOIN applicants
ON prospects.user_id = applicants.user_id
INNER JOIN customers
ON prospects.user_id = customers.user_id
;
This works fine, but:
user_id
is in each of the tables.prospects.user_id = applicants.user_id
in the customers
join.Alternatively, a simple inner join between the 3 of them using the USING
syntax might look like:
SELECT
user_id,
prospects.prospect_date,
applicants.application_date,
customers.onboard_date
FROM prospects
INNER JOIN applicants
USING(user_id)
INNER JOIN customers
USING(user_id)
;
To people that are new to USING
, the user_id
looks ambiguous -- but this implies to me (as someone that uses USING
a lot) that user_id
will actually be in all of the tables precisely because the table prefix has been omitted.
Additionally, the joins are now much cleaner and no longer prone to copy-and-paste errors.
Using the same tables as the previous example, a simple full join between the 3 tables using the ON
syntax might look like:
SELECT
COALESCE(prospects.user_id, applicants.user_id, customers.user_id) AS user_id,
prospects.prospect_date,
applicants.application_date,
customers.onboard_date
FROM prospects
FULL JOIN applicants
ON prospects.user_id = applicants.user_id
FULL JOIN customers
ON COALESCE(prospects.user_id, applicants.user_id) = customers.user_id
;
Since we have full joins, we need to make use of the COALESCE
function to make sure that we're getting all of the non-null values. This does not scale nicely: more joins to more tables leads to larger and larger COALESCE
calls.
Alternatively, a simple full join between the 3 of them using the USING
syntax might look like:
SELECT
user_id,
prospects.prospect_date,
applicants.application_date,
customers.onboard_date
FROM prospects
FULL JOIN applicants
USING(user_id)
FULL JOIN customers
USING(user_id)
This is much, much simpler and handles all of the COALESCE
-ing for us. This extends much easier than the ON
syntax, and still has the benefits laid out in the inner join example.
USING
exampleTo clarify, I don't think that USING
should always be used: I think it should just be used where it's more appropriate to use than ON
(whatever "more appropriate" means). In particular, I don't think USING
should be used when different columns are being used in each of the joins.
For example, here's an example that uses the ON
syntax:
SELECT
customers.customer_id,
loans.loan_id,
repayments.repayment_id,
repayments.repayment_date,
repayments.repayment_value
FROM customers
LEFT JOIN loans
ON customers.customer_id = loans.customer_id
LEFT JOIN repayments
ON loans.loan_id = repayments.loan_id
This is one where I think the ON
syntax is clearer than the USING
syntax, as the USING
syntax now hides which columns come from which tables:
SELECT
customer_id,
loan_id,
repayments.repayment_id,
repayments.repayment_date,
repayments.repayment_value
FROM customers
LEFT JOIN loans
USING(customer_id)
LEFT JOIN repayments
USING(loan_id)
This is a case where I would prefer the ON
syntax over the USING
syntax.
I don't know how best to rephrase the recommendation to account for this nuance, which is why I think it'd be best to drop the recommendation and leave it to the developers to use the syntax that is more appropriate for their use case. A stab at a rephrased recommendation is:
- Avoid the `using` clause in joins, preferring instead to explicitly list the CTEs and associated join keys with an `on` clause, unless the joins are all over the same column(s).
For reference, this is the PR that added this item in (it has the rationale in the description and across some of the comments):
In the 1.5 release we will be introducing model versioning! ๐ฅ ๐ฅณ
As a result a model will have two parts:
name
configuration and will be used in the {{ ref() }}
functiondefined_in
As a result we will need to update our style guide to demonstrate how we think both the model name and file name should be structured.
Drop the "UNION ALL
over UNION
" recommendation in the style guide.
Note this is not a request to change the recommendation to "UNION
over UNION ALL
" -- this is a request to not make a comment favouring either.
The style guide recommends using UNION ALL
instead of UNION
in the SQL style guide section:
Line 278 in 09e4d4d
Although the linked page does show an example where the UNION ALL
syntax is required over just UNION
, this is not representative of most pipelines.
Given that tables usually have a well-defined primary key, using UNION ALL
by default instead of UNION
runs the risk of propagating data quality duplicates throughout the project especially in cloud warehouses that don't verify the uniqueness of the primary keys. For large data estates with numerous incremental models and aggregates over them, there would be a significant cost associated with running a full refresh to fix the issues caused by these duplicates.
Rather than favouring UNION
or UNION ALL
, I don't think there should be a recommendation to use one over the other in general -- they have different use cases, so it should be up to the developer to choose the one that is appropriate for their given use-case.
This presentation was given at the fishtown analytics fall retreat in november of 2019. There are several changes from this presentation that haven't yet been updated in values.md:
New value: We work hard and go home.
This value is new, and intentionally new. It's not a rewording of something that we've already believed, it's actually a commitment to something that we have not in the past been committed to. I personally have been guilty of being a workaholic for most of my adult life, and as a result of this, plus as a result of the difficulties inherent in bootstrapping, much of the history of Fishtown Analytics has involved working a lot of long hours. We're making explicit efforts to correct this, and the addition of this value is a public statement of our commitment.
In doing so, we're stealing Slack's formulation. This formulation rings true: we're committed to maintaining our intensity, but also to confining the part of our days that we all dedicate to work. This is not only the right thing to do, it also promotes both long-term sustainability / minimizes burnout, plus it's more inclusive of those with care, or other, responsibilities.
New value: We are patient, yet urgent.
This value is also new, but it's not new for us. This is something we've lived from the founding of the company. We work every day with a sense of urgency, but we think strategically and optimize for the long-term. This tension between urgency and patience is hard to strike, but the creativity involved in doing so has always been central to our success.
Reformulation: Work done well is its own end.
This used to read "Work done well conveys dignity." While this is not a bad statement, it's not exactly what I was trying to say. Rather, it is that the process of creation, and our full engagement in and commitment to that process, inevitably enriches us. It is not just the nail that is created, it is the blacksmith. I think this formulation gets at that more effectively.
There is no opinion put forth in the dbt style guide about how to name exposures.
The style guide recommends to use CTEs for transformation steps, ending with a CTE called final
. It also recommends to use staging models to select from sources.
Would you also recommend to use a single CTE called final
in these staging models?
E.g. this dbt learn course doesn't have that CTE and IMHO this makes sense as the only purpose there is to map the sources and you should not do any transformation logic in there.
cc @coapacetic (maintainer of that course)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.