Coder Social home page Coder Social logo

Comments (10)

z3z1ma avatar z3z1ma commented on September 26, 2024

You can just supply the columns argument to the @dlt.resource decorator and those should be guaranteed to exist.
image

from dlt.

hello-world-bfree avatar hello-world-bfree commented on September 26, 2024

You can just supply the columns argument to the @dlt.resource decorator and those should be guaranteed to exist.

Thanks! Would that require explicitly defining the table schemas in advance?

I assume since dlt.current.source_schema() would have nothing to offer on the first run without an explicit schema, dlt would revert to building the schema as it processes the data. Is that right?

from dlt.

z3z1ma avatar z3z1ma commented on September 26, 2024

dlt.current.source_schema() is not particularly relevant here. I just use it to access the configured normalizer for snake case conversion.

What is relevant is the columns argument my cursor is on. And what I am trying to highlight is the fact you can populate that argument however you want. In this case, I am populating the columns up front by using the SFDC describe API. You can do something similar for a database by querying the information schema. You can also just list out columns manually. Anything there should be created in the destination.

from dlt.

hello-world-bfree avatar hello-world-bfree commented on September 26, 2024

Ahh alright, got it. Thanks @z3z1ma! That's slightly different from the functionality I'm requesting.

I'd like dlt to fill in the missing columns without having to supply the columns explicitly. At the moment, when the first run fails without supplying any column information, it seems to be with complete knowledge of the table schema and what columns are missing. I'd like dlt to use that info to standardize the schema across all records for me - no advanced input required.

from dlt.

rudolfix avatar rudolfix commented on September 26, 2024

@hello-world-bfree dlt infers the schema from the data. if optional fields are not present in the actual extract, they will not be added. @z3z1ma is right - if you have any additional columns that you want to add to the schema, you should add them as columns - all other columns will still be inferred btw.

if you have openAPI spec for your endpoint you can generate Pydantic models for the response items and then use them as columns. we are working on automating this process

from dlt.

hello-world-bfree avatar hello-world-bfree commented on September 26, 2024

Thanks @rudolfix! Is it wrong to say that subsequent pipeline runs have knowledge of the existing schema in the destination then? That's the assumption I'm making given the schema evolution docs.

A common problem with any pipeline is that an upstream source system removes a column without notice. If I'm understanding it right, dlt would fail the pipeline. To have to update my dlt code and manually add the removed column to columns whenever this happens to preserve my destination table and keep the pipeline running is functional, for sure, but less desirable than if the missing columns were handled without failure or user intervention.

After reading closer, this works but only on subsequent pipeline runs. The initial pipeline run is what seems prone to potential issues with missing columns.

from dlt.

hello-world-bfree avatar hello-world-bfree commented on September 26, 2024

Thanks for bearing with me. I think I've got a better understanding now.

The issue seems to be on the initial pipeline run. If I try to run a complete backfill, it fails. However, if I only run a subset of the data that has a consistent schema (either no missing columns or only records with missing columns) for the initial run, it'll work just fine and I'll be able to backfill the rest of the subsequent data successfully.

I think it'd be helpful to have an additional step before load for an initial pipeline run that standardizes the schema across the entire dataset.

from dlt.

z3z1ma avatar z3z1ma commented on September 26, 2024

I think it'd be helpful to have an additional step before load for an initial pipeline run that standardizes the schema across the entire dataset.

Can you try specifying the columns kwarg like I mentioned. That standardizes it up front. I think it even takes a pydantic model. If you set it, the columns are guaranteed to exist. That bit has nothing to do with initial or subsequent run per-say. dlt just creates an initial schema from that argument and then schema inference/discovery takes it the rest of the way. Which means you would end up with a superset of the explicitly passed columns and discovered columns. A schema contract or pydantic model could probably lock that.

If your contention is that you don't want to specify the schema up front but you want the schema to be standardized, then you have a chicken and egg problem. dlt doesn't "know" the schema ahead of time. it only does if the source is able to discover the columns in some way and inform dlt via the columns kwarg or the dlt mark module. The former is exactly what I was doing in my screenshot.

If you believe it knows the schema ahead of time, then it is probably the state tables _dlt_<table> which are providing that continuity on subsequent run. But on a new dataset, those _dlt prefixed tables dont exist.

Let me know if any of this is useful @hello-world-bfree

from dlt.

hello-world-bfree avatar hello-world-bfree commented on September 26, 2024

Can you try specifying the columns kwarg like I mentioned. That standardizes it up front. I think it even takes a pydantic model. If you set it, the columns are guaranteed to exist.

Apologies for not being more clear @z3z1ma. That method 100% works and obtains the correct functional outcome, no doubt. My feature request looks to solve for the problem of missing columns by default, though, without explicit user intervention. And to be clear, I mean missing as-in intermittently appearing within the same dataset. Not missing all together.

If your contention is that you don't want to specify the schema up front but you want the schema to be standardized, then you have a chicken and egg problem. dlt doesn't "know" the schema ahead of time.

While dlt may not know the schema ahead of time, it infers the schema during the extract & normalization steps, building the stored schema as it goes, right? Once all chunks of the dataset have been normalized, the stored schema has the superset of all columns is my understanding. It has to in order to create the destination table. That's why I suggest that a means of using the stored schema to pad missing columns where they might exist to ensure all data to-be loaded with a standardized schema for a new dataset. It seems beyond the initial run, intermittent columns are not an issue.

Am I missing anything about the schema inference and storage process? And, definitely useful, thanks again!

from dlt.

rudolfix avatar rudolfix commented on September 26, 2024

@hello-world-bfree I'm not sure I fully grasp you problem but maybe this will help:

  1. you can execute extract and normalize steps of the pipeline without loading the data to infer the schema over several chunks.
  2. you can use "export" schema function to get the the final schema
  3. then you can use this "export" schema as "import" schema for a new pipeline instance that starts clean but with already inferred schema. (you could even lock the schema with schema contracts)

another idea: if you discover new columns at runtime and you want to create them in the destination without data you can use this: https://dlthub.com/docs/general-usage/resource#adjust-schema-when-you-yield-data

from dlt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.