Coder Social home page Coder Social logo

Comments (29)

tomsej avatar tomsej commented on August 21, 2024 1

Honestly I did not perform any performance tests. So thanks a lot for this. Think we can try what @jwills is proposing. Or we can directly dump it with something like copy select... so there would not be any object in the middle. Tomorrow I will take a look.

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024 1

I experimented with a few patterns and best performance seemed to be moving the copy operation to the very end. But you can handle larger datasets if you use intermediate materializations and replace the models with view referencing parquet since duckdb can dump those from memory. It is slower however.

from dbt-duckdb.

ngould avatar ngould commented on August 21, 2024 1

I've noticed that dbt-duckdb doesn't log out the underlying DuckDB error verbatim. I've found myself jumping into an interactive DuckDB session to run my dbt queries and get the actual error message.

That may help you troubleshoot. But...while I'm here, curious if anyone's looked into propagating the actual DuckDB error messages? Would be very handy, I think...

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024 1

@matsonj you should be able to set it in the profile, e.g.

type: duckdb
settings:
  memory_limit: "1G"

but yeah I hear you that we aren't out of the woods yet.

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024 1

@jwills confirming the config works in the dbt profile - (just set it to smaller and smaller numbers until the build failed).

from dbt-duckdb.

tomsej avatar tomsej commented on August 21, 2024 1

Ok, I also tried to install duckdb==0.5.2.dev2286 and memory is way better:

image

But I still do not understand why 16Mb Parquet file need this amount of RAM :/

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024 1

@jwills I think I'm fine with closing this, given there are 4 workarounds. 1) set the max memory with PRAGMA. 2) use a bigger VM. 3) materialize the problematic tables as tables, and export with a post hook. 4) don't use external tables and instead use the duckdb database directly.

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

Repro steps:

Open the small devcontainer on https://github.com/matsonj/nba-monte-carlo in the ‘dbt-upgrade’ branch. Run ‘make build’ then ‘make pipeline’ and you will get the error. On main, you will get no such error with an identical set of models and number of simulations, but using COPY instead of external materialization.

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024

I think the issue is that we are using a table instead of a view as our temp relation type here:

https://github.com/jwills/dbt-duckdb/blob/master/dbt/include/duckdb/macros/materializations/external.sql#L16

/cc @tomsej I think that will fix the memory usage issue @matsonj is seeing, but I don't know what negative side-effects that could have elsewhere

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024

Ah, of course-- you need a table for an external python model to work-- or do you? Maybe it would work with a view as well? Not sure, one way to find out

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

So just spitballing here, but after the table is written out, it should be replaced by a view that is "select * from {file}.parquet". If you leave it as a view (as suggested above), it will instead reference your other models, right? Need to make sure we "inject" the parquet file back into the database.

That way you only load it into memory when it is called subsequent in your DAG. I don't think it needs to both exist in the :memory: database and on disk.

from dbt-duckdb.

tomsej avatar tomsej commented on August 21, 2024

@matsonj if you want to try it, you can override the default external macro and replace table materialization. Sorry I am out for the weekend without pc

from dbt-duckdb.

tomsej avatar tomsej commented on August 21, 2024

Think this is related too databricks/dbt-databricks#131. I have been playing wit dbt threads recently and they are using multiprocessing.dummy for threading. That have some issues. For lambda I had to rewrite this. How many threads did you set in the profile.yml?

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024

Okay, so a couple of things: I think this is the issue b/c afaict the memory problems (which I have successfully repro'd locally) are occurring when we are reading back the external table, not when we are writing it: duckdb/duckdb#3969

The jemalloc fixes Mark describes in there are dropping in the next DuckDB release which should be early next week, so we should retry this then. For now I don't think I have enough evidence to justify changing the existing external implementation (i.e., my view optimizations have a very minimal impact on the actual runtime/memory usage of the NBA sims project.)

Semi-related: does it seem to anyone else that the COPY operations are blocking somehow? Like it looks to me that multiple threads can run CREATE TABLE statements that don't block each other, but only one COPY operation is allowed to run at once? Going to come up with a simple test of that to see if I'm going crazy or not.

Finally, @ngould fixing up how we report the errors we get from DuckDB has been on my todo list for awhile ever since they did a revamp of the error class hierarchy in DuckDB 0.5.x; I'll file a separate issue to get that addressed tho.

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024

Ack, no, wait-- I'm wrong. The jemalloc stuff is only impacting Linux, not OS X. 🤔

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

@jwills I just read through that thread, I think that issue is at the very least related if not the same. Dumb question - is there a way to set the memory limit in dbt-duckdb? Perhaps in the dbt profile? My hunch is that setting the max RAM limit may actually avoid this error and instead cause duckdb to be slightly slower when using external tables.

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

Just as a further note, it doesn’t seem like I can set the memory limit on in the config, but I was able to add it in the dbt_project.yml file:

on-run-start: "PRAGMA memory_limit='512MB'"

Decreasing the memory limit does indeed slow the dbt build down, so I can see that is effective, but we are still running OOM. It is failing a bit more elegantly though:

Out of Memory Error: could not allocate block of 134217736 bytes (404514157/512000000 used)

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

Just one more note here:

  1. 1GB memory limit when using external tables - get the error mentioned in first post.
  2. 1GB memory limit when using COPY in on-run-end - runs without issue.

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024

@matsonj in my testing, it seems like things really go wrong when models/conformed/inputs/random_num_gen.sql is an external materialization vs. any of the other external materializations; is that what you're seeing as well? Like, if that model is changed from external to table, then I can get through 100k sims fine on my machine, but if it's external them I'm going to OOM.

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

Good find. Bizarre that it is an issue at all, but it's most important that it is held as a table so that simulation returns the same results. It's more or less useless for analysis though, unless you are checking it for "true randomness" it contains basically no other useful information.

To be clear this isn't blocking any work - just odd that it is an issue at all.

For me it was failing at initialize_seeding, which might be doing full load of random_num_gen multiple times. I'll mess around a bit and see if I can get clearer on why this is occurring.

from dbt-duckdb.

tomsej avatar tomsej commented on August 21, 2024

Hey @matsonj I have tried to run your project in master branch to cProfile where the problem can be:

  • I opened devcontainer
  • run make build and make pipeline.
    But I am getting the following error: FileExistsError: [Errno 17] File exists: 'dbt_packages/dbt-utils-0.9.2/integration_tests/tests/sql'. I also tried to delete dbt_packages folder and rerun. Nothing helps. Do you have any idea what I am doing wrong?

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

It looks like that might be an issue with dbt deps.

Try the following commands:

  1. meltano invoke dbt-duckdb clean
  2. meltano invoke dbt-duckdb deps

I also DM’d you in the dbt slack if its easier.

from dbt-duckdb.

tomsej avatar tomsej commented on August 21, 2024

All right, I have managed to get some information. I am running this in the codespaces with 2GB of memory. As @jwills wrote, the problem is with random_num_gen model. If it is materialized as external the memory consumption looks like this:
image
When the memory consumption exceeds the 1750MB it fails.

With table materialization everything looks just fine:
image

I tried to load parquet file from /tmp/data_catalog/conformed/random_num_gen.parquet into DuckDB with following code:

import duckdb

conn = duckdb.connect(":memory:")
res = conn.execute("select * from '/tmp/data_catalog/conformed/random_num_gen.parquet'").fetchall()

And memory consumption looked like this:

image

As you can see, just loading random_num_gen.parquet takes 1.2GB of memory. And if you load this parquet multiple times (model is parent table for some other models), memory is skyrocketing -> OOM killer.

I was running it in Codespaces, it is a Linux Docker container, so the jemalloc fix may help here. We can give it a shot with the new release.

Also, it seems that external materialization should be used only on the leafs of the DAG (at least for now). If some model is referenced from other models, table materialization works better.

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

Awesome, thanks @tomsej. Should only use on leaf nodes, makes sense.

I can also export it out at the end with COPY, so this is now more of an implementation detail.

I'll do some more testing once v0.6.0 ships and see how many scenarios I can jam through on a codespace VM.

from dbt-duckdb.

tomsej avatar tomsej commented on August 21, 2024

Ha version 0.6 was released today :)

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024

@matsonj any updates on this, esp. with 0.6.0? An issue any longer?

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

I will take a look and close this out.

from dbt-duckdb.

matsonj avatar matsonj commented on August 21, 2024

Bad news, running in a small github container unfortunately this is still happening.

/usr/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '


error invoking dbt None        error_message=dbt invocation failed returncode=-9

from dbt-duckdb.

jwills avatar jwills commented on August 21, 2024

@matsonj gotcha-- thanks man!

from dbt-duckdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.