Coder Social home page Coder Social logo

Comments (10)

Eh2406 avatar Eh2406 commented on July 1, 2024

Note that the current algorithm does not check whether the assertions are redundant. As I recall, the assertions we experimented with were very redundant. We often had is_numeric called 3-7 times on many of the columns. We could have reduced this by better research on which assertions implied the others, but we ran out of time.

from orca_test.

smmaurer avatar smmaurer commented on July 1, 2024

Interesting! Thanks for digging into this.

I agree that we shouldn't be generating computed columns unless they're explicitly mentioned in the data spec. Can you tell what's triggering that? It doesn't look like merge_tables() is called directly by orca_test.

Here are some things that come to mind:

  1. If a table is a DataFrameWrapper, orca_test runs assert_table_can_be_generated() on it before testing any column characteristics. This is supposed to check that there are no errors in loading the core DataFrame (like missing data files, etc). If I'm accidentally causing DataFrames to be generated multiple times, this could be unexpectedly costly.

https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L250-L271

  1. As you said, orca_test does have to generate computed columns in order to verify the data spec. This shouldn't be too costly in aggregate, so long as the columns listed in the spec would all be generated eventually anyway. But maybe there's a bug or sequencing issue causing them to be generated multiple times? Here's the code that triggers evaluation of computed columns.

https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L320-L354

I don't think the primary key/ foreign key tests do any merging, just set algebra.

If we can figure out what the source of the inefficiency is, I'm happy to take a crack at fixing it. Adding 30% to execution time is definitely not ideal.

from orca_test.

fscottfoti avatar fscottfoti commented on July 1, 2024

Sorry not merge tables, I misspoke - I think it's this to_frame call. Does that get called a lot? Some columns might not be cached so they might get computed every time...

from orca_test.

smmaurer avatar smmaurer commented on July 1, 2024

Ah, got it.

That line gets called for columns that are indexes of the underlying DataFrame. It's an attempt to get around a limitation in orca where DataFrameWrapper.get_column() fails if you request an index by name. (Is that intended? I've been meaning to open an issue about it.)

Originally we were using pd.Series(t.index) to get the index of DataFrameWrapper t, but @Eh2406 pointed out that this doesn't work for MultiIndexes. So #13 replaced it with

t.to_frame([]).reset_index()[column_name]

The intention is to pass an empty list of columns to to_frame() so that only the indexes are returned, but maybe that's not supported. Here's the relevant orca code:

https://github.com/UDST/orca/blob/master/orca/orca.py#L193-L218

If that's the cause of the slowdown, it would explain why I didn't notice it with Bay Area UrbanSim last August. Any ideas about a more efficient way to pull MultiIndexes out of a DataFrameWrapper?

from orca_test.

smmaurer avatar smmaurer commented on July 1, 2024

I dug into this a bit more.

DataFrameWrapper.to_frame([]) returns the whole DataFrame, which is not what we want.

DataFrameWrapper.to_frame(['']) returns just the index(es), but without their names.

DataFrameWrapper.index will return either an Index or a MultiIndex, including names. So we'll stick with that, and just need some logic to distinguish between the single- and multi-index cases.

Working on a fix!

from orca_test.

fscottfoti avatar fscottfoti commented on July 1, 2024

Honestly I try to avoid multi-indexes so no brilliant ideas, but your explanation certainly makes sense. Hopefully that takes care of it!

from orca_test.

Eh2406 avatar Eh2406 commented on July 1, 2024

Thanks for finding this!

On a different note we have performance issues because our spec looks like:

....
ColumnSpec('zone_id', registered =  True , numeric =True , missing = False, min = 1, max = 2899)
...

It processes them in order.

  1. registered test
  2. numeric, but numeric implies registered so:
    1. registered test
    2. numeric test
  3. missing test
  4. min, but min implies numeric so:
    1. registered test
    2. numeric test
    3. min test
  5. max, but max implies numeric so:
    1. registered test
    2. numeric test
    3. max test

A little redundant yes? But I did not have time to remove the redundant tests in our spec, nore do I see an easy way for OT to remove them automatically.

from orca_test.

smmaurer avatar smmaurer commented on July 1, 2024

(Update: per comments in PR #17, the to_frame() call does not seem to be the source of the slowdown after all.)

You're right about the redundancy, @Eh2406. Could you do a test of execution time with orca_test vs. without orca_test, to get a sense of the performance hit?

I'd expect most of the tests to be practically costless, unless there's a bad cache setting and columns are being regenerated. But maybe I'm wrong -- and the redundancy could also be annoying if it makes the debug output harder to read, or just on a conceptual level.

Right now the hierarchy is specified within the assertion functions. For example, assert_table_can_be_generated() calls assert_table_is_registered() as its first action. But if the redundancy is a problem, we could move that logic up to the function that parses a spec, and call each relevant assertion only once.

from orca_test.

Eh2406 avatar Eh2406 commented on July 1, 2024

We have never run orca_test during our models runs. It lives in its own workbook. As I recall that workbook takes approximately 15 minutes to run. To be clear that spec was generated from some prior art by @semcogli, and some automatic scan of a functional models data. So it is long, redundant and not optimized.

I see 3 good outcomes for dealing with the redundancy:

  1. orca_test detects slow checks and suggests targeted ways to improve cache settings. I.E consider improving the cache settings on nearest_schools
  2. orca_test detects redundancy in the specifications and suggests targeted ways to remove that redundancy. I.E you specified registered = True and numeric =True This is the same but slower than just numeric =True
  3. orca_test detects redundancy but just doesn't do the redundant checks. As you suggested we could do that in the parsing code. Or we could use some kind of memoization cash, so that redundant calls are a no-op. I'm sure there are other options.

from orca_test.

smmaurer avatar smmaurer commented on July 1, 2024

Makes sense. Looking at the orca documentation, the default is for computed data not to be cached at all, which does seem like it would lend itself to situations where orca_test slows down performance by re-generating things.

https://udst.github.io/orca/core.html#caching

As a first step, let's add a test that warns users if a computed object has cache=False.

from orca_test.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.