First off, this library is very cool <a class="user-mention notranslate" data-hovercar

Sorry not merge tables, I misspoke - I think it's this <a href="https://github.com/UDS

(Update: per comments in PR <a class="issue-link js-issue-link" data-error-text="Faile

Slow performance & redundant tests about orca_test HOT 10 OPEN

fscottfoti commented on July 1, 2024

Slow performance & redundant tests

from orca_test.

Comments (10)

Eh2406 commented on July 1, 2024

Note that the current algorithm does not check whether the assertions are redundant. As I recall, the assertions we experimented with were very redundant. We often had is_numeric called 3-7 times on many of the columns. We could have reduced this by better research on which assertions implied the others, but we ran out of time.

from orca_test.

smmaurer commented on July 1, 2024

Interesting! Thanks for digging into this.

I agree that we shouldn't be generating computed columns unless they're explicitly mentioned in the data spec. Can you tell what's triggering that? It doesn't look like merge_tables() is called directly by orca_test.

Here are some things that come to mind:

If a table is a DataFrameWrapper, orca_test runs assert_table_can_be_generated() on it before testing any column characteristics. This is supposed to check that there are no errors in loading the core DataFrame (like missing data files, etc). If I'm accidentally causing DataFrames to be generated multiple times, this could be unexpectedly costly.

https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L250-L271

As you said, orca_test does have to generate computed columns in order to verify the data spec. This shouldn't be too costly in aggregate, so long as the columns listed in the spec would all be generated eventually anyway. But maybe there's a bug or sequencing issue causing them to be generated multiple times? Here's the code that triggers evaluation of computed columns.

https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L320-L354

I don't think the primary key/ foreign key tests do any merging, just set algebra.

If we can figure out what the source of the inefficiency is, I'm happy to take a crack at fixing it. Adding 30% to execution time is definitely not ideal.

from orca_test.

fscottfoti commented on July 1, 2024

Sorry not merge tables, I misspoke - I think it's this to_frame call. Does that get called a lot? Some columns might not be cached so they might get computed every time...

from orca_test.

smmaurer commented on July 1, 2024

Ah, got it.

That line gets called for columns that are indexes of the underlying DataFrame. It's an attempt to get around a limitation in orca where DataFrameWrapper.get_column() fails if you request an index by name. (Is that intended? I've been meaning to open an issue about it.)

Originally we were using pd.Series(t.index) to get the index of DataFrameWrapper t, but @Eh2406 pointed out that this doesn't work for MultiIndexes. So #13 replaced it with

t.to_frame([]).reset_index()[column_name]

The intention is to pass an empty list of columns to to_frame() so that only the indexes are returned, but maybe that's not supported. Here's the relevant orca code:

https://github.com/UDST/orca/blob/master/orca/orca.py#L193-L218

If that's the cause of the slowdown, it would explain why I didn't notice it with Bay Area UrbanSim last August. Any ideas about a more efficient way to pull MultiIndexes out of a DataFrameWrapper?

from orca_test.

smmaurer commented on July 1, 2024

I dug into this a bit more.

DataFrameWrapper.to_frame([]) returns the whole DataFrame, which is not what we want.

DataFrameWrapper.to_frame(['']) returns just the index(es), but without their names.

DataFrameWrapper.index will return either an Index or a MultiIndex, including names. So we'll stick with that, and just need some logic to distinguish between the single- and multi-index cases.

Working on a fix!

from orca_test.

fscottfoti commented on July 1, 2024

Honestly I try to avoid multi-indexes so no brilliant ideas, but your explanation certainly makes sense. Hopefully that takes care of it!

from orca_test.

Eh2406 commented on July 1, 2024

Thanks for finding this!

On a different note we have performance issues because our spec looks like:

....
ColumnSpec('zone_id', registered =  True , numeric =True , missing = False, min = 1, max = 2899)
...

It processes them in order.

registered test
numeric, but numeric implies registered so:
1. registered test
2. numeric test
missing test
min, but min implies numeric so:
1. registered test
2. numeric test
3. min test
max, but max implies numeric so:
1. registered test
2. numeric test
3. max test

A little redundant yes? But I did not have time to remove the redundant tests in our spec, nore do I see an easy way for OT to remove them automatically.

from orca_test.

smmaurer commented on July 1, 2024

(Update: per comments in PR #17, the to_frame() call does not seem to be the source of the slowdown after all.)

You're right about the redundancy, @Eh2406. Could you do a test of execution time with orca_test vs. without orca_test, to get a sense of the performance hit?

I'd expect most of the tests to be practically costless, unless there's a bad cache setting and columns are being regenerated. But maybe I'm wrong -- and the redundancy could also be annoying if it makes the debug output harder to read, or just on a conceptual level.

Right now the hierarchy is specified within the assertion functions. For example, assert_table_can_be_generated() calls assert_table_is_registered() as its first action. But if the redundancy is a problem, we could move that logic up to the function that parses a spec, and call each relevant assertion only once.

from orca_test.

Eh2406 commented on July 1, 2024

We have never run orca_test during our models runs. It lives in its own workbook. As I recall that workbook takes approximately 15 minutes to run. To be clear that spec was generated from some prior art by @semcogli, and some automatic scan of a functional models data. So it is long, redundant and not optimized.

I see 3 good outcomes for dealing with the redundancy:

orca_test detects slow checks and suggests targeted ways to improve cache settings. I.E consider improving the cache settings on nearest_schools
orca_test detects redundancy in the specifications and suggests targeted ways to remove that redundancy. I.E you specified registered = True and numeric =True This is the same but slower than just numeric =True
orca_test detects redundancy but just doesn't do the redundant checks. As you suggested we could do that in the parsing code. Or we could use some kind of memoization cash, so that redundant calls are a no-op. I'm sure there are other options.

from orca_test.

smmaurer commented on July 1, 2024

Makes sense. Looking at the orca documentation, the default is for computed data not to be cached at all, which does seem like it would lend itself to situations where orca_test slows down performance by re-generating things.

https://udst.github.io/orca/core.html#caching

As a first step, let's add a test that warns users if a computed object has cache=False.

from orca_test.

Slow performance & redundant tests about orca_test HOT 10 OPEN

Comments (10)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent