Comments (10)
Note that the current algorithm does not check whether the assertions are redundant. As I recall, the assertions we experimented with were very redundant. We often had is_numeric
called 3-7 times on many of the columns. We could have reduced this by better research on which assertions implied the others, but we ran out of time.
from orca_test.
Interesting! Thanks for digging into this.
I agree that we shouldn't be generating computed columns unless they're explicitly mentioned in the data spec. Can you tell what's triggering that? It doesn't look like merge_tables()
is called directly by orca_test.
Here are some things that come to mind:
- If a table is a DataFrameWrapper, orca_test runs
assert_table_can_be_generated()
on it before testing any column characteristics. This is supposed to check that there are no errors in loading the core DataFrame (like missing data files, etc). If I'm accidentally causing DataFrames to be generated multiple times, this could be unexpectedly costly.
https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L250-L271
- As you said, orca_test does have to generate computed columns in order to verify the data spec. This shouldn't be too costly in aggregate, so long as the columns listed in the spec would all be generated eventually anyway. But maybe there's a bug or sequencing issue causing them to be generated multiple times? Here's the code that triggers evaluation of computed columns.
https://github.com/UDST/orca_test/blob/master/orca_test/orca_test.py#L320-L354
I don't think the primary key/ foreign key tests do any merging, just set algebra.
If we can figure out what the source of the inefficiency is, I'm happy to take a crack at fixing it. Adding 30% to execution time is definitely not ideal.
from orca_test.
Sorry not merge tables, I misspoke - I think it's this to_frame call. Does that get called a lot? Some columns might not be cached so they might get computed every time...
from orca_test.
Ah, got it.
That line gets called for columns that are indexes of the underlying DataFrame. It's an attempt to get around a limitation in orca where DataFrameWrapper.get_column()
fails if you request an index by name. (Is that intended? I've been meaning to open an issue about it.)
Originally we were using pd.Series(t.index)
to get the index of DataFrameWrapper t
, but @Eh2406 pointed out that this doesn't work for MultiIndexes. So #13 replaced it with
t.to_frame([]).reset_index()[column_name]
The intention is to pass an empty list of columns to to_frame()
so that only the indexes are returned, but maybe that's not supported. Here's the relevant orca code:
https://github.com/UDST/orca/blob/master/orca/orca.py#L193-L218
If that's the cause of the slowdown, it would explain why I didn't notice it with Bay Area UrbanSim last August. Any ideas about a more efficient way to pull MultiIndexes out of a DataFrameWrapper?
from orca_test.
I dug into this a bit more.
DataFrameWrapper.to_frame([])
returns the whole DataFrame, which is not what we want.
DataFrameWrapper.to_frame([''])
returns just the index(es), but without their names.
DataFrameWrapper.index
will return either an Index or a MultiIndex, including names. So we'll stick with that, and just need some logic to distinguish between the single- and multi-index cases.
Working on a fix!
from orca_test.
Honestly I try to avoid multi-indexes so no brilliant ideas, but your explanation certainly makes sense. Hopefully that takes care of it!
from orca_test.
Thanks for finding this!
On a different note we have performance issues because our spec looks like:
....
ColumnSpec('zone_id', registered = True , numeric =True , missing = False, min = 1, max = 2899)
...
It processes them in order.
- registered test
- numeric, but numeric implies registered so:
- registered test
- numeric test
- missing test
- min, but min implies numeric so:
- registered test
- numeric test
- min test
- max, but max implies numeric so:
- registered test
- numeric test
- max test
A little redundant yes? But I did not have time to remove the redundant tests in our spec, nore do I see an easy way for OT to remove them automatically.
from orca_test.
(Update: per comments in PR #17, the to_frame()
call does not seem to be the source of the slowdown after all.)
You're right about the redundancy, @Eh2406. Could you do a test of execution time with orca_test vs. without orca_test, to get a sense of the performance hit?
I'd expect most of the tests to be practically costless, unless there's a bad cache setting and columns are being regenerated. But maybe I'm wrong -- and the redundancy could also be annoying if it makes the debug output harder to read, or just on a conceptual level.
Right now the hierarchy is specified within the assertion functions. For example, assert_table_can_be_generated()
calls assert_table_is_registered()
as its first action. But if the redundancy is a problem, we could move that logic up to the function that parses a spec, and call each relevant assertion only once.
from orca_test.
We have never run orca_test during our models runs. It lives in its own workbook. As I recall that workbook takes approximately 15 minutes to run. To be clear that spec was generated from some prior art by @semcogli, and some automatic scan of a functional models data. So it is long, redundant and not optimized.
I see 3 good outcomes for dealing with the redundancy:
- orca_test detects slow checks and suggests targeted ways to improve cache settings. I.E
consider improving the cache settings on nearest_schools
- orca_test detects redundancy in the specifications and suggests targeted ways to remove that redundancy. I.E
you specified registered = True and numeric =True This is the same but slower than just numeric =True
- orca_test detects redundancy but just doesn't do the redundant checks. As you suggested we could do that in the parsing code. Or we could use some kind of memoization cash, so that redundant calls are a no-op. I'm sure there are other options.
from orca_test.
Makes sense. Looking at the orca documentation, the default is for computed data not to be cached at all, which does seem like it would lend itself to situations where orca_test slows down performance by re-generating things.
https://udst.github.io/orca/core.html#caching
As a first step, let's add a test that warns users if a computed object has cache=False.
from orca_test.
Related Issues (18)
- Ultra specific testing for debug purposes HOT 4
- YAML syntax and additional assertions on value consistency? HOT 3
- Overbroad Except Clauses HOT 4
- Split into multiple files HOT 1
- Complete a test with report on specific issues HOT 5
- Possible bug in missing value assertions HOT 2
- Write unit tests HOT 3
- Add Python 3 cross-compatibility HOT 1
- Standardize docstrings and put together Sphinx documentation
- Foreign_key test report additional/missing values HOT 1
- Set defaults in test specs HOT 3
- Error messages should put table name in a consistent spot
- What can we learn from engarde HOT 1
- categorical columns
- Raise an error when an undefined key is included in a spec
- More informative error message for max_portion_missing
- Warnings or reports rather than assertions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from orca_test.