paddymul / buckaroo Goto Github PK

Buckaroo - the data wrangling assistant for pandas. Quickly explore dataframes, and run pandas commands via a GUI. Works inside the jupyter notebook.

Home Page: https://buckaroo-data.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

JavaScript 1.57% Python 35.79% HTML 0.32% Shell 0.03% Jupyter Notebook 47.89% TypeScript 13.86% CSS 0.53%

buckaroo data-science jupyter pandas paddy

buckaroo's People

Contributors

Stargazers

Watchers

Forkers

pplonski richardscottoz nasrin1748 ciemaar chbrandt maartenbreddels deanm0000

buckaroo's Issues

Error in column sizing via ag-grid sometimes preventing all columns from displaying

currently the code in DFVIewer.tsx depends on a dumb timeout. This should probably run after an onGridReady event... but that also didn't work on the initial build.

        const timer = setTimeout(() => {
            gridRef.current!.columnApi.autoSizeAllColumns();
        }, 150);
        return () => clearTimeout(timer);
    }, [gridRef]);

this code ocassionaly throws an error of
v.current is null

Text row labels appear blank

Thanks again - enjoying the exploration so far.

I have noticed that when I use text labels for the Pandas index, these appear blank - is this just me? For example:

import buckaroo
import pandas as pd

df = pd.DataFrame(
    dict(names=['one', 'two', 'three'],
         values=[1, 2, 3])).set_index('names')
df

For me at least the row labels are blank for these text labels, but leaving the default number labels, these appear correctly.

Refactor table_hints

table_hints should have a type key, with multiple values...

Much easier to make typescript happy this way vs is_numeric, is_integer, is_boolean

Add boolean formatter

Boolean values display in the table as 1 or 0.

Automatic version bump via CI

figure out how to automatically bump versions with code. figure out what that means for the next commit. run during merge to main CI

store my pypi credentials in github secrets

Incorporate ipyreact

ipyreact seems to be a better way of bridging between jupyter widgets and react.
https://github.com/widgetti/ipyreact

Move the frontend over to this model, probably by just porting the code in https://github.com/widgetti/ipyreact/blob/master/src/widget.tsx

I don't want the react 16/18 testing (this is built on react 18). I also don't want the dynamic build step that ipyreact gets from anywidget... good for dev, adds complication to reading the code for a primarily packaged repo.

Better ticking for histograms

Try to use bokeh or matplotlib ticking to improve bin edges for histograms (and their labels)

Horrible discoverability

While at the Jupytercon conference, getting people to buckaroo docs or this github page was an embarrassing exercise in frustration.

Provide a simple URL for reaching the buckaroo project.

Make CI pass

Get a baseline check that guaranteesthe package builds.

Add read-the-docs to DCEF

Add version to dcef

figure out __version__ vs VERSION... whatever the cool kids are using these days. Make it part of the build process, tied to pyproject.toml

Show python errors in the UI

Some combinations of Operations can throw errors, for instance dropping the same column twice. This should show as an error. Showing the python error will also help users develop their own commands

Quiet=False sometimes fails

The following results in a stack trace (definitely not quiet)

w = BuckarooWidget(df[:500], showCommands=False)
class Variance(ColAnalysis):
    provides_summary = ["variance"]
    requires_summary = ["mean"]
    quiet = True
    
    @staticmethod
    def summary(sampled_ser, summary_ser, ser):
        mean = summary_ser.get('mean', False)
        arr = ser.to_numpy()
        if mean and pd.api.types.is_integer_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        elif mean and pd.api.types.is_float_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        return dict(variance="NA")
    
    summary_stats_display = [
        'dtype', 'length', 'nan_count', 'distinct_count', 'empty_count',
        'empty_per', 'unique_per', 'nan_per', 
        'is_numeric', 'is_integer', 'is_datetime',
        'mode', 'min', 'max', 'mean', 
        # we must add variance to the list of summary_stats_display, otherwise our new stat won't be displayed
        'variance']

w.add_analysis(Variance)

Remove ipydatagrid references

The DCEF widget was originally built on top of the ipydatagrid repo to make it easier to get something built. move the necessary widget pieces from the ipydatagrid directory to the DCEF directory. update the build config files too, make sure the built package installs properly.

Mono repo including dcf-npm

Currently most of the front end is built as a separate repo, https://github.com/paddymul/dcf-widget-npm . Bring that repo into the main dcf repo.

We could have only a single js/ts build step... I would prefer to keep the frontend in a separate npm package. The dcf-widget-npm repo (containing only the react component - no ipywidget or jupyter widget stuff... poor name for the repo) has a much cleaner js build setup vs the main dcf repo. This fosters collaboration with frontend devs.

Add Red/Black conditional formatting

Add conditional formatting that shows negative numbers in red, positive in black

Modernize the styling of the Lowcode UI

The Lowcode UI is very poorly styled.

The Delete button is on the bottom left, instead of embedded into each operation. The table is built with a raw HTML table.
There is also no use of the symbol metadata embedded into JLisp (JSON flavored Lisp, brackets [] instead of parens ()) operations. This metadata is currently only used to distinguish auto-cleaning operations from user added operations, but could do much more (temporarily disabled, dependencies, errors)

When I built the lowcode UI, I was thinking of a sequential timeline for events, similar to design history in many CAD tools. https://collaborate.canadabay.nsw.gov.au/3Dprintclub/using-timeline-function-fusion-360

https://www.youtube.com/watch?v=o5NsPOcXLho

The Lowcode UI was one of the first pieces I built in Buckaroo. It hasn't seen a lot of use or development since.

In general my plan with buckaroo is to try to gain adoption, and then fix problems as they arise. The lowcode UI is an advanced feature that is rarely discovered by users. For the most part buckaroo has a top of funnel problem, most of the jupyter/pandas/polars users don't know about buckaroo at all.

As Buckaroo sees more adoption I want to add to the low-code UI, it's a very powerful tool.

filters step 1

Write the interpreter side of filters.

Make a python side command called "make-filter" that exposes a named variable bound to a filter constraint.

Write some unit tested code that uses filters to shape output transformed data

Allow UI cycling through option sets

There are multiple places where different analyses or presentations could be configured. Having one prescription for all dataframes, and requiring BuckarooWidget to be reinstantiated/configured for each type is cumbersome.

This is particularly relevant for

Auto-cleaning (off, aggressive, types only...)
Summary stats different sets of summary stats
Sampling methodology (random, first 500, segments of 50 each)
histogram presentation styles
column ordering
sets of low code UI commands

Auto-cleaning is probably the best candidate.

The function should work such that the backend suppplies a list of options and current, then clicking on that value toggles through the different options, changing the bi-directional widget value of current.

Move to ag-grid tight styling

Move to this type of styling
https://www.ag-grid.com/angular-data-grid/global-style-customisation-compactness/

I could not get their example code to work. will distill into a minimally reproducible test case and either fix or file a bug with ag-grid

Update naming to DCEF

DCF isn't available as a pip package name. DCEF or Data Cleaning Exploration Framework is available. all references should be updated to DCEF in code and documentation.

Add default rounding

There are many places where excess precision is wasteful of screen real estate and makes comparison harder. Adding a default rounding for computed statistics (as a result of group by or summary statistics) should be definitely permissible (you aren't altering raw data). Even for raw data, having a sensible default seems reasonable to me.

Add line plots to histograms

Add a line plot to histograms, it will show only ncreasing ID's very clearly

Possibly use resample sum/last for two different plot lines
also add cumsum plotline

Undo default capitalization of column names?

Thanks for this very nice-looking tool.

I noticed that the UI capitalizes my column names, at least by default. This is inconvenient, because I have to then separately check what the capitalization is for the columns, before using the names from the display. Is there any way to disable changing case of the column names by default?

Thanks again.

consolidate dcf-widget-npm with main Buckaoo codebase

I had kept https://github.com/paddymul/dcf-widget-npm in a separate repo because I thought it was easier to have a clean frontend that way. I now see that they can be completely combined and tha this will lead to a cleaner easier to develop codebase.

Add live interactive notebook links

Follow what IPYReact does widgetti/ipyreact#19

Improve histograms

A list of histogram research to follow up on. I'm open to suggestions for how to improve histograms

https://www.cedricscherer.com/2021/06/06/visualizing-distributions-with-raincloud-plots-and-how-to-create-them-with-ggplot2/

https://github.com/jorvlan/raincloudplots

https://www.hindawi.com/journals/amete/2019/1795673/

https://wis.kuleuven.be/stat/robust/papers/2008/adjboxplot-revision.pdf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6480976/

https://wellcomeopenresearch.org/articles/4-63/v1

Add summary statistics

Summary stats

Add a base no-arg summary statistics function, that can operate on either the input DataFrame or transformed DataFrame. enable a toggle via the UI

Try to limit display to 5 rows

Column level features

for text columns

display number of discrete values vs total number of rows
If there are only a couple of discrete values show those values

If there are a limited number of outliers, show them

for integer and float columns

Show a histogram sparkline
list quartiles
medina, mean, stddev

Add sorting

Add a sorting config to the python widget for transformed and raw dfs. This is run after all other steps.

Separately make the frontend manipulate the sorting config via clicking on up/down/x icons in column headers

Widget styles not rendered properly in VS Code

Thanks for this great package, it's very promising. I played around with it in VS Code but found the widget is not rendered properly. It might be because it's assuming the widget is rendered in Jupyter Lab context and using some Jupyter Lab css rules, which don't exist in VS Code output webview

Support filters

Currently Commands are prototypes for operations that accept a column or colspec as an option.

Design a UI for row-wise options, and develop commands that work against them.

Examples could include "remove all rows where column A is above 20"
Split into two dataframes... based on column A above or below 20.

Add correlations view

Add an extra view, similar to summary statistics with the correlations, colored

Add colormap conditioning

Build a JS Formatter that colors background of each cell based on histogram bin/deviation from mean

Indepently verify pip installation

@kolibril13 would you mind giving this a try?

I just renamed my project again and would like confirmation that it installs properly.
https://buckaroo-data.readthedocs.io/en/latest/
What do the docs and initial experience seem like to you?

Thanks

polars DataFrame shape Error

polars dataframe shape is displayed as shape: (1_000, 9)

Data Sketches for strings

not sure if this already exists. I would think that datasketches do well with different statistical distributions on continuous variables.

I'm not sure how they do with strings.

*** address
count occurrences of 'ln st dr ave street lane apt unit' and number strings and any other string. an address field will have a much higher count of those keys than any other field.
*** phone number
matching regex counts that match with

2028675309
(202)867-5309
202-867-5309

*** dates
maybe a bunch of date types
12 may 2004
12/4/2004

just count the regex matches

Then run all of these tokenizers over each string column. The address column should be very obvious

Add extra dataframes via pluggable_analysis_framework

A correlations dataframe is just one of many possible transformed dataframes that could be shown. Extend pluggable analysis framework to be able to emit, via configuration, multiple different dataframes.

Then set up frontend to accept arbitrarily named dataframes, and display them via configuration. Summary_df is just a dataframe. Currently summary_df is hardoced into the json response, but the system would be more usable and configurable if there were many possible dataframes that could be tied different buttons in the UI.

Webinar today Thursday, October 19th. 1PM EST

I will be giving a webinar about Buckaroo this Thursday, October 19 1:00 – 2:00pm EST.

Learn about Buckaroo and how it can be customized to automate your own data analysis workflow. Register Here

This might get more attention as an issue (didn't know people were paying attention :) )

Better datetime formatting

Date times are displayed as excessivley verbose. Ideally this could use ticking code similar to bokeh or matplotlib to determine the format that provides the proper amount of info.

Maybe even different formats
so for ordered data, show the full date on every row where the full date changes
for the subsequent rows, only show the time of day

Smarter Default sizing of dataframe via widget

setup the python widget code to only send a max of 500 rows of data to the JS widget by default. more data is likely to cause performance problems and isn't the right way to explore large datasets - summary stats are.

Warning or error message when buckaroo is run against notebook < 7.0

Buckaroo requires jupyterlab > 3.6 (3.6 or 4.0)
or notebook > 7.0

I think that users occasionally run into bugs running against the wrong environment.

the pyproject.toml should be double checked
and a special error/warning added to buckaroo/__init__.py

I'm not sure how to test if code is being executed in the notebook vs being executed in jupyterlab.

It should be ok to have an environment with jupyterlab 3.6, and notebook 6.5... as long as buckaroo is only tried in jupyterlab.

Buckaroo doesn't work properly in Google Colab

When used in google Colab, Buckaroo pulls a very old version of the JS. Pushing new versions of the PYPI package doesn't seem to fix this. Google Colab is pulling some stale version of the NPM JS package, it is unclear which version, nor how to control it

googlecolab/colab-cdn-widget-manager#45

add_analysis reproduce code is duplicated

When adding a flawed ColAnalysis the reproduce code is printed twice

IPykernel warning on every display?

Hi - me again!

I am just testing the latest version - and I notice that:

import buckaroo
import pandas as pd

df = pd.DataFrame(
    dict(names=['one', 'two', 'three'],
         values=[1, 2, 3]))
df

gives:

/Users/mb312/Library/Python/3.9/lib/python/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

Is this what you expected?

I have:

$ pip list | grep ipy
ipykernel                 5.3.4
ipython                   8.8.0
ipython-genutils          0.2.0
ipywidgets                8.0.2

on Python 3.9 (I know, I'll upgrade soon).

Normalize util

Another possible utility: throw a bunch of dataframes at multipel dataframe piece, have it spit out groups of matching data sketch dataframes. This way separate cleanings can be developed for each archetype.

Once we figure out codegen tied to notebook cells, this could layout the following

Cell 1: ujdf = unjumbleDataFrames([df1,df2, df3....])
ujdf
#Cell 1 generates the following
Cell 2: group1_widget = Buckaroo(ujdf, 1)
Cell 3: group2_widget = Buckaroo(ujdf, 2)
#... for how ever many groups there are
Cell N: Buckaroo.MultipleDF(group1_widget, group2_widget)

In this was MultipleDF can be live updated with the sketch of the whole df matches better... without rexecuting cells. The core thing is we aren't pointing at one output dataframe, but the widget... which will update. This might also be possible with some type of clojure like ref/atom concept.

cell 1: ref1 = Ref()
        Buckaroo(df, ref=ref1)

cell 2: Buckaroo(ref1)

Multiple DataFrame placeholder

Join makes the table wider

color code table 1 vs table 2 columns . specially color code the join colum as the mixed color

so table 1 red, table 2 blue, join column purple.

Add a special type of summary stats. How many rows in both tables, how many rows succesfully joined
.

First join heuristic is just finding the same column name. Of couse this should work on hyperloglog eventually

For concat, check first the column names, maybe fake in a filename column, then take the data sketch of each column, and see what matches. if everything matches, bob is your auntie.

Otherwise hope that datasketches match.

Stacktrace on dataframes without a numeric column

import pandas as pd
import buckaroo
pd.DataFrame({'a':['foo', 'bar', 'baz']})

results in the following stacktrace

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/IPython/core/formatters.py:915, in IPythonDisplayFormatter.__call__(self, obj)
    913     pass
    914 else:
--> 915     printer(obj)
    916     return True
    917 # Finally look for special method names

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/buckaroo/widget_utils.py:6, in _display_as_buckaroo(df)
      4 def _display_as_buckaroo(df):
      5     from IPython.display import display
----> 6     return display(BuckarooWidget(df, showCommands=False))

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/buckaroo/buckaroo_widget.py:112, in BuckarooWidget.__init__(self, df, sampled, summaryStats, reorderdColumns, showCommands, autoType, postProcessingF)
    110 #we need dfConfig setup first before we get the proper working_df for auto_cleaning
    111 self.raw_df = df
--> 112 self.run_autoclean(autoType)
    114 warnings.filterwarnings('default')

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/buckaroo/buckaroo_widget.py:119, in BuckarooWidget.run_autoclean(self, autoType)
    116 def run_autoclean(self, autoType):
    117     if autoType:
    118         # this will trigger the setting of self.typed_df
--> 119         self.operations = get_auto_type_operations(
    120             self.raw_df, metadata_f=self.typing_metadata_f,
    121             recommend_f=self.typing_recommend_f)
    122     else:
    123         self.set_typed_df(self.get_working_df())

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:748, in TraitType.__set__(self, obj, value)
    746     raise TraitError('The "%s" trait is read-only.' % self.name)
    747 else:
--> 748     self.set(obj, value)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:3607, in List.set(self, obj, value)
   3605     return super().set(obj, [value])
   3606 else:
-> 3607     return super().set(obj, value)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:737, in TraitType.set(self, obj, value)
    733     silent = False
    734 if silent is not True:
    735     # we explicitly compare silent to True just in case the equality
    736     # comparison above returns something other than True/False
--> 737     obj._notify_trait(self.name, old_value, new_value)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:1532, in HasTraits._notify_trait(self, name, old_value, new_value)
   1531 def _notify_trait(self, name, old_value, new_value):
-> 1532     self.notify_change(
   1533         Bunch(
   1534             name=name,
   1535             old=old_value,
   1536             new=new_value,
   1537             owner=self,
   1538             type="change",
   1539         )
   1540     )

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/ipywidgets/widgets/widget.py:635, in Widget.notify_change(self, change)
    632     if name in self.keys and self._should_send_property(name, getattr(self, name)):
    633         # Send new state to front-end
    634         self.send_state(key=name)
--> 635 super().notify_change(change)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:1544, in HasTraits.notify_change(self, change)
   1542 def notify_change(self, change):
   1543     """Notify observers of a change event"""
-> 1544     return self._notify_observers(change)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:1591, in HasTraits._notify_observers(self, event)
   1588 elif isinstance(c, EventHandler) and c.name is not None:
   1589     c = getattr(self, c.name)
-> 1591 c(event)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/buckaroo/buckaroo_widget.py:153, in BuckarooWidget.handle_operations(self, change)
    151 new_ops = change['new']
    152 split_ops = split_operations(new_ops)
--> 153 self.machine_gen_operations = split_ops[0]
    155 user_gen_ops = split_ops[1]
    157 #if either the user_gen part or the machine_gen part changes,
    158 #we still have to recompute the generated code and
    159 #resulting_df because the input df will be different

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:748, in TraitType.__set__(self, obj, value)
    746     raise TraitError('The "%s" trait is read-only.' % self.name)
    747 else:
--> 748     self.set(obj, value)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:3607, in List.set(self, obj, value)
   3605     return super().set(obj, [value])
   3606 else:
-> 3607     return super().set(obj, value)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:737, in TraitType.set(self, obj, value)
    733     silent = False
    734 if silent is not True:
    735     # we explicitly compare silent to True just in case the equality
    736     # comparison above returns something other than True/False
--> 737     obj._notify_trait(self.name, old_value, new_value)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:1532, in HasTraits._notify_trait(self, name, old_value, new_value)
   1531 def _notify_trait(self, name, old_value, new_value):
-> 1532     self.notify_change(
   1533         Bunch(
   1534             name=name,
   1535             old=old_value,
   1536             new=new_value,
   1537             owner=self,
   1538             type="change",
   1539         )
   1540     )

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/ipywidgets/widgets/widget.py:635, in Widget.notify_change(self, change)
    632     if name in self.keys and self._should_send_property(name, getattr(self, name)):
    633         # Send new state to front-end
    634         self.send_state(key=name)
--> 635 super().notify_change(change)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:1544, in HasTraits.notify_change(self, change)
   1542 def notify_change(self, change):
   1543     """Notify observers of a change event"""
-> 1544     return self._notify_observers(change)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/traitlets/traitlets.py:1591, in HasTraits._notify_observers(self, event)
   1588 elif isinstance(c, EventHandler) and c.name is not None:
   1589     c = getattr(self, c.name)
-> 1591 c(event)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/buckaroo/buckaroo_widget.py:193, in BuckarooWidget.interpret_machine_gen_ops(self, change, force)
    191     return # nothing changed, do no computations
    192 new_ops = change['new']
--> 193 self.set_typed_df(self.interpret_ops(new_ops, self.get_working_df()))

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/buckaroo/buckaroo_widget.py:206, in BuckarooWidget.set_typed_df(self, new_df)
    204 # stats need to be rerun each time 
    205 self.stats = DfStats(self.typed_df, [TypingStats, DefaultSummaryStats, ColDisplayHints])
--> 206 self.summaryDf = df_to_obj(self.stats.presentation_sdf, self.stats.col_order)
    207 self.update_based_on_df_config(3)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/buckaroo/analysis_management.py:196, in DfStats.presentation_sdf(self)
    194 if self.ap.summary_stats_display == "all":
    195     return self.sdf
--> 196 return self.sdf.loc[self.ap.summary_stats_display]

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
   1150 axis = self.axis or 0
   1152 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1153 return self._getitem_axis(maybe_callable, axis=axis)

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/pandas/core/indexing.py:1382, in _LocIndexer._getitem_axis(self, key, axis)
   1379     if hasattr(key, "ndim") and key.ndim > 1:
   1380         raise ValueError("Cannot index with multidimensional key")
-> 1382     return self._getitem_iterable(key, axis=axis)
   1384 # nested tuple slicing
   1385 if is_nested_tuple(key, labels):

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/pandas/core/indexing.py:1322, in _LocIndexer._getitem_iterable(self, key, axis)
   1319 self._validate_key(key, axis)
   1321 # A collection of keys
-> 1322 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1323 return self.obj._reindex_with_indexers(
   1324     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1325 )

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/pandas/core/indexing.py:1520, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1517 ax = self.obj._get_axis(axis)
   1518 axis_name = self.obj._get_axis_name(axis)
-> 1520 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1522 return keyarr, indexer

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6114, in Index._get_indexer_strict(self, key, axis_name)
   6111 else:
   6112     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6114 self._raise_if_missing(keyarr, indexer, axis_name)
   6116 keyarr = self.take(indexer)
   6117 if isinstance(key, Index):
   6118     # GH 42790 - Preserve name from an Index

File ~/anaconda3/envs/buckaroo-install-jp4-3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6178, in Index._raise_if_missing(self, key, indexer, axis_name)
   6175     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6177 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6178 raise KeyError(f"{not_found} not in index")

KeyError: "['mean'] not in index"

This was because DfStats expected mean to be present in the summary dataframe, which it wasn't if none of the columns could compute mean. this has been fixed for the general case, with unit tests

Add unit tests

The codebase is largely untested because it was built in a rapid prototyped "can this even work" style since I was dealing with new technologies. As the shape of the system has become clearer, it's time to add unit tests to prevent regressions.

Here are the main areas that can be tested

Transform interpreter

Does it properly perform the transformations?
Can we add a Command and then use it
What do errors look like?
Can we build in testing to the Command class?

Py_Code_Gen interpreter

does the generated python code produce the same output as the corresponding transform python code when evaled

Widget python side

Do observers get properly called... can we set a property/model attribute and verify that the proper observer was called and state exists as expected.

Widget ts code (widget side)

can we instantiate the widget with known values and verify that no errors exist
do the sent python types match the typescript types expected

widget react code

Can we instantiate and display each component without errors
do the weird setter gymnastics work properly with a widget-ts analogue (add explanation)

Finish Command/Operation cleanup

Add this text to the README when the code changes have been completed

What is the difference between a command and an operation?

A command is closer to a function, it is the definition of a transform step. An operation is an invocation of a command with concrete arguments.

The UI reads the list of Commands, along with their argspecs and presents them in the UI. Using them to build a list of operations.

Commands include a python function to actually perform the transformation, a python function to emit the equivalent python code, and argspecs describing the expected arguments for the Command.

Previously invocations of commands and commands, and commands themselves were called Command (with extra explanatory endings). zainhoda helped the disambiguation in a pairing session. Most of the changes have been made to frontend react widget... they need to be carried through to the backend python code.

Add conda package

Package buckaroo via conda too.