koaning / clumper Goto Github PK

View Code? Open in Web Editor NEW

146.0 3.0 14.0 5.85 MB

A small python library that can clump lists of data together.

Home Page: https://koaning.github.io/clumper/

License: MIT License

Python 99.19% Makefile 0.81%

clumper's Introduction

Clumper

A small python library that can clump lists of nested data together.

Part of a video series on calmcode.io.

Base Example

Clumper allows you to quickly parse through a list of json-like data.

Here's an example of such a dataset.

pokemon = [
    {'name': 'Bulbasaur', 'type': ['Grass', 'Poison'], 'hp': 45, 'attack': 49},
    {'name': 'Charmander', 'type': ['Fire'], 'hp': 39, 'attack': 52},
    ...
]

Given this list of dictionaries we can write the following query;

from clumper import Clumper

clump = Clumper.read_json('https://calmcode.io/datasets/pokemon.json')

(clump
  .keep(lambda d: len(d['type']) == 1)
  .mutate(type=lambda d: d['type'][0],
          ratio=lambda d: d['attack']/d['hp'])
  .select('name', 'type', 'ratio')
  .sort(lambda d: d['ratio'], reverse=True)
  .head(5)
  .collect())

What this code does line-by-line.

This code will perform the following steps.

It imports Clumper.
It fetches a list of json-blobs about pokemon from the internet.
It removes all the pokemon that have more than 1 type.
The dictionaries that are left will have their type now as a string instead of a list of strings.
The dictionaries that are left will also have a property called ratio which calculates the ratio between hp and attack.
All the keys besides name, type and ratio are removed.
The collection is sorted by ratio, from high to low.
We grab the top 5 after sorting.
The results are returned as a list of dictionaries.

This is what we get back:

[{'name': 'Diglett', 'type': 'Ground', 'ratio': 5.5},
 {'name': 'DeoxysAttack Forme', 'type': 'Psychic', 'ratio': 3.6},
 {'name': 'Krabby', 'type': 'Water', 'ratio': 3.5},
 {'name': 'DeoxysNormal Forme', 'type': 'Psychic', 'ratio': 3.0},
 {'name': 'BanetteMega Banette', 'type': 'Ghost', 'ratio': 2.578125}]

Documentation

We've got a lovely documentation page that explains how the library works.

Features

This library has no dependencies besides a modern version of python.
The library offers a pattern of verbs that are very expressive.
You can write code from top to bottom, left to right.
You can read in many json/yaml/csv files by using a wildcard *.
MIT License

Installation

You can install this package via pip.

pip install clumper

It may be safer however to install via;

python -m pip install clumper

For details on why, check out this resource.

There are some optional dependencies that you might want to install as well.

python -m pip install clumper[yaml]

Contributing

Make sure you check out the issue list beforehand in order to prevent double work before you make a pull request. To get started locally, you can clone the repo and quickly get started using the Makefile.

git clone [email protected]:koaning/clumper.git
cd clumper
make install-dev

clumper's People

Contributors

Stargazers

Watchers

Forkers

existeundelta samukweku stjordanis rohith295 samarpan-rai francbartoli shunte88 synapticarbors afiqmuzaffar gsarfo-boateng shreeeli gabs619 ondraz david26694

clumper's Issues

Verbs to deal with nesting.

It might be nice if a nested structure could easily "unnest" itself.

from clumper import Clumper

data = [{'a': 1, 'items': [1, 2]}]
new_data = Clumper(data).explode(item="items").collect()

new_data == [{'a': 1, 'item': 1}, {'a': 1, 'item': 2}]

YAML files

I have a use-case to deal with YAML-files. For now reading them in is the main focus, but we also want to be able to write them.

In this case we'd be best of to introduce a dependency: pyyaml but I prefer to keep this dependency optional. That way, if folks don't use yaml they can choose not to install it.

Add a helper to filter in/out examples that do not fit a `Pydantic` validator

It should probably be a .pipe()-helper.

Explode can remove data by accident.

It seems that explode removes data at times.

from clumper import Clumper 

data = [
    {"name": "john", "series": []},
    {"name": "jane", "series": [1, 2]},
    {"name": "jack", "series": [1, 2, 3]},
]

Clumper(data).explode("series").collect()

This yields.

[{'name': 'jane', 'series': 1},
 {'name': 'jane', 'series': 2},
 {'name': 'jack', 'series': 1},
 {'name': 'jack', 'series': 2},
 {'name': 'jack', 'series': 3}]

Note that john is gone.

Not 100% sure if this is behavior that I like. You can easily prevent it with a mutate before though.

Need to think about this one.

Add a verb that removes duplicates.

New Verb: Rename

It'd be nice if users could rename a key. Syntax should be like:

clump.rename(new_name="old_name")

Reading in multiple files in one go.

Is your feature request related to a problem? Please describe.

I have a folder with lots and lots of json files. Can I read them in all at once?

Describe the solution you'd like

It'd be nice if all of our readers allowed for a '*' or a Path.glob that allows you to read in lots of files at once.

Something like:

Clumper.read_json("folder/*.json") 
Clumper.read_json(pathlib.Path("folder").glob("*"))

Additional context

As far as an implementation goes, we can probably solve this nicely with a decorator. Assuming the function that it wraps is a file-reader we should not need to touch the internal readers.

Allow for multiple write types.

I noticed a job fail with this traceback:

Traceback (most recent call last):                                                                                      
  File "/home/vincent/Development/gh-dashb/scripts/grab_workflows.py", line 71, in <module>                             
    typer.run(scrape_workflows)                                                                                         
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/typer/main.py", line 859, in run            
    app()                                                                                                               
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/typer/main.py", line 214, in __call__       
    return get_command(self)(*args, **kwargs)                                                                           
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__       
    return self.main(*args, **kwargs)                                                                                   
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 782, in main           
    rv = self.invoke(ctx)                                                                                               
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke        
    return ctx.invoke(self.callback, **ctx.params)                                                                      
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke         
    return callback(*args, **kwargs)                                                                                    
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/typer/main.py", line 497, in wrapper        
    return callback(**use_params)  # type: ignore                                                                       
  File "/home/vincent/Development/gh-dashb/scripts/grab_workflows.py", line 66, in scrape_workflows                     
    (clump_workflows.write_jsonl(output_path))                                                                          
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/clumper/clump.py", line 446, in write_jsonl 
    with open(path, "x") as f:                                                                                          
FileExistsError: [Errno 17] File exists:                                                                                
'/home/vincent/Development/gh-dashb/workflows/rasahq/rasa/workflows-2021-02-25.jsonl'

It would be nice to allow for an overwrite flag instead of assuming "x" with the file-open here.

Aggregation Methods: .var()/.std

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular var/std sounds like a reasonable candidate.

Data Loader: .from_jsonl()

It'd be nice if we could also read data from disk. A syntax like this would be nice:

Clumper.from_jsonl(path, settings)

New Verb: drop_duplicates()

Seems useful enough.

Make .mutate() Aware of Groups

Currently .mutate() is not aware of groups.

[FEATURE] Have multifile support Path objects

Is your feature request related to a problem? Please describe.
N/A

Describe the solution you'd like
Ideally, the multifile decorator should support parsing List[pathlib.Path] and List[str] instead of relying on the asterisk in a path. Currently, it takes a URI which may or may not have an asterisk in it and handles all the parsing. A user should be able to pass a list for it to instead. Since pathlib also includes glob and is available in Python 3.4+, it should be pretty trivial.

Describe alternatives you've considered
It currently works as written so no alternatives. I think this feature request would improve its functionality. It also feels a lot more natural to pass Path objects than using the glob module itself.

Additional context
N/A

Add encoding option for readers

Per this issue. It would be a boon to also allow the user to configure the encoding when reading in files.

Sort doesn't work after agg without ungroup

I am not sure if this is a bug but it is definitely not a feature request so I am writing it as a bug report.

Problem
I was expecting the following code to return a sorted list of the count grouped by the primary_type key but that is not the case. I see an unsorted list. Is this an expected behaviour?

from clumper import Clumper
(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .mutate(primary_type = lambda c : c['type'][0])
    .group_by('primary_type')
    .agg(occurence = ('primary_type','count'))
    .sort(key=lambda x : x['occurence'])
    .collect()
)

Additional context

Adding ungroup after agg and before sort solves it. The following code produces what I want.

from clumper import Clumper

(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .mutate(primary_type = lambda c : c['type'][0])
    .group_by('primary_type')
    .agg(occurence = ('primary_type','count'))
    .ungroup()
    .sort(key=lambda x : x['occurence'])
    .collect()
)

Datetime utilities

It would be nice if we could group-by day/week/hour given a timestamp.

We should first discuss a proper API before making an implementation but this would really be a nice feature.

Aggregation Methods: .first()

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular first sounds like a reasonable candidate.

Grouping by two columns mixes up the keys

Grouping by two columns mixes the keys up. See below. Aggregated output shows dict with 'grp_1': 'b', 'grp_2': 'c' . That combination of keys is not present in original list_dicts data.

from clumper import Clumper

list_dicts = [
    {'grp_1': 'a', 'grp_2': 'a', 'a': 6},
    {'grp_1': 'a', 'grp_2': 'b', 'a': 7},
    {'grp_1': 'a', 'grp_2': 'c', 'a': 5},
    {'grp_1': 'b', 'grp_2': 'a', 'a': 2},
    {'grp_1': 'b', 'grp_2': 'b', 'a': 4},
]

(Clumper(list_dicts)
  .group_by('grp_1', 'grp_2')
  .agg(c=('a', 'count'),
       s=('a', 'sum'),
       m=('a', 'mean'))
  .collect()
)

# output
[{'grp_1': 'b', 'grp_2': 'b', 'c': 1, 's': 4, 'm': 4},
 {'grp_1': 'b', 'grp_2': 'a', 'c': 1, 's': 7, 'm': 7},
 {'grp_1': 'b', 'grp_2': 'c', 'c': 1, 's': 2, 'm': 2},     # this key combination is not present in list_dicts
 {'grp_1': 'a', 'grp_2': 'b', 'c': 1, 's': 5, 'm': 5},
 {'grp_1': 'a', 'grp_2': 'c', 'c': 1, 's': 6, 'm': 6}]     # 'grp_1': 'a', 'grp_2': 'a' is missing here

python 3.9.7, clumper 0.2.15

Experimental Method: Table(n)

It might be cool to use a table from rich to show intermediate data in a user-friendly way. I'm not 100% sure about this because I pride myself for having zero dependencies sofar. It might be optional?

Add `.schema()` verb

I think it'd be nice to see the schema of the current dict object. Might make it a lot easier to write queries.

I might be willing to import rich for this feature too.

Experimental Idea: Mutate that is Group-aware via `row_number()`.

I've got a function that could serve as a row_number().

def row_number():
    """
    This stateful function can be used to calculate row numbers
    on dictionaries.

    Usage:

    ```python
    from clumper import Clumper

    list_dicts = [
        {'a': 1, 'b': 2},
        {'a': 2, 'b': 3},
        {'a': 3},
        {'a': 4}
    ]

    (Clumper(list_dicts)
      .mutate(r=row_number())
      .collect())
    ```
    """
    i = 0

    def incr(_):
        nonlocal i
        i += 1
        return i

    return incr

The question is, can we make this function aware of the group_by in a nice way?

This is definately an advanced feature, if you're new to functional style programming ... probably best to skip this one.

Flatten the keys

Sometimes I'm dealing with dictionaries that look like this:

{
  'feature_1': {'propery_1': 1, 'property_2': 2},
  'feature_2': {'propery_1': 3, 'property_2': 4},
  'feature_3': {'propery_1': 5, 'property_2': 6},
}

In this case there's three features, but in real life this can be much larger. Currently we have two small issues.

If you read in this blob in Clumper then the length currently is 3 instead of 1.
We currently don't have a nice way in Clumper to turn this dictionary into a more flat representation. Something like;

[
  {'feature': 'feature_1', 'propery_1': 1, 'property_2': 2},
  {'feature': 'feature_2', 'propery_1': 3, 'property_2': 4},
  {'feature': 'feature_3', 'propery_1': 5, 'property_2': 6},
]

Aggregation Methods: .median()

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular median sounds like a reasonable candidate.

Make .sort() Aware of Groups

Currently, the .sort() method is not aware of groups set.

Groups should return a copy.

This issue was raised here. If you look at our implementations you'll notice that we typically do not return self, rather a copy of self. This in order to keep things immutable.

There currently is an exception to that rule though groupby and ungroup do not follow this pattern. As seen here.

Split Clumper class by functionality

Is your feature request related to a problem? Please describe.
Our main class is becoming a monolith. Currently Clumper class is over 1500 lines. The major contributor is the documentation but it still makes it difficult to navigate through while developing.

Describe the solution you'd like
Split the class into multiple methods and/or classes. I think we already have a good structure based on tests by functionality. For example, we can split into following system by functionality

Read/writing
Verbs
(other?)

Add `impute` method that can add missing values

clump.impute(a=1)

Add `foreach` verb.

It's similar to tee. The idea is to have a function that runs for each element, but doesn't change the collection.

Aggregation Method: .last()

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular first sounds like a reasonable candidate.

Experimental Idea: Expand Verb and Functions

Since we're dealing with nested structures here, we might use the following syntax to deal with the creation of rolling/expanding/smoothing windows.

(clump
 .expand(f1=moving(col='a',window=5),
         f2=expanding(col='a',window=5),
         f3=smoothing(col='a',window=5)))

In this sense, expand will be like mutate in the sense that we'll add a key but we'll do it with functions that behave just slightly differently. This is an experimental idea and I'm starting a thread here to gather my thoughts into a single place.

Add `all` aggregation method.

We've got unique but maybe we also want all. Maybe not that name, but at least something that doesn't throw things away.

Data Loader: .from_json()

It'd be nice if we could also read data from disk. A syntax like this would be nice:

Clumper.from_json(path, settings)

Data Writer: json/jsonl

Data Loader: .to_json()/to_jsonl()

It'd be nice if we could also write data to disk. A syntax like this would be nice:

Clumper.to_json(path, settings)
Clumper.to_jsonl(path, settings)

Data Loader: .from_csv()

It'd be nice if we could also read data from disk. A syntax like this would be nice:

Clumper.from_csv(path, settings)

read_jsonl method : Importing file just renamed from .json to .jsonl is allowed

The user can read in .json file by just renaming it to .jsonl. With the current code, Clumper would parse it one line as a big dictionary. Unexpected behaviour will happen during analysis.

To reproduce:

Rename pokemon.json to pokemon.jsonl (any json file really)
Read it and load to Clumper

from clumper import Clumper
wrongly_parsed = Clumper.read_jsonl("pokemon.jsonl")

You can see that the len returns 1

print(len(wrongly_parsed))

I couldn't find an elegant solution on how to verify if the file being read is actually JSONL apart from looking at its extension. Any suggestion is welcome.

Let's remove "Error occured during writing JSONL file"

This is the result of a failing pytest on my side.

    def write_jsonl(self, path, sort_keys=False, indent=None):
        """
        Writes to a jsonl file.
    
        Arguments:
            path: filename
            sort_keys: If sort_keys is true (default: False), then the output of dictionaries will be sorted by key.
            indent: If indent is a non-negative integer (default: None), then JSON array elements members will be pretty-printed with that indent level.
        Usage:
    
        ```python
        from clumper import Clumper
        clump_orig = Clumper.read_jsonl("tests/data/cards.jsonl")
        clump_orig.write_jsonl("tests/data/cards_copy.jsonl")
    
        clump_copy = Clumper.read_jsonl("tests/data/cards_copy.jsonl")
    
        assert clump_copy.collect() == clump_orig.collect()
        ```
        """
    
        try:
            # Create a new file and open it for writing
            with open(path, "x") as f:
                for current_line_nr, json_dict in enumerate(self.collect()):
                    f.write(
                        json.dumps(json_dict, sort_keys=sort_keys, indent=indent) + "\n"
                    )
    
        except Exception:
>           raise RuntimeError("Error occured during writing JSONL file")
E           RuntimeError: Error occured during writing JSONL file

clumper/clump.py:276: RuntimeError

The message Error occured during writing JSONL file is making it harder for me to understand what is actually going on. Can we just maybe remove it?

The error here was that I was trying to write a file that already exists. Instead of giving me this error I got the uninformative "Error occured during writing JSONL file" message.

Readers should be able to add a filename.

When you read a bunch of json files with a glob, you also want to add the name to the blob.

Clumper.read_json("path/to", add_filename=True).glob("*/settings.json")

Otherwise you would manually need to add this info sometimes.

Add tests for reading in yaml files with wildcard

The current test only checks for json, csv and jsonl. https://github.com/koaning/clumper/blob/master/tests/test_decorator/test_multifile.py

Side note : should we rename multifile to wildcard wrapper as you are calling it something different in the changelog and documentation?

Add verb to unnest item in dict.

Example.

{
  'nodeid': 'tests/test_cron_parsing.py::test_job_parsing[check0]', 
  'duration': 0.0003903769999999973, 
  'parsed': {'path': 'tests', 'file': 'test_cron_parsing', 'test': 'test_job_parsing[check0]'}
}

I'd like a verb that can remove the parsed part such that the dict remains flat.

Helper method to nest per dictionary

Let's say that I have the monopoly dataset. I have rows such as;

{'name': 'Boardwalk',
  'rent': '50',
  'house_1': '200',
  'house_2': '600',
  'house_3': '1400',
  'house_4': '1700',
  'hotel': '2000',
  'deed_cost': '400',
  'house_cost': '200',
  'color': 'blue',
  'tile': '39'}

Let's suppose that I want to change that to;

{'name': 'Boardwalk',
  'color': 'blue',
  'tile': '39',
  'costs': {'deed': '400', 'house': '200'},
  'income': {'rent': '50',
   'hotel': '2000',
   'house_1': '200',
   'house_2': '600',
   'house_3': '1400',
   'house_4': '1700'}

Then you currently need to run this:

(Clumper.read_csv("tests/data/monopoly.csv")
  .mutate(costs=lambda d: {"deed": d["deed_cost"], "house": d["house_cost"]},
          income=lambda d: {**{"rent": d["rent"], "hotel": d["hotel"]}, **{f"house_{i}": d[f"house_{i}"] for i in [1, 2, 3, 4]}})
  .drop("house_1", "house_2", "house_3", "house_4", "rent", "hotel", "deed_cost", "house_cost")
  .collect())

It feels like there should be an easier way to do it. This issue is a place where we might discuss this. Since it is a rowwise operation we might come up with a helper function for mutate but since we also want to drop the values afterwards we might be able to come up with something more general.

Data Writer: csv

Data Loader: .to_csv()

It'd be nice if we could also write data from disk. A syntax like this would be nice:

Clumper.to_csv(path, settings)

An important theme here is to keep it simple and to think how we might want to deal with keys that sometimes go missing.

Join(s) performance enhancement

Is your feature request related to a problem? Please describe.
As mentioned in the codebase itself, inner_join and left_join methods implementation is "naive" and speedup is possible.
I figured it out while working with clumpers with 10k+ dicts.

Describe the solution you'd like
Here a possible speedup which avoided the inner for-loop, with few performance comparisons aswell:

from clumper import Clumper

def join(self, other, mapping, how="inner", lsuffix="", rsuffix="_joined"):
    """Possible new join implementation, remark that I am adding the `how` keyword argument"""
    
    result = []
    self_keys, other_keys = mapping.keys(), mapping.values()
    
    if how == "inner":
        # If it's an inner join, it's sufficient to keep only the dicts that have all the matching keys
        _self = self.keep(lambda d: all((k in d.keys() for k in self_keys)))
    elif how == "left":
        _self = self
    else:
        raise NotImplementedError()
    
    other_filtered = other.keep(lambda d: all((k in d.keys() for k in other_keys)))
    
    for d_i in _self:
        
        # as already implemented, extract values to join on
        values_i = [d_i.get(k) for k in self_keys]
        
        # expoit keep method to find all the dicts in other clumper that match
        matched = other_filtered.keep(lambda d: all(d[k]==v for k, v in zip(other_keys, values_i)))

        if len(matched):
            for d_j in matched:
                result.append(Clumper._merge_dicts(d_i, d_j, mapping, lsuffix, rsuffix))
        else:
            # for left join, we want to keep d_i in any case
            if how == "left":
                 result.append(Clumper._merge_dicts(d_i, {}, mapping, lsuffix, rsuffix))
            
    return self._create_new(result)

Now let's define helper functionalities for benchmarking

from functools import wraps
import numpy as np
import pandas as pd
from time import process_time
from memo import memlist, grid, Runner

def generate_random_clumper(size, keys=list("abc")):
    """
    Creates a Clumper with random integers of shape=(size, len(keys)) 
    starting from a pandas DataFrame
    """
    
    df = pd.DataFrame(
        data=np.random.randint(0, 100, (size, len(keys))),
        columns=keys
    )
    
    clump = Clumper(df.to_dict("records"))
    return clump

def drop_random_keys(clump, frac = 0.1, keys = list("ab")):
    """
    Randomly drops frac percentage of keys not in the provided keys
    """"
    c1 = clump.sample_frac(frac, replace=False).select(*keys)
    c2 = clump.sample_frac(1-frac, replace=False)
    
    return c1.concat(c2)

def timer(func):
    """timer decorator"""
    @wraps(func)
    def wrapper(*args, **kwargs):

        tic = process_time()
        res = func(*args, **kwargs)
        toc = process_time()

        time_elapsed = toc-tic
        return res, time_elapsed

    return wrapper

Time for testing

results = []

@memlist(data=results)
def join_experiment(left_size, right_size, left_drop, right_drop):
    
    c1 = generate_random_clumper(left_size).pipe(drop_random_keys, left_drop)
    c2 = generate_random_clumper(right_size).pipe(drop_random_keys, right_drop)
    
    inner_old, time_inner_old = timer(c1.inner_join)(c2, mapping={"b": "b", "c": "c"})
    left_old, time_left_old = timer(c1.left_join)(c2, mapping={"b": "b", "c": "c"})
    inner_new, time_inner_new = timer(join)(c1, c2, mapping={"b": "b", "c": "c"}, how="inner")
    left_new, time_left_new = timer(join)(c1, c2, mapping={"b": "b", "c": "c"}, how="left")
    
    res = {
        "equals_inner": inner_old.equals(inner_new),
        "equals_left": left_old.equals(left_new),
        "time_inner_old":time_inner_old,
        "time_left_old": time_left_old,
        "time_inner_new": time_inner_new,
        "time_left_new": time_left_new,
        "best_inner": "new" if time_inner_new < time_inner_old else "old",
        "best_left": "new" if time_left_new < time_left_old else "old"
    }
    return res

sizes = [100, 1_000, 10_000]
drop_rates = [0.01, 0.1, 0.25, 0.5, 0.9]

settings = grid(left_size=sizes, right_size=sizes, left_drop=drop_rates, right_drop=drop_rates)
runner = Runner(backend="threading", n_jobs=8)
runner.run(func=join_experiment, settings=settings, progbar=True)

df_res = (pd.DataFrame(results)
    .assign(
        delta_inner = lambda t: t["time_inner_old"]/t["time_inner_new"],
        delta_left = lambda t: t["time_left_old"]/t["time_left_new"]
    )
)

# As a first sanity check make sure every join is as expected
df_res["equals_inner"].all(), df_res["equals_left"].all()
# (True, True)

# Then let's see 
df_res[["delta_inner", "delta_left"]].describe(percentiles=[.01, .05, .25, .5, .75, .9, .99]).T

	count	mean	std	min	1%	5%	25%	50%	75%	90%	99%	max
delta_inner	144	8.7035	14.6614	0.649615	0.925807	1.03278	1.75209	3.09647	10.467	15.9198	75.1523	109.684
delta_left	144	2.81045	2.81197	0.686376	0.743097	0.864589	1.04525	1.46856	2.77124	7.99939	9.13121	14.003

Inner join(s) improved in 95% of the tests
Left join(s) improved in slightly more than 75% of them - in particular when all (actually, 99%) of the dicts in the right clumper have all the keys (i.e. in the test when right_drop = 0.01

Additional context

I can imagine that further improvements can be done
If we want to keep inner_join and left_join methods standalone, we can write the join as semiprivate and call it in the methods.

New Verb: Sample

It'd be nice if we could randomly sample from a clumper collection.

The API should allow for:

n: the number of items to sample
frac: the fraction of items to sample
replace: if we should replace items yes/no
weights: a key that can be passed in, associating it's value of the probability of being drawn
random_state: the random seed used for sampling

Dictionaries are Causing Issues

Let's say this is the input.

d = {
  'name': 'name',
 'image': 'img.img',
 'short': 'something short',
 'tags': ['science', 'entertainment'],
 'videos': [{'name': 'Intro',
   'url': 'https://player.vimeo.com/video/414517859'},
  {'name': 'Code', 'url': 'https://player.vimeo.com/video/414517885'},
  {'name': 'Plotting', 'url': 'https://player.vimeo.com/video/414517957'},
  {'name': 'How it Works 1',
   'url': 'https://player.vimeo.com/video/414518015'},
  {'name': 'How it Works 2',
   'url': 'https://player.vimeo.com/video/414518059'},
  {'name': 'Accuracy', 'url': 'https://player.vimeo.com/video/414518106'},
  {'name': 'Benchmark', 'url': 'https://player.vimeo.com/video/414518141'},
  {'name': 'Final Features',
   'url': 'https://player.vimeo.com/video/414518199'}]
}

Then what should come out of this?

Clumper(d).map(lambda d: [d]).collect()

Not this;

[['name'], ['image'], ['short'], ['tags'], ['videos']]

Yet that is exactly what is happening! The root cause for this is that we currently allow dictionaries to be read in via all of our read_ functions. The issue lies in the map method. It assumes a list of dictionaries. We should consider a decorator that can detect this but maybe we should also be more strict when we create a clumper object.

Avoiding class methods to simplify API usage

Is your feature request related to a problem? Please describe.
Looking at the API, it does look a bit odd to me that one needs to call a class method (and import the class) to read files. Is there a need for the class object? From a style perspective, calling class methods breaks with the otherwise very functional method style of the other parts of the lib (e.g. chaining).

Describe the solution you'd like

import clumper

clump = clumper.read_json('https://calmcode.io/datasets/pokemon.json')

Add a method that can add missing values based on a schema.

It'd be nice to have a way to guarantee that the shape remains the same everywhere.

PyPi Downloads: Another Example for the Docs

This dataset might be fun to check;

https://hugovk.github.io/top-pypi-packages/top-pypi-packages-365-days.json

Ideas;

how many of these tools come from cloud-providers?
how many of these tools are deprecated?
do these download numbers relate to the github stars?

koaning / clumper Goto Github PK

clumper's Introduction

Clumper

Base Example

Documentation

Features

Installation

Contributing

clumper's People

Contributors

Stargazers

Watchers

Forkers

clumper's Issues

Recommend Projects

Recommend Topics

Recommend Org