Coder Social home page Coder Social logo

clumper's Introduction

Clumper

A small python library that can clump lists of nested data together.

Part of a video series on calmcode.io.

Base Example

Clumper allows you to quickly parse through a list of json-like data.

Here's an example of such a dataset.

pokemon = [
    {'name': 'Bulbasaur', 'type': ['Grass', 'Poison'], 'hp': 45, 'attack': 49},
    {'name': 'Charmander', 'type': ['Fire'], 'hp': 39, 'attack': 52},
    ...
]

Given this list of dictionaries we can write the following query;

from clumper import Clumper

clump = Clumper.read_json('https://calmcode.io/datasets/pokemon.json')

(clump
  .keep(lambda d: len(d['type']) == 1)
  .mutate(type=lambda d: d['type'][0],
          ratio=lambda d: d['attack']/d['hp'])
  .select('name', 'type', 'ratio')
  .sort(lambda d: d['ratio'], reverse=True)
  .head(5)
  .collect())
What this code does line-by-line. This code will perform the following steps.
  1. It imports Clumper.
  2. It fetches a list of json-blobs about pokemon from the internet.
  3. It removes all the pokemon that have more than 1 type.
  4. The dictionaries that are left will have their type now as a string instead of a list of strings.
  5. The dictionaries that are left will also have a property called ratio which calculates the ratio between hp and attack.
  6. All the keys besides name, type and ratio are removed.
  7. The collection is sorted by ratio, from high to low.
  8. We grab the top 5 after sorting.
  9. The results are returned as a list of dictionaries.

This is what we get back:

[{'name': 'Diglett', 'type': 'Ground', 'ratio': 5.5},
 {'name': 'DeoxysAttack Forme', 'type': 'Psychic', 'ratio': 3.6},
 {'name': 'Krabby', 'type': 'Water', 'ratio': 3.5},
 {'name': 'DeoxysNormal Forme', 'type': 'Psychic', 'ratio': 3.0},
 {'name': 'BanetteMega Banette', 'type': 'Ghost', 'ratio': 2.578125}]

Documentation

We've got a lovely documentation page that explains how the library works.

Features

  • This library has no dependencies besides a modern version of python.
  • The library offers a pattern of verbs that are very expressive.
  • You can write code from top to bottom, left to right.
  • You can read in many json/yaml/csv files by using a wildcard *.
  • MIT License

Installation

You can install this package via pip.

pip install clumper

It may be safer however to install via;

python -m pip install clumper

For details on why, check out this resource.

There are some optional dependencies that you might want to install as well.

python -m pip install clumper[yaml]

Contributing

Make sure you check out the issue list beforehand in order to prevent double work before you make a pull request. To get started locally, you can clone the repo and quickly get started using the Makefile.

git clone [email protected]:koaning/clumper.git
cd clumper
make install-dev

clumper's People

Contributors

daddycocoaman avatar koaning avatar samarpan-rai avatar samukweku avatar synapticarbors avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

clumper's Issues

Verbs to deal with nesting.

It might be nice if a nested structure could easily "unnest" itself.

from clumper import Clumper

data = [{'a': 1, 'items': [1, 2]}]
new_data = Clumper(data).explode(item="items").collect()

new_data == [{'a': 1, 'item': 1}, {'a': 1, 'item': 2}]

YAML files

I have a use-case to deal with YAML-files. For now reading them in is the main focus, but we also want to be able to write them.

In this case we'd be best of to introduce a dependency: pyyaml but I prefer to keep this dependency optional. That way, if folks don't use yaml they can choose not to install it.

Explode can remove data by accident.

It seems that explode removes data at times.

from clumper import Clumper 

data = [
    {"name": "john", "series": []},
    {"name": "jane", "series": [1, 2]},
    {"name": "jack", "series": [1, 2, 3]},
]

Clumper(data).explode("series").collect()

This yields.

[{'name': 'jane', 'series': 1},
 {'name': 'jane', 'series': 2},
 {'name': 'jack', 'series': 1},
 {'name': 'jack', 'series': 2},
 {'name': 'jack', 'series': 3}]

Note that john is gone.

Not 100% sure if this is behavior that I like. You can easily prevent it with a mutate before though.

Need to think about this one.

New Verb: Rename

It'd be nice if users could rename a key. Syntax should be like:

clump.rename(new_name="old_name")

Reading in multiple files in one go.

Is your feature request related to a problem? Please describe.

I have a folder with lots and lots of json files. Can I read them in all at once?

Describe the solution you'd like

It'd be nice if all of our readers allowed for a '*' or a Path.glob that allows you to read in lots of files at once.

Something like:

Clumper.read_json("folder/*.json") 
Clumper.read_json(pathlib.Path("folder").glob("*"))

Additional context

As far as an implementation goes, we can probably solve this nicely with a decorator. Assuming the function that it wraps is a file-reader we should not need to touch the internal readers.

Allow for multiple write types.

I noticed a job fail with this traceback:

Traceback (most recent call last):                                                                                      
  File "/home/vincent/Development/gh-dashb/scripts/grab_workflows.py", line 71, in <module>                             
    typer.run(scrape_workflows)                                                                                         
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/typer/main.py", line 859, in run            
    app()                                                                                                               
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/typer/main.py", line 214, in __call__       
    return get_command(self)(*args, **kwargs)                                                                           
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__       
    return self.main(*args, **kwargs)                                                                                   
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 782, in main           
    rv = self.invoke(ctx)                                                                                               
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke        
    return ctx.invoke(self.callback, **ctx.params)                                                                      
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke         
    return callback(*args, **kwargs)                                                                                    
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/typer/main.py", line 497, in wrapper        
    return callback(**use_params)  # type: ignore                                                                       
  File "/home/vincent/Development/gh-dashb/scripts/grab_workflows.py", line 66, in scrape_workflows                     
    (clump_workflows.write_jsonl(output_path))                                                                          
  File "/home/vincent/Development/gh-dashb/venv/lib/python3.7/site-packages/clumper/clump.py", line 446, in write_jsonl 
    with open(path, "x") as f:                                                                                          
FileExistsError: [Errno 17] File exists:                                                                                
'/home/vincent/Development/gh-dashb/workflows/rasahq/rasa/workflows-2021-02-25.jsonl'   

It would be nice to allow for an overwrite flag instead of assuming "x" with the file-open here.

Aggregation Methods: .var()/.std

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular var/std sounds like a reasonable candidate.

Data Loader: .from_jsonl()

It'd be nice if we could also read data from disk. A syntax like this would be nice:

Clumper.from_jsonl(path, settings)

[FEATURE] Have multifile support Path objects

Is your feature request related to a problem? Please describe.
N/A

Describe the solution you'd like
Ideally, the multifile decorator should support parsing List[pathlib.Path] and List[str] instead of relying on the asterisk in a path. Currently, it takes a URI which may or may not have an asterisk in it and handles all the parsing. A user should be able to pass a list for it to instead. Since pathlib also includes glob and is available in Python 3.4+, it should be pretty trivial.

Describe alternatives you've considered
It currently works as written so no alternatives. I think this feature request would improve its functionality. It also feels a lot more natural to pass Path objects than using the glob module itself.

Additional context
N/A

Sort doesn't work after agg without ungroup

I am not sure if this is a bug but it is definitely not a feature request so I am writing it as a bug report.

Problem
I was expecting the following code to return a sorted list of the count grouped by the primary_type key but that is not the case. I see an unsorted list. Is this an expected behaviour?

from clumper import Clumper
(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .mutate(primary_type = lambda c : c['type'][0])
    .group_by('primary_type')
    .agg(occurence = ('primary_type','count'))
    .sort(key=lambda x : x['occurence'])
    .collect()
)

Additional context

Adding ungroup after agg and before sort solves it. The following code produces what I want.

from clumper import Clumper

(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .mutate(primary_type = lambda c : c['type'][0])
    .group_by('primary_type')
    .agg(occurence = ('primary_type','count'))
    .ungroup()
    .sort(key=lambda x : x['occurence'])
    .collect()
)

Datetime utilities

It would be nice if we could group-by day/week/hour given a timestamp.

We should first discuss a proper API before making an implementation but this would really be a nice feature.

Aggregation Methods: .first()

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular first sounds like a reasonable candidate.

Grouping by two columns mixes up the keys

Grouping by two columns mixes the keys up. See below. Aggregated output shows dict with 'grp_1': 'b', 'grp_2': 'c' . That combination of keys is not present in original list_dicts data.

from clumper import Clumper

list_dicts = [
    {'grp_1': 'a', 'grp_2': 'a', 'a': 6},
    {'grp_1': 'a', 'grp_2': 'b', 'a': 7},
    {'grp_1': 'a', 'grp_2': 'c', 'a': 5},
    {'grp_1': 'b', 'grp_2': 'a', 'a': 2},
    {'grp_1': 'b', 'grp_2': 'b', 'a': 4},
]

(Clumper(list_dicts)
  .group_by('grp_1', 'grp_2')
  .agg(c=('a', 'count'),
       s=('a', 'sum'),
       m=('a', 'mean'))
  .collect()
)

# output
[{'grp_1': 'b', 'grp_2': 'b', 'c': 1, 's': 4, 'm': 4},
 {'grp_1': 'b', 'grp_2': 'a', 'c': 1, 's': 7, 'm': 7},
 {'grp_1': 'b', 'grp_2': 'c', 'c': 1, 's': 2, 'm': 2},     # this key combination is not present in list_dicts
 {'grp_1': 'a', 'grp_2': 'b', 'c': 1, 's': 5, 'm': 5},
 {'grp_1': 'a', 'grp_2': 'c', 'c': 1, 's': 6, 'm': 6}]     # 'grp_1': 'a', 'grp_2': 'a' is missing here

python 3.9.7, clumper 0.2.15

Experimental Method: Table(n)

It might be cool to use a table from rich to show intermediate data in a user-friendly way. I'm not 100% sure about this because I pride myself for having zero dependencies sofar. It might be optional?

Add `.schema()` verb

I think it'd be nice to see the schema of the current dict object. Might make it a lot easier to write queries.

I might be willing to import rich for this feature too.

Experimental Idea: Mutate that is Group-aware via `row_number()`.

I've got a function that could serve as a row_number().

def row_number():
    """
    This stateful function can be used to calculate row numbers
    on dictionaries.

    Usage:

    ```python
    from clumper import Clumper

    list_dicts = [
        {'a': 1, 'b': 2},
        {'a': 2, 'b': 3},
        {'a': 3},
        {'a': 4}
    ]

    (Clumper(list_dicts)
      .mutate(r=row_number())
      .collect())
    ```
    """
    i = 0

    def incr(_):
        nonlocal i
        i += 1
        return i

    return incr

The question is, can we make this function aware of the group_by in a nice way?

This is definately an advanced feature, if you're new to functional style programming ... probably best to skip this one.

Flatten the keys

Sometimes I'm dealing with dictionaries that look like this:

{
  'feature_1': {'propery_1': 1, 'property_2': 2},
  'feature_2': {'propery_1': 3, 'property_2': 4},
  'feature_3': {'propery_1': 5, 'property_2': 6},
}

In this case there's three features, but in real life this can be much larger. Currently we have two small issues.

  1. If you read in this blob in Clumper then the length currently is 3 instead of 1.
  2. We currently don't have a nice way in Clumper to turn this dictionary into a more flat representation. Something like;
[
  {'feature': 'feature_1', 'propery_1': 1, 'property_2': 2},
  {'feature': 'feature_2', 'propery_1': 3, 'property_2': 4},
  {'feature': 'feature_3', 'propery_1': 5, 'property_2': 6},
]

Aggregation Methods: .median()

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular median sounds like a reasonable candidate.

Groups should return a copy.

This issue was raised here. If you look at our implementations you'll notice that we typically do not return self, rather a copy of self. This in order to keep things immutable.

There currently is an exception to that rule though groupby and ungroup do not follow this pattern. As seen here.

Split Clumper class by functionality

Is your feature request related to a problem? Please describe.
Our main class is becoming a monolith. Currently Clumper class is over 1500 lines. The major contributor is the documentation but it still makes it difficult to navigate through while developing.

Describe the solution you'd like
Split the class into multiple methods and/or classes. I think we already have a good structure based on tests by functionality. For example, we can split into following system by functionality

  • Read/writing
  • Verbs
  • (other?)

Add `foreach` verb.

It's similar to tee. The idea is to have a function that runs for each element, but doesn't change the collection.

Aggregation Method: .last()

We've got mean, count, sum, etc. But it would be nice to add a few more. In particular first sounds like a reasonable candidate.

Experimental Idea: Expand Verb and Functions

Since we're dealing with nested structures here, we might use the following syntax to deal with the creation of rolling/expanding/smoothing windows.

(clump
 .expand(f1=moving(col='a',window=5),
         f2=expanding(col='a',window=5),
         f3=smoothing(col='a',window=5)))

In this sense, expand will be like mutate in the sense that we'll add a key but we'll do it with functions that behave just slightly differently. This is an experimental idea and I'm starting a thread here to gather my thoughts into a single place.

Add `all` aggregation method.

We've got unique but maybe we also want all. Maybe not that name, but at least something that doesn't throw things away.

Data Loader: .from_json()

It'd be nice if we could also read data from disk. A syntax like this would be nice:

Clumper.from_json(path, settings)

Data Writer: json/jsonl

Data Loader: .to_json()/to_jsonl()

It'd be nice if we could also write data to disk. A syntax like this would be nice:

Clumper.to_json(path, settings)
Clumper.to_jsonl(path, settings)

Data Loader: .from_csv()

It'd be nice if we could also read data from disk. A syntax like this would be nice:

Clumper.from_csv(path, settings)

read_jsonl method : Importing file just renamed from .json to .jsonl is allowed

The user can read in .json file by just renaming it to .jsonl. With the current code, Clumper would parse it one line as a big dictionary. Unexpected behaviour will happen during analysis.

To reproduce:

  1. Rename pokemon.json to pokemon.jsonl (any json file really)
  2. Read it and load to Clumper
from clumper import Clumper
wrongly_parsed = Clumper.read_jsonl("pokemon.jsonl")
  1. You can see that the len returns 1
print(len(wrongly_parsed))

I couldn't find an elegant solution on how to verify if the file being read is actually JSONL apart from looking at its extension. Any suggestion is welcome.

Let's remove "Error occured during writing JSONL file"

This is the result of a failing pytest on my side.

    def write_jsonl(self, path, sort_keys=False, indent=None):
        """
        Writes to a jsonl file.
    
        Arguments:
            path: filename
            sort_keys: If sort_keys is true (default: False), then the output of dictionaries will be sorted by key.
            indent: If indent is a non-negative integer (default: None), then JSON array elements members will be pretty-printed with that indent level.
        Usage:
    
        ```python
        from clumper import Clumper
        clump_orig = Clumper.read_jsonl("tests/data/cards.jsonl")
        clump_orig.write_jsonl("tests/data/cards_copy.jsonl")
    
        clump_copy = Clumper.read_jsonl("tests/data/cards_copy.jsonl")
    
        assert clump_copy.collect() == clump_orig.collect()
        ```
        """
    
        try:
            # Create a new file and open it for writing
            with open(path, "x") as f:
                for current_line_nr, json_dict in enumerate(self.collect()):
                    f.write(
                        json.dumps(json_dict, sort_keys=sort_keys, indent=indent) + "\n"
                    )
    
        except Exception:
>           raise RuntimeError("Error occured during writing JSONL file")
E           RuntimeError: Error occured during writing JSONL file

clumper/clump.py:276: RuntimeError

The message Error occured during writing JSONL file is making it harder for me to understand what is actually going on. Can we just maybe remove it?

The error here was that I was trying to write a file that already exists. Instead of giving me this error I got the uninformative "Error occured during writing JSONL file" message.

Readers should be able to add a filename.

When you read a bunch of json files with a glob, you also want to add the name to the blob.

Clumper.read_json("path/to", add_filename=True).glob("*/settings.json")

Otherwise you would manually need to add this info sometimes.

Add verb to unnest item in dict.

Example.

{
  'nodeid': 'tests/test_cron_parsing.py::test_job_parsing[check0]', 
  'duration': 0.0003903769999999973, 
  'parsed': {'path': 'tests', 'file': 'test_cron_parsing', 'test': 'test_job_parsing[check0]'}
}

I'd like a verb that can remove the parsed part such that the dict remains flat.

Helper method to nest per dictionary

Let's say that I have the monopoly dataset. I have rows such as;

{'name': 'Boardwalk',
  'rent': '50',
  'house_1': '200',
  'house_2': '600',
  'house_3': '1400',
  'house_4': '1700',
  'hotel': '2000',
  'deed_cost': '400',
  'house_cost': '200',
  'color': 'blue',
  'tile': '39'}

Let's suppose that I want to change that to;

{'name': 'Boardwalk',
  'color': 'blue',
  'tile': '39',
  'costs': {'deed': '400', 'house': '200'},
  'income': {'rent': '50',
   'hotel': '2000',
   'house_1': '200',
   'house_2': '600',
   'house_3': '1400',
   'house_4': '1700'}

Then you currently need to run this:

(Clumper.read_csv("tests/data/monopoly.csv")
  .mutate(costs=lambda d: {"deed": d["deed_cost"], "house": d["house_cost"]},
          income=lambda d: {**{"rent": d["rent"], "hotel": d["hotel"]}, **{f"house_{i}": d[f"house_{i}"] for i in [1, 2, 3, 4]}})
  .drop("house_1", "house_2", "house_3", "house_4", "rent", "hotel", "deed_cost", "house_cost")
  .collect())

It feels like there should be an easier way to do it. This issue is a place where we might discuss this. Since it is a rowwise operation we might come up with a helper function for mutate but since we also want to drop the values afterwards we might be able to come up with something more general.

Data Writer: csv

Data Loader: .to_csv()

It'd be nice if we could also write data from disk. A syntax like this would be nice:

Clumper.to_csv(path, settings)

An important theme here is to keep it simple and to think how we might want to deal with keys that sometimes go missing.

Join(s) performance enhancement

Is your feature request related to a problem? Please describe.
As mentioned in the codebase itself, inner_join and left_join methods implementation is "naive" and speedup is possible.
I figured it out while working with clumpers with 10k+ dicts.

Describe the solution you'd like
Here a possible speedup which avoided the inner for-loop, with few performance comparisons aswell:

from clumper import Clumper

def join(self, other, mapping, how="inner", lsuffix="", rsuffix="_joined"):
    """Possible new join implementation, remark that I am adding the `how` keyword argument"""
    
    result = []
    self_keys, other_keys = mapping.keys(), mapping.values()
    
    if how == "inner":
        # If it's an inner join, it's sufficient to keep only the dicts that have all the matching keys
        _self = self.keep(lambda d: all((k in d.keys() for k in self_keys)))
    elif how == "left":
        _self = self
    else:
        raise NotImplementedError()
    
    other_filtered = other.keep(lambda d: all((k in d.keys() for k in other_keys)))
    
    for d_i in _self:
        
        # as already implemented, extract values to join on
        values_i = [d_i.get(k) for k in self_keys]
        
        # expoit keep method to find all the dicts in other clumper that match
        matched = other_filtered.keep(lambda d: all(d[k]==v for k, v in zip(other_keys, values_i)))

        if len(matched):
            for d_j in matched:
                result.append(Clumper._merge_dicts(d_i, d_j, mapping, lsuffix, rsuffix))
        else:
            # for left join, we want to keep d_i in any case
            if how == "left":
                 result.append(Clumper._merge_dicts(d_i, {}, mapping, lsuffix, rsuffix))
            
    return self._create_new(result)

Now let's define helper functionalities for benchmarking

from functools import wraps
import numpy as np
import pandas as pd
from time import process_time
from memo import memlist, grid, Runner

def generate_random_clumper(size, keys=list("abc")):
    """
    Creates a Clumper with random integers of shape=(size, len(keys)) 
    starting from a pandas DataFrame
    """
    
    df = pd.DataFrame(
        data=np.random.randint(0, 100, (size, len(keys))),
        columns=keys
    )
    
    clump = Clumper(df.to_dict("records"))
    return clump

def drop_random_keys(clump, frac = 0.1, keys = list("ab")):
    """
    Randomly drops frac percentage of keys not in the provided keys
    """"
    c1 = clump.sample_frac(frac, replace=False).select(*keys)
    c2 = clump.sample_frac(1-frac, replace=False)
    
    return c1.concat(c2)

def timer(func):
    """timer decorator"""
    @wraps(func)
    def wrapper(*args, **kwargs):

        tic = process_time()
        res = func(*args, **kwargs)
        toc = process_time()

        time_elapsed = toc-tic
        return res, time_elapsed

    return wrapper

Time for testing

results = []

@memlist(data=results)
def join_experiment(left_size, right_size, left_drop, right_drop):
    
    c1 = generate_random_clumper(left_size).pipe(drop_random_keys, left_drop)
    c2 = generate_random_clumper(right_size).pipe(drop_random_keys, right_drop)
    
    inner_old, time_inner_old = timer(c1.inner_join)(c2, mapping={"b": "b", "c": "c"})
    left_old, time_left_old = timer(c1.left_join)(c2, mapping={"b": "b", "c": "c"})
    inner_new, time_inner_new = timer(join)(c1, c2, mapping={"b": "b", "c": "c"}, how="inner")
    left_new, time_left_new = timer(join)(c1, c2, mapping={"b": "b", "c": "c"}, how="left")
    
    res = {
        "equals_inner": inner_old.equals(inner_new),
        "equals_left": left_old.equals(left_new),
        "time_inner_old":time_inner_old,
        "time_left_old": time_left_old,
        "time_inner_new": time_inner_new,
        "time_left_new": time_left_new,
        "best_inner": "new" if time_inner_new < time_inner_old else "old",
        "best_left": "new" if time_left_new < time_left_old else "old"
    }
    return res

sizes = [100, 1_000, 10_000]
drop_rates = [0.01, 0.1, 0.25, 0.5, 0.9]

settings = grid(left_size=sizes, right_size=sizes, left_drop=drop_rates, right_drop=drop_rates)
runner = Runner(backend="threading", n_jobs=8)
runner.run(func=join_experiment, settings=settings, progbar=True)

df_res = (pd.DataFrame(results)
    .assign(
        delta_inner = lambda t: t["time_inner_old"]/t["time_inner_new"],
        delta_left = lambda t: t["time_left_old"]/t["time_left_new"]
    )
)

# As a first sanity check make sure every join is as expected
df_res["equals_inner"].all(), df_res["equals_left"].all()
# (True, True)

# Then let's see 
df_res[["delta_inner", "delta_left"]].describe(percentiles=[.01, .05, .25, .5, .75, .9, .99]).T
count mean std min 1% 5% 25% 50% 75% 90% 99% max
delta_inner 144 8.7035 14.6614 0.649615 0.925807 1.03278 1.75209 3.09647 10.467 15.9198 75.1523 109.684
delta_left 144 2.81045 2.81197 0.686376 0.743097 0.864589 1.04525 1.46856 2.77124 7.99939 9.13121 14.003
  • Inner join(s) improved in 95% of the tests
  • Left join(s) improved in slightly more than 75% of them - in particular when all (actually, 99%) of the dicts in the right clumper have all the keys (i.e. in the test when right_drop = 0.01

Additional context

  • I can imagine that further improvements can be done
  • If we want to keep inner_join and left_join methods standalone, we can write the join as semiprivate and call it in the methods.

New Verb: Sample

It'd be nice if we could randomly sample from a clumper collection.

The API should allow for:

  • n: the number of items to sample
  • frac: the fraction of items to sample
  • replace: if we should replace items yes/no
  • weights: a key that can be passed in, associating it's value of the probability of being drawn
  • random_state: the random seed used for sampling

Dictionaries are Causing Issues

Let's say this is the input.

d = {
  'name': 'name',
 'image': 'img.img',
 'short': 'something short',
 'tags': ['science', 'entertainment'],
 'videos': [{'name': 'Intro',
   'url': 'https://player.vimeo.com/video/414517859'},
  {'name': 'Code', 'url': 'https://player.vimeo.com/video/414517885'},
  {'name': 'Plotting', 'url': 'https://player.vimeo.com/video/414517957'},
  {'name': 'How it Works 1',
   'url': 'https://player.vimeo.com/video/414518015'},
  {'name': 'How it Works 2',
   'url': 'https://player.vimeo.com/video/414518059'},
  {'name': 'Accuracy', 'url': 'https://player.vimeo.com/video/414518106'},
  {'name': 'Benchmark', 'url': 'https://player.vimeo.com/video/414518141'},
  {'name': 'Final Features',
   'url': 'https://player.vimeo.com/video/414518199'}]
}

Then what should come out of this?

Clumper(d).map(lambda d: [d]).collect()

Not this;

[['name'], ['image'], ['short'], ['tags'], ['videos']]

Yet that is exactly what is happening! The root cause for this is that we currently allow dictionaries to be read in via all of our read_ functions. The issue lies in the map method. It assumes a list of dictionaries. We should consider a decorator that can detect this but maybe we should also be more strict when we create a clumper object.

Avoiding class methods to simplify API usage

Is your feature request related to a problem? Please describe.
Looking at the API, it does look a bit odd to me that one needs to call a class method (and import the class) to read files. Is there a need for the class object? From a style perspective, calling class methods breaks with the otherwise very functional method style of the other parts of the lib (e.g. chaining).

Describe the solution you'd like

import clumper

clump = clumper.read_json('https://calmcode.io/datasets/pokemon.json')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.