mrpowers / farsante Goto Github PK

View Code? Open in Web Editor NEW

34.0 4.0 6.0 70 KB

Fake Pandas / PySpark DataFrame creator

Python 36.34% Rust 63.39% Makefile 0.26%

farsante's Introduction

farsante

Fake Pandas / PySpark DataFrame creator.

Install

pip install farsante

PySpark

Here's how to quickly create a 7 row DataFrame with first_name and last_name fields.

import farsante

df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Tommy|     Hess|
|    Arthur| Melendez|
|  Clemente|    Blair|
|    Wesley|   Conrad|
|    Willis|   Dunlap|
|     Bruna|  Sellers|
|     Tonda| Schwartz|
+----------+---------+

Here's how to create a DataFrame with 5 rows of data with first names and last names using Mexican Spanish.

import farsante
from mimesis import Person

mx = Person('es-mx')

df = farsante.pyspark_df([mx.first_name, mx.last_name], 5)
df.show()

+-----------+---------+
| first_name|last_name|
+-----------+---------+
|     Connie|    Xicoy|
|  Oliverios|   Merino|
|     Castel|    Yáñez|
|Guillelmina|   Prieto|
|     Gezane|   Campos|
+-----------+---------+

Pandas

Here's how to quickly create a 3 row DataFrame with first_name and last_name fields.

import farsante

df = farsante.quick_pandas_df(['first_name', 'last_name'], 3)
print(df)

  first_name last_name
0       Toby   Rosales
1      Gregg    Hughes
2    Terence       Ray

Here's how to create a 5 row DataFrame with first names and last names using Russian.

from mimesis import Person
ru = Person('ru')
df = farsante.pandas_df([ru.first_name, ru.last_name], 5)
print(df)

  first_name   last_name
0      Амиль  Ханженкова
1  Славентий  Голумидова
2    Паладин   Волосиков
3       Акша    Бабашова
4       Ника    Синусова

Fake files

Here's how to create a CSV file with some fake data:

import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime

person = Person()
address = Address()
datetime = Datetime()
df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_csv('./tmp/fake_data.csv', index=False)

Here's how to create a Parquet file with fake data:

df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_parquet('./tmp/fake_data.parquet', index=False)

h2o dataset creation

h2o is a popular library to benchmark data processing engines. Farsante uses rust to generate h2o datasets.

The following datasets are currently supported:

name	rows	cols	cols types	nulls
groupby	n	9	6 id cols, 2 int cols, 1 float col	optional
join_big	n	7	6 id cols, 1 float col	no
join_big_na	n	7	6 id cols, 1 float col	optional
join_medium	n / 1000	5	4 id cols, 1 float col	optional
join_small	n / 1_000_000	4	3 id cols, 1 float col	optional

Python

To create one of the above datasets, use the generate_h2o_dataset() function in farsante.h2o_dataset_create

from farsante import generate_h2o_dataset

generate_h2o_dataset(
    ds_type="join_big",
    n=10_000_000,
    k=10,
    nas=10,
    seed=10,
)

To create all of the above datasets in parallel, use the h2o_dataset_create_all.py script

python h2o_dataset_create_all.py --n 10000000 --k 10 --nas 10 --seed 42

Rust

To generate these datasets in rust:

Install rust
Install cargo
Install the rust dependencies: cargo install --path .
Run the rust program: cargo run --release -- --help to see run options

cargo run --release -- --n 10000000 --k 10 --nas 10 --seed 42

Contributing

If you would like to help make Farsante better, take a look at our Contributing Guide.

farsante's People

Contributors

Stargazers

Watchers

Forkers

bishwajitdey rohankumardubey semyonsinchenko jeffbrennan metecanakar paulooctavio

farsante's Issues

Error Encountered When Running 'test_create_fake_parquet' Test

I’m encountering an issue with the test_create_fake_parquet test. I haven't made any code changes. The test fails with the following error message:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.parquet. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (172.27.58.218 executor driver): org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)). ...

Here’s what I’ve tried so far:

First, I followed the documentation on contributing guidelines, but the "4. Install project dependencies" section doesn't seem to completely reflect the transition from poetry to maturin. Following the provided instructions results in the following error being thrown:
[tool.poetry] section not found in /path/to/project//farsante/pyproject.toml
Then I runned maturin develop inside a virtual environment and was able to properly run the tests using pytest tests/.
Googled the error message to find similar issues, and tried to change from Java 10.1 to Java 1.8, but I couldn't find a solution that worked.

Environment Info:

Scala: 2.12.18
Java: 1.8.0_402
Python: 3.10.9
PySpark: 3.5.1

Is there anything I'm missing here? Since I haven't made any code changes I believe it is something wrong with my environment.

Thanks in advance for your help!

Python 3.11 Compatibility Issue

I am getting a pickling error when trying to use Python 3.11. It was working correctly with Python 3.10.

_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

Make PySpark an optional dependency

PySpark is a heavy dependency and shouldn't be required for users that just want to create Pandas DataFrames.

pip install farsante should not install PySpark.

pip install farsante[pyspark] should be the PySpark installation.

Here is how to install optional dependencies with Poetry. I am guessing this will work.