Coder Social home page Coder Social logo

farsante's Introduction

farsante

Fake Pandas / PySpark DataFrame creator.

Install

pip install farsante

PySpark

Here's how to quickly create a 7 row DataFrame with first_name and last_name fields.

import farsante

df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()
+----------+---------+
|first_name|last_name|
+----------+---------+
|     Tommy|     Hess|
|    Arthur| Melendez|
|  Clemente|    Blair|
|    Wesley|   Conrad|
|    Willis|   Dunlap|
|     Bruna|  Sellers|
|     Tonda| Schwartz|
+----------+---------+

Here's how to create a DataFrame with 5 rows of data with first names and last names using Mexican Spanish.

import farsante
from mimesis import Person

mx = Person('es-mx')

df = farsante.pyspark_df([mx.first_name, mx.last_name], 5)
df.show()
+-----------+---------+
| first_name|last_name|
+-----------+---------+
|     Connie|    Xicoy|
|  Oliverios|   Merino|
|     Castel|    Yáñez|
|Guillelmina|   Prieto|
|     Gezane|   Campos|
+-----------+---------+

Pandas

Here's how to quickly create a 3 row DataFrame with first_name and last_name fields.

import farsante

df = farsante.quick_pandas_df(['first_name', 'last_name'], 3)
print(df)
  first_name last_name
0       Toby   Rosales
1      Gregg    Hughes
2    Terence       Ray

Here's how to create a 5 row DataFrame with first names and last names using Russian.

from mimesis import Person
ru = Person('ru')
df = farsante.pandas_df([ru.first_name, ru.last_name], 5)
print(df)
  first_name   last_name
0      Амиль  Ханженкова
1  Славентий  Голумидова
2    Паладин   Волосиков
3       Акша    Бабашова
4       Ника    Синусова

Fake files

Here's how to create a CSV file with some fake data:

import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime

person = Person()
address = Address()
datetime = Datetime()
df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_csv('./tmp/fake_data.csv', index=False)

Here's how to create a Parquet file with fake data:

df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_parquet('./tmp/fake_data.parquet', index=False)

h2o dataset creation

h2o is a popular library to benchmark data processing engines. Farsante uses rust to generate h2o datasets.

The following datasets are currently supported:

name rows cols cols types nulls
groupby n 9 6 id cols, 2 int cols, 1 float col optional
join_big n 7 6 id cols, 1 float col no
join_big_na n 7 6 id cols, 1 float col optional
join_medium n / 1000 5 4 id cols, 1 float col optional
join_small n / 1_000_000 4 3 id cols, 1 float col optional

Python

To create one of the above datasets, use the generate_h2o_dataset() function in farsante.h2o_dataset_create

from farsante import generate_h2o_dataset

generate_h2o_dataset(
    ds_type="join_big",
    n=10_000_000,
    k=10,
    nas=10,
    seed=10,
)

To create all of the above datasets in parallel, use the h2o_dataset_create_all.py script

python h2o_dataset_create_all.py --n 10000000 --k 10 --nas 10 --seed 42

Rust

To generate these datasets in rust:

  1. Install rust
  2. Install cargo
  3. Install the rust dependencies: cargo install --path .
  4. Run the rust program: cargo run --release -- --help to see run options
cargo run --release -- --n 10000000 --k 10 --nas 10 --seed 42

Contributing

If you would like to help make Farsante better, take a look at our Contributing Guide.

farsante's People

Contributors

jeffbrennan avatar mrpowers avatar semyonsinchenko avatar

Stargazers

Thomas avatar Don Le avatar Alexandros Biratsis avatar chethanuk avatar Mete Can Akar avatar Rodrigo Ferraz avatar  avatar Paulo Haddad avatar  avatar  avatar Kannan G avatar Araucaria avatar Igor Tavares avatar  avatar Zinan Lin avatar Damien Jones avatar bechm avatar Gabriel Salazar avatar Will Gaviria Rojas avatar Haojin Gui avatar Nick avatar Rakesh Setty avatar Leandro Humberto Vieira avatar Luis Velasco avatar Gokul  avatar  avatar Hoàng Trung Nghĩa avatar  avatar  avatar Jeremey Bingham avatar Andrew Stewart avatar Hyuntak Joo avatar Aldo avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar

farsante's Issues

Error Encountered When Running 'test_create_fake_parquet' Test

image
I’m encountering an issue with the test_create_fake_parquet test. I haven't made any code changes. The test fails with the following error message:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.parquet. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (172.27.58.218 executor driver): org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)). ...

Here’s what I’ve tried so far:

  1. First, I followed the documentation on contributing guidelines, but the "4. Install project dependencies" section doesn't seem to completely reflect the transition from poetry to maturin. Following the provided instructions results in the following error being thrown:
    [tool.poetry] section not found in /path/to/project//farsante/pyproject.toml
    Then I runned maturin develop inside a virtual environment and was able to properly run the tests using pytest tests/.
  2. Googled the error message to find similar issues, and tried to change from Java 10.1 to Java 1.8, but I couldn't find a solution that worked.

Environment Info:

Scala: 2.12.18
Java: 1.8.0_402
Python: 3.10.9
PySpark: 3.5.1

Is there anything I'm missing here? Since I haven't made any code changes I believe it is something wrong with my environment.

Thanks in advance for your help!

Python 3.11 Compatibility Issue

I am getting a pickling error when trying to use Python 3.11. It was working correctly with Python 3.10.

_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

Cannot run test suite

Screenshot 2024-02-15 at 3 18 47 PM

Perhaps we can create a CONTRIBUTING.md file with instructions on how to generate the datasets necessary to run the test suite.

Create dataset generation tests

I think we should have some tests to ensure that the datasets generated in different formats with the same seed produce the same dataframes. Started working on this when I began the parquet implementation and moved that code into it's own branch.

There may be issues with comparing a dataframe from a csv against one from a schema defined format like parquet or avro, but I think we can worry about that later

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.