Light

intake / intake-spark Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 4.0 55 KB

Spark drivers for Intake

Home Page: https://intake-spark.readthedocs.io/en/latest/

License: BSD 2-Clause "Simplified" License

Python 100.00%

intake-spark's Introduction

Intake: Take 2

A general python package for describing, loading and processing data

Taking the pain out of data access and distribution

Intake is an open-source package to:

describe your data declaratively
gather data sets into catalogs
search catalogs and services to find the right data you need
load, transform and output data in many formats
work with third party remote storage and compute platforms

Documentation is available at Read the Docs.

Please report issues at https://github.com/intake/intake/issues

Install

Recommended method using conda:

conda install -c conda-forge intake

You can also install using pip, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements

pip install intake

Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.

Development

Create development Python environment with the required dependencies, ideally with conda. The requirements can be found in the yml files in the scripts/ci/ directory of this repo.
- e.g. conda env create -f scripts/ci/environment-py311.yml and then conda activate test_env
Install intake using pip install -e .
Use pytest to run tests.
Create a fork on github to be able to submit PRs.
We respect, but do not enforce, pep8 standards; all new code should be covered by tests.

intake-spark's People

Contributors

Stargazers

Watchers

Forkers

yuhonghong7035 talebzeghmi danielballan datalayer-externals

intake-spark's Issues

Requires pyspark >=2.3

While the requirements don't state a specific pyspark version is required, the use of the _to_corrected_pandas_type function from pyspark.sql.dataframe requires pyspark 2.3, as that was when that function was introduced.

Either the requirements should be updated, or just copy-paste that function into intake-spark.

Store functions in YAML file

In the intake-spark documentation there is a brief mention that:

Note that you can pass functions using this formalism, but only encode python built-ins into a YAML file, e.g., len => !!python/name:builtins.len ''.

Does this mean that we cannot store function calls in the YAML file if it's not a builtin function? Our main use case is trying to use functions from pyspark.sql.functions so that we can define more complex datasets than just a table, or being able to define a complex query through pyspark syntax rather than storing it as text.

release the current version on pypi

Hi. Thx for the package. Is it planned to release the current version to pypi? It seems that the one on pypi is a bit outdated. Thx a lot.

Caching files for Spark

What is the best way to cache a file for the spark_dataframe driver? I'm guessing this has a very simple answer, but I'm struggling to see it as a new user. I don't see a way to connect urlpath to the generic caching mechanisms in intake (since the Spark drivers don't take that arg) and the drivers don't use fsspec, so simplecache is out.

All I'd like to do is point a spark_dataframe driver at this json file: https://storage.googleapis.com/open-targets-data-releases/20.04/input/evidence-files/eva-2020-03-20.json.gz. I'd like intake to manage the local download and caching, and then have Spark read it locally. Is there a straightforward way to do it?

Even more ideally, the downloaded file would be converted to parquet (also locally) and then linked from the catalog that way. Should I just manage this kind of thing in building the catalog rather than having intake do it on .read or .to_spark?

Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.