Coder Social home page Coder Social logo

intake-spark's Introduction

Intake: Take 2

A general python package for describing, loading and processing data

Logo

Build Status Documentation Status

Taking the pain out of data access and distribution

Intake is an open-source package to:

  • describe your data declaratively
  • gather data sets into catalogs
  • search catalogs and services to find the right data you need
  • load, transform and output data in many formats
  • work with third party remote storage and compute platforms

Documentation is available at Read the Docs.

Please report issues at https://github.com/intake/intake/issues

Install

Recommended method using conda:

conda install -c conda-forge intake

You can also install using pip, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements

pip install intake

Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.

Development

  • Create development Python environment with the required dependencies, ideally with conda. The requirements can be found in the yml files in the scripts/ci/ directory of this repo.
    • e.g. conda env create -f scripts/ci/environment-py311.yml and then conda activate test_env
  • Install intake using pip install -e .
  • Use pytest to run tests.
  • Create a fork on github to be able to submit PRs.
  • We respect, but do not enforce, pep8 standards; all new code should be covered by tests.

intake-spark's People

Contributors

danielballan avatar martindurant avatar yergi avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

intake-spark's Issues

Requires pyspark >=2.3

While the requirements don't state a specific pyspark version is required, the use of the _to_corrected_pandas_type function from pyspark.sql.dataframe requires pyspark 2.3, as that was when that function was introduced.

Either the requirements should be updated, or just copy-paste that function into intake-spark.

Store functions in YAML file

In the intake-spark documentation there is a brief mention that:

Note that you can pass functions using this formalism, but only encode python built-ins into a YAML file, e.g., len => !!python/name:builtins.len ''.

Does this mean that we cannot store function calls in the YAML file if it's not a builtin function? Our main use case is trying to use functions from pyspark.sql.functions so that we can define more complex datasets than just a table, or being able to define a complex query through pyspark syntax rather than storing it as text.

release the current version on pypi

Hi. Thx for the package. Is it planned to release the current version to pypi? It seems that the one on pypi is a bit outdated. Thx a lot.

Caching files for Spark

What is the best way to cache a file for the spark_dataframe driver? I'm guessing this has a very simple answer, but I'm struggling to see it as a new user. I don't see a way to connect urlpath to the generic caching mechanisms in intake (since the Spark drivers don't take that arg) and the drivers don't use fsspec, so simplecache is out.

All I'd like to do is point a spark_dataframe driver at this json file: https://storage.googleapis.com/open-targets-data-releases/20.04/input/evidence-files/eva-2020-03-20.json.gz. I'd like intake to manage the local download and caching, and then have Spark read it locally. Is there a straightforward way to do it?

Even more ideally, the downloaded file would be converted to parquet (also locally) and then linked from the catalog that way. Should I just manage this kind of thing in building the catalog rather than having intake do it on .read or .to_spark?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.