Coder Social home page Coder Social logo

patterns-app / patterns-devkit Goto Github PK

View Code? Open in Web Editor NEW
106.0 4.0 5.0 1.8 MB

Data pipelines from re-usable components

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
data-science data-analysis pipelines etl etl-pipeline etl-framework functional-reactive-programming data-engineering sql immutability

patterns-devkit's People

Contributors

ajalt avatar kvh avatar samnickolay avatar stanleychris2 avatar trav avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

patterns-devkit's Issues

Add support for specifying and uploading Schemas

User can declare schemas in their graph.yml file:

schemas:
  - path/to/SchemaDef.yml
  - ...

Schemas declared will be added to the organization's Schema library and can then be used by any node.

Isolate Schemas as separate library

Schemas are a potentially general purpose piece of data ecosystems, reusable and valuable across projects and languages, and may be more successful as a independent library.

Explicit field type support checks and validation for non-database formats

Sqlalchemy does a good change of smoothing over datatype support differences across database engines, but there is no similar framework for other data formats and storage engines (json, csv, dataframe, etc). Snapflow should provide a framework for easily defining the common operations and conversions for these other formats, and providing implementations for the core formats.

Build a "common" module

Provide Schemas for most common data record types:

  • Date (date, day of week, year, month, day, week number, quarter, etc)
  • TimeSeries (datetime, value)
  • Address (line1, line2, city, state/province, postal code, country)
  • Geo (city, metro, county, state/province, country, region) (??)
  • Others?

along with any associated convenience pipes.

Support Streams for pipe output

Right now the only way to emit chunks of data out of a pipe is via a python generator of objects, which snapflow treats as a single DataBlock. There are scenarios where we want to create multiple DataBlocks from a single pipe, like the obvious case of a "split" or "chunk" pipe that breaks a bigger datablock into smaller chunks. Right now the only way to achieve this would be a hack where you create DataBlocks within the pipe and manually log those blocks as output, which is error prone and breaks the basic contract of the pipe and execution.

Expand StorageApi to support more high-level operations

Expanding StorageApi to support more operations allows for simpler pipes, more code re-use, and a natural place to make optimizations and per-engine enhancements.

Operations:

  • concat(*blocks: DataBlock)
  • split(block: DataBlock, parts: int)

App crash

I am trying to write a crawler to explore your v. cool app. So far I like the constructs, schedule, state and table are pretty useful. Here's a bug I encountered:

If I log a lot of data to the State, the front end app crashes when I click on the settings tab. If I just ignore the settings tab, everything works -- rerun, logging, etc. (gj backend!) Presumably this is because of fetching or formatting lots of data but here are steps to reproduce and a screenshot of the crash. You can create a key using free tier in bing, but can also just fake a bunch of data some example included.

Code to reproduce:

BING_SUBSCRIPTION_KEY = Parameter("BING_SUBSCRIPTION_KEY")
BING_API_IMAGE_ENDPOINT = "https://api.bing.microsoft.com/v7.0/images/search"
STATE = State()

table = Table("bing_search_spotify_album", "w")


def process(offset):
    headers = {"Ocp-Apim-Subscription-Key": BING_SUBSCRIPTION_KEY}
    params = {
        "q": "site:https://open.spotify.com/album/",
        "textDecorations": True,
        "textFormat": "HTML",
        "count": 200,
        "offset": offset,
    }
    resp = requests.get(BING_API_IMAGE_ENDPOINT, headers=headers, params=params)
    if not resp:
        print(resp.text)
    resp.raise_for_status()
    search_results = resp.json()
    # print(search_results)
    last_offset = offset
    batch = []
    for i,result in enumerate(search_results.get("value")):
        print(result)
        last_offset = offset+i
        row = dict(
            offset = last_offset,
            query = urllib.parse.urlencode(params),
            thumbnail_url = result.get("thumbnailUrl"),
            content_url = result.get("contentUrl"),
            name = result.get("name"),
            host_url = result.get("hostPageUrl"),
            width = result.get("width"),
            height = result.get("height"),
            thumbnail_width = result.get("thumbnail").get("width"),
            thumbnail_height = result.get("thumbnail").get("height"),
        )
        batch.append(row)
    return batch, dict(
        last_offset=last_offset,
        last_completed=time.time(),
        last_batch=batch,    # NOTE: Removing this fixed the issue
        last_result=search_results  # NOTE: Removing this fixed the issue
    )


# sleep for 2 hours to keep <1000 req per month
if time.time() > STATE.get_value("last_completed", 0) + 7200:
    batch, update = process(
        offset=STATE.get_value("last_offset", 0)
    )
    STATE.set(update)
    print(update)
    table.append(batch)

Sample data:

{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=E9ADCC2340F3E6C6AC86AA063964C7BB35B55AD4&simid=608034758239470266', 'name': 'Afrobeats 2019 by Various Artists on Spotify', 'thumbnailUrl': 'https://tse1.explicit.bing.net/th?id=OIP.IZSQBS103aMDbmK69eFHAwHaHa&pid=Api', 'datePublished': '2020-05-12T11:39:00.0000000Z', 'isFamilyFriendly': False, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b273902a0a4e4706077a4f08bddc', 'hostPageUrl': 'https://open.spotify.com/album/7gBEu5fhAqpGbEtPBlgumB', 'contentSize': '89555 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>7gBEu5fhAqpGbEtPBlgumB', 'width': 640, 'height': 640, 'hostPageFavIconUrl': 'https://www.bing.com/th?id=ODF.R52bdbpEO49IDxcvodLRPQ&amp;pid=Api', 'hostPageDomainFriendlyName': 'Spotify', 'hostPageDiscoveredDate': '2019-04-30T00:00:00.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_IZSQBS10*cp_43C66ACB7882454BCA658DE55D8F9DC5*mid_E9ADCC2340F3E6C6AC86AA063964C7BB35B55AD4*simid_608034758239470266*thid_OIP.IZSQBS103aMDbmK69eFHAwHaHa', 'insightsMetadata': {'pagesIncludingCount': 3, 'availableSizesCount': 3}, 'imageId': 'E9ADCC2340F3E6C6AC86AA063964C7BB35B55AD4', 'accentColor': 'CBA700'}
{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=9E0EB6940AEF800E6583EE319B5A7C3F7B9997C6&simid=608022440268620733', 'name': 'Latino Hits 2021 - Compilation by Various Artists | Spotify', 'thumbnailUrl': 'https://tse1.mm.bing.net/th?id=OIP.m8nKtJZA4hVjac06QG-4MAHaHa&pid=Api', 'datePublished': '2021-06-06T19:41:00.0000000Z', 'isFamilyFriendly': True, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b273032ff8097e2f7d7a0224cd3e', 'hostPageUrl': 'https://open.spotify.com/album/6lho1kKuau2Y42v1OBFokO', 'contentSize': '111293 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>6lho1kKuau2Y42v1OBFokO', 'width': 640, 'height': 640, 'hostPageDiscoveredDate': '2021-04-14T00:00:00.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_m8nKtJZA*cp_021366D268A21C90781B29F60A4FCAAE*mid_9E0EB6940AEF800E6583EE319B5A7C3F7B9997C6*simid_608022440268620733*thid_OIP.m8nKtJZA4hVjac06QG-4MAHaHa', 'insightsMetadata': {'pagesIncludingCount': 6, 'availableSizesCount': 4}, 'imageId': '9E0EB6940AEF800E6583EE319B5A7C3F7B9997C6', 'accentColor': '27609A'}
{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=4C740C11D9EE305300B49FA7D994852786DEAEE2&simid=608001974760133715', 'name': 'PRISM (Deluxe) - Album by Katy Perry | Spotify', 'thumbnailUrl': 'https://tse4.mm.bing.net/th?id=OIP.xuefx77Rebve4ytjkBu5tgHaHa&pid=Api', 'datePublished': '2013-10-23T07:50:00.0000000Z', 'isFamilyFriendly': True, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b27347f930accd8ac01686401fa2', 'hostPageUrl': 'https://open.spotify.com/album/5MQBzs5YlZlE28mD9yUItn', 'contentSize': '92459 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>5MQBzs5YlZlE28mD9yUItn', 'width': 640, 'height': 640, 'hostPageFavIconUrl': 'https://www.bing.com/th?id=ODF.R52bdbpEO49IDxcvodLRPQ&amp;pid=Api', 'hostPageDomainFriendlyName': 'Spotify', 'hostPageDiscoveredDate': '2013-10-23T07:50:18.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_xuefx77R*cp_38E9D4B8FE7B810AC79F9DA208FE8F2E*mid_4C740C11D9EE305300B49FA7D994852786DEAEE2*simid_608001974760133715*thid_OIP.xuefx77Rebve4ytjkBu5tgHaHa', 'insightsMetadata': {'recipeSourcesCount': 0, 'pagesIncludingCount': 469, 'availableSizesCount': 102}, 'imageId': '4C740C11D9EE305300B49FA7D994852786DEAEE2', 'accentColor': 'B1951A'}
{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=000CFEA5E6DA4823C3C7688ADF6B8BEEB68FE086&simid=607996782153506816', 'name': 'Tabata Songs 2020: 20 Sec. Work &amp; 10 Sec. Rest Cycles - Album by Tabata ...', 'thumbnailUrl': 'https://tse1.mm.bing.net/th?id=OIP.WT55rX93sI4F7pp9Xm_IKwHaHa&pid=Api', 'datePublished': '2021-04-21T22:09:00.0000000Z', 'isFamilyFriendly': True, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b273d551eb09e4c9cee22b96dba4', 'hostPageUrl': 'https://open.spotify.com/album/7sK0MYzPWdIOflCEJmVvWD', 'contentSize': '90176 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>7sK0MYzPWdIOflCEJmVvWD', 'width': 640, 'height': 640, 'hostPageFavIconUrl': 'https://www.bing.com/th?id=ODF.R52bdbpEO49IDxcvodLRPQ&amp;pid=Api', 'hostPageDomainFriendlyName': 'Spotify', 'hostPageDiscoveredDate': '2020-02-20T00:00:00.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_WT55rX93*cp_A89DC841AFCB85C91019D4E896699ECB*mid_000CFEA5E6DA4823C3C7688ADF6B8BEEB68FE086*simid_607996782153506816*thid_OIP.WT55rX93sI4F7pp9Xm!_IKwHaHa', 'insightsMetadata': {'pagesIncludingCount': 2, 'availableSizesCount': 2}, 'imageId': '000CFEA5E6DA4823C3C7688ADF6B8BEEB68FE086', 'accentColor': '666666'}
{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=ACD0A1DA61D46EF5E66B674072AF293E78770D6F&simid=608032894229619605', 'name': 'reputation - Album by Taylor Swift | Spotify', 'thumbnailUrl': 'https://tse4.mm.bing.net/th?id=OIP.117skYvCoUrVk9T_s_rXnAAAAA&pid=Api', 'datePublished': '2017-09-22T22:40:00.0000000Z', 'isFamilyFriendly': True, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b273da5d5aeeabacacc1263c0f4b', 'hostPageUrl': 'https://open.spotify.com/album/6DEjYFkNZh67HP7R9PSZvv', 'contentSize': '132658 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>6DEjYFkNZh67HP7R9PSZvv', 'width': 640, 'height': 640, 'hostPageFavIconUrl': 'https://www.bing.com/th?id=ODF.R52bdbpEO49IDxcvodLRPQ&amp;pid=Api', 'hostPageDomainFriendlyName': 'Spotify', 'hostPageDiscoveredDate': '2017-09-22T22:40:33.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_117skYvC*cp_2EAB6AEA0C662632B737D2899F5BC828*mid_ACD0A1DA61D46EF5E66B674072AF293E78770D6F*simid_608032894229619605*thid_OIP.117skYvCoUrVk9T!_s!_rXnAAAAA', 'insightsMetadata': {'recipeSourcesCount': 0, 'pagesIncludingCount': 1292, 'availableSizesCount': 484}, 'imageId': 'ACD0A1DA61D46EF5E66B674072AF293E78770D6F', 'accentColor': '212121'}
{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=A8A9801DBB5910696DFE8BCE3EE7D26818389A88&simid=607992263846672663', 'name': 'Deep House Relax - Compilation by Various Artists | Spotify', 'thumbnailUrl': 'https://tse3.mm.bing.net/th?id=OIP.w1Qr0TdzT_ItWaTIwXzTnAHaHa&pid=Api', 'datePublished': '2017-03-26T00:48:00.0000000Z', 'isFamilyFriendly': True, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b273e6ab6301659453c587a4bc4b', 'hostPageUrl': 'https://open.spotify.com/album/0rwYBNJwMprvq8I0WV7i8H', 'contentSize': '121039 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>0rwYBNJwMprvq8I0WV7i8H', 'width': 640, 'height': 640, 'hostPageFavIconUrl': 'https://www.bing.com/th?id=ODF.R52bdbpEO49IDxcvodLRPQ&amp;pid=Api', 'hostPageDomainFriendlyName': 'Spotify', 'hostPageDiscoveredDate': '2017-03-26T00:48:00.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_w1Qr0Tdz*cp_E05B645E8EF4F7049FAB390F4C107CCA*mid_A8A9801DBB5910696DFE8BCE3EE7D26818389A88*simid_607992263846672663*thid_OIP.w1Qr0TdzT!_ItWaTIwXzTnAHaHa', 'insightsMetadata': {'pagesIncludingCount': 3, 'availableSizesCount': 3}, 'imageId': 'A8A9801DBB5910696DFE8BCE3EE7D26818389A88', 'accentColor': '083880'}
{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=7E8DD6DEA59E47C79A46EF16F68316CB0A866C4C&simid=608027379491675102', 'name': 'Love Songs by Beth on Spotify', 'thumbnailUrl': 'https://tse1.explicit.bing.net/th?id=OIP.GooE4oMjM85ZSoh76cjNswHaHa&pid=Api', 'datePublished': '2021-03-11T16:10:00.0000000Z', 'isFamilyFriendly': False, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b2739e9b35c23db7b4c4250acd5e', 'hostPageUrl': 'https://open.spotify.com/album/63SJYNPV82aAeMU1iuMvU8', 'contentSize': '47911 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>63SJYNPV82aAeMU1iuMvU8', 'width': 640, 'height': 640, 'hostPageFavIconUrl': 'https://www.bing.com/th?id=ODF.R52bdbpEO49IDxcvodLRPQ&amp;pid=Api', 'hostPageDomainFriendlyName': 'Spotify', 'hostPageDiscoveredDate': '2017-01-10T00:00:00.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_GooE4oMj*cp_E35A2EF6C028AE2C1209FF1E92D0BF9F*mid_7E8DD6DEA59E47C79A46EF16F68316CB0A866C4C*simid_608027379491675102*thid_OIP.GooE4oMjM85ZSoh76cjNswHaHa', 'insightsMetadata': {'pagesIncludingCount': 3, 'availableSizesCount': 2}, 'imageId': '7E8DD6DEA59E47C79A46EF16F68316CB0A866C4C', 'accentColor': 'C24509'}
{'webSearchUrl': 'https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=site%3ahttps%3a%2f%2fopen.spotify.com%2falbum%2f&id=76321ACC2F37D8963CBA91CE622E81111C4E1CC5&simid=608034363109737202', 'name': 'Top Radio Hits - Album by Top 40 Hits, The Cover Crew, Dance Hits 2017 ...', 'thumbnailUrl': 'https://tse4.mm.bing.net/th?id=OIP.1--fEfj5kRx_6NxI9mi2qwHaHa&pid=Api', 'datePublished': '2021-04-25T18:01:00.0000000Z', 'isFamilyFriendly': True, 'contentUrl': 'https://i.scdn.co/image/ab67616d0000b273237353afe0bd5986e3911684', 'hostPageUrl': 'https://open.spotify.com/album/6hLT8YyHHWrEfU0sTvAQ6F', 'contentSize': '155717 B', 'encodingFormat': 'jpeg', 'hostPageDisplayUrl': '<b>https://open.spotify.com/album/</b>6hLT8YyHHWrEfU0sTvAQ6F', 'width': 640, 'height': 640, 'hostPageFavIconUrl': 'https://www.bing.com/th?id=ODF.R52bdbpEO49IDxcvodLRPQ&amp;pid=Api', 'hostPageDomainFriendlyName': 'Spotify', 'hostPageDiscoveredDate': '2018-04-03T00:00:00.0000000Z', 'thumbnail': {'width': 474, 'height': 474}, 'imageInsightsToken': 'ccid_1++fEfj5*cp_8DC0DA3C8EC4E93B0E343553C8FB9B18*mid_76321ACC2F37D8963CBA91CE622E81111C4E1CC5*simid_608034363109737202*thid_OIP.1--fEfj5kRx!_6NxI9mi2qwHaHa', 'insightsMetadata': {'pagesIncludingCount': 6, 'availableSizesCount': 5}, 'imageId': '76321ACC2F37D8963CBA91CE622E81111C4E1CC5', 'accentColor': 'C90267'}

Screenshot:
Screen Shot 2022-12-03 at 1 11 32 PM

Support "versioning" of pipes

Support for pipe "versions" would enable a couple of valuable features:

  • allow iterated improvement of modules without breaking existing projects
  • ability to define multiple versions of a single pipe for different runtimes and engines: a version for mysql, a version for postgres, a version for dataframes, etc. User could simply specify the abstract pipe and allow snapflow to pick a compatible version for the given runtime.

Versions come with an abstraction cost and there would challenges ensuring equivalence among different pipe version implementations, but keeping the defaults as they are now while supporting this capability seems worth it.

Slight improvement to node connection mismatch error messages

Not high priority:
If I specify an InputStream instead of InputTable in a downstream Python node, I can get the following error:

ValueError: Invalid graph: ['Cannot connect sql_output: input is a stream, but output is a table']

It might be nicer to give more specifics,
Cannot connect sql_output to downstream_node: downstream_node expects an InputStream, but sql_output is an InputTable

Support categorical field types

Most data formats and engines support enum / categorical types, snapflow should too.

What it would take:

  • add sqlalchemy enum to valid field types
  • and inference rules for categorical types

Evaluate Apache arrow as common data format

Arrow seems like a natural fit for snapflow, and may help alleviate some format conversion issues. Next steps are to evaluate if arrow can support the core snapflow use cases -- conversion across files, databases, and memory while respecting "database-like" type information. It seems like "apache arrow + snapflow schemas" could be a sufficient solution.

SchemaTranslation and implementations have confused directionality

SchemaTranslations have a mapping of "from_schema_field": "to_schema_field", while implementations have the inverse mapping relation: to_field: from_field. This is confusing, but there is a case to make for the direction being the "natural" direction in both cases. A possible solution is to rename SchemaTranslation to something that implies the inverse mapping, such as SchemaAssignment or SchemaRoles or other. Maybe easier is to just invert the implementations order.

Progress updates for long-running pipes

Some long-running pipes, like extractors, can give progress updates and/or emit datablock chunks as they go (related to streams as output, see #22). Showing this progress to the end user would let them know that progress is being made and nothing is stuck. For pipes where progress updates are not possible (long running sql for instance), could show a timer or spinner instead.

Documentation

Create complete documentation for snapflow:

  • Getting started
  • Building pipelines
  • Common use case examples
  • Deploying
  • Developing modules
  • API reference

Support quoting identifiers in core SQL

Field and alias names that do not comply with a storage systems naming rules (say a field name with a capital letter in Postgres) currently cause errors.

Resolving this issue requires supporting engine-specific quoting in core SQL. This overlaps with larger initiatives for runtime-specific pipe versions and better database APIs

Login fails when the environment is invalid

I've enabled stack traces in my shell, so this is atypical output. Usually it just says something like "failed to login: None"

The issue is that the environment that it is defaulting to is invalid (I believe). It says [tests/Default], which actually looks like that is going to be the value that it defaults to. However tests is one environment, and Default is a different one. Each of those is valid on their own.

~/g/b/t/slow-pipe►poetry run basis login                                                                                                                (master|💩?↓4) 09:57
Email ([email protected]): 
Password: 
Select an environment [tests/Default]: 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/jtravis/git/basis-devkit/basis/cli/main.py", line 83, in main
    app()
  File "/Users/jtravis/git/basis-devkit/.venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jtravis/git/basis-devkit/.venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/jtravis/git/basis-devkit/.venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jtravis/git/basis-devkit/.venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jtravis/git/basis-devkit/.venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/jtravis/git/basis-devkit/.venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/jtravis/git/basis-devkit/basis/cli/commands/login.py", line 33, in login
    organization_id=ids.organization_id, environment_id=ids.environment_id
  File "/opt/homebrew/Cellar/[email protected]/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/functools.py", line 982, in __get__
    val = self.func(instance)
  File "/Users/jtravis/git/basis-devkit/basis/cli/services/lookup.py", line 80, in environment_id
    return envs_by_name[env_name]["uid"]
KeyError: None

Support serialization of core snapflow objects to yaml/json

A clean serialization format is necessary for working across runtimes, storing graphs and configurations, and for sharing complex pipelines easily.

For example, to serialize a graph:

graphs:
  - name: test
    nodes:
      - name: source
        pipe: core.extract_csv
        output_dataset_name: population
        config:
          path: /var/population.csv
      - name: aggregate
        pipe: core.run_sql
        upstream: source
        output_dataset_name: population2
        config:
          sql: select * from population

Local python objects that cannot be serialized will cause serialization to fail.

Support lightweight / in-line schema annotations

A way to specify the expected schema (really just expected columns) in-line in a pipe definition, or in a lightweight way close by?

def pipe(block: DataBlock[Schema[fieldname1, fieldname2]]):
    ...

or

QuickType = quick_schema(fields=[("field1", "Unicode"), ("field2", "Numeric")])

def pipe(block: DataBlock[QuickType]):
    ...

Isolate format and storage conversion as separate library

Snapflow data conversions mostly reuse existing solutions, but with additional hinting from schemas. If both are schemas and conversions are pulled out into their own libraries, they could provide more general solutions to the problems snapflow is trying to solve.

Print more meaningful errors when server rejects uploads

When trying to upload my graph, I got the following response:

Upload failed: 422 Client Error: Unprocessable Entity for url: http://localhost:8000/api/graph_versions/?graph_name=Test+graph&organization_name=organic-capybara

The issue at hand is that the graph name in graph.yml does not conform to the current requirements ("Test graph" does not conform to hyphens, letters, digits). We should print the server's response in the failure output of the devkit so people know what to fix.

Build essential core pipes

The core module needs more basic functionality, including the following pipes:

  • dedupe: support common deduplication scenarios
  • subtract: datablock subtraction (for deleting records)
  • chunk: split a datablock into smaller chunks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.