meltano / sdk Goto Github PK

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com

Home Page: https://sdk.meltano.com

License: Apache License 2.0

Python 100.00%

sdk python

sdk's Introduction

Meltano Singer SDK

The Tap and Target SDKs are the fastest way to build custom data extractors and loaders! Taps and targets built on the SDK are automatically compliant with the Singer Spec, the de-facto open source standard for extract and load pipelines.

Future-proof extractors and loaders, with less code

On average, developers tell us that they write about 70% less code by using the SDK, which makes learning the SDK a great investment. Furthermore, as new features and capabilities are added to the SDK, your taps and targets can always take advantage of the latest capabilities and bug fixes, simply by updating your SDK dependency to the latest version.

Meltano

Not familiar with Meltano? Meltano is your CLI for ELT+ that:

Starts simple: Meltano is pip-installable and comes in a prepackaged docker container, you can have your first ELT pipeline running within minutes.
Has DataOps out-of-the-box: Meltano provides tools that make DataOps best practices easy to use in every project.
Integrates with everything: 300+ natively supported data sources & targets, as well as additional plugins like great expectations or dbt are natively available.
Is easily customizable: Meltano isn't just extensible, it's built to be extended! The Singer SDK (for Connectors) & EDK (for Meltano Components) are easy to use. Meltano Hub helps you find all of the connectors and components created across the data community.
Is a mature system: Developed since 2018, runs in production at large companies like GitLab, and currently powers over a million pipeline runs monthly.
Has first class ELT tooling built-in: Extract data from any data source, load into any target, use inline maps to transform on data on the fly, and test the incoming data, all in one package.

If you want to get started with Meltano, we suggest you:

head over to the Installation
or if you have it installed, go through the Meltano Tutorial.

Documentation

See our online documentation for instructions on how to get started with the SDK.

Contributing back to the SDK

For more information on how to contribute, see our Contributors Guide.

Making a new release of the SDK

Trigger a version bump using the GitHub web UI or the cli:
```
$ gh workflow run
```
The increment: auto option will figure out the most appropriate bump based on commit history.
Follow the checklist in the PR description.
Publish a new release using the GitHub web UI.

sdk's People

Contributors

Stargazers

Watchers

Forkers

epoch8 oviohub buzzcutnorman mghastin ilkkapeltola z3z1ma ericboucher fossabot visch willdasilva jack-burnett kalyanr23 widen spacecowboy danielpdwalker jamielxcarter qbatten ryansmith23 adherr menzenski flexponsive juleshuisman mkranna jamiel-carter jamiesplitit ashish-atidiv radbrt chadbaily josephlozano laurents daniljr mjsqu shuaahvee glysav tyshkan douwem omarabed15 drbjim rawwar s7clarke10 riordan peliqan-io jcohen010 andyoneal haleemur prakharcode ego asamasoma saasworksinc vicmattos mattwill09 dgawlowsky joaopamaral rubenvereecken likenatural reubenfrankel tobiascadee raulbonet dlouseiro sidduhussain kgpayne advolve grigi mac-lp3 jcotton1123 tozhovez

sdk's Issues

Discuss spec for auto-parsing of environment variables

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/10

Originally created by @aaronsteers on 2020-12-22 20:08:18

Most orchestrators (i.e. meltano, pipelinewise, and tapdance) have developed a convention of being able to parse config values from environment variables. This capability especially helps for serverless environments and CI/CD scenarios.

In the goal of establishing CI/CD for this project, I implemented the same. That said, we should discuss ramifications of this and also naming convention patterns - to make sure this is a considered decisions.

Here's a sample which demonstrates this functionality for our this projects CI testing pipeline (the plugin name in this case is SAMPLE_TAP_GITLAB but in a production scenario it would be TAP_GITLAB):

The proposed convention, which I've carried over from tapdance and meltano is as follows:

Env vars are prefixed by the plugin name, converted to uppercase, with dashes replaced by underscores.
Env vars are suffixed with the setting name (case sensitive, most often lower case), with any dashes replaced by underscores.
Values from environment variables will override those specified in the config file.

In theory there should be no conflict with orchestrators, since values at worst would be set twice. Since the orchestrator may not know if it's tap or target can parse it's own environment variables, the orchestrator may choose pop the consumed variables from the passed environment context before launching the respective tap or target.

UPDATE (2021-01-07):

Latest spec update is to use a --config=ENV input as indicator to pull from environment variables. This can optionally be combined with --config=path/to/file.json. If ENV is not set as a config source, environment variables will not be parsed.

Allow users/developers to customize STATE message emitting frequency

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/17

Originally created by @aaronsteers on 2021-01-06 22:09:44

The following discussion from !1 should be addressed:

@DouweM started a discussion: (+2 comments)

Is this the kind of thing we want to make configurable with a setting? 10000 also seems pretty high as a default :)

Add strong typing and validation for settings using JSONSchema

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/6

Originally created by @aaronsteers on 2020-12-12 03:06:18

It came up in discussion here (!1) and here (#10) that rather than a simple list of string names (and a list of list of strings for required option sets), we could instead allow developers to provide a json schema and then use the jsonschema library for validation.

RE: https://gitlab.com/meltano/tap-base/-/merge_requests/1#note_464222633

Check linting on pre-commit and in CI

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/12

Originally created by @DouweM on 2020-12-29 21:55:58

We recently added linting to Meltano (https://gitlab.com/meltano/meltano/-/issues/2470), in https://gitlab.com/meltano/meltano/-/merge_requests/1970 and some changes in follow-up MRs like https://gitlab.com/meltano/meltano/-/merge_requests/1983.

Let's add the same linting here, and try to get it right from the start without needing to ignore tons of existing issues using a flakehell baseline :)

Of course we can tweak the rules as needed. I would want to be more strict on docstring completeness in the public interface of this framework, for example.

Question: Should config keys be lower-cased when they are read from the environment?

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/27

Originally created by @edgarrmondragon on 2021-02-26 08:06:04

For example, suppose the JSON config for tap-example looks like this:

{
  "api_key": "1234"
}

Should the corresponding env var be TAP_EXAMPLE_API_KEY or TAP_EXAMPLE_api_key?

If it's the former, then I think there's a bug in https://gitlab.com/meltano/singer-sdk/blob/development/singer_sdk/plugin_base.py#L87.

Making the following change solves the issue for me:

config_key = k.split(plugin_env_prefix)[1].lower()

I can open a MR if this is how you folks expect it to work.

Connection pooling at the Tap level (esp. Database-type streams)

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/21

Originally created by @aaronsteers on 2021-01-14 00:14:00

@DouweM started a discussion: (+1 comment)

What do you think about moving this to the tap, so that multiple streams could use a connection pool managed by the tap?

Following from the discussion on !1, as noted above, there are some complexities with moving connections to the tap level, and I thought those may be better discussed here as their own topic. Note: As of now, only database-type streams inherit these methods.

Currently streams can initialize themselves using class factory methods cls.from_input_catalog() and cls.from_discovery().
Currently taps don't need to inherit from specialized base classes (unlike streams) and all custom handling of specific stream use cases happens in the Stream class itself.
It will be difficult to preserve the above behavior while moving more logic to the tap, specifically execute_query() and open_connection().
Previously this logic lived in a "connection" class but it was proven much simpler to have the stream class fully self-sufficient in its access to its underlying data. Refactoring back to a dedicated connection class or back to the Tap class will have a similar outcome of splitting systems logic across more than one class, which could make some design choices more difficult.
Since execute_query() and open_connection() are already defined as class methods, we should be able to implement a class-level connection-pooling and class-level max concurrency. Without much change in design structure this would allow global limits on how many times a class (or perhaps even a set of derived subclasses) could instantiate a new connection.
As a counterpoint, the downside of using class members is that we do have to pass instance variables explicitly. So far, this hasn't been a problem though, since in most cases we only need to pass the query (the sql string) and the config dict.

Following from bullet (5) above, I'm inclined to try implementing a class-level connection pool as first preference, and see if we still can get good and intuitive usability from a design/dev perspective.

Opening this thread to support expanded discussion and exploring of various options.

Consolidate and streamline page token processing - [merged]

Merges feature/consolidate-page-token-processing -> development

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/9

This update removes the insert_page_token method from the Stream class and instead adds a next_page_token argument to both get_url_params and get_request_payload. In practice, this will simplify the creation of the request, since the paging value is known at the time of creating the requests params and request body - whereas previously the request parameters were created in one method and altered in another. For simple use cases this seemed fine but it breaks down for more complex use cases such as Power BI REST API - since the paging token is incompatible with the rest of the request params, and since an additional layer of "paging" must be created by the developer to compensate for the API only allowing 24 hours of data to be requested in any one URL call.

The updated Gitlab rest API sample demonstrates the new methods nicely:

class GitlabStream(RESTStream):
    """Sample tap test for gitlab."""

    ...

    def get_url_params(self, partition: Optional[dict], next_page_token: Optional[Any] = None) -> Dict[str, Any]:
        """Return a dictionary of values to be used in URL parameterization."""
        state = self.get_stream_or_partition_state(partition)
        result = copy.deepcopy(state)
        result.update({"start_date": self.config.get("start_date")})
        result["page"] = next_page_token or 1
        return result

    def get_next_page_token(self, response: requests.Response, previous_token: Optional[Any] = None) -> Optional[Any]:
        """Return token for identifying next page or None if not applicable."""
        next_page_token = response.headers.get("X-Next-Page", None)
        if next_page_token:
            self.logger.info(f"Next page token retrieved: {next_page_token}")
        if next_page_token and next_page_token == previous_token:
            raise RuntimeError(
                f"Loop detected in pagination. Pagination token {next_page_token} is identical to previous run."
            )
        return next_page_token

Investigate using Apache Arrow (or other interchange formats)

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/2616

Originally created by @tayloramurphy on 2021-03-01 12:25:03

As mentioned in https://news.ycombinator.com/item?id=26284253#26300348, it would be interesting to see how the Singer Spec could evolve to support other interchange formats such as Apache Arrow.

Repo core meta config - [closed]

Merges feature/repo-core-meta -> main

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/2

Proposed adds:

generic Python gitignore
Apache v2 license

Swagger Codegen for Singer taps based on OpenAPI spec: RFC

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/2333

Originally created by @vfnevin on 2020-08-31 18:57:45

I wanted to file this issue to start collection thoughts from the community on the prospect of building a codegen extension for Swagger to allow anyone to write an OpenAPI spec describing an API and then have swagger build a well formatted tap for them automatically.

Does anyone have any experience building a Swagger Codegenerator extension?
What roadblocks can anyone see?
Does this sound like a useful tool for the community?
What thoughts / opinions do you have about a tool like this?

Spec work on `Replication key(s)`

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/23

Originally created by @aaronsteers on 2021-01-23 00:27:58

The following discussion from !1 should be addressed:

@DouweM started a discussion: (+2 comments)

We don't have any logic yet that that actually uses this replication_key_value when building queries.

Perhaps we can rip out the state/bookmark business for the moment, and bring it back when we can actually focus on getting that and metadata to work perfectly in all scenarios we can imagine?

This ticket can be used to track progress and discuss resolution of the spec, specifically regarding replication keys, sort orders, and bookmarking scenarios.

Draft: CI Tests and Poetry Dependency Management - [closed]

Merges feature/poetry-cicd -> main

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/3

This change proposes to add poetry for python package and dependency management.
This change also creates a working GitLab-CI config.

Resolves #3

Singer SDK `v0.0.1` initial release - [closed]

Merges development -> main

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/5

This MR replaces !1 as the release candidate review for version 0.0.1 of Singer SDK. See !1 for additional context.

Replace 'master' with 'main' as default branch name

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/1

Originally created by @aaronsteers on 2020-11-26 23:29:16

As a general practice, I have found it best practice to avoid 'master' as the primary branch name, and instead use 'main'. This is easier done earlier in the project lifecycle before there are other dependencies.

Process I've used successfully in the past:

Create and push a new branch called 'main' which is at the same position as 'master.
Update default branch settings and branch protection mechanisms to point to 'main'.
Delete 'master' branch.

Happy to discuss additional thoughts and/or further discussion on this.

Add `Property` class and `ObjectType` class, prefering name-less types - [merged]

Merges feature/property-type-separation -> development

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/8

This MR updates how json schema helper functions are implemented. Previously the "type" classes were functioning as "properties" - as each took a "name" argument and an optional "required" argument. However, this approach breaks down for arrays which wrap another type and have no name themselves.

Rather than have some types acting as properties and others as types, this update introduces a Property class, which is a more appropriate and natural way to model properties which contain the (required) name and (optional) required attributes. The Property class's constructor takes name, type, and an optional required flag - with the type input accepting either a type class of a type object.

Here is the new syntax example as listed at the top of helpers/typing.py:

    jsonschema = PropertiesList(
        Property("id", IntegerType, required=True),
        Property("name", StringType),
        Property("tags", ArrayType(StringType)),
        Property("ratio", NumberType),
        Property("days_active", IntegerType),
        Property("updated_on", DateTimeType),
        Property("is_deleted", BooleanType),
        Property(
            "author",
            ObjectType(
                Property("id", StringType),
                Property("name", StringType),
            )
        ),
        Property(
            "groups",
            ArrayType(
                ObjectType(
                    Property("id", StringType),
                    Property("name", StringType),
                )
            )
        ),
    ).to_json()

Implementation note: in order to make this work properly, the type_dict properties are now class properties, which means they work identically on the type classes as well as instances of those classes.

Deprecation notices:

the name property of the type helpers should be considered deprecated and preference going forward should be for the new wrapped syntax Property(name, type).
Going forward: "properties" have names and required specifications, while "types" should be used to define schema. The exception to this is "objects", which themselves have named "properties". While a little more complex to describe, this more closely aligns with the JSON Schema spec.
ComplexType references should be replaced with ObjectType. The latter does not take a name argument, and the reference to object more closely aligns with the native JSONSchema type of object.
ComplexType and the optional name argument for type constructors will continue to work for a while longer, but that support will likely be removed at/before the first public beta.

SPIKE on adding target functionality under the same repo

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/4

Originally created by @aaronsteers on 2020-12-04 19:11:18

I'm finding on !1 that there could be some significant benefit to adding target processing into the same repo - or at least in developing both sides in tandem. I may open up a new PR to explore what this would look like.

If we did decide to put tap and target functionality into the same repo (either for the base classes or the derived actual classes), this would certainly cause a big headache in repo naming and discovery. For this reason, I think we'll probably ultimately still keep these separate. However, in the meanwhile, if and when it's convenient to test both sides of this process together, I may temporarily add in target-based functionality in order to (1) establish proper end-to-end testing and (2) build out the complete end-to-end inheritance vision.

Fix reading catalog from JSON file - [merged]

Merges bugfix/read-catalog-file -> development

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/11

Trying to use a catalog for extraction results in a error due the catalog being incorrectly instantiated:

$ tap-example --config .secrets/config.json --discover > catalog.json
$ tap-example --config .secrets/config.json --catalog catalog.json
time=2021-03-09 11:50:58 name=tap-example level=INFO message=Parsing env var for settings config...                                                                              
time=2021-03-09 11:50:58 name=tap-example level=INFO message=Config validation passed with 0 errors and 0 warnings.
Traceback (most recent call last):                                                                                                                                              
  File "<string>", line 1, in <module>                                                                                                                                          
  File "venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__                                     
    return self.main(*args, **kwargs)                                                                                                                                           
  File "venv/lib/python3.8/site-packages/click/core.py", line 782, in main                                         
    rv = self.invoke(ctx)                                                                                                                                                       
  File "venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke                                      
    return ctx.invoke(self.callback, **ctx.params)                                                                                                                              
  File "venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke                                       
    return callback(*args, **kwargs)                                                                                                                                            
  File "venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 178, in cli
    tap.sync_all()    
  File "venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 141, in sync_all
    for stream in self.streams.values():
  File "venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 57, in streams
    stream.apply_catalog(self.input_catalog)                                                                                                                                    
  File "venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 423, in apply_catalog
    catalog_entry: singer.CatalogEntry = catalog.get_stream(self.name)                                                                                                          
  File "venv/lib/python3.8/site-packages/singer/catalog.py", line 130, in get_stream
    if stream.tap_stream_id == tap_stream_id:                                                                                                                                   
AttributeError: 'str' object has no attribute 'tap_stream_id'

This fixes the error.

The `primary_keys` are not yet correctly handing off into SCHEMA messages.

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/28

Originally created by @aaronsteers on 2021-02-26 20:53:57

Originally logged here: dataops-tk/tap-powerbi-metadata#3

Preview: Target SDK release - [merged]

Merges feature/target-base -> main

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/4

Closes #96
Partially implements #63
Includes and replaces !92

Summary of implementation, from the docs here in this branch:

Create targets with singer-sdk requires overriding just two classes:

The Target class. This class governs configuration, validation, and stream discovery.
The Sink class. This class is responsible for writing records to the target and keeping tally of written records. Each Sink implementation my write records immediately in Sink.load_record() or in batches during Sink.drain().

This writeup describes the need to decouple the Sink and Stream classes:

While the default implementation will create one Sink per stream, this is not required. The Stream:Sink mapping behavior can be overwritten in the following ways:

1:1: This is the default, where only one sink is active for each incoming stream name.
- The exception to this rule is when a knew STATE message is received for an already-active stream. In this case, the existing sink will be marked to be drained and a new sink will be initialized to receive the next incoming records.
- In the case that a sink is archived because of a superseding STATE message, all prior version(s) of the stream's sink are guaranteed to be drained in creation order.
- Example: a database-type target where each stream will land in a dedicated table.
1:many: In this scenario, the target intentionally creates multiple sinks per stream. The developer may override Target.get_sink() and use details within the record (or a randomization algorithm) to send records to multiple sinks all for the same stream.
- Example: a data lake target where output files should be pre-partitioned according to one or more attributes within the record. Multiple smaller files, named according to their partition key values are more efficient than fewer larger files.
many:1: In this scenario, the target intentionally sends all records to the same sink, regardless of the stream name. The stream name will likely be made an attribute of the final output, but records do not need to be segregated by the stream name.
- Example: a json file writer where the desired output is a single combined json file with all records from all streams.

By default, the current implementation triggers a full drain and flush of all sink objects when a STATE message is received. This implementation prioritizes emitting the provided STATE message, and is well tuned for scenarios where only one stream is being sent at a time (the vast majority).

Lower priority: To optimize for randomized streams, we may in the future add an alternate implementation which instead prioritizes for fewer drain operations or a minimum batch size for the drain process to be triggered.

Included samples

I have two samples so far which I'm using for testing the interface. Both use the lazy drain() method to write records all at once, rather than writing data during load_record().`

How to test during the preview

The easiest way to test is to run the following from your own Target repo:

Run poetry remove singer-sdk within your existing sdk-based repo or simply comment-out the singer-sdk line in your repo's pyproject.toml.
According to your preference:
- Depend on this branch (feature/target-base): poetry add git+https://gitlab.com/meltano/singer-sdk.git#feature/target-base
- Clone and make a local dev dependency: poetry add --dev ../singer-sdk/ (assumes singer-sdk is a sibling directory).

Configure CI testing in GitLab with Poetry package management

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/3

Originally created by @aaronsteers on 2020-12-02 18:29:20

In order to ensure stable, high quality code, we want to establish automated testing in GitLab CI. This also introduces the need to pick a framework (or none) for python testing, package management, and dependency management.

I've been interested to try Poetry and I see related discussion for Meltano here and here.

(Currently CI tests on the #2 PR are failing due to missing setup.py. This would resolve that failure and set a framework for sustainable management over time.)

Add log-level customization method

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/16

Originally created by @aaronsteers on 2021-01-06 22:00:11

The following discussion from !1 should be addressed:

@DouweM started a discussion: (+2 comments)

It would be great if the stream (etc) could also use this logger, and if we could always support a log_level setting with INFO, DEBUG, etc values, that Meltano could then also toggle as appropriate based on its own log level.

Rename v0.0.1 milestone to v0.1.0

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/29

Originally created by @DouweM on 2021-02-26 21:11:45

@aaronsteers Since v0.0.1 has already been pushed to https://pypi.org/project/singer-sdk/, I think the first version we'll announce on the blog and consider ready for use by people outside of GitLab/Slalom will have to be v0.1.0.

I'll rename the milestone.

Consider universal config options for `primary_keys` and `replication_key` overrides

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/5

Originally created by @aaronsteers on 2020-12-12 03:02:24

In tap-adwords, there's a convention to send an optional config primary_keys option in the config file. The expected value is a map of stream names to the respecting lists of string primary keys properties.

I've been considering what this might look like and tap-base seems like a perfect place for this to live.

Possible spec:

taps would accept by default an optional primary_keys config option with a value of type Dict[str, Optional[List[str]]]
taps would accept by default an optional replication_keys config option with a value of type Dict[str, Optional[str]]
if provided in config options, these values would override tap-specified defaults and also would override any conflicting specs in the catalog.json file.
- similarly, an explicit null in either the primary_keys or replication_key would clear the respective setting if it was specified in the catalog or in the default tap behavior.
the tap would throw an error if it rejects the setting or for any reason is unable to accept it.
- likely reasons the tap would decide to throw an error:
  1. primary keys or replication keys do not exist as properties in the stream
  2. requested replication keys not possible given the extract mode
Because this behavior could be undesirable or problematic on API sources, we may choose to implement by default only on base classes that are Generic or Database streams classes.

Ref: https://gitlab.com/meltano/tap-adwords#user-content-create-the-config-file

Background:

Database-like, file-like, and report-like data sources do not generally know their own metadata. This feature would enable users of those taps to quickly and easily implement the desired behavior.

Accelerated tap development framework (v0.0.1-apha) - [merged]

Merges feature/initial-base-classes -> development

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/1

This initial MR would add core capabilities and provide the initial interface spec for tap developers to use when creating new taps.

Objectives:

We want tap and target developers can get more done with less code, and without having to become experts in the Singer spec.
We want the cost of supporting taps and targets to be significantly decreased.
We want to enable new features and new Singer spec additions in a systemic and minimally invasive way.
We want to create a smooth onramp for existing taps.
We want to take advantage of modern Python typing to eliminate guess work during development.

Related discussion

Please see https://gitlab.com/meltano/meltano/-/issues/2401 for a robust discussion on the need iof a new framework.

Status:

All core capabilities functions are working: discovery via scan, discovery via catalog file, sync_one, sync_all, cli execution.
docs/dev_guide.md shows recommended usage.
Cookiecutter started as well here, modeled after the Parquet test. (Needs updates.)
GitLab CI testing is online leverage pytest framework.
Poetry has been implemented for package and dependency management.
Current samples implemented:
- GitLab (REST/GraphQL streams hybrid)
- Countries API (GraphQL stream type)
- Snowflake (Database stream type)
- Parquet (Generic stream type)

Known limitations:

Need more Robust paging for REST and GraphQL sources.
- Currently there is no paging implemented for the GraphQL source and the REST implementation is based on a single sample (GitLab). In theory, the developer can simply override the get_next_page() method, returning something truthy if there's another page, but this not well defined or well documented as of yet.
Templating and parameterization is not well documented as of yet, and it may be worth leveraging jinja instead of doing it by hand.
- I'd like to evaluate if instead of using the current and generic {my_val} syntax for templating, we perhaps should migrate to jinja syntax {{my_val}} and then developers and tap users can implement more complex logic if and when it is needed. (~~I'll spin off~~ I have spun off an Issue in #11 to discuss this in more depth.)
- A basic means of templating is already implemented for the url_suffix parameterization in REST calls, but we probably will also want a similar standard for parsing tap settings like filepath or file_naming_scheme, as well as GraphQL queries and perhaps also in SQL queries.

Many functions still need to be added and I've opened tickets here for follow-on items. That said, the MR is already getting quite large, I would love to get initial feedback on a first merge to main.

Initial import and refactoring of tap-snowflake

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/2

Originally created by @aaronsteers on 2020-11-26 23:31:18

The 'pipelinewise-tap-snowflake' was identified as a good starter example. This action would begin to refactor some of the basic functionality from 'tap-snowflake' into generic base classes.

Request Logging

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/31

Originally created by @vischous on 2021-03-03 18:15:31

Writing a bamboohr tap, it uses basic auth. Trying to debug the requests that are sent out.

@aaronsteers recommended putting in an issue here. Note I'm a noob with python so there's almost certainly things I'm missing. re https://meltano.slack.com/archives/CFG3C3D1Q/p1614794618011900

$ poetry run tap-bamboohr --config .secrets/config.json
INFO Parsing env var for settings config...
DEBUG Running config validation using jsonschema: {'type': 'object', 'properties': {'auth_token': {'type': ['string'], 'required': True}, 'subdomain': {'type': ['string'], 'required': True}}, 'required': ['auth_token', 'subdomain']}
INFO Config validation passed with 0 errors and 0 warnings.
INFO Beginning FULL_TABLE sync of stream 'employees'...
{"type": "SCHEMA", "stream": "employees", "schema": {"type": "object", "properties": {"id": {"type": ["number", "null"], "required": false}, "displayName": {"type": ["string", "null"], "required": false}, "firstName": {"type": ["string", "null"], "required": false}, "lastName": {"type": ["string", "null"], "required": false}, "gender": {"type": ["string", "null"], "required": false}, "jobTitle": {"type": ["string", "null"], "required": false}, "workPhone": {"type": ["string", "null"], "required": false}, "workPhoneExtension": {"type": ["string", "null"], "required": false}, "skypeUsername": {"type": ["string", "null"], "required": false}, "facebook": {"type": ["string", "null"], "required": false}}, "required": []}, "key_properties": null}
INFO {'url': 'https://api.bamboohr.com/api/gateway.php/autoidmtest/v1/employees/directory', 'params': {}, 'request_data': None}
DEBUG Starting new HTTPS connection (1): api.bamboohr.com:443
DEBUG https://api.bamboohr.com:443 "GET /api/gateway.php/autoidmtest/v1/employees/directory HTTP/1.1" 401 None
INFO Skipping request to https://api.bamboohr.com/api/gateway.php/autoidmtest/v1/employees/directory
INFO Reason: 401 - b''
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 178, in cli
    tap.sync_all()
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 142, in sync_all
    stream.sync()
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 390, in sync
    self._sync_records()
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 307, in _sync_records
    for row_dict in self.get_records(partition=partition):
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/singer_sdk/streams/rest.py", line 187, in get_records
    for row in self.request_records(partition):
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/singer_sdk/streams/rest.py", line 141, in request_records
    resp = self._request_with_backoff(prepared_request)
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "/home/visch/git/tap_bamboohr/.venv/lib/python3.8/site-packages/singer_sdk/streams/rest.py", line 94, in _request_with_backoff
    raise RuntimeError(
RuntimeError: Requested resource was unauthorized, forbidden, or not found.

Evaluate `jinja2` as a standard and robust templating engine

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/11

Originally created by @aaronsteers on 2020-12-22 20:42:32

The Jinja templating engine has grown in adoption with it's integration with DBT, CookieCutter, and other tools. Rather than do all templating by hand in the legacy print() or f-string-like syntax, I'd like to evaluate if Jinja could be a better foundation.

Jinja supports complex expressions, including if-then logic, and we could start with an implementation that simply surfaces the config dictionary as the templating inputs.

In the case of our GitLab sample code, instead of the url_suffix being /projects/{project_id}?statistics=1, it would be /projects/{{project_id}}?statistics=1. Then we just pass the template string and config dictionary to the jinja render() command rather than doing text-based substitution ourselves.

In theory, this would require very little change for developers and users, but it could give a big benefit in the long run.

Support Defaults in Tap `config_jsonschema`

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/35

Originally created by @kgpayne on 2021-03-12 13:45:07

For clarity, it would be great if I could specify defaults when declaring the config schema. Without this functionality, defaults are likely to be implemented using .get() defaults buried in individual Stream implementations:

self.config.get('batch_size', -1)

In stead, I would like to do the following:

config_jsonschema = PropertiesList(
        Property("host", StringType, required=True),
        Property("username", StringType, required=True),
        Property("password", StringType, required=True),
        Property("batch_size", IntegerType, default=-1),
    ).to_dict()

PyTest Fixes - [merged]

Merges bugfix/pytest-fixes -> development

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/7

This update resolves pytest failures when run locally.
Note: Due to challenges with environment variable parsing, the same pytests are not yet passing in CI/CD.

Auto-generate documentation on Python classes and interface.

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/15

Originally created by @aaronsteers on 2021-01-05 23:49:49

Spun off from this discussion.

One-time formatting cleanup and added CI format test(s)

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/36

Originally created by @aaronsteers on 2021-03-12 18:50:32

Towards resolving #12, we need to do a one-time cleanup of the existing codebase, combined with a new CI test which fails if standard formatting requirements are not met.

My plan of approach would be to run black for autoformat and mypy for linting. I'm also open to alternative or additional linter suggestions. Likely I would setup the CI test first and then use a TDD approach to resolve any failures in subsequent commits.

Since these types of changes generate a large number of potential code conflicts, I would like to prioritize highly and include as part of our 0.1.0 release plan.

Connection-free testing capability: `--replay` and `--demo`

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/30

Originally created by @aaronsteers on 2021-03-02 18:49:07

Background

In my effort to support development on tap-powerbi-metadata, I'm realizing that I am overly reliant on the tester and co-developer because I don't have access to valid credentials for the source. It is not trivial to create test creds either, since this requires at minimum: creating a new Azure account with a valid credit card, creating an Active Directory Domain, creating a Service Principal, creating a new Power BI tenant, and finally granting my Service Principal access to the Power BI tenant. In fact, there are still more steps on top of the above, since my new Power BI environment probably doesn't have any reports, workspaces, log history, etc. In other words, even after gaining valid creds, I would not necessarily have valid test data to validate the data sync methods are working properly.

Proposal:

As a sibling and complement of planned connection tests in #14, we would like to be able to run "as close as possible" to a full sync test without having any access to the upstream connection. This is difficult by nature to make generic, since there isn't a "dummy data set" which would make sense for every stream. We'd also like to create portable and replayable versions of synced data, so that we get better reproducibility - and to do so in a way that "just works", regardless of which tap or source system we are dealing with.

Use cases

Find bugs, repro them, and apply fixes locally - without requiring source credentials or network connectivity.
Provide a sample data output option for tap users.
- By specifying the path to a "golden set" of jsonl output in the repo, developers can optionally "opt-in" to enabling sample data output for end users. This "demo mode" would allows users (and orchestrators like meltano!) to preview the data and understand the emitted data shape even before specifying a connection. To be widely adopted, the data format for the sample set needs to be generated automatically from the SDK (of course, likely after some amount of anonymizing by the developer).
Adding replay-capability would vastly expand the number and breadth of tests which could be run - with or without connectivity. Importantly:
1. We could test that SCHEMA message generation is correct, based on catalog input and/or predefined tap logic.
2. We could verify STATE message emitting behaviors are tested against config inputs.
3. We could ensure stream selection rules are tested, as specified by the optional catalog input, per usual.
4. We can test that streams' jsonschema config compliant to the JSON Schema spec and that data within the sample stream complies with the specified jsonschema definition on the stream.
Bug reproducibility across teams.
- We would have a better way to repro errors across teams: as a user experiences an issue specific to their environment, they can save the output and transmit it (securely and/or anonymized, of course) to a tap developer who can then replay the offending stream, repro the issue, debug and confirm the fix, add unit tests, etc. - all without ongoing input from the user.
Perform quality control tests en masse.
- For indexes like the planned singer-db, we can more realistically scale to hundreds or thousands of taps, performing automated generic CI/CD testing without having to manage hundreds or thousands of corresponding source credentials. (Having sample data capabilities would likely result in a "badge" of some sort in the index, along with the latest CI/CD test results based on that sample data.)

Proposed implementation

Internal changes proposed:

Write a new Stream._replay_records() into the SDK base classes as an alternative path to get_records(). This function would never need to be overridden by developers since it would be implemented generically. In order to meet that design goal (i.e. not requiring dev effort), we would require a generic, predefined text file format. The easiest and most generalizable file format is our already-defined jsonl output from the tap itself.

Proposed CLI updates:

Add a new --replay=path/to/output.jsonl capability which would then run in dry-run mode using the sample data. The process of creating a source connection would then be skipped.
- At least initially, --catalog would be required whenever --replay is set.
- The --replay option should be
Add a new optional --demo capability which is automatically enabled if the tap developer specifies a path to a valid and replayable demo data set, including a catalog file and at least one jsonl file. When the capability is supported, tap-mysource --demo is equivalent to tap-mysource --replay=path/to/demo/replay-file.json --catalog=path/to/demo/catalog.json.

Why raw `jsonl` sync output as the standard "data replay" format:

After considering several options, I landed on native jsonl output as the best storage mechanism I could think of for enabling this functionality across the wide ecosystem of existing taps.

By definition, this output already describes all the nuances of each diverse data set, which is hard to say for any other data serialization method. I first considered using target-csv generically but experience has shown that CSV doesn't work well for complex and nested data sets. We could consider target-jsonl or target-parquet but neither is simpler or offers any significant benefit over simply replaying the raw output data. (See "out of scope" section below for possible future extensibility options.)
As a native text file format, jsonl is very easy to review for PII and other confidential information which could then be relatively easily be replaced with obfuscated/generic data. (See "out of scope" for thoughts around auto-obfuscation.)
It's very easy to truncate all-but the first 100 or 1000 rows in order to get a smaller data file.
At least in terms of generating the datasets themselves, no new code or training is needed, since this already comes out of box with every Singer tap - even those not built on the SDK. (That means we can replay data generated on a pre-SDK version using the SDK version, and then validate the new output against the original.)

Other Notes:

Caveats:

For this to be valuable and effective for testing purposes, we should run through as much of the "real" data flow as possible:

Since part of what we are wanting to test is that SCHEMA and RECORD message types are properly generated, we would need to treat RECORD messages as "raw data" and not simply echo them.
Similarly for SCHEMA messages, those schema messages stored in the jsonl output should either be ignored completely or used as test assertions. We would not simply echo them out, since one of the objectives of the test are to ensure that they are correctly generated by the developers implementation.
Config values would still need to be parsed or passed as usual, since some of those config values will modify how the output is generated. Credential-based config might still be required (as per usual validation rules) but dummy values could be passed, since those specific setting values would effectively be ignored.

Out-of-scope but worthy of discussion:

Eventually we could add an auto-anonymization option via something like pyanonymizer.
We might eventually allow developers to write alternative dummy-data generation methods in addition to the private _replay_records() method discussed here as generic.
We might eventually create a process or toolset for running diffs against successive outputs. For example, this could be built as a CI/CD test to better ensure properly behaving taps and highlight any changes across releases.

Inherit `Click` CLI behavior from base class

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/8

Originally created by @aaronsteers on 2020-12-17 20:59:26

In order for the CLI to work as expected, we require each tap to past the below boilerplate code into their platform. This could be removed if we can get Click to work with the base class static method declaration directly. As of my latest efforts on this, Click throws an error due to confusion of the first argument in the method (the class method).

Workarounds Considered

I considered making this a static method, which doesn't get a cls argument, but that wouldn't give use access to the class type, which we need for initialization purposes.

Sample boilerplate code today

# CLI Execution:

@click.option("--version", is_flag=True)
@click.option("--discover", is_flag=True)
@click.option("--config")
@click.option("--catalog")
@click.command()
def cli(
    discover: bool = False,
    config: str = None,
    catalog: str = None,
    version: bool = False,
):
    SampleTapGitlab.cli(
        version=version, discover=discover, config=config, catalog=catalog
    )

Add CLI option `--about --format=json` that dumps settings and capabilities metadata

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/13

Originally created by @DouweM on 2020-12-30 16:16:53

Meltano (and others) can then run tap --metadata to learn about the tap's capabilities and settings instead of having to explicitly repeat this in https://gitlab.com/meltano/meltano/blob/master/src/meltano/core/bundle/discovery.yml.

Conventions re: Multi-call REST implementations

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/20

Originally created by @aaronsteers on 2021-01-12 21:21:54

There are two main scenarios under which a REST-based tap may require multiple HTTP calls, likely with different parameter values in subsequent calls.

Scenario A: ID of entity being streamed is keyed into URL, requiring multiple calls: one call each for every captured entry.

Example: GitLab `Projects`.

The project ID is in the URL, and (in this case) the user is able to specify a comma-separated list of projects to be extracted.
Multiple calls need to be made to retrieve the full list of projects. (One call per provided project ID.)

Requirements to fulfill:

get_url() should be plural.
get_params() should probably be plural - or dynamic based on an iterated input.

Scenario B: Parent ID is parameterized, requiring multiple calls: one call each for every parent entity.

Note: This scenario and scenario C are difficult with traditional REST but may be significantly easier in GraphQL, since joins can be performed in the query itself.

Example: Gitlab `Epic Issues`.

Sample path: /groups/{id}/epics/{secondary_id}/issues

Since Epics are child to Groups, and since project ID and issue ID are both part of the URL to retrieve issue comments, we must make multiple calls, one per issue ID identified during extraction.
Since Epic Issues cannot be known unless Epics are known, it is not possible to query Epic Issues without also querying Epics.

Requirements to fulfill:

URLs need to be generated dynamically from the output of other REST request.
We may need intelligence to differentiate behavior:
- Epics is selected - in which case we emit matching records and also chain calls to the child stream
- Epics is not selected - we still need to query Epics to get the IDs, but we should not emit the corresponding records.

Some options:

Option A: The parent entity's stream either calls the child stream or registers future calls for the child stream.
Option B: The child entity gets access to the parent's keys or is empowered to make calls to retrieve them.
Option C: The child entity registers itself in relationship to it's parent, and, once registered, receives a call on subsequent iterations of the parent key instance.

Scenario C: Follow-on (child) HTTP calls needed for full stream definition.

The core stream needs some vital information from another lookup, either in a 1:1, 1:many, or many:many relationship.

Example: Hypothetical "Students" records requiring chained calls (above).

Using the hypothetical example of relationships between "students", "users", and "addresses", for example:

"users" - 1:1 with "students", needs additional lookup by "user_id" on the "student" record
"home_zip_code" - 1:many with "students", needs additional lookup to "addresses" by the "address_id" on the "student" record.
"majors" - many:many with "students", needs additional lookup to "subjects" by the "major_subject_ids" on the "student" record.

Requirements to fulfill:

Already available: We can use post_process() for the additional lookups.
Lower priority: We could create a documented paradigm in samples for calling the (parameterized and auto-retrying) http methods from within post_process()`

Add CLI option like `--test` to test whether config is valid and a connection can be made

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/14

Originally created by @DouweM on 2020-12-30 16:17:36

Cookiecutter template doesn't include the requests package

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/26

Originally created by @edgarrmondragon on 2021-02-26 01:19:04

I just began exploring this awesome project and tried to create a tap for a simple REST API, however I found that the cookiecutter doesn't include requests in its dependencies:

[tool.poetry.dependencies]
python = "^3.6"
singer-python = "5.9.0"
singer-sdk = ">=0.0.0"

Explore RAML as possible generic API interface.

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/19

Originally created by @aaronsteers on 2021-01-08 01:06:57

I was recently exposed to RAML which is a YAML language for documenting REST APIs. This might be a stretch, but I thought it would be interesting to explore if a generic tap could be written to accept a RAML definition.

Sample "RAML YAML" from https://raml.org:

feature: Add built-in support for `BATCH` message type (aka `FAST_SYNC` spec)

This feature would bring BATCH message type support to SDK taps and targets, beginning with the .jsonl.gz file format saved to local storage.

Spec discussion:

#859

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/9

Originally created by @aaronsteers on 2020-12-18 18:12:08

Spec discussion (old)

This enhancement would add framework support for the new FAST_SYNC spec as described on the meltano thread (https://gitlab.com/meltano/meltano/-/issues/2364).

To kick off the discussion, what about this as a strawman spec:

List of spec changes to support Fast Sync (partial, wip):

register_batch_export_handler() - Registers a handler function to respond to batch export requests. Includes in the registration command a declaration of what file type and storage options are supported by the handler, along with the relative priority of the specific handler.
register_batch_import_handler() - Same as above but for targets.

Following from other design practices, we would not require that the tap author knows how to implement the BATCH message type, just that they return file paths in a way we can properly pass them to the downstream client (according to spec work on https://gitlab.com/meltano/meltano/-/issues/2364).

Example:

In the case of a Redshift UNLOAD command, the register_batch_export_handler() might give a function to execute the UNLOAD command, save to S3, and then download the files locally and return the corresponding local filepaths.

Use pipelinewise-singer-python and most recent pytz - [merged]

Merges pipelinewise-singer-python -> development

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/10

Ensure developer-generated exceptions are handled or raised, as appropriate

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/24

Originally created by @aaronsteers on 2021-01-23 00:54:40

The following discussion from !1 should be addressed:

@DouweM started a discussion: (+2 comments)

We should make sure we handle exceptions thrown by methods that the user may have overridden and report them appropriately.

Add support for sharded state bookmarks for substreams (ex. substreams by parent key)

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/22

Originally created by @aaronsteers on 2021-01-14 20:02:25

On #20 and in discussions on !1, we have uncovered the need to work out additional spec work on complex bookmarking requirements - specifically bookmarks which are tracking distincts "shards" or "subdomains" of the total domain of entities in the stream.

A common scenario, for instance, would be keeping a distinct bookmark for each "project" in the GitLab tap, since virtually all streams must be keyed off of a project_id. For instance, the nested bookmark structure is needed to track the latest replication key for each substream, since each was run at a slightly different time, and each may need to be retried separately from one another if a failure affects one substream and not the others.

Add support for ACTIVATE_VERSION message types

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/18

Originally created by @aaronsteers on 2021-01-06 23:04:29

From the singer-python library:

ACTIVATE_VERSION message (EXPERIMENTAL).

The ACTIVATE_VERSION messages has these fields:
  * stream - The name of the stream.
  * version - The version number to activate.

This is a signal to the Target that it should delete all previously
seen data and replace it with all the RECORDs it has seen where the
record's version matches this version number.

Note that this feature is experimental. Most Taps and Targets should
not need to use the "version" field of "RECORD" messages or the
"ACTIVATE_VERSION" message at all.

msg = singer.ActivateVersionMessage(
    stream='users',
    version=2)

Implementing for taps:

I think we can safely implement for taps and send the message by default. For cases where targets cannot tolerate the unknown message types, we should support a disable_activate_version_messages=True option.

When `FULL_TABLE` replication is selected in the tap:

Initialize a version number (likely an epoch-based integer): https://github.com/transferwise/pipelinewise-tap-snowflake/blob/aa89f2e4235999dbeafc7406a7f8b382542d8d5b/tap_snowflake/sync_strategies/common.py#L33
Include version as property within emitted RECORD messages. https://github.com/transferwise/pipelinewise-tap-snowflake/blob/aa89f2e4235999dbeafc7406a7f8b382542d8d5b/tap_snowflake/sync_strategies/common.py#L200
Emit ACTIVATE_VERSION at the beginning of the first FULL_TABLE sync operation: https://github.com/transferwise/pipelinewise-tap-snowflake/blob/aa89f2e4235999dbeafc7406a7f8b382542d8d5b/tap_snowflake/sync_strategies/full_table.py#L87-L95
Emit ACTIVATE_VERSION after a successful FULL_TABLE sync: https://github.com/transferwise/pipelinewise-tap-snowflake/blob/aa89f2e4235999dbeafc7406a7f8b382542d8d5b/tap_snowflake/sync_strategies/full_table.py#L114

Provide spec documentation and samples for dynamic schema detection (aka dynamic discovery)

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/33

Originally created by @aaronsteers on 2021-03-03 19:51:40

All of the samples and cookiecutter boilerplate currently assumes a static schema per stream. We should add an example which uses a dynamic schema detection method so that developers can more easily adopt this behavior in their own taps.

In theory, the code would look something like this:

class MyStream(...):
    ...
    @property
    def schema(self) -> dict:
        # piggy-back on self.request_url() or similar, leveraging the same auth flow
        return {... something dynamically calculated ...}
    ...

Another option is to discover and initialize each stream in the Tap class by overriding this static approach with a dynamic one:

class Tap{{ cookiecutter.source_name }}(Tap):
    """{{ cookiecutter.source_name }} tap class."""
    # ...
    def discover_streams(self) -> List[Stream]:
        """Return a list of discovered streams."""
        return [stream_class(tap=self) for stream_class in STREAM_TYPES]

Meltano extractor install ModuleNotFoundError

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/34

Originally created by @vischous on 2021-03-04 00:08:28

Trying to add my initial plugin to Meltano as a custom plugin, and I'm running into some issues. I think this is an SDK thing but I could be wrong.

I manually copied the files to extract/tap-bamboohr . Also tried this with git+https://gitlab.com/autoidm/tap-bamboohr.git and had the same issue.

Feel free to just point me somewhere, but it sounds like you appreciate the issues!

meltano.yml extractors section

  - name: tap-bamboohr
    namespace: tap_bamboohr
    pip_url: extract/tap-bamboohr
    config:
      auth_token: ***
      subdomain: ***

~/git/meltano-projects/testproject$ meltano install extractor tap-bamboohr
Installing 1 plugins...
Installing extractor 'tap-bamboohr'...
Extractor 'tap-bamboohr' could not be installed: failed to install plugin 'tap-bamboohr'.
ERROR: Exception:
Traceback (most recent call last):
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 189, in _main
    status = self.run(options, args)
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 178, in wrapper
    return func(self, options, args)
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 316, in run
    requirement_set = resolver.resolve(
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 100, in resolve
    r = self.factory.make_requirement_from_install_req(
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 301, in make_requirement_from_install_req
    cand = self._make_candidate_from_link(
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 167, in _make_candidate_from_link
    self._link_candidate_cache[link] = LinkCandidate(
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 300, in __init__
    super().__init__(
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 144, in __init__
    self.dist = self._prepare()
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 226, in _prepare
    dist = self._prepare_distribution()
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 311, in _prepare_distribution
    return self._factory.preparer.prepare_linked_requirement(
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/operations/prepare.py", line 457, in prepare_linked_requirement
    return self._prepare_linked_requirement(req, parallel_builds)
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/operations/prepare.py", line 500, in _prepare_linked_requirement
    dist = _get_prepared_distribution(
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/operations/prepare.py", line 66, in _get_prepared_distribution
    abstract_dist.prepare_distribution_metadata(finder, build_isolation)
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/distributions/sdist.py", line 39, in prepare_distribution_metadata
    self._setup_isolation(finder)
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_internal/distributions/sdist.py", line 97, in _setup_isolation
    reqs = backend.get_requires_for_build_wheel()
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_vendor/pep517/wrappers.py", line 177, in get_requires_for_build_wheel
    return self._call_hook('get_requires_for_build_wheel', {
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_vendor/pep517/wrappers.py", line 284, in _call_hook
    raise BackendUnavailable(data.get('traceback', ''))
pip._vendor.pep517.wrappers.BackendUnavailable: Traceback (most recent call last):
  File "/home/visch/git/meltano-projects/testproject/.meltano/extractors/tap-bamboohr/venv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 86, in _build_backend
    obj = import_module(mod_path)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'poetry.core.masonry.rest'

Automated PyPi publish - [merged]

Merges feature/pypi-publish -> main

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/merge_requests/6

Adds automated PyPi publish with the following rules:

Commits on main are pushed as new versions.
- Publish will fail if version is not bumped from prior master release.
- In the occasion that bump is forgotten, a bump commit directly on main will resolve the failed build and redo the publish.
Commits on all non-main branches are pushed as prelease versions.

Export metrics for each extract job for API calls used

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/210

Originally created by @joshlambert on 2018-04-09 18:30:43

We should capture the number of API calls used by each of our extractors. We could utilize for example the push gateway of Prometheus, and then we can visualize and ultimately receive alerts when we get near limits (could be manually defined, for now).

Properly handle stream selection and `selected_properties`, including both `RECORD` and `SCHEMA` messages

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/7

Originally created by @aaronsteers on 2020-12-12 03:13:16

This would add proper stream selection and property selection via the input catalog (when provided).

Approach

Following from the discussion here, the selection filtering should occur at four steps in the process:

Selecting or deselecting entire streams.
As an input to get_records() - so tap developers can avoid collecting data they don't need.
- This is available today in some manner, if the developer interrogates Tap.input_catalog (not a trivial effort).
As a property filter on the standard (base class) transformation from raw data into the RECORD message.
- Failsafe for excessive fields being returned from get_records().
As a property filter on the standard (base class) generation of SCHEMA messages.
- Many taps don't perform this filtering today, which results in downstream columns being unnecessarily (and confusingly) created in the target's destination tables.

Spec re: dynamic param resolution for urls, url params, and query params

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/25

Originally created by @aaronsteers on 2021-01-23 02:52:21

The following discussion from !1 should be addressed:

@DouweM started a discussion:

I was initially confused by this name, thinking it returned a dict of key/values that would be used as the URL's query params (e.g. {"foo": "bar"} becomes foo=bar), rather than variables that can be used in the URL template.

What do you think about path_variables (if we rename url_suffix to path) or url_variables? That assumes this'd also become a property rather than an explicit getter.

The way it's currently written, Stream.get_params() as the docstring says:

Return a dictionary of values to be used in parameterization.

By default, this includes all settings which are not secrets, along with any
stored values the stream or partition state, as passed via the
`stream_or_partition_state` argument.

The three places something like this is needed:

URL parameterization - for classes based on RESTStream, the path can be dynamically calculated with parent stream keys or hardcoded param values.
REST params and payloads - for RESTStream URL calls which require params and for GraphQL when variables play a similar role.
For dynamic SQL queries for classes based on DatabaseStream.

The generic implementation just cleans out any identifiable secrets (to not accidentally leak credentials into external systems) and adds in any bookmark keys which might be relevant based on the context.

    def get_params(self, stream_or_partition_state: dict) -> dict:
        result = {
            k: v for k, v in self.config.items() if not isinstance(v, SecretString)
        }
        result.update(stream_or_partition_state)
        return result

This basically works as-is but I wanted to create this issue to discuss trade-offs of this implementation versus alternatives:

splitting out each use case separately (i.e. url_params, query_params, variables, etc.).
requiring the developer to explicitly declare variable collections for one or more of these use cases.
defaulting to also include secrets, possibly with other mitigation steps or overrides available.

Inventory of working samples

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/32

Originally created by @aaronsteers on 2021-03-03 19:26:58

Until we get the new Singer Hub up and running, it could be helpful for prospective SDK users to be able to browse other developers' working examples. Should we add a section in the repo that links to an inventory of working examples?

In the interim, here are links to the 2 3 4 WIP implementations I am currently aware of:

And (at least?) one more private repo which is being built by Edgar Ramírez Mondragón as noted here in slack: https://meltano.slack.com/archives/C01PKLU5D1R/p1614382174006900

meltano / sdk Goto Github PK

sdk's Introduction

Meltano Singer SDK

The Tap and Target SDKs are the fastest way to build custom data extractors and loaders! Taps and targets built on the SDK are automatically compliant with the Singer Spec, the de-facto open source standard for extract and load pipelines.

Future-proof extractors and loaders, with less code

Meltano

Documentation

Contributing back to the SDK

Making a new release of the SDK

sdk's People

Contributors

Stargazers

Watchers

Forkers

sdk's Issues

How to test during the preview

Objectives:

Related discussion

Status:

Background

Proposal:

Use cases

Proposed implementation

Internal changes proposed:

Proposed CLI updates:

Why raw jsonl sync output as the standard "data replay" format:

Other Notes:

Caveats:

Out-of-scope but worthy of discussion:

Workarounds Considered

Sample boilerplate code today

Scenario A: ID of entity being streamed is keyed into URL, requiring multiple calls: one call each for every captured entry.

Example: GitLab Projects.

Requirements to fulfill:

Scenario B: Parent ID is parameterized, requiring multiple calls: one call each for every parent entity.

Example: Gitlab Epic Issues.

Requirements to fulfill:

Scenario C: Follow-on (child) HTTP calls needed for full stream definition.

Example: Hypothetical "Students" records requiring chained calls (above).

Requirements to fulfill:

Implementing for taps:

When FULL_TABLE replication is selected in the tap:

Approach

Recommend Projects

Recommend Topics

Recommend Org

Why raw `jsonl` sync output as the standard "data replay" format:

Example: GitLab `Projects`.

Example: Gitlab `Epic Issues`.

When `FULL_TABLE` replication is selected in the tap: