scriptfusion / porter Goto Github PK

:lipstick: Durable and asynchronous data imports for consuming data at scale and publishing testable SDKs.

License: GNU Lesser General Public License v3.0

PHP 100.00%

porter data-import framework data-transformation php-development abstraction scalability durability asynchronous library fibers

porter's Introduction

Porter

Durable and asynchronous data imports for consuming data at scale and publishing testable SDKs

Porter is the all-purpose PHP data importer. She fetches data from APIs, web scraping or anywhere and serves it as an iterable record collection, encouraging processing one record at a time instead of loading full data sets into memory. Durability features provide automatic, transparent recovery from intermittent network errors by default.

Porter's interface triad of providers, resources and connectors allows us to publish testable SDKs and maps well to APIs and HTTP endpoints. For example, a typical API such as GitHub would define the provider as GitHubProvider, a resource as GetUser or ListRepositories and the connector could be HttpConnector.

Porter supports asynchronous imports via fibers^{(PHP 8.1)} allowing multiple imports to be started, paused and resumed concurrently. Async allows us to import data as fast as possible, transforming applications from network-bound (slow) to CPU-bound (optimal). Throttle support ensures we do not exceed peer connection or throughput limits.

Benefits

Defines an easily testable interface triad for data imports: providers represent one or more resources that fetch data from connectors. These interfaces make it very easy to test and mock specific parts of the import lifecycle using industry standard tools, whether we want to mock at the connector level and feed in raw responses or mock at the resource level and supply ready-hydrated objects.
Provides memory-efficient data processing interfaces that handle large data sets one record at a time, via iterators, which can be implemented using deferred execution with generators.
Asynchronous imports offer highly efficient CPU-bound data processing for large scale imports across multiple connections concurrently, eliminating network latency performance bottlenecks. Concurrency can be rate-limited using throttling.
Protects against intermittent network failures with durability features that transparently and automatically retry failed data fetches.
Offers post-import transformations, such as filtering and mapping, to transform third-party data into useful data for our applications.
Supports PSR-6 caching, at the connector level, for each fetch operation.
Joins two or more linked data sets together using sub-imports automatically.

Quick start

To get started quickly, consuming an existing Porter provider, try one of our quick start guides:

General quickstart – Get started using Porter with vanilla PHP (no framework) and the European Central Bank provider.
Symfony quickstart – Get started by integrating Porter into a new Symfony project with the Steam provider.

For a more thorough introduction continue reading.

About this manual

Those consuming a Porter provider create one instance of Porter for their application and an instance of Import for each data import they wish to perform. Those publishing providers must implement Provider and ProviderResource.

The first half of this manual covers Porter's main API for consuming data services. The second half covers architecture, interface and implementation details for publishing data services. There's an intermission in-between, so you'll know where the separation is!

Text marked as inline code denotes literal code, as it would appear in a PHP file. For example, Porter refers specifically to the class of the same name within this library, whereas Porter refers to this project as a whole.

Usage

Creating the container

Create a new Porter instance—we'll usually only need one per application. Porter's constructor requires a PSR-11 compatible ContainerInterface that acts as a repository of providers.

When integrating Porter into a typical MVC framework application, we'll usually have a service locator or DI container implementing this interface already. We can simply inject the entire container into Porter, although it's best practice to create a separate container just for Porter's providers. For an example of doing this correctly in Symfony, see the Symfony quickstart.

Without a framework, pick any PSR-11 compatible library and inject an instance of its container class. We could even write our own container since the interface is easy to implement, but using an existing library is beneficial, particularly since most support lazy-loading of services. If you're not sure which to use, Joomla DI seems fairly simple and light.

Registering providers

Configure the container by registering one or more Porter providers. In this example we'll add the ECB provider for foreign exchange rates. Most provider libraries will export just one provider class; in this case it's EuropeanCentralBankProvider. We could add the provider to the container by writing something similar to $container->set(EuropeanCentralBankProvider::class, new EuropeanCentralBankProvider), but consult the manual for your particular container implementation for the exact syntax.

It is recommended to use the provider's class name as the container service name, as in the example in the previous paragraph. Porter will retrieve the service matching the provider's class name by default, so this reduces friction when getting started. If we use a different service name, it will need to be configured on the Import by calling setProviderName().

Importing data

Porter's import method accepts an Import that describes which data should be imported and how the data should be transformed. To import DailyForexRates without applying any transformations we can write the following.

$records = $porter->import(new Import(new DailyForexRates));

Calling import() returns an instance of PorterRecords or CountablePorterRecords, which both implement Iterator, allowing each record in the collection to be enumerated using foreach as in the following example.

foreach ($records as $record) {
    var_dump($record);
}

Porter's API

Porter's simple API comprises data import methods that must always be used to begin imports, instead of calling methods directly on providers or resources, in order to take advantage of Porter's features correctly.

Porter provides just two public methods for importing data. These are the methods to be most familiar with, where the life of a data import operation begins.

import(Import): PorterRecords|CountablePorterRecords – Imports one or more records from the resource contained in the specified import specification. If the total size of the collection is known, the record collection may implement Countable, otherwise PorterRecords is returned.
importOne(Import): ?array – Imports one record from the resource contained in the specified import specification. If more than one record is imported, ImportException is thrown. Use this when a provider implements SingleRecordResource, returning just a single record.

Overview

The following data flow diagram gives a high level overview of Porter's main interfaces and the data flows between them when importing data. Note that we use the term resource for brevity, but the interface is actually called ProviderResource, because resource is a reserved word in PHP.

Our application calls Porter::import() with an Import and receives PorterRecords back. Everything else happens internally, so we don't need to worry about it unless writing custom providers, resources or connectors.

Import specifications

Import specifications specify what to import, how it should be transformed and whether to use caching. Create a new instance of Import and pass a ProviderResource that specifies the resource we want to import.

Options may be configured using the methods below.

setProviderName(string) – Sets the provider service name.
addTransformer(Transformer) – Adds a transformer to the end of the transformation queue.
addTransformers(Transformer[]) – Adds one or more transformers to the end of the transformation queue.
setContext(mixed) – Specifies user-defined data to be passed to transformers.
enableCache() – Enables caching. Requires a CachingConnector.
setMaxFetchAttempts(int) – Sets the maximum number of fetch attempts per connection before failure is considered permanent.
setFetchExceptionHandler(FetchExceptionHandler) – Sets the exception handler invoked each time a fetch attempt fails.
setThrottle(Throttle) – Sets the connection throttle, invoked each time a connector fetches data.

Record collections

Record collections are Iterators, guaranteeing imported data is enumerable using foreach. Each record of the collection is the familiar and flexible array type, allowing us to present structured or flat data, such as JSON, XML or CSV, as an array.

Details

Record collections may be Countable, depending on whether the imported data was countable and whether any destructive operations were performed after import. Filtering is a destructive operation since it may remove records and therefore the count reported by a ProviderResource would no longer be accurate. It is the responsibility of the resource to supply the total number of records in its collection by returning an iterator that implements Countable, such as ArrayIterator, or more commonly, CountableProviderRecords. When a countable iterator is used, Porter returns CountablePorterRecords, provided no destructive operations were performed.

Record collections are composed by Porter using the decorator pattern. If provider data is not modified, PorterRecords will decorate the ProviderRecords returned from a ProviderResource. That is, PorterRecords has a pointer back to the previous collection, which could be written as: PorterRecords → ProviderRecords. If a filter was applied, the collection stack would be PorterRecords → FilteredRecords → ProviderRecords. Normally this is an unimportant detail but can sometimes be useful for debugging.

The stack of record collection types informs us of the transformations a collection has undergone and each type holds a pointer to relevant objects that participated in the transformation. For example, PorterRecords holds a reference to the Import that was used to create it and can be accessed using PorterRecords::getImport.

Metadata

Since record collections are just objects, it is possible to define derived types that implement custom fields to expose additional metadata in addition to the iterated data. Collections are very good at representing a repeating series of data but some APIs send additional non-repeating data which we can expose as metadata. However, if the data is not repeating at all, it should be treated as a single record rather than metadata.

The result of a successful Porter::import call is always an instance of PorterRecords or CountablePorterRecords, depending on whether the number of records is known. If we need to access methods of the original collection, returned by the provider, we can call findFirstCollection() on the collection. For an example, see CurrencyRecords of the European Central Bank Provider and its associated test case.

Asynchronous

Porter has had asynchronous support since version 5 (2019) thanks to Amp integration. In v5, async was implemented with coroutines, but from version 6 onwards, Porter uses the simpler fibers model. Fiber support is included in PHP 8.1 and can be added to PHP 8.0 using ext-fiber. PHP 7 does not support fibers, so if you are stuck with that version of PHP, coroutines are the only option. It is strongly recommended to upgrade to PHP 8.1 to use async, to avoid unnecessary bugs leading to segfaults and to avoid getting trapped in the coroutine architecture that is cumbersome to upgrade, difficult to debug and harder to reason about.

In version 5, Porter offered a dual API to support the asynchronous code path. That is, Porter::import had the async analogue: Porter::importAsync and Porter::importOne had Porter::importOneAsync. In version 6 we switched to fibers but kept the dual API to making migrating from coroutines to fibers slightly easier. Since version 7, we unified the dual API because async with fibers can be almost entirely transparent: the synchronous and asynchronous code paths are identical, so we don't even have to think about async unless and until we want to start leveraging its benefits in our application.

To use async in Porter v7 onwards, simply wrap an import() or importOne() call in an async() call using one of the following two methods.

use function Amp\async;

async(
    $this->porter->import(...),
    new Import(new MyResource())
);

// -OR-

async(fn () => $this->porter->import(new Import(new MyResource()));

In order for this to work, the only requirement is that the underlying connector supports fibers. To know whether a particular connector supports fibers, consult its documentation. The most common connector, HttpConnector, already has fiber support.

Calling async() returns a Future representing the eventual result of an asynchronous operation. To understand how futures are composed and abstracted, or how to await and iterate collections of futures, is beyond the scope of this document. Full details about async programming can be found in the official Amp documentation.

Note: At the time of writing, Amp v3 is still in beta, so you may find it necessary to lower a project's minimum stability to include Amp packages, via composer.json:
"minimum-stability": "beta"
To avoid pulling in any betas other than those absolutely necessary for the dependency solver to be satisfied, it is recommended to also set stable packages as the preferred stability when using the above setting.
"prefer-stable": true

Throttling

The asynchronous import model is very powerful because it changes our application's performance model from I/O-bound, limited by the speed of the network, to CPU-bound, limited by the speed of the CPU. In the traditional synchronous model, each import operation must wait for the previous to complete before the next begins, meaning the total import time depends on how long it takes each import's network I/O to finish. In the async model, since we send many requests concurrently without waiting for the previous to complete. On average, each import operation only takes as long as our CPU takes to process it, since we are busy processing another import during network latency (except during the initial "spin-up").

Synchronously, we seldom trip protection measures even for high volume imports, however the naïve approach to asynchronous imports is often fraught with perils. If we import 10,000 HTTP resources at once, one of two things usually happens: either we run out of PHP memory and the process terminates prematurely or the HTTP server rejects us after sending too many requests in a short period. The solution is throttling.

Async Throttle is included with Porter to throttle asynchronous imports. The throttle works by preventing additional operations starting when too many are executing concurrently, based on user-defined limits. By default, NullThrottle is assigned, which does not throttle connections. DualThrottle can be used to set two independent connection rate limits: the maximum number of connections per second and the maximum number of concurrent connections.

A DualThrottle can be assigned by modifying the import specification as follows.

(new Import)->setThrottle(new DualThrottle)

ThrottledConnector

A throttle can be assigned to a connector implementing the ThrottledConnector interface. This allows a provider to apply a throttle to all its resources by default. When a throttle is assigned to both a connector and an import specification, the specification's throttle takes priority. If the connector we want to use does not implement ThrottledConnector, simply extend the connector and implement the interface.

Implementing ThrottledConnector is likely to be preferable when we want many resources to share the same throttle or when we want to inject the throttle using dependency injection, since specifications are typically instantiated inline whereas connectors are not. That is, we would usually declare connectors in our application framework's service configuration.

Transformers

Transformers manipulate imported data. Transforming data is useful because third-party data seldom arrives in a format that looks exactly as we want. Transformers are added to the transformation queue of an Import by calling its addTransformer method and are executed in the order they are added.

Porter includes one transformer, FilterTransformer, that removes records from the collection based on a predicate. For more information, see filtering. More powerful data transformations can be designed with MappingTransformer. More transformers may be available from Porter transformers.

Writing a transformer

Transformers implement the Transformer and/or AsyncTransformer interfaces that define one or more of the following methods.

public function transform(RecordCollection $records, mixed $context): RecordCollection;

public function transformAsync(AsyncRecordCollection $records, mixed $context): AsyncRecordCollection;

When transform() or transformAsync() is called the transformer may iterate each record and change it in any way, including removing or inserting additional records. The record collection must be returned by the method, whether or not changes were made.

Transformers should also implement the __clone magic method if they store any object state, in order to facilitate deep copy when Porter clones the owning Import during import.

Filtering

Filtering provides a way to remove some records. For each record, if the specified predicate function returns false (or a falsy value), the record will be removed, otherwise the record will be kept. The predicate receives the current record as an array as its first parameter and context as its second parameter.

In general, we would like to avoid filtering because it is inefficient to import data and then immediately remove some of it, but some immature APIs do not provide a way to reduce the data set on the server, so filtering on the client is the only alternative. Filtering also invalidates the record count reported by some resources, meaning we no longer know how many records are in the collection before iteration.

Example

The following example filters out any records that do not have an id field present.

$records = $porter->import(
    (new Import(new MyResource))
        ->addTransformer(
            new FilterTransformer(static function (array $record) {
                return array_key_exists('id', $record);
            })
        )
);

Durability

Porter automatically retries connections when an exception occurs during Connector::fetch. This helps mitigate intermittent network conditions that cause temporary data fetch failures. The number of retry attempts can be configured by calling the setMaxFetchAttempts method of an Import.

The default exception handler, ExponentialSleepFetchExceptionHandler, causes a failed fetch to pause the entire program for a series of increasing delays, doubling each time. Given that the default number of retry attempts is five, the exception handler may be called up to four times, delaying each retry attempt for ~0.1, ~0.2, ~0.4, and finally, ~0.8 seconds. After the fifth and final failure, FailingTooHardException is thrown.

The exception handler can be changed by calling setFetchExceptionHandler. For example, the following code changes the initial retry delay to one second.

$specification->setFetchExceptionHandler(new ExponentialSleepFetchExceptionHandler(1000000));

Durability only applies when connectors throw a recoverable exception type derived from RecoverableConnectorException. If an unexpected exception occurs the fetch attempt will be aborted. For more information, see implementing connector durability. Exception handlers receive the thrown exception as their first argument. An exception handler can inspect the recoverable exception and throw its own exception if it decides the exception should be treated as fatal instead of recoverable.

Caching

Any connector can be wrapped in a CachingConnector to provide PSR-6 caching facilities to the base connector. Porter ships with one cache implementation, MemoryCache, which caches fetched data in memory, but this can be substituted for any other PSR-6 cache implementation. The CachingConnector caches raw responses for each unique request, where uniqueness is determined by DataSource::computeHash.

Remember that whilst using a CachingConnector enables caching, caching must also be enabled on a per-import basis by calling Import::enableCache().

Example

The follow example enables connector caching.

$records = $porter->import(
    (new Import(new MyResource))
        ->enableCache()
);

INTERMISSION ☕️

Congratulations! We have covered everything needed to use Porter.

The rest of this readme is for those wishing to go deeper. Continue when you're ready to learn how to write providers, resources and connectors.

Architecture

The following UML class diagram shows a partial architectural overview illustrating Porter's main components and how they are related. Asynchronous implementation details are mostly omitted since they mirror the synchronous system. [enlarge]

Providers

Providers supply their ProviderResource objects with a Connector. The provider must ensure it supplies a connector of the correct type for accessing its service's resources. A provider implements Provider that defines one method with the following signature.

public function getConnector() : Connector;

A provider does not know how many resources it has nor maintains a list of such resources and neither does any other part of Porter. That is, a resource class can be created at any time and claim to belong to a given provider without any formal registration.

Writing a provider

Providers must implement the Provider interface and supply a valid connector when getConnector is called. From Porter's perspective, writing a provider often requires little more than supplying the correct type hint when storing a connector instance, but we can embellish the class with any other features we may want. For HTTP service providers, it is common to add a base URL constant and some static methods to compose URLs, reducing code duplication in its resources.

Implementation example

In the following example we create a provider that only accepts HttpConnector instances. We also create a default connector in case one is not supplied. Note it is not always possible to create a default connector, and it is perfectly valid to insist the caller supplies a connector.

final class MyProvider implements Provider
{
    private $connector;

    public function __construct(Connector $connector = null)
    {
        $this->connector = $connector ?: new HttpConnector;
    }

    public function getConnector(): Connector
    {
        return $this->connector;
    }
}

Resources

Resources fetch data using the supplied connector and format it as a collection of arrays. A resource implements ProviderResource that defines the following three methods.

public function getProviderClassName(): string;
public function fetch(ImportConnector $connector): \Iterator;

A resource supplies the class name of the provider it expects a connector from when getProviderClassName() is called.

When fetch() is called it is passed the connector from which data must be fetched. The resource must ensure data is formatted as an iterator of array values whilst remaining as true to the original format as possible; that is, we must avoid renaming or restructuring data because it is the caller's prerogative to perform data customization if desired. The recommended way to return an iterator is to use yield to implicitly return a Generator, which has the added benefit of processing one record at a time.

The fetch method receives an ImportConnector, which is a runtime wrapper for the underlying connector supplied by the provider. This wrapper is used to isolate the connector's state from the rest of the application. Since PHP doesn't have native immutability support, working with cloned state is the only way we can guarantee unexpected changes do not occur once an import has started. This means it's safe to import one resource, make changes to the connector's settings and then start another import before the first has completed. Providers can also safely make changes to the underlying connector by calling getWrappedConnector(), because the wrapped connector is cloned as soon as ImportConnector is constructed.

Providing immutability via cloning is an important concept because resources are often implemented using generators, which implies delayed code execution. Multiple fetches can be started with different settings, but execute in a different order some time later when they're finally enumerated. This issue will become even more pertinent when Porter supports asynchronous fetches, enabling multiple fetches to execute concurrently. However, we don't need to worry about this implementation detail unless writing a connector ourselves.

Writing a resource

Resources must implement the ProviderResource interface. getProviderClassName() usually returns a hard-coded provider class name and fetch() must always return an iterator of array values.

In this contrived example that uses dummy data and ignores the connector, suppose we want to return the numeric series one to three: the following implementation would be invalid because it returns an iterator of integer values instead of an iterator of array values.

public function fetch(ImportConnector $connector): \Iterator
{
    return new ArrayIterator(range(1, 3)); // Invalid return type.
}

Either of the following fetch() implementations would be valid.

public function fetch(ImportConnector $connector): \Iterator
{
    foreach (range(1, 3) as $number) {
        yield [$number];
    }
}

Since the total number of records is known, the iterator can be wrapped in CountableProviderRecords to enrich the caller with this information.

public function fetch(ImportConnector $connector): \Iterator
{
    $series = function ($limit) {
        foreach (range(1, $limit) as $number) {
            yield [$number];
        }
    };

    return new CountableProviderRecords($series($count = 3), $count, $this);
}

Implementation example

In the following example we create a resource that receives a connector from MyProvider and uses it to retrieve data from a hard-coded URL. We expect the data to be JSON encoded so we decode it into an array and use yield to return it as a single-item iterator.

class MyResource implements ProviderResource, SingleRecordResource
{
    private const URL = 'https://example.com';

    public function getProviderClassName(): string
    {
        return MyProvider::class;
    }

    public function fetch(ImportConnector $connector): \Iterator
    {
        $data = $connector->fetch(self::URL);

        yield json_decode($data, true);
    }
}

If the data represents a repeating series, yield each record separately instead, as in the following example, and remove the SingleRecordResource marker interface.

public function fetch(ImportConnector $connector): \Iterator
{
    $data = $connector->fetch(self::URL);

    foreach (json_decode($data, true) as $datum) {
        yield $datum;
    }
}

Exception handling

Unrecoverable exceptions will be thrown and can be caught as normal, but good connector implementations will wrap their connection attempts in a retry block and throw a RecoverableConnectorException. The only way to intercept a recoverable exception is by attaching a FetchExceptionHandler to the ImportConnector by calling its setExceptionHandler() method. Exception handlers cannot be used for flow control because their return values are ignored, so the main application of such handlers is to re-throw recoverable exceptions as non-recoverable exceptions.

Connectors

Connectors fetch remote data from a source specified at fetch time. Connectors for popular protocols are available from Porter connectors. It might be necessary to write a new connector if dealing with uncommon or currently unsupported protocols. Writing providers and resources is a common task that should be fairly easy but writing a connector is less common.

Writing a connector

A connector implements the Connector interface that defines one method with the following signature.

public function fetch(DataSource $source): mixed;

When fetch() is called the connector fetches data from the specified data source. Connectors may return data in any format that's convenient for resources to consume, but in general, such data should be as raw as possible and without modification. If multiple pieces of information are returned it is recommended to use a specialized object, like the HttpResponse returned by the HTTP connector that contains the response headers and body together.

Data sources

The DataSource interface must be implemented to supply the necessary parameters for a connector to locate a data source. For an HTTP connector, this might include URL, method, body and headers. For a database connector, this might be a SQL query.

DataSource specifies one method with the following signature.

public function computeHash(): string;

Data sources are required to return a unique hash for their state. If the state changes, the hash must change. If states are effectively equivalent, the hash must be the same. This is used by the cache system to determine whether the fetch operation has been seen before and thus can be served from the cache rather than fetching fresh data again.

It is important to define a canonical order for hashed inputs such that identical state presented in different orders does not create different hash values. For example, we might sort HTTP headers alphabetically before hashing because header order is not significant and reordering headers should not produce different output.

Durability

To support Porter's durability features a connector may throw a subclass of RecoverableConnectorException to signal that the fetch operation can be retried. Execution will halt as normal if any other exception type is thrown. It is recommended to throw a recoverable exception type when the fetch operation is idempotent.

Limitations

Current limitations that may affect some users and should be addressed in the near future.

No end-to-end data steaming interface.

Testing

Porter is fully unit and mutation tested.

Run unit tests with the composer test command.
Run mutation tests with the composer mutation command.

Contributing

Everyone is welcome to contribute anything, from ideas and issues to code and documentation!

License

Porter is published under the open source GNU Lesser General Public License v3.0. However, the original Porter character and artwork is copyright © 2022 Bilge and may not be reproduced or modified without express written permission.

Support

Porter is supported by JetBrains for Open Source products.

Quick links

porter's People

Contributors

Stargazers

Watchers

porter's Issues

Replace zendframework/zend-uri with lightweight alternative

zend-uri serves its purpose adequately but it's a heavy library bringing in additional dependencies of its own. We may wish to investigate simpler alternatives for URL parsing, such as purl.

CachingConnector is a poor user experience

Having to wrap a connector in CachingConnector just to use caching is not as easy to use as if the cache just worked with any connector. Moreover, cache + connector is a violation of SRP. The cache should be refactored as a separate entity, apart from connectors.

Make Mapper a suggested dependency

Mapper is currently a required dependency, but users who do not use mappings do not need to install it at all. In order to make Mapper a suggested dependency care must be taken to ensure Porter works correctly when Mapper is unavailable, including tests to verify correct operation in this scenario.

Durability is broken for subsequent generator iterations after the first

Durability is provided for the $provider->fetch call, but Provider::fetch is declared to return Iterator, which is typically implemented using generators. Generators imply deferred code executions, which means that even if the generator throws an exception, it is not caught by the retry handler because it already exited that code block.

This common case is not captured by PorterTest because it only tests that Provider::fetch throws an exception directly instead of the generator throwing an exception.

Reconsider whether forcing resources to return arrays is correct

Currently Porter believes resources should always want to return structured data as an array. However, there may be use-cases where structured data is either unavailable or undesirable. I am yet to encounter any compelling cases but am very interested to hear about any such cases.

If we open up the return type to be mixed, this would allow resources to return objects, which would solve #12. Allowing objects can be convenient for object-oriented applications, but if resources return objects as the de-facto standard, this could be inefficient for applications that just want to work with raw data. However, mixed would even permit resources to return different types depending on some configuration parameter.

Forcing the array return type is nice because it feeds into the transformers subsystem, giving transformers a consistent type to work with. However, I'm willing to forgo the entire transformers system in a future version, or change it to only be available when the return type is array, or change it to work with any return type, as necessary. Ultimately, the consequences for the transformers system are not important because Porter's primary responsibility is fetching data reliably, not transforming it.

Integrate async throttle

It was previously thought that directly integrating the async throttle with Porter was not needed because we can just throttle high level Porter import operations. However, this is false, for two reasons:

Internally, fetches can be retried (by default up to 5 times).
Any given import may pull down any number of resources to satisfy the import operation. The most common case is enumerating a paginated resource that results in n requests for n pages.

Each of these additional requests must be throttled independently to avoid triggering limits, whether a retry or the next resource in a sequence. For this to be possible, the throttle must be integrated into ImportConnector so it can throttle transparently without burdening the developer with additional calls or configuration.

A default throttle should be provided for async imports but it should be possible to override with a custom configuration or implementation via AsyncImportSpecification. Throttling will not be available for sync imports until such a time as the sync API converges with the async API internally.

Add asynchronous fetch support

Performing many sub-imports simultaneously is equivalent to queuing a series of I/O-bound operations whose total execution time is the sum of all imports' individual execution times. By running sub-requests concurrently in parallel asynchronously we reduce the total execution time to that of the the longest-running sub-import only. For highly concurrent sub-imports this is a significant time saving.

Drop PHP 5 support

It is currently planned to drop support for PHP 5 and target either 7.0 or 7.1 for Porter v5.

HttpConnectorTest intermittent CI failure

Travis occasionally fails to pass HttpConnectorTest with an error similar to the following.

There was 1 error:

1) ScriptFUSIONTest\Functional\Porter\Net\Http\HttpConnectorTest::testConnectionToLocalWebserver
ScriptFUSION\Retry\FailingTooHardException: Operation failed after 5 attempt(s).

/home/travis/build/ScriptFUSION/Porter/vendor/scriptfusion/retry/src/retry.php:29
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:96
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:34

Caused by
ScriptFUSION\Porter\Net\Http\HttpConnectionException: file_get_contents(http://[::1]:12345/test?baz=qux): failed to open stream: Connection refused

/home/travis/build/ScriptFUSION/Porter/src/Net/Http/HttpConnector.php:65
/home/travis/build/ScriptFUSION/Porter/src/Connector/CachingConnector.php:62
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:110
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:86
/home/travis/build/ScriptFUSION/Porter/vendor/scriptfusion/retry/src/retry.php:26
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:96
/home/travis/build/ScriptFUSION/Porter/test/Functional/Porter/Net/Http/HttpConnectorTest.php:34

This never used to be a problem, and thanks to the five retries it should have plenty of time to spin up the server. However, this is also the first test in the suite so it may have something to do with PHPUnit start-up time. We should consider moving slower tests to the end of the suite, and if that doesn't work, we'll have to increase the retry delay coefficient.

ExponentialAsyncDelayRecoverableExceptionHandler not being cloned correctly

A recent high-concurrency import, that fails catastrophically when the target service is down, indicated with an integer overflow that somehow state is being shared across the default implementation of the recoverable exception handler.

A debugging session shows the handler is being cloned, and initialize() is called at least once, but somehow the series of delays keeps growing beyond the default five retries.

In case it matters, the specific resource implementation calls fetchAsync() 80 times, but each call should still be independent as the ImportConnector clones a new handler for each fetch*() call.

Document Porter's main API

Explicitly document the public methods of Porter, specifically import(), importOne(), the provider methods, including details about tagging, and all other public methods.

[BC-BREAK] scriptfusion/retry 1.1.2

Hi,

When running porter 3.* retry 1.1.2 will be installed because of the following composer requirement:

"scriptfusion/retry": "^1.1",

The retry lib works on 1.1.1 with porter, upgrading to 1.1.2 breaks stuff.

Specific lines in the retry lib that are triggered:

if ($result instanceof \Generator) {
            throw new \UnexpectedValueException('Cannot retry a Generator. You probably meant something else.');
        }

Porter causes this because a generator is returned in Porter.php line 98

function () use ($provider, $resource) {
                if (($records = $provider->fetch($resource)) instanceof \Iterator) {
                    // Force generator to run until first yield to provoke an exception.
                    $records->valid();
                }

                return $records;     <----- this breaks
            },

Dependency on psr/cache:^1

I just wanted to take Porter for a quick spin, created a new Symfony project and tried to require the Porter package, resulting in this error:

scriptfusion/porter 7.0.0 requires psr/cache ^1 -> found psr/cache[1.0.0, 1.0.1] but the package is fixed to 3.0.0 (lock file version)

Is an update feasible, best for psr/container as well?

CachingConnector does not strip special characters reserved by PSR6 for future expansion.

The following characters are reserved for future extensions and MUST NOT be supported by implementing libraries: {}()/@:

http://www.php-fig.org/psr/psr-6/

CachingConnector::fetch should allow passing in the cache key

If we allow CachingConnector to take a cache key parameter, then it can be used with existing or shared caches where the keys are not of the form CachingConnector::hash produces.

My use case for this is using existing ODM Mongo documents to cache values with the document ID being the cache key.

To maintain backward compatibility the parameter should be optional and in the event of null, CachingConnector should fallback to generating the cache key using CachingConnector::hash.

Document Symfony integration best practices

The readme is written in a framework-agnostic way, as if one were to just use Porter in isolation, which is a good default tone to take since it makes no assumptions. However, a lot of people use Symfony and it would be useful to describe how a Porter integration with Symfony should look like for people looking to get started in a Symfony framework environment.

Document FetchExceptionHandlers

After rewriting a 4000 word manual for Porter v4 I didn't really feel like writing about FetchExceptionHandlers. This feature will seldom be required, and for those whom do need it, if they can't figure it out for themselves, the docblocks in the file should probably suffice. Nevertheless, we should document the interface properly at some point.

Lazy-load registered providers

A typical Porter factory might load many providers to support all use cases of an application, even though only a smaller subset may actually be used during one execution life-cycle. Therefore we would like a mechanism to lazy-load registered providers only when they are required.

One such mechanism may be a factory interface that looks similar to the following.

interface PorterProviderFactory
{
    public function getProviderClassName() : string;

    public function createProvider() : Provider;
}

Integrate hydrators into the architecture

Porter's notion of records is arrays, which are very flexible to pass between interfaces, but once data leaves Porter it is common for applications to want to work with objects instead. The job of a hydrator is to use array data to populate object fields. We should investigate the value of designing a hydrator interface and whether there are any existing hydration libraries fit for purpose.

Retry delays are tied to the lifetime of an import specification

Since ImportSpecification creates the ExponentialBackoffExceptionHandler, the current retry delay is tied to the lifetime of the specification. That is, if an import fails five times and the same specification is used to import again, the next delay begins with the sixth attempt delay time instead of restarting from one.

Ideally the retry counter would restart at the beginning of a new import regardless of whether the specification is reused or not. However, this tends to be low impact bug because specifications are typically not reused. As a workaround, anyone encountering this issue can just create a new specification for each import instead of reusing specifications.

Add relevant objects to various RecordCollection specializations

Type classes should be migrated to phptype organization

Type classes in the ScriptFUSION\Porter\Type namespace are no longer used and should be migrated to the phptype organization.

Fix specification cloning in Porter::import()

The specification is cloned too late during import() because members of the specifications are shared with other objects before cloning takes place thus creating shared mutable state. The specification must be cloned before any of its members are shared.

Rate Limiter

Any thoughts on adding some type of rate limiter functionality, as to not clobber the servers?

Dev mode

The introduction of a developer mode would allow for an opinionated preset to be applied to Porter's features set, in contrast to its defaults, which subsequently enables/disables certain features or modifies default values to be more conducive to development work.

For example, developer mode may:

reduce automatic retries from 5 -> 1
<add more ideas here...>

Document static data imports

It is possible to use Porter to import data we already have using static imports via StaticDataImportSpecification. This brings with it the same post-import benefits as importing data over a network and is especially useful in testing.

Spawn a temporary SOAP server to test SoapConnector

The only file not fully tested, and thus preventing 100% code coverage, is SoapConnector. Its analogue, HttpConnector, is tested by the functional test, HttpConnectorTest, that spawns a temporary HTTP server using php -S to test the connector. In a similar fashion I suggest spawning a temporary SOAP server to test SoapConnector, however I do not know the best way to do this.

A question posted to StackOverflow asking how to write a minimum valid WSDL has received no answers.

How I can get what went wrong at import process ?

For example when server return 404 not found 412 ?

Integrate formatters into the architecture

This ticket is an open discussion about whether there is a good way to integrate formatters into the architecture. Data might flow through objects in the following order.

Connector → Formatter → ProviderResource

However, we need to understand what the interface for Formatter must be and how it integrates into the rest of the system in a meaningful and reusable way.

Document multiple instances of same provider

Although we normally add a provider to the container by its class name and expect a single instance of each provider in the container, there are many valid use cases for adding the same provider multiple times. Document these use cases with examples and how-tos.

Often, we may operate multiple accounts with a given provider for various reasons. Examples:

Multiple Stripe accounts for handling payments in different currencies
Multiple Discord bots to leverage separate request rate limits

Enable cache substitution in connectors

There's no point in implementing PSR-6 caching interfaces if the default caching implementation cannot be changed. However, due to some oversight, none of the first party connectors expose a method to change the cache implementation.

Laravel and CachingConnector?

How to enable CachingConnector in Laravel?

    public function handle() {
        app()->bind(HttpConnector::class, CachingConnector::class);

        app()->bind(EuropeanCentralBankProvider::class, EuropeanCentralBankProvider::class);

        $porter = new Porter(app() );

        $specification = new ImportSpecification(new DailyForexRates() );
        $specification->enableCache();
        $rates = $porter->import($specification);

        foreach ($rates as $rate) {
            echo "$rate[currency]: $rate[rate]\n";
        }

    }

ScriptFUSION\Porter\Cache\CacheUnavailableException : Cannot cache: connector does not support caching.

SingleRecord interface

Instead of requiring consumers to guess whether to use import() or importOne(), resources that emit only one record should implement a new SingleRecord interface to clearly indicate that importOne() should be used and which we can use to verify the correct method has been called.

This provides a clear mechanism for data publishers to express intent and makes sense, because resources always know if they export one or multiple records, so they should have a way to express this.

Add FAQ or cookbook to documentation

Add examples either in the form of an FAQ or "cookbook" to demonstrate pattern solutions to common problems.

Scenarios:

Importing binary data
Import two or more collections at once (collections of collections)

ConnectorOptions is a bad design

At first glance, one would think tying options to a connector would create concurrency issues where two requests could set different options on the connector at the same time. Due to cloning, this is not an issue, however the problems with ConnectorOptions reach further than just potential concurrency issues. Since connectors may be decorated, finding the options you need to modify often means traversing the stack of connectors, but it's cumbersome and error prone to do this, by traversing the stack of connectors from ImportConnector down.

We cannot simply remove connector options and let implementations do as they please because the cache needs knowledge of the particular options exported by the connector in order to determine whether two requests are identical and thus the cache may be reused.

We propose changing the signature of fetch(string) to fetch(object), where object is some implementation-defined object that encapsulates both the original source string plus the connector options. In this way, everything needed to define the request is passed through all connectors in the stack and can be inspected or modified as needs be when it passes through. This also precludes the need to clone the connector (and its options), which makes implementations much easier and cleaner.

This change would be a BC break, and moreover, the signature is less convenient than simply passing a string, which can be sufficient for HTTP GET requests and some others. It is a consideration that we may support object|string, however this does complicate the interface and make it more taxing to implement.

Rather than just fetch(object) where object is literally typed to object, which is unsupported in PHP 7.1 anyway, we should probably have a Source interface that specifies toArray and serializes all configurable options as an array, for use with caching.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

scriptfusion / porter Goto Github PK

porter's Introduction

Porter

Durable and asynchronous data imports for consuming data at scale and publishing testable SDKs

Porter network quick links

Contents

Benefits

Quick start

About this manual

Usage

Creating the container

Registering providers

Importing data

Porter's API

Overview

Import specifications

Record collections

Details

Metadata

Asynchronous

Throttling

ThrottledConnector

Transformers

Writing a transformer

Filtering

Example

Durability

Caching

Example

INTERMISSION ☕️

Architecture

Providers

Writing a provider

Implementation example

Resources

Writing a resource

Implementation example

Exception handling

Connectors

Writing a connector

Data sources

Durability

Limitations

Testing

Contributing

License

Support

Quick links

porter's People

Contributors

Stargazers

Watchers

Forkers

porter's Issues

Recommend Projects

Recommend Topics

Recommend Org