Coder Social home page Coder Social logo

apache / beam Goto Github PK

View Code? Open in Web Editor NEW
7.8K 257.0 4.2K 695.08 MB

Apache Beam is a unified programming model for Batch and Streaming data processing.

Home Page: https://beam.apache.org/

License: Apache License 2.0

Shell 0.58% Dockerfile 0.10% Groovy 0.47% Python 17.28% Jupyter Notebook 0.08% HCL 0.31% Java 66.33% Kotlin 0.21% Go 9.34% Dart 1.96% HTML 0.28% Scala 0.01% JavaScript 0.18% TypeScript 2.75% Rust 0.01% ANTLR 0.01% FreeMarker 0.01% Lua 0.01% Thrift 0.01% Cython 0.10%
python java big-data beam batch golang sql streaming

beam's Introduction

Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow, and Hazelcast Jet.

Status

Maven Version PyPI version Go version Python coverage Build python source distribution and wheels Python tests Java tests

Overview

Beam provides a general approach to expressing embarrassingly parallel data processing pipelines and supports three categories of users, each of which have relatively disparate backgrounds and needs.

  1. End Users: Writing pipelines with an existing SDK, running it on an existing runner. These users want to focus on writing their application logic and have everything else just work.
  2. SDK Writers: Developing a Beam SDK targeted at a specific user community (Java, Python, Scala, Go, R, graphical, etc). These users are language geeks and would prefer to be shielded from all the details of various runners and their implementations.
  3. Runner Writers: Have an execution environment for distributed processing and would like to support programs written against the Beam Model. Would prefer to be shielded from details of multiple SDKs.

The Beam Model

The model behind Beam evolved from several internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model”.

To learn more about the Beam Model (though still under the original name of Dataflow), see the World Beyond Batch: Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site, and the VLDB 2015 paper.

The key concepts in the Beam programming model are:

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.

SDKs

Beam supports multiple language-specific SDKs for writing pipelines against the Beam Model.

Currently, this repository contains SDKs for Java, Python and Go.

Have ideas for new SDKs or DSLs? See the sdk-ideas label.

Runners

Beam supports executing programs on multiple distributed processing backends through PipelineRunners. Currently, the following PipelineRunners are available:

  • The DirectRunner runs the pipeline on your local machine.
  • The PrismRunner runs the pipeline on your local machine using Beam Portability.
  • The DataflowRunner submits the pipeline to the Google Cloud Dataflow.
  • The FlinkRunner runs the pipeline on an Apache Flink cluster. The code has been donated from dataArtisans/flink-dataflow and is now part of Beam.
  • The SparkRunner runs the pipeline on an Apache Spark cluster.
  • The JetRunner runs the pipeline on a Hazelcast Jet cluster. The code has been donated from hazelcast/hazelcast-jet and is now part of Beam.
  • The Twister2Runner runs the pipeline on a Twister2 cluster. The code has been donated from DSC-SPIDAL/twister2 and is now part of Beam.

Have ideas for new Runners? See the runner-ideas label.

Instructions for building and testing Beam itself are in the contribution guide.

📚 Learn More

Here are some resources actively maintained by the Beam community to help you get started:

Resource Details
Apache Beam Website Our website discussing the project, and it's specifics.
Java Quickstart A guide to getting started with the Java SDK.
Python Quickstart A guide to getting started with the Python SDK.
Go Quickstart A guide to getting started with the Go SDK.
Tour of Beam A comprehensive, interactive learning experience covering Beam concepts in depth.
Beam Quest A certification granted by Google Cloud, certifying proficiency in Beam.
Community Metrics Beam's Git Community Metrics.

Contact Us

To get involved with Apache Beam:

beam's People

Contributors

aaltay avatar abacn avatar angoenka avatar apilloud avatar aromanenko-dev avatar chamikaramj avatar damccorm avatar davorbonaci avatar dependabot[bot] avatar dhalperi avatar echauchot avatar herohde avatar ibzib avatar iemejia avatar ihji avatar jbonofre avatar jkff avatar kennknowles avatar lostluck avatar lukecwik avatar mxm avatar pabloem avatar reuvenlax avatar robertwb avatar swegner avatar tgroh avatar theneuralbit avatar tvalentyn avatar tweise avatar udim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

beam's Issues

Merging of watermark holds depends on runner-specific "take the minimum" behavior

When multiple watermark holds are to be combined via an OutputTimeFn and committed to persistent storage, the case where the OutputTimeFn computes the minimum is elided, assuming the underlying runner will perform this combine by default.

This is a natural default, but today it is a runner-specific behavior that the runner-agnostic code relies upon.

Imported from Jira BEAM-24. Original Jira may contain additional context.
Reported by: kenn.

Matcher(s) for TableRow

TableRow has poorly behaved equality, and a certain amount of automatic coercion, so matchers based equals() are not applicable in tests. It would be handy to have matchers such as "isTableRowEqualTo(otherRow)".

Imported from Jira BEAM-332. Original Jira may contain additional context.
Reported by: kenn.

Retry failures in the DirectRunner

Bundles may be retried by a runner; the DirectRunner should be able to retry bundles that fail, based on some configuration.

Imported from Jira BEAM-296. Original Jira may contain additional context.
Reported by: tgroh.

In TriggerRunner, avoid reading the set of finished subtriggers when no subtrigger ever finishes.

For a trigger expression, TriggerRunner tracks which of the subexpressions have finished to track which subexpression is the currently active trigger that determines when output is permitted. For some triggers (such as the default trigger) this information is not needed. The reads and writes of this data can be elided.

Imported from Jira BEAM-30. Original Jira may contain additional context.
Reported by: kenn.

Create Vector, Matrix types and operations to enable linear algebra API

Survey of existing linear algebra libraries within mahout (samsara), spark (sparkML) and flink (flinkML) shows that they all expose local vector, matrix (sparse, dense) and distributed matrix types and operations. We should define these data types and operations within Beam and add this to the BeamML document https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PBECHb-xA/edit#heading=h.n51rhya8bv4f

Imported from Jira BEAM-478. Original Jira may contain additional context.
Reported by: Kam Kasravi.
Subtask of issue #17859

Integrate code coverage to build and review process

We cannot use codecov, but we can use coveralls. We have the maven plugin included in the pom and need to invoke it appropriately in our various builds, and disseminate knowledge about browser extensions to get it into the pull request UI.

Imported from Jira BEAM-177. Original Jira may contain additional context.
Reported by: kenn.

Testing harness for Unbounded Sources

SourceTestUtils has a fair bit of support for testing things like dynamic work rebalancing and initial splitting.

We should add variants that test UnboundedSource checkpoint-and-resume, etc.

Imported from Jira BEAM-74. Original Jira may contain additional context.
Reported by: dhalperi.

Add SocketStream IO

We can add the SocketStream as Flink to read data from net

Imported from Jira BEAM-21. Original Jira may contain additional context.
Reported by: bakey.

Using Merging Windows and/or Triggers without a downstream aggregation should fail

Both merging windows (such as sessions) and triggering only actually happen at an aggregation (GroupByKey). We should produce errors in any of these cases:

  1. Merging window used without a downstream GroupByKey
  2. Triggers used without a downstream GroupByKey
  3. Window inspected after inserting a merging window and before the GroupByKey

Imported from Jira BEAM-184. Original Jira may contain additional context.
Reported by: bchambers.

Add Hadoop MapReduce runner

I think a MapReduce runner could be a good addition to Beam. It would allow users to smoothly "migrate" from MapReduce to Spark or Flink.

Of course, the MapReduce runner will run in batch mode (not stream).

Imported from Jira BEAM-165. Original Jira may contain additional context.
Reported by: jbonofre.

Should use allowedLateness from downstream computations in ReduceFnRunner

Much of the reasoning about holds, late data and final panes in ReduceFnRunner assume the current getAllowedLateness is an upper bound of the getAllowedLateness of all downstream computations.

There is currently no test that this is indeed the case.

It may be much simpler (for us and our users) to have a global allowed lateness setting.

Imported from Jira BEAM-237. Original Jira may contain additional context.
Reported by: mshields822.

Need serialized form and serialVersionUID for user-facing superclasses

When a class does not have an explicit serialVersionUID, it should be considered an unstable value based on the exact version of the code. This is fine for transmission most of the time, but never acceptable for persistence where backwards compatibility matters.

There are two use cases that require explicit serialized form and serialVersionUID even just for transmission. They are required for user-facing superclasses such as DoFn, WindowFn, etc, to support the following:

  • Encoding a pipeline with a JDK and decoding with a JDK that computes defaults differently.
  • Encoding a pipeline against a version of the Beam SDK and decoding with a different version.

The first situation should be rare since there is a deterministic spec, but we have unfortunately seen it.

The second situation is very reasonable; a runner might want to run with additional security fixes in the SDK, etc. Given a correct semantic version for the SDK, the pipeline author and runner author may reasonably expect it to work.

So we should add explicit serialization to superclasses that are necessarily encoded as part of a user's pipeline.

Imported from Jira BEAM-169. Original Jira may contain additional context.
Reported by: kenn.

Incremental join

Consider a co-group by key over the two (streaming) collections:
l : PCollection<KV<K, L>>
r : PCollection<KV<K, R>>
Each processElement sees a K, Iterable<L> and Iterable<R>.

If the underlying trigger only allows a single PaneInfo.Timing.ON_TIME pane then it is trivial to calculate the traditional cross-product, including any of the inner/outer join combinations should Iterable<L> or Iterable<R> be empty.

However if the underlying trigger supports speculative (ie PaneInfo.Timing.EARLY) or late (ie PaneInfo.Timing.LATE) panes then the corresponding speculative output panes are awkward to compute.

(left_already_seen ******** new_left) X (right_already_seen ******** new_right)
==
(left_already_seen X right_already_seen) ********
(new_left X right_already_seen) ********
(left_already_seen X new_right) ********
(new_left X new_right)

Currently the barrier between 'already seen' and 'new' must be maintained for left and right in per-window state. That suppresses some optimizations.

This bug is for finding a cleaner way to express this combinator.

Imported from Jira BEAM-197. Original Jira may contain additional context.
Reported by: mshields822.

Provide Beam turnkey binary distributions embedding runners and execution runtime

Now, the only distribution Beam provides is the source distribution.

For new users, it could be interesting to have ready-to-use binary distribution embedding the SDK, a specific runner with the backend execution runtime.

For instance, we could provide:

  • beam-spark-xxx.tar.gz containing SDK, Spark runner, Spark
  • beam-flink-xxx.tar.gz containing SDK, Flink runner, Flink

Thoughts ?

Imported from Jira BEAM-320. Original Jira may contain additional context.
Reported by: jbonofre.

Storm Runner

Gathering place for interest in a Storm runner for Beam.

Imported from Jira BEAM-9. Original Jira may contain additional context.
Reported by: frances.

Make sure we unit test bounded sessions

A few customers have been using Window.into(Sessions...) and of course quickly realize they are exposed to unbounded sessions.

We should have unit tests to confirm various combinations of AfterPane.elementCountAtLeast and AfterProcessingTime... correctly force sessions to be broken apart.

We should also check this all works with repeated messages with the same timestamp (since they will create the exact same session window and can thus see trigger state from previous sessions).

At some point we may may flow into reworking bounded sessions to be done directly rather than via Sessions plus triggers.

Imported from Jira BEAM-279. Original Jira may contain additional context.
Reported by: mshields822.

Support for limiting parallelism of a step

Users may want to limit the parallelism of a step. Two classic uses cases are:

  • User wants to produce at most k files, so sets TextIO.Write.withNumShards(k).
  • External API only supports k QPS, so user sets a limit of k/(expected QPS/step) on the ParDo that makes the API call.

Unfortunately, there is no way to do this effectively within the Beam model. A GroupByKey with exactly k keys will guarantee that only k elements are produced, but runners are free to break fusion in ways that each element may be processed in parallel later.

To implement this functionaltiy, I believe we need to add this support to the Beam Model.

Imported from Jira BEAM-68. Original Jira may contain additional context.
Reported by: dhalperi.

Add declarative DSLs (XML & JSON)

Even if users would still be able to use directly the API, it would be great to provide a DSL on top of the API covering batch and streaming data processing but also data integration.
Instead of designing a pipeline as a chain of apply() wrapping function (DoFn), we can provide a fluent DSL allowing users to directly leverage keyturn functions.

For instance, an user would be able to design a pipeline like:


.from(“kafka:localhost:9092?topic=foo”).reduce(...).split(...).wiretap(...).map(...).to(“jms:queue:foo….”);

The DSL will allow to use existing pipelines, for instance:


.from("cxf:...").reduce().pipeline("other").map().to("kafka:localhost:9092?topic=foo&acks=all")

So it means that we will have to create a IO Sink that can trigger the execution of a target pipeline: (from("trigger:other") triggering the pipeline execution when another pipeline design starts with pipeline("other")). We can also imagine to mix the runners: the pipeline() can be on one runner, the from("trigger:other") can be on another runner). It's not trivial, but it will give strong flexibility and key value for Beam.

In a second step, we can provide DSLs in different languages (the first one would be Java, but why not providing XML, akka, scala DSLs).

We can note in previous examples that the DSL would also provide data integration support to bean in addition of data processing. Data Integration is an extension of Beam API to support some Enterprise Integration Patterns (EIPs). As we would need metadata for data integration (even if metadata can also be interesting in stream/batch data processing pipeline), we can provide a DataxMessage built on top of PCollection. A DataxMessage would contain:
structured headers
binary payload
For instance, the headers can contains an Avro schema to describe the payload.
The headers can also contains useful information coming from the IO Source (for instance the partition/path where the data comes from, …).

Imported from Jira BEAM-14. Original Jira may contain additional context.
Reported by: jbonofre.

Implement a CSV file reader

We should implement a CSV-based source.

One possibility would be to support the same options as BigQuery. https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats These options are:

fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is this is critical. One common delimiter that people use is 'thorn' (þ).

quote: Custom quote char. By default, this is '"', but this allows users to set it to something else, or, perhaps more commonly, remove it entirely (by setting it to the empty string). For example, tab-separated files generally don't need quotes.

allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC, newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single line. This makes splitting of large csv files impossible, so we should disallow quoted newlines by default unless the user really wants them (in which case, they'll get worse performance).

allowJaggedRows: This allows inferring null if not enough columns are specified. Otherwise we give an error for the row.

ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a user has too many values for the schema, we will ignore the ones we don't recognize, rather than reporting an error for the row.

skipHeaderRows: How many header lines are in the file.

encoding: UTF8-vs latin1, etc.
compression: gzip, bzip, etc.

Imported from Jira BEAM-51. Original Jira may contain additional context.
Reported by: dhalperi.

Source design pattern: option to retain metadata

  • FileBasedSource: user may want to know original path, directory, line number, byte offset, etc.
  • PubSubIO: user may want to know things about the originating message (such as metadata attributes), not just they payload. BEAM-53
  • etc.

We should have a standard pattern for this that provides a good guideline for implementing sources.

Imported from Jira BEAM-72. Original Jira may contain additional context.
Reported by: dhalperi.

ProtoIO

Make it easy to read and write binary files of Protobuf objects. If there is a standard open source format for this, use it.

If not, roll our own and implement it?

Imported from Jira BEAM-221. Original Jira may contain additional context.
Reported by: dhalperi.

Apache Sqoop connector

Bounded source has been requested in the past.

Imported from Jira BEAM-67. Original Jira may contain additional context.
Reported by: dhalperi.

Running tests of dynamic work rebalancing

SourceTestUtils has an exhaustive test for splitAtFraction. However, we should also add a runner-independent test that, for example, can verify that dynamic work rebalancing is actually invoked for a new BoundedSource.

An example of this test was proposed in https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/test/java/com/google/cloud/dataflow/sdk/transforms/GroupByKeyTest.java#L435 , but was specific to the Google Cloud Dataflow service.

Imported from Jira BEAM-75. Original Jira may contain additional context.
Reported by: dhalperi.

When triggers are changed via pipeline update, stale finished triggers data applied

TriggerRunner tracks which trigger subexpresions are finished. When the trigger expession is updated via pipeline update, this data is applied arbitrarily to the new trigger expression.

Implementation note: trigger subexpressions are identified by number, and their finished state stored in a bit set.

Imported from Jira BEAM-31. Original Jira may contain additional context.
Reported by: kenn.

Native support for conditional iteration

Ported from: GoogleCloudPlatform/DataflowJavaSDK#50

There are a variety of use cases which would benefit from native support for conditional iteration.

For instance, http://stackoverflow.com/questions/31654421/conditional-iterations-in-google-cloud-dataflow/31659923?noredirect=1#comment51264604_31659923 asks about being able to write a loop like the following:


PCollection data  = ...
while(needsMoreWork(data)) {
  data = doAStep(data)
}

If there are specific use cases please let us know the details. In the future we will use this issue to post progress updates.

Imported from Jira BEAM-106. Original Jira may contain additional context.
Reported by: lcwik.

Retractions

We still haven't added retractions to Beam, even though they're a core part of the model. We should document all the necessary aspects (uncombine, reverting DoFn output with DoOvers, sink integration, source-level retractions, etc), and then implement them.

Imported from Jira BEAM-91. Original Jira may contain additional context.
Reported by: takidau.

Add implicit conf/pipeline-default.conf options file

Right now, most of users provide the pipeline options via the main arguments.
For instance, it's the classic way to provide pipeline input, etc.

For convenience, it would be great that the pipeline looks for options in conf/[pipeline_name]-default.conf by default, and override the options using the main arguments.

Thoughts ?

Imported from Jira BEAM-137. Original Jira may contain additional context.
Reported by: jbonofre.

DataflowPipelineJob should have an API that prints messages but doesn't wait for completion

DataflowPipelineJob has a method waitToFinish() that takes a handler for printing the job's output messages, AND waits for the job to finish, printing messages along the way using that handler.

However, there are cases when a caller would like to poll for the job's messages and print them, but would like to keep the job under the caller's control, rather than having to wait for it to complete.

E.g., one can imagine wanting to do the following "wait until a certain Aggregator in the job reaches a certain value, and then cancel the job, printing messages along the way". This is not possible with the current API, without copying code of waitToFinish().

Imported from Jira BEAM-89. Original Jira may contain additional context.
Reported by: jkff.

Data-driven triggers

For some applications, it's useful to declare a pane/window to be emitted (or finished) based on its contents. The simplest of these is the AfterCount trigger, but more sophisticated predicates could be constructed.

The requirements for consistent trigger firing are essentially that the state of the trigger form a lattice and that the "should fire?" question is a monotonic predicate on the lattice. Basically it asks "are we high enough up the lattice?"

Because the element types may change between the application of Windowing and the actuation of the trigger, one idea is to extract the relevant data from the element at Windowing and pass it along implicitly where it can be combined and inspected in a type safe way later (similar to how timestamps and windows are implicitly passed with elements).

Imported from Jira BEAM-101. Original Jira may contain additional context.
Reported by: robertwb.

Support side inputs for Sessions

Today, the side input window Sessions is undefined in the model (thus the Java implementation of #getSideInputWindow throws UnsupportedOperationException).

Imported from Jira BEAM-28. Original Jira may contain additional context.
Reported by: kenn.

Add 'garbage collection' hold when receive late element

We currently add a 'garbage collection' hold in WatermarkHold (invoked via ReduceFnRunner) if the closing behavior is FIRE_ALWAYS. This means an element which has come in too late for a data holds and an end-of-window hold may end up setting no hold at all. As a result, the eventual pane containing that element may end up dropped as being too late.

Imported from Jira BEAM-311. Original Jira may contain additional context.
Reported by: mshields822.

Data-dependent sinks

Current sink API writes all data to a single destination, but there are many use cases where different pieces of data need to be routed to different destinations where the set of destinations is data-dependent (so can't be implemented with a Partition transform).

One internally discussed proposal was an API of the form:


PCollection<Void> PCollection<T>.apply(
    Write.using(DoFn<T, SinkT> where,
                MapFn<SinkT,
WriteOperation<WriteResultT, T>> how)

so an item T gets written to a destination (or multiple destinations) determined by "where"; and the writing strategy is determined by "how" that produces a WriteOperation (current API - global init/write/global finalize hooks) for any given destination.

This API also has other benefits:

  • allows the SinkT to be computed dynamically (in "where"), rather than specified at pipeline construction time
  • removes the necessity for a Sink class entirely
  • is sequenceable w.r.t. downstream transforms (you can stick transforms onto the returned PCollection<Void>, while the current Write.to() returns a PDone)

Imported from Jira BEAM-92. Original Jira may contain additional context.
Reported by: jkff.

Make TextIOTest and AvroIOTest runner-agnostic

PipelineOptions contains a tempLocation property, and IOChannelUtils should be capable of handling arbitrary file locations that are used as a tempLocation.

The read and write tests for TextIO and AvroIO can use these properties to be written in a runner-agnostic fashion, and then be marked as RunnableOnService. Doing so allows all runners integrated with RunnableOnService to benefit from the existing tests.

Imported from Jira BEAM-181. Original Jira may contain additional context.
Reported by: tgroh.

Visual Pipeline Designers / Editors

This concept can be used by business people with little or no programming ability if you have a Visual Editors. If the editor is an components, the editor can be reused in integration with other products.

Imported from Jira BEAM-266. Original Jira may contain additional context.
Reported by: sirinath.

IOChannelUtils.getFactory should throw an unchecked exception rather than IOException

In this PR, we've add a new method, IOChannelUtils.hasFactory, which provides a convenient way to verify that a factory exists before calling getFactory. As such, the IOException thrown by getFactory is extra cruft for consumers using the new pattern.

We should update additional consumers to check hasFactory first, and then migrate getFactory to throw a RuntimeException

Imported from Jira BEAM-281. Original Jira may contain additional context.
Reported by: swegner.

Improve TypeDescriptor inference of DoFn's created inside a generic PTransform

Commit aa7f07f introduced the ability to infer a TypeDescriptor from an object created inside a concrete instance of a PTransform and used it to simplify SimpleFunction usage.

We should probably look at using the same mechanism elsewhere, such as when inferring the output type of a ParDo.

Imported from Jira BEAM-324. Original Jira may contain additional context.
Reported by: bchambers.

FileBasedSource/IOChannelFactory: Custom glob expansion

Many cloud and distributed filesystems are eventually consistent, for instance Amazon s3 and Google Cloud Storage.

To work around this, many systems that produce files such as Beam's FileBasedSinks, or Google BigQuery will provide methods to determine the number and set of files produced. E.g.,

  • Beam FileBasedSink uses -00000-of-NNNNN
  • BigQuery export jobs uses -000000 -000001 -000002 ... until an empty file is produced
  • Another system may produce a .filelist suffix that contains a list of all files.

Users should be able to supply a glob to FileBasedSource but additionally supply a "glob expander" that can provide a custom implementation for file expansion. That way, e.g., Beam pipelines can be run back-to-back-to-back where each consumes the output of the previous, on an inconsistent filesystem, without data loss.

Imported from Jira BEAM-60. Original Jira may contain additional context.
Reported by: dhalperi.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.