Comments (5)
no...in direct dstream progress dir is very critical to the progress of the whole streaming pipeline...we will not open that to the user at least in the recent future
from azure-event-hubs-spark.
It's a generic ProgressTracker, in what way is having a file system necessary for the dstream approach?
from azure-event-hubs-spark.
because ProgressTracker is mostly about how to translate seq number to offset in the next batch which is very critical to the whole pipeline...we will not open it to the user to avoid unnecessary troubles
Similarly, Structured Streaming in Spark doesn't open offsetLog to the user and enforce it to be based on HDFS
so that's it
from azure-event-hubs-spark.
ProgressTracker seems to only manage saving the progress state back to storage and loading it back up again. It's up to the EH client to determine what the next sequence number is. The tracker is obviously important as things are blocked until the latest numbers are committed, but that doesn't mean it has to be tied to the file system.
from azure-event-hubs-spark.
ok, let me end the discussion,
you can implement whatever you want as the storage place for progress files. However, that does not mean we will adopt it in this project
from azure-event-hubs-spark.
Related Issues (20)
- Structured streaming job hangs after a while
- EH - Trigger once HOT 1
- This library or kafka
- Using Kafka driver for Spark the throughput is 50-80 times faster
- AAD Authentication is terminated after running for a couple of minutes HOT 1
- Job for consumption of Event Hub messages aborts on Databricks (request seqNo less than received seqNo) HOT 3
- Package Support for Scala 2.13 or 3+ HOT 4
- PySpark job doesnt stop on stopping query
- Batch read from eventhubs throws duration format error
- Checkpoint write to blob storage using hadoop configuration HOT 3
- Missing tag for v2.3.22
- Spark-scala api references for Azure eventhub schema registry
- maxEventsPerTrigger is not working
- Spark streaming kubernetes - Fails to recover from chechpoint. Cannot find endpoint: spark://PartitionPerformanceReceiver
- EventHub Writer fails due to Throttling of EventHub, configuration settings have no impact. HOT 1
- ReceiverDisconnectedException even if using different consumer groups HOT 1
- The auth docs are wrong - causing service unavailability issues
- Azure EventHub - PySpark Failed to configure SaslClientAuthenticator works when using Confluent cloud
- Support for Spark v3.3.0 HOT 1
- java.util.concurrent.TimeoutException: Futures timed out after [5 minutes] HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azure-event-hubs-spark.