Coder Social home page Coder Social logo

temporalio / sdk-core Goto Github PK

View Code? Open in Web Editor NEW
236.0 17.0 61.0 3.06 MB

Core Temporal SDK that can be used as a base for language specific Temporal SDKs

License: MIT License

Rust 99.73% Dockerfile 0.01% Makefile 0.24% Shell 0.03%
temporal rust rust-sdk workflow-engine workflow-tool workflow-automation rust-lang

sdk-core's Introduction

Build status

Temporal Core SDK

Core SDK that can be used as a base for other Temporal SDKs. It is currently used as the base of:

Documentation

Core SDK documentation can be generated with cargo doc, output will be placed in the target/doc directory.

Architecture doc provides some high-level information about how Core SDK works and how language layers interact with it.

For the reasoning behind the Core SDK, see blog post:

Development

You will need the protoc protobuf compiler installed to build Core.

This repo is composed of multiple crates:

  • temporal-sdk-core-protos ./sdk-core-protos - Holds the generated proto code and extensions
  • temporal-client ./client - Defines client(s) for interacting with the Temporal gRPC service
  • temporal-sdk-core-api ./core-api - Defines the API surface exposed by Core
  • temporal-sdk-core ./core - The Core implementation
  • temporal-sdk ./sdk - A (currently prototype) Rust SDK built on top of Core. Used for testing.
  • rustfsm ./fsm - Implements a procedural macro used by core for defining state machines (contains subcrates). It is temporal agnostic.

Visualized (dev dependencies are in blue):

Crate dependency graph

All the following commands are enforced for each pull request:

Building and testing

You can build and test the project using cargo: cargo build cargo test

Run integ tests with cargo integ-test. By default it will start an ephemeral server. You can also use an already-running server by passing -s external.

Run load tests with cargo test --test heavy_tests.

Formatting

To format all code run: cargo fmt --all

Linting

We are using clippy for linting. You can run it using: cargo clippy --all -- -D warnings

Debugging

The crate uses tracing to help with debugging. To enable it for a test, insert the below snippet at the start of the test. By default, tracing data is output to stdout in a (reasonably) pretty manner.

crate::telemetry::test_telem_console();

The passed in options to initialization can be customized to export to an OTel collector, etc.

To run integ tests with OTel collection on, you can use integ-with-otel.sh. You will want to make sure you are running the collector via docker, which can be done like so:

docker-compose -f docker/docker-compose.yaml -f docker/docker-compose-telem.yaml up

If you are working on a language SDK, you are expected to initialize tracing early in your main equivalent.

Proto files

This repo uses a subtree for upstream protobuf files. The path sdk-core-protos/protos/api_upstream is a subtree. To update it, use:

git pull --squash --rebase=false -s subtree ssh://[email protected]/temporalio/api.git master --allow-unrelated-histories

Do not question why this git command is the way it is. It is not our place to interpret git's ways.

The java testserver protos are also pulled from the sdk-java repo, but since we only need a subdirectory of that repo, we just copy the files with read-tree:

# add sdk-java as a remote if you have not already
git remote add -f -t master --no-tags testsrv-protos [email protected]:temporalio/sdk-java.git
# delete existing protos
git rm -rf sdk-core-protos/protos/testsrv_upstream
# pull from upstream & commit
git read-tree --prefix sdk-core-protos/protos/testsrv_upstream -u testsrv-protos/master:temporal-test-server/src/main/proto
git commit

Fetching Histories

Tests which would like to replay stored histories rely on that history being made available in binary format. You can fetch histories in that format like so (from a local docker server):

cargo run --bin histfetch {workflow_id} [{run_id}]

You can change the TEMPORAL_SERVICE_ADDRESS env var to fetch from a different address.

Style Guidelines

Error handling

Any error which is returned from a public interface should be well-typed, and we use thiserror for that purpose.

Errors returned from things only used in testing are free to use anyhow for less verbosity.

The Rust "SDK"

This repo contains a prototype Rust sdk in the sdk/ directory. This SDK should be considered pre-alpha in terms of its API surface. Since it's still using Core underneath, it is generally functional. We do not currently have any firm plans to productionize this SDK. If you want to write workflows and activities in Rust, feel free to use it - but be aware that the API may change at any time without warning and we do not provide any support guarantees.

sdk-core's People

Contributors

andrzej-mag avatar antlai-temporal avatar baszalmstra avatar bergundy avatar c-thiel avatar chronos-tachyon avatar cole-h avatar cretz avatar dandavison avatar debuggerpk avatar djc avatar dt665m avatar h7kanna avatar herklos avatar jackdawm avatar jakejscott avatar lee-aaron avatar lorensr avatar mfateev avatar mjameswh avatar mmcshane avatar quinn-with-two-ns avatar rossdylan avatar shanginn avatar sushisource avatar tdejager avatar tedbyron avatar vitarb avatar ytaben avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sdk-core's Issues

Rust SDK Prototype Release

We've had a number of people ask about Rust SDK progress. This issue serves for now as a place to track that progress.

Currently there is no official timeline for releasing a Rust SDK. However, I do plan to make the prototype SDK publicly exposed with no API compatibility guarantees. This issue will track that. It directly depends on #176

[Refactor] Separate polling & completing "halves" of WFT/activity management

The last round of refactoring that moved WFT management inside of workers still leaves room for improvement. Hopefully the last round of work we need to do here for likely a rather long time.

There are a bunch of synchronization concerns that would be made simpler by following the below pattern for both WFTs and Activities:

  • Polling happens in the background (this is largely already true, just need to remove limiting semaphore)
    • Locks workflow run state to make needed updates
    • Poll results are fed into a stream
  • Poller selects from stream(s) (for wft: pending, real poll results)
  • Completing locks workflow run state to make needed updates

By keeping the polling half separate from the completing half, locking (or, whatever synchronization mechanism, could be message passing) permits polling and completing to happen simultaneously per-worker, but there cannot be multiple concurrent polls. Multiple concurrent completions for different workflow runs is permitted (as it is now).

Before doing this I think it's probably worth diagramming it out, to land on a really nice final iteration

[Bug] Completed activity for which cancellation was requested causes duplicate `ResolveActivity` activation job

Describe the bug
When a Workflow requests to cancel an Activity, Core uses TRY_CANCEL mode and immediately activates the workflow with a ResolveActivity activation where the result is canceled.
If the activity completes before it heartbeats again core will resolve the activity again, this time with result set to completed.
The node SDK will throw an error saying that there's no pending completion for that activity.

To Reproduce
Steps to reproduce the behavior:
Run the cancel-http-request workflow in the node sdk and set the activity's heartbeatTimeout to something high like 3 minutes.

Expected behavior
Core should not resend an ActivityResolve activation job.

Screenshots/Terminal ouput

Caught error while worker was running Error: No completion for taskSeq 0

[Feature Request] Allow activity-only workers

Relates to #224

A user has a need to limit the maximum number of activities targeting a certain resource and they do this by using many task queues. That's fine, but at the moment those workers would also poll (pointlessly) on workflow queues which introduces unnecessary burden. Allow running activity-only workers which do not poll on workflow queues at all.

Implement dataconverter-esque facility for prototype SDK

We need a more ergonomic way to deal with payloads in testing (and of course eventually in the real Rust SDK).

My initial inclination is to use serde traits as a requirement on all types which are eventually turned into payloads. Then there would also be a dataconverter trait which allows creating a series of layers like so:

serde::Serialize -> DCSerializer (chooses a format, ex: serde_json or bincode) -> Payload
-> DCRaw (operates on raw payloads to allow things like compression or encryption) -> Payload

Then in reverse, outputting a serde::Deserialize when reading payloads in.

Connection closed - Worker state changed -> FAILED

Expected Behavior

My microservice doesn't restart per "x" minutes.

Actual Behavior

I have a microservice that is working for "x" minutes. Where finish this period the microservice dies and I always see the same error in the log.

2021-08-26T06:04:49.715Z [ERROR] Worker failed {
  error: [TransportError: Unhandled error when calling the temporal server: Status { code: Unknown, message: "transport error: operation was canceled: connection closed" }]
}
2021-08-26T06:04:49.716Z [INFO] Worker state changed { state: 'FAILED' }
  Aug 26 06:04:49.717  WARNtemporal_sdk_core::workflow::concurrency_manager: Sending side of workflow machine creator was dropped. Likely the WorkflowConcurrencyManager was dropped. This indicates a failure to call shutdown.

[TransportError: Unhandled error when calling the temporal server: Status { code: Unknown, message: "transport error: operation was canceled: connection closed" }]

In 12 hours the microservice has restarted 138 times.

NAME                                READY   STATUS             RESTARTS   AGE
aggregation-app-production-0        0/1     CrashLoopBackOff   96902      288d
aggregation-app-sandbox-0           1/1     Running            0          71d
auth-76878d74b-s6dlh                0/1     CrashLoopBackOff   15893      56d
calls-f488df9dc-hp68g               1/1     Running            0          56d
campaigns-548f6447c8-m5p4s          1/1     Running            138        16h

However, Temporal seems to be working fine, I don't see any restart.

$ kubectl get pods -n temporal
NAME                                                      READY   STATUS    RESTARTS   AGE
temporal-web-67d5fc7f4f-jggqp                             1/1     Running   0          5d13h
temporaltest-temporal-server-admintools-b98547786-hhvj4   1/1     Running   0          47d
temporaltest-temporal-server-frontend-5c48d5cc75-tcmjb    1/1     Running   0          47d
temporaltest-temporal-server-history-6c6785b94b-2hvfg     1/1     Running   0          47d
temporaltest-temporal-server-matching-865f56cd6d-bhckg    1/1     Running   0          47d
temporaltest-temporal-server-web-84b6b7dd6c-zjr2k         1/1     Running   0          5d13h
temporaltest-temporal-server-worker-7775955945-xv8xk      1/1     Running   0          28d

Specifications

Version:
Platform: EKS AWS

[Bug] poll_* methods not interrupted when shutting down

Describe the bug
When calling multiple poll methods (activities, workflows) concurrently only one of them is interrupted during shutdown.

To Reproduce
This happens in the Node SDK after a running a workflow.

RUN_INTEGRATION_TESTS=1 npx ava packages/test/lib/test-integration.js -T 60s -m sleep

Expected behavior
Any outstanding poll request should immediately resolve when shutdown is requested.

[Feature Request] Report extra commands as nondeterminism errors sooner

Assuming this is actually something that makes sense to do and can be done (following up on that):

When replaying, if a WFT application results in commands being produced which do not have a corresponding command-event in history following that WFT's completion event, we should blow up with a nondeterminism error along the lines of "extra commands produced".

As it stands that should happen eventually in that some command event will be applied which doesn't match the extra command, but it doesn't happen as soon as it could, which harms debugability.

Ideally, we should blow up as soon as we finish applying a WFT (through WFT completed + commands in that WFT) if we notice there are extra commands lying around that were not matched. My initial attempts to do this though resulted in a bunch of spurious errors. (Note to self: Saved a shelf with my attempts)

[Feature Request] On worker clean shutdown, release cached workflow executions from sticky queue

Is your feature request related to a problem? Please describe.

We've been experiencing increased workflow execution latencies whenever we re-deploy workers and shut down the old versions. This seems to be due to cached workflow executions and the default sticky task queue timeout of 10s: https://temporalio.slack.com/archives/CTRCR8RBP/p1635215037092500

Describe the solution you'd like

I'd like the option to release all cached workflow executions from the sticky queue whenever a worker is cleanly shutdown -- our particular deploy mechanism uses SIGTERM.

From the Slack discussion this feature doesn't exist in other SDKs yet but could as well!

Stop returning workflow related errors from poll/complete and instead pipe them to evictions & attach reason

Rather than returning errors that lang can't do anything about except evict the workflow, Core should assume the responsibility of just properly issuing evictions in such situations. Eviction jobs will also need to have reasons attached as a result.

The only errors that should be returned from core's top-level APIs should be fatal:

  • Unrecoverable/unretryable network/server error
  • Lang sent malformed data
  • Worker shutdown / not present

All errors having to do with a specific workflow should instead be piped into an eviction.

sdk-core and sdk-go

Hi, team! I use sdk-go in my projects now. Will sdk-go be rewritten with sdk-core in the future?

[Bug] sticky task poller should specify TaskQueueKind::Sticky

When frontend received polling request, it needs to decide which partition to forward to. The kind on the request is what frontend use to decide if this is a sticky queue or a normal queue. When it is a sticky queue, its name is used to decide which matching engine owns the queue. If it is a normal queue, an extra random partition is picked and then decide which matching owns that partition.
Partition count is defined by dynamic config, and default to 4. Which means it is 4x cost in turns of server resource of normal queue compare to sticky queue.

[Feature Request] Allow activity to notify workflow while activity is still running

Is your feature request related to a problem? Please describe.

Allow activity to notify workflow while activity is still running, this can be done solely by SDK providing an API allow activity to signal the workflow it belongs to

Describe the solution you'd like

the new SDK API should also allow context propagation

Additional context

N/A

What happens when a new pending activation interrupts long poll?

See original discussion here: #89

We need to test and determine best course of action when a new PA interrupts a poll. Ideally it should cancel the request, but there are many potential weird corner cases here like what to do if things race and we somehow get a new task back but don't acknowledge it.

[Feature Request] Ensure large payloads are trimmed in debug/tracing output

We could potentially end up really bogging things down with the default Debug output of many things that have payloads in them, which could end up trying to print or trace giant payloads which is pointless. Avoid that.

Would be easier if we could just override the Debug impl of Payload protobuf messages, but that's unfortunately dependent on tokio-rs/prost#334 still

For now we should probably just make sure we're relying on Display impls for all logging/tracing that might contain big stuff.

Drain buffered server responses on shutdown

The buffered server responses we've got are not currently drained on shutdown, which could mean we time out tasks, etc. Drain them and allow lang to respond before completing shutdown.

Avoid timing-sensistive sleeps in test code

We have a handful of tests that require sleeps to get things to happen in an expected order. This is obviously not ideal. We need a reasonable way to wait for certain things to happen deep in the internals of the code and expose that to tests in a way that doesn't require making all sorts of ugly changes purely for that ability.

If we could leverage tracing for this, that'd be nice. We could just wait to see some specific event in the test.

[Feature Request] Split into more crates to allow shared workflows between integ/replay tests

Right now there's no good way to define shared workflows in the test_utils crate because it depends on core in non-dev mode, but core depends on it in dev-mode, and thus if you try to use shared workflows in unit tests types mismatch and everything breaks.

Seemingly the only real way to fix this is to split out all the code that deals with mocking server responses and building fake histories into yet another crate (as well as possibly another crate for all the gateway stuff, possibly the pollers abstraction) so that we can define tests which can use mock responses and the history builder etc and also depend on the shared workflows.

Pretty big and annoying chunk of work but ultimately will result in a nicer structure

[Bug] Core throws HeartbeatTimeoutNotSet when activity heartbeats and no heartbeat timeout is configured

Describe the bug
As described in title, record_activity_heartbeat throws when there's no heartbeat timeout configured.

.ok_or(ActivityHeartbeatError::HeartbeatTimeoutNotSet)?

It would be better to ignore or simply warn in this case since the activity sending heartbeats is decoupled from the workflow that sets its ActivityOptions.

To Reproduce
Schedule an activity without a heartbeat timeout.

Getting an error after worker inactivity

Happens when running the sdk-node worker.

[Error: Poll response from server was malformed: PollWorkflowTaskQueueResponse { task_token: [], workflow_execution: None, workflow_type: None, previous_started_event_id: 0, started_event_id: 0, attempt: 0, backlog_count_hint: 0, history: None, next_page_token: [], query: None, workflow_execution_task_queue: None, scheduled_time: None, started_time: None, queries: {} }]

[Bug] "Unhandled Command" errors when completing don't trigger eviction but need to

A recent server update rejects extra commands after completing the workflow, which revealed via the signal_workflow_signal_not_handled_on_workflow_completion test that we were actually sending complete workflow twice in that test, which is almost certainly because the unhandled command problem doesn't trigger a cache clear, but it really should.

[Bug] poll_activity_task returns an empty task when poll timeout is reached

Describe the bug
When calling poll_activity_task, core returns an empty task when poll timeout is reached.

To Reproduce
Steps to reproduce the behavior:
Run the NodeJS worker
Wait for 30 seconds (the default poll timeout).

Expected behavior
Core should poll again instead of returning an empty task.

Screenshots/Terminal ouput
Output when adding a print of the received activity task in the node sdk bridge:

ActivityTask { task_token: [], activity_id: "", variant: Some(Start(Start { workflow_namespace: "", workflow_type: "", workflow_execution: None, activity_type: "", header_fields: {}, input: [], heartbeat_details: [], scheduled_time: None, current_attempt_scheduled_time: None, started_time: None, attempt: 0, schedule_to_close_timeout: None, start_to_close_timeout: None, heartbeat_timeout: None, retry_policy: None })) }

Node worker throws

Error: Got StartActivity without an "activityType" attribute

[Feature Request] Make compilation work on Mac M1.

When compiling the node SDK, I got the following error with an M1 CPU.

npm ERR! error: failed to run custom build command for `temporal-sdk-core v0.1.0 (/Users/bergundy/temporal/sdk-node/packages/worker/native/sdk-core)`
npm ERR!
npm ERR! Caused by:
npm ERR!   process didn't exit successfully: `/Users/bergundy/temporal/sdk-node/packages/worker/native/target/release/build/temporal-sdk-core-de4edef256dc9a6c/build-script-build` (exit code: 1)
npm ERR!   --- stderr
npm ERR!   Error: Custom { kind: Other, error: "failed to invoke protoc (hint: https://docs.rs/prost-build/#sourcing-protoc): Bad CPU type in executable (os error 86)" }

This is easily fixed by installing protobuf with brew install protobuf and exporting PROTOC=$(which protoc) (See tonic docs).
This is of course not critical but it would be nice to be able to have compilation "just work".

[Feature Request] Improve DX for when Core encounters server / network errors.

In #202 we tied the logging of retries to the retry config's max_retries.

I feel like a better experience would be to start logging immediately when failed to connect to the server.

[WARN] Failed to connect to Temporal server at {address}, trying again in 5 seconds...
[WARN] Failed to connect to Temporal server at {address}, trying again in 10 seconds...
[INFO] Connected to Temporal server at {address}

When the connection is lost we should also log

[WARN] Lost connection to Temporal server at {address}

Initially the PR made it so namespace not found errors are retried by default, a decision which was reverted before it was merged, as it might be an issue when users deploy to production.
Their Workers will seem healthy while in practice they are standing idle and unable to progress.
I see the value of retrying in test scenarios but I'm hesitant to have this as the default SDK behavior.

I asked @Sushisource what the go SDK does in this case and he said that go retries (almost?) any non-retriable error.
We should list which non-retriable errors we actually want to retry and make this behavior consistent cross our SDKs.

Differentiating fatal and per-workflow / other non-fatal errors

As of right now, all errors returned from the Core interface are in one enum, listed here:

pub enum CoreError {
    /// [Core::shutdown] was called, and there are no more replay tasks to be handled. You must
    /// call [Core::complete_task] for any remaining tasks, and then may exit.
    ShuttingDown,
    /// Poll response from server was malformed: {0:?}
    BadDataFromWorkProvider(PollWorkflowTaskQueueResponse),
    /// Lang SDK sent us a malformed completion: {0:?}
    MalformedCompletion(TaskCompletion),
    /// Error buffering commands
    CantSendCommands(#[from] SendError<Vec<WFCommand>>),
    /// Couldn't interpret command from <lang>
    UninterpretableCommand(#[from] InconvertibleCommandError),
    /// Underlying error in history processing
    UnderlyingHistError(#[from] HistoryInfoError),
    /// Underlying error in state machines: {0:?}
    UnderlyingMachinesError(#[from] WFMachinesError),
    /// Task token had nothing associated with it: {0:?}
    NothingFoundForTaskToken(Vec<u8>),
    /// Error calling the service: {0:?}
    TonicError(#[from] tonic::Status),
    /// Server connection error: {0:?}
    TonicTransportError(#[from] tonic::transport::Error),
    /// Failed to initialize tokio runtime: {0:?}
    TokioInitError(std::io::Error),
    /// Invalid URI: {0:?}
    InvalidUri(#[from] InvalidUri),
    /// State machines are missing for the workflow with run id {0}!
    MissingMachines(String),
}

Most of these are not fatal. In fact, probably the only fatal error here that should cause lang sdk to give up entirely is TokioInitError, and maaaybe CantSendCommands, which is specific to a workflow, but if it happens something very wrong has happened (other workflows might continue working but that one never will). All other errors are either transient network problems (ex tonic errors), expected (ShuttingDown), bad data from the server or the lang sdk, or are specific to some individual workflow.

So, we need a way to make that more explicit. A few options:

Nested enums

This is probably the most natural and obvious option

Break up CoreError into this top level enum, and create new sub-enums

pub enum CoreError {
    Fatal(FatalError),
    WorkflowError {
        run_id: String,
        error: WorkflowError
    }
    ServerError(ServerError), // Bad data received from server
    LangSDKError(LangSDKError) // Bad data given to us from lang
}
pub enum FatalError { .. }
pub enum WorkflowError { .. }
// etc

Function based approach

We'd keep all the errors at the top level, and use functions implemented on the enum to indicate which ones are fatal or per workflow, etc. This is a bit more ergonomic for the insides or core when constructing errors, and potentially on the lang side when it wants to match for some specific error. This also allows errors to be of more than one category. Ex the CantSendCommands error could be both fatal, and a workflow error. That's kinda nice.

impl CoreError {
    pub fn is_fatal(&self) -> bool {
        /// returns true if it's a fatal error
    }
    pub fn is_workflow_error(&self) -> Option<String> {
        /// returns workflow's run id if the error is a workflow error
    }
    // for server / bad data errors, etc.
}

Now that I've written this I'm sorta leaning towards the second option.

Not explicitly mentioned already, for all workflow errors we need to include the run id in the error enum variant.

Thoughts?

Follow up on "buffering" signals/etc between WFT started events

Follow up w/ Max re: Not "buffering" everything between two WFT started events, and instead applying events from started until the end of produced commands for that WFT.

There seem to be two concerns:

  • Ensuring signals are applied serially (already handled lang side right now, and can probably be kept that way, as we send signal jobs in the order they should be applied)
  • Avoiding potentially eating up a bunch of memory by buffering a bunch of signals after commands end but before next WFT begins. Not exactly sure what the issue is here though since we already will have received all that data on the wire. Follow up on that.

[Feature Request] More efficient large payload handling

Right now it's possible (in particular for local activities, which is why I'm writing this) to clone large payloads inside events during history lookahead (and possibly also application).

Ideally that would be avoided as it could cause substantial overhead during replay. Create a bench test to verify and then improve.

Possible fixes:

  • Rc/Arc Payload(s) types - also would require a custom HistoryEvent version (maybe prost annotation?)
  • Never clone HistoryEvents, always use refs

This might require #180 to be done, as it probably allows using Rc more freely.

[Bug] Nondeterminism problem when running queries on finished workflows

Describe the bug

Shawn was able to mostly reliably repro a bug querying closed workflows. The bug repro'd when spamming queries against a workflow that has completed.

To Reproduce

Doable with Shawn's nextJS demo

Wf code:
https://github.com/temporalio/samples-node/blob/addNextjsExample/nextjs-oneclick/temporal/src/workflows/index.ts

Query runner:
https://github.com/temporalio/samples-node/blob/addNextjsExample/nextjs-oneclick/pages/api/getBuyState.js

What can happen is when the query is spammed we'll actually run into one of these:

2021-09-25T00:02:32.180Z [WARNING] Poll resulted in WorkflowError, converting to a removeFromCache job [WorkflowError: Workflow update error] {
  runId: 'd3593c96-b3ad-46ee-9d75-84e3c91a226a',
  source: 'Nondeterminism error: During event handling, this event had an initial command ID but we could not find a matching command for it: HistoryEvent { event_id: 15, event_time: Some(Timestamp { seconds: 1632528151, nanos: 868323513 }), event_type: WorkflowTaskCompleted, version: 0, task_id: 2097515, attributes: Some(WorkflowTaskCompletedEventAttributes(WorkflowTaskCompletedEventAttributes { scheduled_event_id: 13, started_event_id: 14, identity: "[email protected]", binary_checksum: "@temporalio/[email protected]" })) }'

Obv. that shouldn't happen. Bonus points, then this surfaces as the query timing out, when we should've rejected it.

Expected behavior

None of that bad stuff happens.

Notes to self:

  • Make a query-spam-against-completed-wf test in core. Ut and integ probably
  • If no repro do it in JS integ

Unifying Logging / Metrics w/ OpenTelemetry between Core & Lang

We've discussed the desire to unify logging across core/lang.

Right now, core uses https://github.com/tokio-rs/tracing to simultaneously perform traditional logging functions, while also generating OpenTelemetry traces.

When we consider unifying these concerns across the language barrier, we really want to do two things:

  1. Have consistent "traditional" logging output - IE: What you'd see in the console / a log file.
  2. Pass tracing context across the language barrier

For 1, one option (possibly better option described below) is to expose a log function from core which takes a level parameter, a span parameter, a message parameter, and optionally a series of key/value pairs to be logged (all logging is structured). Then, lang side has to translate it's logging interface to the core logging interface in whatever way feels most natural/idiomatic and pass logging through to core. The core implementation of that function then constructs an https://tracing-rs.netlify.app/tracing/struct.event corresponding to the passed in parameters.

If there's something more automatic we could do here I'm all ears but nothing comes to mind.

For 2, is seems best for poll_workflow_task and poll_activity_task to return a Span as defined here https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L57 by OpenTelemetry. Then, lang can convert that into the Span type that the tracing library it has chosen to use uses, and proceed from there to pass it around and create new child spans, etc. It would then pass it / such spans back into the logging interface described above as the span parameter.

While writing the above, I realized there probably is a more elegant option here. The core side can provide an OpenTelemetry "Collector" https://opentelemetry.io/docs/concepts/data-collection/ which the lang side will need to provide an implementation for. Then, ideally, lang simply uses it's opentelem library, registering the custom collector which is core. On the core side, the collector is just a pass through to tokio tracing. Lang would be expected to disable all output, relying on core to collect the logs/traces and output them as console/file/tracing etc. This is more work up front, and possibly per-lang, but also feels substantially less hacky. Unfortunately, I think the spans from 2 still need to be passed into lang so that what it's doing can be associated with certain workflow/activity tasks.

[Bug] use one sticky queue per worker process

Both sdk-go and sdk-java uses one sticky task queue per worker process.
Currently core uses one sticky queue per one named task queue per process.
There are uses cases where one worker process would poll from hundreds of task queues and it would be 100x more expensive on server.
A simple bookkeeping would be needed on sdk if a proper concurrent and RPS control is needed per named task queue.

[Feature Request] Reflect workflow task failures in Core

Is your feature request related to a problem? Please describe.

some errors are logged only in Temporal Web, some errors are only logged in the terminal. its confusing to check two places when developing.

Describe the solution you'd like

passthrough workflow task failures as WARN in terminal

how to test

  • try to write a naughty workflow (eg as of right now, use WeakMap) and execute it
  • error would show up in the web ui, but not in the terminal

Additional context

slack discussion https://temporaltechnologies.slack.com/archives/C01FT8U10GK/p1634677730404700

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.