Coder Social home page Coder Social logo

iand675 / hs-opentelemetry Goto Github PK

View Code? Open in Web Editor NEW
61.0 7.0 29.0 1.18 MB

OpenTelemetry support for the Haskell programming language

Home Page: https://iankduncan.com/projects/opentelemetry

License: BSD 3-Clause "New" or "Revised" License

Haskell 97.45% Shell 0.14% Nix 2.06% Makefile 0.35%
opentelemetry haskell honeycomb prometheus metrics logging tracing observability performance

hs-opentelemetry's Introduction


Support Haskell OpenTelemetry development Alpha


OpenTelemetry for Haskell

This repository provides an implementation of the OpenTelemetry API and OpenTelemetry SDK for the Haskell programming language.

Implementation Status

This library still has a number of areas that don't fully cover the breadth of the OpenTelemetry specification. The core aspects of the tracing system are implemented, but you may view the detailed conformance checklist here

Metrics and logs are not yet supported.

Usage

If you're instrumenting a library directly, or as a wrapper, use the hs-opentelemetry-api package.

If you're instrumenting an application, use the hs-opentelemetry-sdk package.

Examples

Fully instrumented application examples are available in the examples folder.

Provided Instrumentation Libraries

Package OpenTelemetry Instrumentation
wai hs-opentelemetry-instrumentation-wai
yesod-core hs-opentelemetry-instrumentation-yesod
persistent hs-opentelemetry-instrumentation-persistent
esqueleto hs-opentelemetry-instrumentation-persistent
persistent-mysql hs-opentelemetry-instrumentation-persistent-mysql
postgresql-simple hs-opentelemetry-instrumentation-postgresql-simple
http-client hs-opentelemetry-instrumentation-http-client
http-conduit hs-opentelemetry-instrumentation-http-client
conduit hs-opentelemetry-instrumentation-conduit

Provided Exporters

Exporter OpenTelemetry Package Module Support
OTLP hs-opentelemetry-exporter-otlp OpenTelemetry.Exporter.OTLP
Jaeger Not Implemented.
Zipkin Not Implemented
Honeycomb (use hs-opentelemetry-exporter-otlp)

Provided Propagators

Propagator OpenTelemetry Package Module Support
W3CBaggage hs-opentelemetry-propagator-w3c OpenTelemetry.Propagator.W3CBaggage
W3CTraceContext hs-opentelemetry-propagator-w3c OpenTelemetry.Propagator.W3CTraceContext
B3 hs-opentelemetry-propagator-b3 OpenTelemetry.Propagator.B3
Jaeger hs-opentelemetry-propagator-jaeger OpenTelemetry.Propagator.Jaeger Not implemented.
Datadog hs-opentelemetry-propagator-datadog OpenTelemetry.Propagator.Datadog

Contributing

See CONTRIBUTING.md.

Maintainers:

PRs and Reviews

When creating a PR please follow the process described here.

PRs will be automatically associated with the reviewers based on CODEOWNERS. PRs will be also automatically assigned to one of the maintainers or approvers for facilitation.

hs-opentelemetry's People

Contributors

9999years avatar chrismanning avatar claymager avatar evanlauer1 avatar gabriella439 avatar iand675 avatar jfraudeau avatar jkeuhlen avatar jonathanlorimer avatar kakkun61 avatar lf- avatar ocharles avatar parsonsmatt avatar pbrisbin avatar psilospore avatar qqwy avatar solomon-b avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

hs-opentelemetry's Issues

Logging roadmap

While implementing metrics will presumably be more challenging, implementing logging looks much easier in the meantime.

The basic OpenTelemetry.Logging.Core.Log type is there (although the tracingDetails field should be broken up as you can have a TraceID without a SpanID, and visa versa; likewise for the TraceFlags), and the protos are there (for the OTLP protocol, anyway). It looks like it should be straightforward enough to:

  1. Refactor the TracerProvider and Tracer classes to be implemented in terms of generic SignallerProvider and Signaller types, which are parameterised by the signal output type (e.g. ImmutableSpan or Log), signal-specific extra data and hooks. Much of existing logic for these types is generic. This is probably the step with the most effort, but it shouldn't be difficult.
  2. Implement LoggerProvider and Logger in terms of SignallerProvider and Signaller
  3. Implement the Logs Bridge API (that is, implement the emit function; should be trivial given the prior two steps)
  4. Implement the remainder of the Logs SDK (most of this logic should already be in the SignallerProvider and Signaller implementations)
  5. Refactor the two SDK processors (Simple and Batch) to be parameterised by signal type. I don't think either of them actually depend on ImmutableSpans at the moment, other than overly-strict type annotations?
  6. Refactor the existing OTLP exporter code to be generic over the signal type.
  7. Implement the construction of the Log PDUs for the request. Should be trivial.

The first step will be the biggest change, and probably the only one which might break existing code (unless appropriate type aliases and possibly pattern synonyms are introduced); after that, each step can be implemented gradually in point releases if necessary.

Am I missing anything?

Combining WAI and Yesod instrumentation loses exception details

I've got an application instrumented with both hs-opentelemetry-instrumentation-wai and hs-opentelemetry-instrumentation-yesod. For the most part this works well, however I've found that the details of exceptions (exception message, exception backtrace, span status message) thrown by handlers never make it to WAI middleware trace.

I see why this happens: by default Yesod catches the exception and converts it to an error page -- the WAI instrumentation never sees the exception. However, I'm not sure what the proper fix is. The Yesod middleware could simply recordException targeting the WAI trace. Unfortunately this still results in the WAI middleware setting a "" span status message rather than a meaningful description.

GHC 9.4 Compilation Failure

GHC 9.4 is failing to compile with the Word# and Word64# switcharoo:

Failed to build hs-opentelemetry-api-0.0.3.5.
Build log (
/home/matt/.cabal/logs/ghc-9.4.1/hs-opentelemetry-api-0.0.3.5-280d98a9f948cd5074faf3edf808743d75cc05f40bdcbb81b10c8c3d3e83bf17.log
):
Configuring library for hs-opentelemetry-api-0.0.3.5..
Preprocessing library for hs-opentelemetry-api-0.0.3.5..
Building library for hs-opentelemetry-api-0.0.3.5..
[ 1 of 32] Compiling OpenTelemetry.Attributes ( src/OpenTelemetry/Attributes.hs, dist/build/OpenTelemetry/Attributes.o, dist/build/OpenTelemetry/Attributes.dyn_o )
[ 2 of 32] Compiling OpenTelemetry.Baggage ( src/OpenTelemetry/Baggage.hs, dist/build/OpenTelemetry/Baggage.o, dist/build/OpenTelemetry/Baggage.dyn_o )
[ 3 of 32] Compiling OpenTelemetry.Common ( src/OpenTelemetry/Common.hs, dist/build/OpenTelemetry/Common.o, dist/build/OpenTelemetry/Common.dyn_o )
[ 4 of 32] Compiling OpenTelemetry.Context.Types ( src/OpenTelemetry/Context/Types.hs, dist/build/OpenTelemetry/Context/Types.o, dist/build/OpenTelemetry/Context/Types.dyn_o )
[ 5 of 32] Compiling OpenTelemetry.Propagator ( src/OpenTelemetry/Propagator.hs, dist/build/OpenTelemetry/Propagator.o, dist/build/OpenTelemetry/Propagator.dyn_o )
[ 6 of 32] Compiling OpenTelemetry.Resource ( src/OpenTelemetry/Resource.hs, dist/build/OpenTelemetry/Resource.o, dist/build/OpenTelemetry/Resource.dyn_o )
[ 7 of 32] Compiling OpenTelemetry.Resource.Cloud ( src/OpenTelemetry/Resource/Cloud.hs, dist/build/OpenTelemetry/Resource/Cloud.o, dist/build/OpenTelemetry/Resource/Cloud.dyn_o )
[ 8 of 32] Compiling OpenTelemetry.Resource.Container ( src/OpenTelemetry/Resource/Container.hs, dist/build/OpenTelemetry/Resource/Container.o, dist/build/OpenTelemetry/Resource/Container.dyn_o )
[ 9 of 32] Compiling OpenTelemetry.Resource.DeploymentEnvironment ( src/OpenTelemetry/Resource/DeploymentEnvironment.hs, dist/build/OpenTelemetry/Resource/DeploymentEnvironment.o, dist/build/OpenTelemetry/Resource/D
eploymentEnvironment.dyn_o )
[10 of 32] Compiling OpenTelemetry.Resource.Device ( src/OpenTelemetry/Resource/Device.hs, dist/build/OpenTelemetry/Resource/Device.o, dist/build/OpenTelemetry/Resource/Device.dyn_o )
[11 of 32] Compiling OpenTelemetry.Resource.FaaS ( src/OpenTelemetry/Resource/FaaS.hs, dist/build/OpenTelemetry/Resource/FaaS.o, dist/build/OpenTelemetry/Resource/FaaS.dyn_o )
[12 of 32] Compiling OpenTelemetry.Resource.Host ( src/OpenTelemetry/Resource/Host.hs, dist/build/OpenTelemetry/Resource/Host.o, dist/build/OpenTelemetry/Resource/Host.dyn_o )
[13 of 32] Compiling OpenTelemetry.Resource.Kubernetes ( src/OpenTelemetry/Resource/Kubernetes.hs, dist/build/OpenTelemetry/Resource/Kubernetes.o, dist/build/OpenTelemetry/Resource/Kubernetes.dyn_o )
[14 of 32] Compiling OpenTelemetry.Resource.OperatingSystem ( src/OpenTelemetry/Resource/OperatingSystem.hs, dist/build/OpenTelemetry/Resource/OperatingSystem.o, dist/build/OpenTelemetry/Resource/OperatingSystem.dyn
_o )
[15 of 32] Compiling OpenTelemetry.Resource.Process ( src/OpenTelemetry/Resource/Process.hs, dist/build/OpenTelemetry/Resource/Process.o, dist/build/OpenTelemetry/Resource/Process.dyn_o )
[16 of 32] Compiling OpenTelemetry.Resource.Service ( src/OpenTelemetry/Resource/Service.hs, dist/build/OpenTelemetry/Resource/Service.o, dist/build/OpenTelemetry/Resource/Service.dyn_o )
[17 of 32] Compiling OpenTelemetry.Resource.Telemetry ( src/OpenTelemetry/Resource/Telemetry.hs, dist/build/OpenTelemetry/Resource/Telemetry.o, dist/build/OpenTelemetry/Resource/Telemetry.dyn_o )
[18 of 32] Compiling OpenTelemetry.Resource.Webengine ( src/OpenTelemetry/Resource/Webengine.hs, dist/build/OpenTelemetry/Resource/Webengine.o, dist/build/OpenTelemetry/Resource/Webengine.dyn_o )
[19 of 32] Compiling OpenTelemetry.Trace.Id.Generator ( src/OpenTelemetry/Trace/Id/Generator.hs, dist/build/OpenTelemetry/Trace/Id/Generator.o, dist/build/OpenTelemetry/Trace/Id/Generator.dyn_o )
[20 of 32] Compiling OpenTelemetry.Trace.Id ( src/OpenTelemetry/Trace/Id.hs, dist/build/OpenTelemetry/Trace/Id.o, dist/build/OpenTelemetry/Trace/Id.dyn_o )

src/OpenTelemetry/Trace/Id.hs:116:10: error:
    • Couldn't match expected type ‘Word#’ with actual type ‘Word64#’
    • In the first argument of ‘or#’, namely
        ‘(indexWord64Array# arr 0#)’
      In the first argument of ‘eqWord#’, namely
        ‘(or# (indexWord64Array# arr 0#) (indexWord64Array# arr 1#))’
      In the first argument of ‘isTrue#’, namely
        ‘(eqWord#
            (or# (indexWord64Array# arr 0#) (indexWord64Array# arr 1#))
            (int2Word# 0#))’
    |
116 |         (indexWord64Array# arr 0#)
    |          ^^^^^^^^^^^^^^^^^^^^^^^^

src/OpenTelemetry/Trace/Id.hs:117:10: error:
    • Couldn't match expected type ‘Word#’ with actual type ‘Word64#’
    • In the second argument of ‘or#’, namely
        ‘(indexWord64Array# arr 1#)’
      In the first argument of ‘eqWord#’, namely
        ‘(or# (indexWord64Array# arr 0#) (indexWord64Array# arr 1#))’
      In the first argument of ‘isTrue#’, namely
        ‘(eqWord#
            (or# (indexWord64Array# arr 0#) (indexWord64Array# arr 1#))
            (int2Word# 0#))’
    |
117 |         (indexWord64Array# arr 1#))
    |          ^^^^^^^^^^^^^^^^^^^^^^^^

src/OpenTelemetry/Trace/Id.hs:177:6: error:
    • Couldn't match expected type ‘Word#’ with actual type ‘Word64#’
    • In the first argument of ‘eqWord#’, namely
        ‘(indexWord64Array# arr 0#)’
      In the first argument of ‘isTrue#’, namely
        ‘(eqWord# (indexWord64Array# arr 0#) (int2Word# 0#))’
      In the expression:
        isTrue# (eqWord# (indexWord64Array# arr 0#) (int2Word# 0#))
    |
177 |     (indexWord64Array# arr 0#)
    |      ^^^^^^^^^^^^^^^^^^^^^^^^
[22 of 32] Compiling OpenTelemetry.Trace.Id.Generator.Dummy ( src/OpenTelemetry/Trace/Id/Generator/Dummy.hs, dist/build/OpenTelemetry/Trace/Id/Generator/Dummy.o, dist/build/OpenTelemetry/Trace/Id/Generator/Dummy.dyn
_o )
[23 of 32] Compiling OpenTelemetry.Trace.TraceState ( src/OpenTelemetry/Trace/TraceState.hs, dist/build/OpenTelemetry/Trace/TraceState.o, dist/build/OpenTelemetry/Trace/TraceState.dyn_o )
[24 of 32] Compiling OpenTelemetry.Util ( src/OpenTelemetry/Util.hs, dist/build/OpenTelemetry/Util.o, dist/build/OpenTelemetry/Util.dyn_o )

Windows support

The h-o-sdk package depends on unix, preventing it to build on Windows, while, at first glance, only unix-specific functions it seems to use are getProcessID (present in unix-compat-0.7.1) and getEffectiveUserName (a no-op on Windows?).

Resource Merge Prioritizes Old Resource

The function mergeResources in OpenTelemetry.Resource seems to be backwards. The documentation calls the first argument the old resource and the second argument the updating resource, meaning the second argument's attribute values should take precedence when a key is in both resources. However, due to the underlying HashMap's left-bias, the old value is taking precedence. For example:

l = mkResource ["host.name" .= ("left" :: Text)] :: Resource 'Nothing
r = mkResource ["host.name" .= ("right" :: Text)] :: Resource 'Nothing
getMaterializedResourcesAttributes . materializeResources $ mergeResources l r
-- Attributes {attributes = fromList [("host.name",AttributeValue (TextAttribute "left"))], attributesCount = 2, attributesDropped = 0}

As a result, getTracerProviderInitializationOptions' does not prioritize the user-provided resource attributes.
This makes it very difficult to override the built-in resources (such as the host detector).

Similarly, I think getTracerProviderInitializationOptions' should prioritize the attributes read from OTEL_RESOURCE_ATTRIBUTES above the detected resources. Currently we have:

let allRs = mergeResources (builtInRs <> envVarRs) rs

Since <> is defined in a similar way as mergeResources and is also left-biased, both issues could be resolved by updating <> and mergeResources to be right-biased. Alternatively, the documentation for mergeResources could be updated to reflect the current behavior, and getTracerProviderInitializationOptions' could be updated to to reverse the order of arguments.

Improving project stewardship

I've been feeling lately that I'm not necessarily responding to issues, PRs, and upstream evolutions of the OpenTelemetry spec in a timely fashion.

My desire for this project has been to build tracing that is fast, stable, relatively easy to use, and that supports most popular Haskell libraries. Ultimately, I'd like to contribute the library to the OpenTelemetry project itself. However, I've both had a new baby and had my oldest child pass away suddenly in the past year, so I have very little time or motivation to really work on anything outside of what is needed on this project to support the company that I work for.

So, I'd like to bring on one or two more people as core owners / contributors that will feel empowered to make decisions and shepherd the project into becoming mature enough that it becomes a go-to tool for production Haskell users. If you are interested, please let me know, and we can figure out next steps.

Metrics?

I'd love to start collecting metrics; is this in development at all?

`hspec` combinators

I think we can get a better hspec integration by providing functions that use hspec hooks to do inSpan - example,

-- | Like 'Hspec.describe' but creates a new 'Span' with the message.
describe :: HasCallStack => String -> SpecWith a -> SpecWith a
describe msg spec =
  aroundWith wrapSpec do
    withFrozenCallStack Hspec.describe msg spec
  where
    wrapSpec runTest a =
      inSpan (pack msg) defaultSpanArguments (runTest a)

Working on this locally, will figure out an extraction if it makes sense

Nix utilities

This is a bit of a brainstorm:

I think we have a few issues that need to be fixed upstream (in GHC itself, or cabal, etc.)

  • Detecting versions of what we're instrumenting
  • Sending contexts through to threads that we didn't create
  • Cleaning up thread local state more reliably

I could see us providing some nix utilities in the meantime that update wrapped dependencies to use patched versions of async, expose their Paths_ module, etc. This would give us much deeper integration abilities into the broader ecosystem at the expense of tooling being less useful outside of the nix ecosystem.

Graceful offline behavior

When there is no internet connection, the library hangs for a while before giving up. This can result in delay and lag in processes.

I'm not sure the best way to detect this and work around it - there must be a differentiation between "the network isn't available at all" and "the service isn't responding promptly."

Support OTEL_SDK_DISABLED

We'd love to see support for OTEL_SDK_DISABLED, so we don't need conditionals in our code to only enable tracing in environments where it's appropriate. Today, we add code to read this variable ourselves and initialize with emptyTracerProviderOptions when set, or our usual way when not.

https://opentelemetry.io/docs/reference/specification/sdk-environment-variables/#general-sdk-configuration

Boolean value. If “true”, a no-op SDK implementation will be used for all telemetry signals. Any other value or absence of the variable will have no effect and the SDK will remain enabled. This setting has no effect on propagators configured through the OTEL_PROPAGATORS variable.

Further details on encoding booleans: https://opentelemetry.io/docs/reference/specification/sdk-environment-variables/#boolean-value.

Any value that represents a Boolean MUST be set to true only by the case-insensitive string "true", meaning "True" or "TRUE" are also accepted, as true. An SDK MUST NOT extend this definition and define additional values that are interpreted as true. Any value not explicitly defined here as a true value, including unset and empty values, MUST be interpreted as false. If any value other than a true value, case-insensitive string "false", empty, or unset is used, a warning SHOULD be logged to inform users about the fallback to false being applied. All Boolean environment variables SHOULD be named and defined such that false is the expected safe default behavior. Renaming or changing the default value MUST NOT happen without a major version upgrade.

Cannot build with GHC 9

src/OpenTelemetry/Baggage.hs:84:27: error:
    • Couldn't match type ‘Q’ with ‘Code m’
      Expected: Code m Token
        Actual: Q (TExp a0)
    • In the expression: unsafeTExpCoerce $ bsToExp tok
      In an equation for ‘liftTyped’:
          liftTyped (Token tok) = unsafeTExpCoerce $ bsToExp tok
      In the instance declaration for ‘Lift Token’
    • Relevant bindings include
        liftTyped :: Token -> Code m Token
          (bound at src/OpenTelemetry/Baggage.hs:84:3)
   |
84 |   liftTyped (Token tok) = unsafeTExpCoerce $ bsToExp tok
   |    

`SpanArguments` with an exception handler

SpanArguments should have a field for determining what happens with a thrown exception. The existing default behavior is to do this:

        forM_ e $ \(SomeException inner) -> do
          setStatus s $ Error $ T.pack $ displayException inner
          recordException s [] Nothing inner

I think that recordException s [] Nothing inner is probably good to leave in, and the "customizable' stuff will be the setStatus s $ Error $ T.pack $ displayException inner.

Related: #90

Comment: #91 (comment)

I think I'd like to see something like:

newtype SpanExceptionBehavior { unSpanExceptionBehavior :: SomeException -> Maybe (Span -> IO ()) }
  deriving newtype (Semigroup, Monoid)

spanExceptionSetStatusError :: SpanExceptionBehavior
spanExceptionSetStatusError = 
  SpanExceptionBehavior \(SomeException inner) ->  
    Just \s -> setStatus s $ Error $ T.pack displayException inner

addSpanExceptionBehavior :: SpanExceptionBehavior -> SpanArguments -> SpanArguments
addSpanExceptionBehavior new sa = sa { exceptionBehavior = new <> exceptionBehavior sa }

That allows you to compose and combine them. However, it does not allow you to override them. So I think we need something other than Maybe there -

data OverrideOrCombine a 
    = Override a
    | Combine a
    | Nothin

instance Semigroup a => Semigroup (OverrideOrCombine a) where
    Override a <> _ = Override a
    _ <> Override a = Override a
    Combine a <> Combine b = Combine (a <> b)
    Nothin <> r = r
    l <> Nothin = l

Then, for the ExitSuccess, we can do an override:

ignoreExitSuccess :: ExceptionBehavior
ignoreExitSuccess = ExceptionBehavior \exn ->
    case fromException exn of
        exitCode 
            | ExitSuccess <- exitCode -> 
                Override mempty
        _ -> 
            Nothin

Simple processor hangs on shutdown, shutdown logic probably should be rewritten

It appears that the simple processor hangs on shutdown, which is quite surprising. It seems to have much much more Creative use of synchronization primitives than the batch processor (especially after I deleted some of the batch processor ones in #23), and this is probably its downfall.

I am writing an example for the hspec exporter and wanted to test it against stdout. Here's where I got to:

I tossed some traces in OpenTelemetry.Processor.Simple (none of them are printed (!) so it's stuck on the cancel):

Begin tests
done tests
"shutdownTracerProvider a processor"
"7870e2a554ff4454a446890e12de20e4" "a73b5effd0cd55a8" Timestamp (TimeSpec {sec = 1661365350, nsec = 54536000}) run tests
"shutdownTracerProvider a processor"
"shutdown batch proc cleanly"
  pure $ Processor
    { processorOnStart = \_ _ -> pure ()
    , processorOnEnd = writeChan inChan
    , processorShutdown = async $ mask $ \restore -> do
        cancel exportWorker
        traceShowM "cancel"
        -- TODO handle timeouts
        restore $ do
          traceShowM "shutitdown"
          -- TODO, not convinced we should shut down processor here
          shutdownProcessor outChan `finally` Exporter.exporterShutdown exporter
        pure ShutdownSuccess
    , processorForceFlush = pure ()
    }

package.yaml:

name: hspec-example

dependencies:
# minimal example dependencies
- base
- hspec
- unliftio
- text
# opentelemetry dependencies
- hs-opentelemetry-sdk
- hs-opentelemetry-exporter-handle
- hs-opentelemetry-instrumentation-hspec

tests:
  test:
    dependencies:
      - hspec-example
    main: Main.hs
    source-dirs:
      - test

library:
  source-dirs: src

Program:

{-# LANGUAGE OverloadedStrings #-}
import OpenTelemetry.Trace.Sampler
import OpenTelemetry.Exporter.Handle
import qualified OpenTelemetry.Context as Context
import OpenTelemetry.Trace hiding (inSpan)
import OpenTelemetry.Context.ThreadLocal (attachContext)
import Control.Monad (void)
import Data.Text (Text)
import UnliftIO (MonadUnliftIO, bracket)
import qualified OpenTelemetry.Trace as Trace
import OpenTelemetry.Instrumentation.Hspec (wrapSpec)
import Test.Hspec.Runner (hspecWith, defaultConfig)
import OpenTelemetry.Processor.Simple (SimpleProcessorConfig(..), simpleProcessor)

-- | Initialize the global tracing provider for the application and run an action
--   (that action is generally the entry point of the application), cleaning
--   up the provider afterwards.
--
--   This also sets up an empty context (creating a new trace ID).
withGlobalTracing :: Sampler -> IO a -> IO a
withGlobalTracing sampler act = do
  void $ attachContext Context.empty
  bracket
    (initializeTracing sampler)
    shutdownTracerProvider
    $ const act

initializeTracing :: Sampler -> IO TracerProvider
initializeTracing sampler = do
  (processors, tracerOptions') <- getTracerProviderInitializationOptions
  proc <- simpleProcessor SimpleProcessorConfig{ exporter = stderrExporter' (pure . defaultFormatter)}
  let processors' = proc : processors
  provider <- createTracerProvider processors' (tracerOptions' {tracerProviderOptionsSampler = sampler})
  setGlobalTracerProvider provider

  pure provider

inSpan :: (MonadUnliftIO m, HasCallStack) => Text -> SpanArguments -> m a -> m a
inSpan name args act = do
  tp <- getGlobalTracerProvider
  let tracer = makeTracer tp "hspec-example" tracerOptions
  Trace.inSpan tracer name args act

main :: IO ()
main = do
  putStrLn "Begin tests"
  withGlobalTracing alwaysOn $
    inSpan "run tests" defaultSpanArguments
    (putStrLn "done tests")
  putStrLn "done"

Doesn't compile on GHC 9.2.3

Attempting to use this compiler version results in errors similar to this:

[20 of 32] Compiling OpenTelemetry.Trace.Id ( src/OpenTelemetry/Trace/Id.hs, /home/janus/flipstone/hs-opentelemetry/api/dist-newstyle/build/x86_64-linux/ghc-9.2.3/hs-opentelemetry-api-0.0.3.3/build/OpenTelemetry/Trace/Id.o, /home/janus/flipstone/hs-opentelemetry/api/dist-newstyle/build/x86_64-linux/ghc-9.2.3/hs-opentelemetry-api-0.0.3.3/build/OpenTelemetry/Trace/Id.dyn_o )

src/OpenTelemetry/Trace/Id.hs:114:43: error:
    • Couldn't match expected type ‘Word#’ with actual type ‘Word8#’
    • In the first argument of ‘convertByte’, namely ‘w’
      In the expression: convertByte w
      In a pattern binding: !(# w1, w2 #) = convertByte w
    |
114 |           let !(# w1, w2 #) = convertByte w
    |                                           ^

It's curious that there is a stack file for 9.2, but I can't actually get it to compile. Is it supposed to work on the 9.2 series?

inSpan without UnliftIO

I am working on an app where the monad doesn't support unliftio. I would like to understand why inSpan requires UnliftIO? It seems like it is just in the bracketError function. It would be great to provide an inSpan function that doesn't require this constraint, but I am not sure where to start.

Adding an eventlog exporter

I'd like to add an eventlog exporter to this library.

I'm thinking it would allow writing traces to the eventlog, and some extra data from the eventlog can be added (eg, sleeping threads, extra GC info, why threads are being blocked), and then a separate process can re-export these traces using the existing exporters.

I wanted to open an issue before starting work on this in case anyone had any thoughts about it.

Instrumented application example?

Hello!

Just wondering if it would be possible to have a quick example of an instrumented application, in the examples? One that is not using any of the instrumentation libraries.

See through more exception wrappers

Our Honeycomb query for exception types is a bit disappointing - the most common exception we deal with is AnnotatedException SomeException, seconded by SomeAsyncException. As a result, we don't get a lot of benefit from the exception type.

It'd be good to "dig through" the exception types and collect extra information on them. For example, instead of doing a mere SomeException inner, we can do something like:

case someException of
  SomeException inner 
    | Just (SomeAsyncException inner') <- cast inner
    -> -- now we can report a more interesting type

But there are a bunch of library defined wrappers, like unliftio's {As,S}yncExceptionWrapper, and AnnotatedException, and I'm sure more that we're not considering. So it may be good to consider a means of providing "unwrappers" that can dig through and add interesting stuff to an exception's type.

MonadMask Support

My MTL stack includes ExceptT and I want to leave the door open for StateT too, but UnliftIO doesn't support it. I poked around a bit and got it working by making versions of inSpan and bracketError that rely only on MonadMask instead

Would you accept a PR that adds a new version of inSpan that doesn't require UnliftIO?

Great project btw! Glad to see someone finally tackling this issue in Haskell.

Help logging to stdout

Hello, I am trying to figure out the best way to log tracing events in stdout for development. I have considered 2 approaches, but not sure which would be best (and couldn't find any documentation on this).

This first approach is to use the inMemoryChannelExporter in a setup like this:

    withTracerDev :: ((TracerOptions -> Tracer) -> IO c) -> IO c
    withTracerDev f = bracket
      -- Install the SDK, pulling configuration from the environment
      (do
        (processor, chann) <- inMemoryChannelExporter
        provider <- createTracerProvider [processor]
          TracerProviderOptions
            { tracerProviderOptionsIdGenerator = _
            , tracerProviderOptionsSampler = _
            , tracerProviderOptionsResources = _
            , tracerProviderOptionsAttributeLimits = _
            , tracerProviderOptionsSpanLimits = _
            , tracerProviderOptionsPropagators
            , tracerProviderOptionsLogger = _
            }
        pure (provider, chann)
      )
      -- Ensure that any spans that haven't been exported yet are flushed
      (shutdownTracerProvider . fst)
      -- Get a tracer so you can create spans
      (\(tracerProvider, chann) -> f chann $ makeTracer tracerProvider "my-service")

Then I would just fork a thread, read from the channel, then write to stdout.

The second approach uses the logger option on TracerProviderOptions. But I couldn't figure out what the semantics of this action are just from the source.

Code detection is flaky when you define helpers for `inSpan`

If you define an inSpan helper function like:

myInSpan :: HasCallStack => Text -> IO a -> IO a
myInSpan name action = inSpan name defaultSpanArguments action

Then every use of myInSpan will be registered as the function that the span originates in. This is because inSpan' uses callerAttributes which only cares about the caller of inSpan.

There is no way to set the `service.name` attribute in code

AFAICT the only supported way to set the service name is via the OTEL_SERVICE_NAME env var. The problem with this is that if multiple services are running in the same environment (such as in CI) then all the services will pick up that env var.

I've seen other (non-haskell) opentelemetry libraries which allow you to set the service.name attribute explicitly in code, but have the env var take precedence. This seems like a reasonable behavior.

Without all the data constructors exported, the closest a library consumer can get to this behavior is to write their own version of initializeTracerProvider:

myInitializeTracerProvider = do
  (processors, opts) <- Otel.getTracerProviderInitializationOptions' (Otel.mkResource ["service.name" Otel..= ("my_cool_service_name" :: Text)] :: Otel.Resource 'Nothing)
  Otel.createTracerProvider processors opts

However, this doesn't work because merging HashMaps sets precedence to the left map. It seems like you have to patch the library to make this work.

I think the simplest solution would be to change detectService to receive an optional custom service name and thread that name down from a custom initializeTracerProvider:

detectService :: Maybe T.Text -> IO Service
detectService customName = do
  mSvcName <- lookupEnv "OTEL_SERVICE_NAME"
  svcName <- case mSvcName of
    Nothing -> maybe (T.pack . ("unknown_service:" <>) <$> getProgName) pure customName
    Just svcName -> pure $ T.pack svcName
  pure $
    Service
      { serviceName = svcName
      , serviceNamespace = Nothing
      , serviceInstanceId = Nothing
      , serviceVersion = Nothing
      }

initializeTracerProviderCustomService :: T.Text -> IO TracerProvider
initializeTracerProviderCustomService = ...

I have a branch with these changes on it if we want this behavior. Or if you prefer another implementation I am happy to write the PR.

Unsure if instrumentation-persistent is tracing write operations

Here's an example trace with an exception,

trace

The exception (correctly annotated on the red spans) is:

SqlError
  { sqlState = "23505"
  , sqlExecStatus = FatalError
  , sqlErrorMsg = "duplicate key value violates unique constraint \"...\"", sqlErrorDetail = "Key (id)=(1) already exists."
  , sqlErrorHint = ""
  }

So clearly there was a write operation that failed, but there's no span for it. Shouldn't there be?

If I look at the instrumentation code, I see:

              { stmtQuery = \ps -> do
                  ctxt <- getContext
                  let spanCreator = do
                        s <- createSpan
                          t
                          ctxt
                          sql
                          (defaultSpanArguments { kind = Client, attributes = ("db.statement", toAttribute sql) : attrs })
                        adjustContext (insertSpan s)
                        pure (lookupSpan ctxt, s)
                      spanCleanup (parent, s) = do
                        s `endSpan` Nothing
                        adjustContext $ \ctx ->
                          maybe ctx (`insertSpan` ctx) parent

                  (p, child) <- mkAcquire spanCreator spanCleanup

                  annotateBasics child conn
                  case stmtQuery stmt ps of
                    Acquire stmtQueryAcquireF -> Acquire $ \f ->
                      handleAny
                        (\(SomeException err) -> do
                          recordException child [] Nothing err
                          endSpan child Nothing
                          throwIO err
                        )
                        (stmtQueryAcquireF f)

              , stmtExecute = \ps -> do
                inSpan' t sql (defaultSpanArguments { kind = Client, attributes = ("db.statement", toAttribute sql) : attrs }) $ \s -> do
                  annotateBasics s conn
                  stmtExecute stmt ps

So I see exception handling in stmtQuery but no similar handling in stmtExecute -- I'll admin I'm not very familiar with all this, but could that be why?

Async exception safety in the batch exporter

The following is a stream of consciousness diagnosis and debugging of a hanging shutdown:

Debugging the shutdownTracerProvider

I noticed that the test suite was failing to cleanly exit.
The final thing in the app is withGlobalTracing, which looks like:

bracket initializeTracing shutdownTraceProvider (const action)

This bracket comes from UnliftIO.Exception, which means that
shutdownTraceProvider runs in an uninterruptibleMask. This makes it pretty
fragile - if it blocks, it may become unkillable.

Turning it into a Control.Exception.bracket, which does not mask the cleanup
action, did allow the shutdown to complete. So something potentially odd is
going on here!

I did a crude timer on the shutdown, and it took 30 seconds.
That's a suspicious amount - seems like a timeout to me.

-- | This method provides a way for provider to do any cleanup required.
--
-- This will also trigger shutdowns on all internal processors.
--
-- @since 0.0.1.0
shutdownTracerProvider :: MonadIO m => TracerProvider -> m ()
shutdownTracerProvider TracerProvider{..} = liftIO $ do
  asyncShutdownResults <- forM tracerProviderProcessors $ \processor -> do
    processorShutdown processor
  mapM_ wait asyncShutdownResults

So, we'll look at each processor, call processShutdown on it, and then wait
on those asyncs to finish. processorShutdown is a field on the Processor
record:

data Processor = Processor
  { processorOnStart :: IORef ImmutableSpan -> Context -> IO ()
  -- ^ Called when a span is started. This method is called synchronously on the thread that started the span, therefore it should not block or throw exceptions.
  , processorOnEnd :: IORef ImmutableSpan -> IO ()
  -- ^ Called after a span is ended (i.e., the end timestamp is already set). This method is called synchronously within the 'OpenTelemetry.Trace.endSpan' API, therefore it should not block or throw an exception.
  , processorShutdown :: IO (Async ShutdownResult)
  -- ^ Shuts down the processor. Called when SDK is shut down. This is an opportunity for processor to do any cleanup required.
  --
  -- Shutdown SHOULD be called only once for each SpanProcessor instance. After the call to Shutdown, subsequent calls to OnStart, OnEnd, or ForceFlush are not allowed. SDKs SHOULD ignore these calls gracefully, if possible.
  --
  -- Shutdown SHOULD let the caller know whether it succeeded, failed or timed out.
  --
  -- Shutdown MUST include the effects of ForceFlush.
  --
  -- Shutdown SHOULD complete or abort within some timeout. Shutdown can be implemented as a blocking API or an asynchronous API which notifies the caller via a callback or an event. OpenTelemetry client authors can decide if they want to make the shutdown timeout configurable.
{- snip ... -}

As a record field, we now need to find where this provider is created.
Fortunately, that's right there in bracket:

-- src/Mercury/Tracing.hs

initializeTracing :: MonadUnliftIO m => m TracerProvider
initializeTracing = do
  (processors, tracerOptions') <- liftIO getTracerProviderInitializationOptions

  -- FIXME: this is probably an upstream bug in hs-opentelemetry, but regardless
  -- Throw away all spans if the OTLP endpoint is not configured: anyone
  -- exporting to Honeycomb would have to have set this, so it should be fine
  otlpEndpoint <- liftIO $ lookupEnv "OTEL_EXPORTER_OTLP_ENDPOINT"
  let processors' = case otlpEndpoint of
        Just _ -> processors
        Nothing -> []

  provider <- createTracerProvider processors' tracerOptions'
  setGlobalTracerProvider provider
  pure provider

So we use the processors for the default, but only if the otlpEndpoint is
populated.

-- src/OpenTelemetry/Trace.hs

getTracerProviderInitializationOptions :: IO ([Processor], TracerProviderOptions)
getTracerProviderInitializationOptions = getTracerProviderInitializationOptions' (mempty :: Resource 'Nothing)

getTracerProviderInitializationOptions' :: (ResourceMerge 'Nothing any ~ 'Nothing) => Resource any -> IO ([Processor], TracerProviderOptions)
getTracerProviderInitializationOptions' rs = do
  sampler <- detectSampler
  attrLimits <- detectAttributeLimits
  spanLimits <- detectSpanLimits
  propagators <- detectPropagators
  processorConf <- detectBatchProcessorConfig
  exporters <- detectExporters
  builtInRs <- detectBuiltInResources
  envVarRs <- mkResource . map Just <$> detectResourceAttributes
  let allRs = mergeResources (builtInRs <> envVarRs) rs
  processors <- case exporters of
    [] -> do
      pure []
    e:_ -> do
      pure <$> batchProcessor processorConf e
  let providerOpts = emptyTracerProviderOptions
        { tracerProviderOptionsIdGenerator = defaultIdGenerator
        , tracerProviderOptionsSampler = sampler
        , tracerProviderOptionsAttributeLimits = attrLimits
        , tracerProviderOptionsSpanLimits = spanLimits
        , tracerProviderOptionsPropagators = propagators
        , tracerProviderOptionsResources = materializeResources allRs
        }
  pure (processors, providerOpts)

-- src/OpenTelemetry/Trace/Core.hs

-- | Initialize a new tracer provider
--
-- You should generally use 'getGlobalTracerProvider' for most applications.
createTracerProvider :: MonadIO m => [Processor] -> TracerProviderOptions -> m TracerProvider
createTracerProvider ps opts = liftIO $ do
  let g = tracerProviderOptionsIdGenerator opts
  pure $ TracerProvider
    (V.fromList ps)
    g
    (tracerProviderOptionsSampler opts)
    (tracerProviderOptionsResources opts)
    (tracerProviderOptionsAttributeLimits opts)
    (tracerProviderOptionsSpanLimits opts)
    (tracerProviderOptionsPropagators opts)
    (tracerProviderOptionsLogger opts)

So createTracerProvider isn't doing anything fancy.

processors is either an empty list, or if we have at least one exporter, then
we create the batchProcessor. Nothing else modifies it, so we're good to just
look at the function:

-- |
-- The batch processor accepts spans and places them into batches. Batching helps better compress the data and reduce the number of outgoing connections
-- required to transmit the data. This processor supports both size and time based batching.
--
batchProcessor :: MonadIO m => BatchTimeoutConfig -> Exporter ImmutableSpan -> m Processor
batchProcessor BatchTimeoutConfig{..} exporter = liftIO $ do
  batch <- newIORef $ boundedMap maxQueueSize
  workSignal <- newEmptyMVar
  worker <- async $ loop $ do
    req <- liftIO $ timeout (millisToMicros scheduledDelayMillis)
      $ takeMVar workSignal
    batchToProcess <- liftIO $ atomicModifyIORef' batch buildExport
    res <- liftIO $ Exporter.exporterExport exporter batchToProcess

    -- if we were asked to shutdown, quit cleanly after this batch
    -- FIXME: this could lose batches if there's more than one in queue?
    case req of
      Just Shutdown -> throwE res
      _ -> pure ()

  pure $ Processor
    { processorOnStart = \_ _ -> pure ()
    , processorOnEnd = \s -> do
        span_ <- readIORef s
        appendFailed <- atomicModifyIORef' batch $ \builder ->
          case push span_ builder of
            Nothing -> (builder, True)
            Just b' -> (b', False)
        when appendFailed $ void $ tryPutMVar workSignal Flush

    , processorForceFlush = void $ tryPutMVar workSignal Flush
    -- TODO where to call restore, if anywhere?
    , processorShutdown = async $ mask $ \_restore -> do
        -- flush remaining messages
        void $ tryPutMVar workSignal Shutdown

        shutdownResult <- timeout (millisToMicros exportTimeoutMillis) $
          wait worker
        -- make sure the worker comes down
        uninterruptibleCancel worker
        -- TODO, not convinced we should shut down processor here

        case shutdownResult of
          Nothing -> pure ShutdownFailure
          Just _ -> pure ShutdownSuccess
    }
  where
    millisToMicros = (* 1000)

There's a lot here, so let's just look at processShutdown - that's the actual
code we're running.

    -- TODO where to call restore, if anywhere?
    , processorShutdown = async $ mask $ \_restore -> do
        -- flush remaining messages
        void $ tryPutMVar workSignal Shutdown

        shutdownResult <- timeout (millisToMicros exportTimeoutMillis) $
          wait worker
        -- make sure the worker comes down
        uninterruptibleCancel worker
        -- TODO, not convinced we should shut down processor here

        case shutdownResult of
          Nothing -> pure ShutdownFailure
          Just _ -> pure ShutdownSuccess

Well, we have a mask there, and we never restore it. But the original code
was in an uninterruptibleMask thanks to bracket anyway.

Line by line, let's look at these:

        -- flush remaining messages
        void $ tryPutMVar workSignal Shutdown

tryPutMVar
is guaranteed not to block, so unlikely to be a problem here.
However, if the workSignal is full, then the Shutdown message is not
received.

        shutdownResult <- timeout (millisToMicros exportTimeoutMillis) $
          wait worker

This may be our culprit.timeout comes from System.Timeout in the base
package. The docs have a hint:

-- |Wrap an 'IO' computation to time out and return @Nothing@ in case no result
-- is available within @n@ microseconds (@1\/10^6@ seconds). In case a result
-- is available before the timeout expires, @Just a@ is returned. A negative
-- timeout interval means \"wait indefinitely\". When specifying long timeouts,
-- be careful not to exceed @maxBound :: Int@.
--
-- >>> timeout 1000000 (threadDelay 1000 *> pure "finished on time")
-- Just "finished on time"
--
-- >>> timeout 10000 (threadDelay 100000 *> pure "finished on time")
-- Nothing
--
-- The design of this combinator was guided by the objective that @timeout n f@
-- should behave exactly the same as @f@ as long as @f@ doesn't time out. This
-- means that @f@ has the same 'myThreadId' it would have without the timeout
-- wrapper. Any exceptions @f@ might throw cancel the timeout and propagate
-- further up. It also possible for @f@ to receive exceptions thrown to it by
-- another thread.
--
-- A tricky implementation detail is the question of how to abort an @IO@
-- computation. This combinator relies on asynchronous exceptions internally
-- (namely throwing the computation the 'Timeout' exception).  The technique
-- works very well for computations executing inside of the Haskell runtime
-- system, but it doesn't work at all for non-Haskell code.  Foreign function
-- calls, for example, cannot be timed out with this combinator simply because
-- an arbitrary C function cannot receive asynchronous exceptions. When
-- @timeout@ is used to wrap an FFI call that blocks, no timeout event can be
-- delivered until the FFI call returns, which pretty much negates the purpose
-- of the combinator. In practice, however, this limitation is less severe than
-- it may sound. Standard I\/O functions like 'System.IO.hGetBuf',
-- 'System.IO.hPutBuf', Network.Socket.accept, or 'System.IO.hWaitForInput'
-- appear to be blocking, but they really don't because the runtime system uses
-- scheduling mechanisms like @select(2)@ to perform asynchronous I\/O, so it
-- is possible to interrupt standard socket I\/O or file I\/O using this
-- combinator.
---
-- Note that 'timeout' cancels the computation by throwing it the 'Timeout'
-- exception. Consequently blanket exception handlers (e.g. catching
-- 'SomeException') within the computation will break the timeout behavior.

Ah, so, alas, it throws an asynchronous exception. But we are in an
uninterruptible masked state
. So the exception is never delivered to our
timeout. That explains why UnliftIO.Exception.bracket causes the problem.

(also, that timeout? that's the 30 seconds we're waiting on --
batchTimeoutConfig has a default of 30,000 milliseconds to wait)

Woof. That's why our thread is not dying, but why isn't it, uh, dying
correctly?

Well, back to the shutdown code - what happens next?

        -- make sure the worker comes down
        uninterruptibleCancel worker
        -- TODO, not convinced we should shut down processor here

Huh. Not entirely clear what goes on here!

-- | Cancel an asynchronous action
--
-- This is a variant of `cancel`, but it is not interruptible.
{-# INLINE uninterruptibleCancel #-}
uninterruptibleCancel :: AsyncM a -> IO ()
uninterruptibleCancel = uninterruptibleMask_ . cancel

OK, so we're not saying "you MUST cancel, NOW", we're saying, "We're going to
cancel you, and the cancellation is uninterruptible." Let's look at cancel
itself:

-- | Cancel an asynchronous action by throwing the @AsyncCancelled@
-- exception to it, and waiting for the `Async` thread to quit.
-- Has no effect if the 'Async' has already completed.
--
-- > cancel a = throwTo (asyncThreadId a) AsyncCancelled <* waitCatch a
--
-- Note that 'cancel' will not terminate until the thread the 'Async'
-- refers to has terminated. This means that 'cancel' will block for
-- as long said thread blocks when receiving an asynchronous exception.
--
-- For example, it could block if:
--
-- * It's executing a foreign call, and thus cannot receive the asynchronous
-- exception;
-- * It's executing some cleanup handler after having received the exception,
-- and the handler is blocking.
{-# INLINE cancel #-}
cancel :: Async a -> IO ()
cancel a@(Async t _) = throwTo t AsyncCancelled <* waitCatch a

OK, so cancel actually waits for the Async to finish! So if the Async
isn't responding to shutdown for whatever reason, then cancel won't return.

It seems to me that we should just be using cancel. That way, if we receive an
async exception in our shutdown, then we don't completely freak out.

        case shutdownResult of
          Nothing -> pure ShutdownFailure
          Just _ -> pure ShutdownSuccess

OK, this is pure, nothing funny or fancy here.

Investigating the worker

OK, so timeout is why we're not dying, but the real reason is that we're
doing something that takes forever inside of that timeout - trying to do a wait worker. So let's look at worker and see what it's doing.

  worker <- async $ loop $ do
    req <- liftIO $ timeout (millisToMicros scheduledDelayMillis)
      $ takeMVar workSignal
    batchToProcess <- liftIO $ atomicModifyIORef' batch buildExport
    res <- liftIO $ Exporter.exporterExport exporter batchToProcess

    -- if we were asked to shutdown, quit cleanly after this batch
    -- FIXME: this could lose batches if there's more than one in queue?
    case req of
      Just Shutdown -> throwE res
      _ -> pure ()

worker is an Async ExportResult. loop is a neat trick:

-- | Exitable forever loop
loop :: Monad m => ExceptT e m a -> m e
loop = liftM (either id id) . runExceptT . forever

throwE terminates the loop.

So, when the worker is created, we use async. When we call
withGlobalTracing, we aren't in a masked state. And the initialization doesn't
appear to do any masking either. So this should be cancellable.

We have another timeout - this time, we're trying to takeMVar from the
workSignal. If workSignal is empty, this will block. So we takeMVar for
about ~5000ms by default - 5 seconds. There are a few things that put things in
the work signal: Shutdown causes the loop to exit.

Expoerter.exporterExport is a record field:

data Exporter a = Exporter
  { exporterExport :: HashMap InstrumentationLibrary (Vector a) -> IO ExportResult
  , exporterShutdown :: IO ()
  }

That was provided in getTracerProviderInitializationOptions':

  exporters <- detectExporters
  {- snip -}
  processors <- case exporters of
    [] -> do
      pure []
    e:_ -> do
      pure <$> batchProcessor processorConf e

So, where does this come from?

knownExporters :: [(T.Text, IO (Exporter ImmutableSpan))]
knownExporters =
  [ ("otlp", do
      otlpConfig <- loadExporterEnvironmentVariables
      otlpExporter otlpConfig
    )
  , ("jaeger", error "Jaeger exporter not implemented")
  , ("zipkin", error "Zipkin exporter not implemented")
  ]

-- TODO, support multiple exporters
detectExporters :: IO [Exporter ImmutableSpan]
detectExporters = do
  exportersInEnv <- fmap (T.splitOn "," . T.pack) <$> lookupEnv "OTEL_TRACES_EXPORTER"
  if exportersInEnv == Just ["none"]
    then pure []
    else do
      let envExporters = fromMaybe ["otlp"] exportersInEnv
          exportersAndRegistryEntry = map (\k -> maybe (Left k) Right $ lookup k knownExporters) envExporters
          (_notFound, exporterIntializers) = partitionEithers exportersAndRegistryEntry
      -- TODO, notFound logging
      sequence exporterIntializers

So, the only real option appears to be otlpExporter. Digging in... and it's a
big function, so skim it real quick.

-- | Initial the OTLP 'Exporter'
otlpExporter :: (MonadIO m) => OTLPExporterConfig -> m (Exporter OT.ImmutableSpan)
otlpExporter conf = do
  -- TODO, url parsing is janky
-- TODO configurable retryDelay, maximum retry counts
  req <- liftIO $ parseRequest (maybe "http://localhost:4318/v1/traces" (<> "/v1/traces") (otlpEndpoint conf))

  let (encodingHeader, encoder) = maybe (id, id)
        (\case
          None -> (id, id)
          GZip -> (((hContentEncoding, "gzip") :), compress)
        )
        (otlpTracesCompression conf <|> otlpCompression conf)

      baseReqHeaders = encodingHeader $
        (hContentType, protobufMimeType) :
        (hAcceptEncoding, protobufMimeType) :
        fromMaybe [] (otlpHeaders conf) ++
        fromMaybe [] (otlpTracesHeaders conf) ++
        requestHeaders req
      baseReq = req
        { method = "POST"
        , requestHeaders = baseReqHeaders
        }
  pure $ Exporter
    { exporterExport = \spans_ -> do
        let anySpansToExport = H.size spans_ /= 0 && not (all V.null $ H.elems spans_)
        if anySpansToExport
          then do
            result <- try $ exporterExportCall encoder baseReq spans_
            case result of
              Left err -> do
                print err
                pure $ Failure $ Just err
              Right ok -> pure ok
          else pure Success
    , exporterShutdown = pure ()
    }
  where
    retryDelay = 100_000 -- 100ms
    maxRetryCount = 5
    isRetryableStatusCode status_ =
      status_ == status408 || status_ == status429 || (statusCode status_ >= 500 && statusCode status_ < 600)
    isRetryableException = \case
      ResponseTimeout -> True
      ConnectionTimeout -> True
      ConnectionFailure _ -> True
      ConnectionClosed -> True
      _ -> False

    exporterExportCall encoder baseReq spans_ = do
      msg <- encodeMessage <$> immutableSpansToProtobuf spans_
      -- TODO handle server disconnect
      let req = baseReq
            { requestBody =
                RequestBodyLBS $ encoder $ L.fromStrict msg
            }
      sendReq req 0 -- TODO =<< getTime for maximum cutoff

    sendReq req backoffCount = do
      eResp <- try $ httpBS req

      let exponentialBackoff = if backoffCount == maxRetryCount
            then pure $ Failure Nothing
            else do
              threadDelay (retryDelay `shiftL` backoffCount)
              sendReq req (backoffCount + 1)

      case eResp of
        Left err@(HttpExceptionRequest _ e) -> if isRetryableException e
          then exponentialBackoff
          else pure $ Failure $ Just $ SomeException err
        Left err -> pure $ Failure $ Just $ SomeException err
        Right resp -> if isRetryableStatusCode (responseStatus resp)
          then case lookup hRetryAfter $ responseHeaders resp of
            Nothing -> exponentialBackoff
            Just retryAfter -> do
              -- TODO support date in retry-after header
              case readMaybe $ C.unpack retryAfter of
                Nothing -> exponentialBackoff
                Just seconds -> do
                  threadDelay (seconds * 1_000_000)
                  sendReq req (backoffCount + 1)

          else pure $! if statusCode (responseStatus resp) >= 300
            then Failure Nothing
            else Success

Wow! OK, that's pretty huge. The active code appears to be the then branch
inside of the record field:

        if anySpansToExport
          then do
            result <- try $ exporterExportCall encoder baseReq spans_
            case result of
              Left err -> do
                print err
                pure $ Failure $ Just err
              Right ok -> pure ok
          else pure Success

printing? in my dependencies? it's more likely than you think

anyway, we're definitely doing something that can take a while. Looks like our
app is potentially doing a lot of threadDelay to try and get these things in.

Well, that could explain the behavior we're seeing - if we're stuck in a
exponentialBackoff that initially waits 100ms and doubles every failure, we
may see that behavior.

Just out of curiosity, it does seem like... this should eventually end? So I'm
just going to let it run for maybe 20 minutes or so and see what happens.
Starting at about 3:15 my time...

and at 3:24, it's still goin'. Not good. Something is up.

            result <- try $ exporterExportCall encoder baseReq spans_
            case result of
              Left err -> do
                print err
                pure $ Failure $ Just err

That err is a SomeException

data ExportResult
  = Success
  | Failure (Maybe SomeException)

which means we could have caught an async exception, since this is plain
Control.Exception:

import Control.Exception (SomeException(..), try)

But, if we got a Failure, then we should have handled that in the call to
the exporter, right? Let's jump back up the stack a bit, and see that exporter
call.

  worker <- async $ loop $ do
    req <- liftIO $ timeout (millisToMicros scheduledDelayMillis)
      $ takeMVar workSignal
    batchToProcess <- liftIO $ atomicModifyIORef' batch buildExport
    res <- liftIO $ Exporter.exporterExport exporter batchToProcess

    -- if we were asked to shutdown, quit cleanly after this batch
    -- FIXME: this could lose batches if there's more than one in queue?
    case req of
      Just Shutdown -> throwE res
      _ -> pure ()

oops! We don't actually do anything with res! Success or failure, we just
kinda... ignore it.

So my new hypothesis is something like this:

The exporterExport caught an async exception in try, and this was dropped in
the worker loop. For some reason, we don't get the memo to shutdown, and the
timeout never completes, because the thread is blocked waiting on the
worker.

So, what else can fill that MVar?

    , processorOnEnd = \s -> do
        span_ <- readIORef s
        appendFailed <- atomicModifyIORef' batch $ \builder ->
          case push span_ builder of
            Nothing -> (builder, True)
            Just b' -> (b', False)
        when appendFailed $ void $ tryPutMVar workSignal Flush

    , processorForceFlush = void $ tryPutMVar workSignal Flush

These are our two candidates. They can do a tryPutMVar with Flush, which
would not shut the worker down.

What calls processorOnEnd? endSpan, which is used liberally in our codebase
as part of inSpan = bracket createSpan endSpan.

OK, let's run this through:

  1. worker starts a loop iteration. The timeout starts.
  2. endSpan happens, and we get a Flush in the MVar.
  3. worker takeMVar succeeds, emptying the MVar.
  4. worker then calls exporterExport, which begins the export. This may take
    a long time if the request fails for some reason!
  5. endSpan happens, we get another Flush in the MVar.
  6. BOOM, we've called processorShutdown. Now we tryPutMVar Shutdown, but
    this does nothing because it already has a Flush.
  7. The timeout begins, waiting for the worker to complete. But because the
    Shutdown write didn't take, it won't ever stop.
  8. Then the worker gets a cancel, receiving an async exception to kill the
    thread. BUT... if this happens in that try? then we swallow the exception
    and keep going! Most likely, because that try surrounds a lot of
    threadDelay, we will throw to that try.

wait worker never completes.


I'm gonna work up a PR that should alleviate these issues.

Extensibility for Propagators and Samplers

👋 Hi there!

We're just getting started with OpenTelemetry and are looking to emit OTel traces from our app and then us the AWS collector to export to X-Ray. This means we need an AWS X-Ray Propagator, IdGenerator, and Sampler. I've written some of this code and it's working (I think), but it's been awkward (IMO) to get the SDK to use it.

I had to basically re-implement initializeGlobalTracerProvider just so that I could inject a record update to add my propagator after all the environment-variable-based configuration has happened. But then I look at the docs for said environment variables, and it seems like I should be able to "register" these things somehow, and then just say (e.g.) OTEL_PROPAGATORS=tracecontext,baggage,awsxray at runtime to turn them on. That would be really useful as we start migrating from X-Ray to Datadog -- to be able to choose via ENV at runtime like that.

So I guess my question is, am I missing a way in this library to do that?

Handling `ExitSuccess` gracefully

Currently if I do an inSpan and then exitSuccess, it'll report that the span was an error. This is beacuse Haskell implements exitSucces by throwing an exception, which hs-opentelemetry records as an exception on the event.

Add thread start / thread end hooks to RTS

We currently have a hard time making sure that contexts are accessible in threads that are outside of our control (e.g. internal threading in 3rd party libraries). Extending the RTS to allow us to hook into the thread lifecycle would allow us to make sure that execution contexts are able to propagate through the system without requiring libraries to monkeypatch their concurrent code.

Deadlock test suite

We should implement a test suite that runs a binary program and verifies that various configurations don't deadlock.

This needs to be managed outside of the Haskell RTS since the Haskell RTS is incapable of handling this internally. Consider:

main = do
  cancel =<< uninterruptibleMask_ $ async $ forever $ pure ()

This main loop can only be killed via a system signal, so we cannot test for the absence of a deadlock in a Haskell-only test suite. This means we'll need to have compiled binaries that the test suite builds and runs, and then verifies that they complete within some specified time.

Release propagators-datadog

Any plans to release https://github.com/iand675/hs-opentelemetry/tree/main/propagators/datadog ?

I've been able to get Datadog correlation fully working as far as I can tell by just adding a trace-id to my logs. The collector handles translating ids for traces, I guess, and then I just set my Trace ID field on log injecting, and it all matches up.

Thing is, I've had to in-line the bit fiddling though to do it, which I'd rather not. I'm not actually sure what a full-on Propagator will do for me, but I'd at least like to rely on the package for the bit-fiddler functions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.