Coder Social home page Coder Social logo

oteps's Introduction

OpenTelemetry Enhancement Proposal (OTEP)

Slack chat Build Status

Evolving OpenTelemetry at the speed of Markdown

OpenTelemetry uses an "OTEP" (similar to a RFC) process for proposing changes to the OpenTelemetry specification.

Table of Contents

What changes require an OTEP

The OpenTelemetry OTEP process is intended for changes that are cross-cutting - that is, applicable across languages and implementations - and either introduce new behaviour, change desired behaviour, or otherwise modify requirements.

In practice, this means that OTEPs should be used for such changes as:

  • New tracer configuration options
  • Additions to span data
  • New metric types
  • Modifications to extensibility requirements

On the other hand, they do not need to be used for such changes as:

  • Bug fixes
  • Rephrasing, grammatical fixes, typos, etc.
  • Refactoring
  • Things that affect only a single language or implementation

Note: The above lists are intended only as examples and are not meant to be exhaustive. If you don't know whether a change requires an OTEP, please feel free to ask!

Extrapolating cross-cutting changes

Sometimes, a change that is only immediately relevant within a single language or implementation may be indicative of a problem upstream in the specification. We encourage you to add an OTEP if and when you notice such cases.

OTEP scope

While OTEPs are intended for "significant" changes, we recommend trying to keep each OTEP's scope as small as makes sense. A general rule of thumb is that if the core functionality proposed could still provide value without a particular piece, then that piece should be removed from the proposal and used instead as an example (and, ideally, given its own OTEP!).

For example, an OTEP proposing configurable sampling and various samplers should instead be split into one OTEP proposing configurable sampling as well as an OTEP per sampler.

Writing an OTEP

  • First, fork this repo.
  • Copy 0000-template.md to text/0000-my-OTEP.md, where my-OTEP is a title relevant to your proposal, and 0000 is the OTEP ID. Leave the number as is for now. Once a Pull Request is made, update this ID to match the PR ID.
  • Fill in the template. Put care into the details: It is important to present convincing motivation, demonstrate an understanding of the design's impact, and honestly assess the drawbacks and potential alternatives.

Submitting the OTEP

  • An OTEP is proposed by posting it as a PR. Once the PR is created, update the OTEP file name to use the PR ID as the OTEP ID.
  • An OTEP is approved when four reviewers github-approve the PR. The OTEP is then merged.
  • If an OTEP is rejected or withdrawn, the PR is closed. Note that these OTEPs submissions are still recorded, as GitHub retains both the discussion and the proposal, even if the branch is later deleted.
  • If an OTEP discussion becomes long, and the OTEP then goes through a major revision, the next version of the OTEP can be posted as a new PR, which references the old PR. The old PR is then closed. This makes OTEP review easier to follow and participate in.

Integrating the OTEP into the Spec

  • Once an OTEP is approved, an issue is created in the specification repo to integrate the OTEP into the spec.
  • When reviewing the spec PR for the OTEP, focus on whether the spec is written clearly, and reflects the changes approved in the OTEP. Please abstain from relitigating the approved OTEP changes at this stage.
  • An OTEP is integrated when four reviewers github-approve the spec PR. The PR is then merged, and the spec is versioned.

Implementing the OTEP

  • Once an OTEP is integrated into the spec, an issue is created in the backlog of every relevant OpenTelemetry implementation.
  • PRs are made until the all the requested changes are implemented.
  • The status of the OpenTelemetry implementation is updated to reflect that it is implementing a new version of the spec.

Changes to the OTEP process

The hope and expectation is that the OTEP process will evolve with the OpenTelemetry. The process is by no means fixed.

Have suggestions? Concerns? Questions? Please raise an issue or raise the matter on our community chat.

Background on the OpenTelemetry OTEP process

Our OTEP process borrows from the Rust RFC and Kubernetes Enhancement Proposal processes, the former also being very influential on the latter; as well as the OpenTracing OTEP process. Massive kudos and thanks to the respective authors and communities for providing excellent prior art 💖

oteps's People

Contributors

aabmass avatar arminru avatar bg451 avatar bhs avatar bogdandrutu avatar dyladan avatar huyan0 avatar inikem avatar iredelmeier avatar jkwatson avatar jmacd avatar kentquirk avatar lquerel avatar mikegoldsmith avatar moviestoreguy avatar mx-psi avatar pyohannes avatar reyang avatar rperry2174 avatar scheler avatar sergeykanzhelev avatar songy23 avatar steverao avatar tedpennings avatar tedsuo avatar tigrannajaryan avatar wsargent avatar yzhuge avatar z1c0 avatar zchee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oteps's Issues

Use pull/issue # as sequence numbers in the RFC file names

It appears one of the reason people want to merge RFCs asap is to reserve the RFC number. That is a completely artificial and unnecessary reason to merge RFCs without addressing feedback.

I propose to rename all files to yyyy-mm-dd-short-title.md, using the data the RFC is created.

Update 2019-09-13

Per discussion below, changing this proposal to use tracking issue # or the PR # (if there's no tracking issue) for RFC numbers.

Proposal: "Plugable backend" Tracing Client/Query library

Hello all,

Apologizes in advance as I am not sure if this repo is the proper one to start a proposal, please, feel free to redirect me to the right channel.

I'd like to ask if the OpenTelemetry specification is considering to create an "agnostic" backend library to consume traces stored in any tracing platform (Jaeger, Tempo, Zipkin).

Today, applications can use opentelemetry sdk to ingest traces and spans in a common format. Projects like Kiali https://kiali.io/ consume metrics, traces, logs, configuration to correlate and combine to offer to the user added value in the Service Mesh domain.

One of the request from users is the possibility to change the tracing platform i.e. kiali/kiali#4278 but unfortunately if you want to query traces for a specific platform you'd require some technical dependency on that platform (in the Kiali case, this one is Jaeger).

There are proto definitions for a gRPC service but I think on this side it's not obvious to change from one platform or other (or at least I'm not aware of an effort in the open-telemetry group around this).

So, one possibility is to wait that the specific implementations (Jaeger, Zipkin, Tempo, others) would implement a common API for backend queries, but for backend applications it would be nice to have a single dependency (i.e. opentelemetry-backend) that can enable to query one or another backend in a common format.

In the Kiali project this is something we'd like to offer to the users and we'd like to explore the possibilites to create a proxy client [2] that can abstract the consumer of the traces/spans about the details of specific implementations (Jaeger, Tempo, others) for easy change from one platform to another.

We think this effort could be interesting for the OpenTelemetry community and it would be nice if it's fostered under the OpenTelemetry umbrella allowing users to participate more actively.

My goal here would be to get feedback about:

  • Is this idea interesting for the OpenTelemetry community ?
  • If there is some similar project started, we wouldn't like to start anything new but join forces and collaborate with it.
  • If not we could volunteer to start some PoC around, in similar fashion like [2] but trying to help to a wider audience, not only to meet some specific requeriments of the Kiali project.

Any feedback would be welcome.

Thank you !
Lucas

[2] https://github.com/lucasponce/jaeger-proto-client

Proposal: Establish consistent guidelines for organizing vendor specific contributions to the OpenTelemetry Collector

Today, some vendor contributed components are being accepted into the OpenTelemetry Collector repositories. Other vendor contributions are being rejected. There are no consistent rules for merging vendor specific code in the OpenTelemetry Collector repositories.

I propose that all vendor specific components for the Collector reside in the “github.com/open-telemetry/opentelemetry-collector-contrib” repository. This will facilitate static linking of all Go components needed by the user.
The alternative is that vendor components reside in vendor-hosted repositories which will lead to fragmentation of the Collector.

In particular, I request that the AWS authentication contribution supporting AWS SIGv4 for Collector based exporters reside in the github.com/open-telemetry/opentelemetry-collector-contrib repository. This would provide users with an easy-to-access, single source for all OpenTelemetry Collector components. In addition, it enables users to easily build exporters with statically linked SIGv4 support.

AWS will continue to maintain its component and create PRs against the collector-contrib repo for all revisions. We will work toward achieving collector-contrib maintainer status and then be able to manage code reviews and handle merging PRs appropriately. In the meantime, we will continue to submit PRs to be reviewed and merged into the collector-contrib repo by the current maintainers.

cc: Collector maintainers @bogdandrutu @tigrannajaryan

Proposal: clarify behavior when retrieving non-existent currently active span

For languages that provide an implicitly propagated Context, the API should provide a way to retrieve the currently active span.
See https://github.com/open-telemetry/opentelemetry-specification/blob/2cfad37daf7e0d20851fd8a639a55375c3fc93dd/specification/trace/api.md#context-interaction

However, I am seeing a divergence in behaviours between SDKs which I believe would be nice to be coherent about.
If there is no current active span, most SDKs will return an invalid/noop span, while others will return undefined.

I believe this is a pretty big difference between those SDKs, as depending on the language being used, folks may get errors if they get a context which unexpectedly doesn't have any span.
Or they may be losing data if they get an invalid span and don't check for it.

My proposal is therefore the following:

  • SDKs that provide a way to retrieve the current span MUST return an invalid or noop span if none were set in the context.
  • SDKs MAY log a debug if an invalid/noop span was returned.

Proposal: Resource Scope and Namespace API

Resource Scope and Namespace API

Resources

Resources are a term to describe properties about things like the enviroment, process name and identifiers, and shard numbers, things that are usually statically known in a library of code.

In current terms, We call such properties "attributes" on a span or span event, "labels" on a metric, and "correlations" in the distributed context. Resources are the implicit properties that become span attributes and metric labels when and where they are used. This is a generally accepted idea for process-wide properties, which may be initialized with the SDK and are not an absolute requirement in the API.

Resources are not included in the current OpenTelemetry APIs (they were removed in the v0.2 release).

Namespaces

Namespace refers to a qualifier on names used in the OpenTelemetry API. While both spans and metric instruments are named entities, there is concern that unrelated code may use the same name, therefore namespaces are introduced. Names are only considered identical when their namespaces match. Namespace is a property of the exported span and metric instrument.

The metric API recommends that Meter implementations SHOULD generate errors when metric instruments are registered with the same name and different kind. For this to work reliably, the API should support a namespace. Most existing metric APIs include a namespace concept, so this is probably a requirement for OpenTelemetry v1.0.

Status of this issue

This issue is filed alongside #68 and #73, as raising a complex issue for discussion and consideration. There is also relationship between these issues discussed below. This is not meant as a proposal to be incorporated into the v1.0 OpenTelemetry API.

Detailed discussion

Across the OpenTelemetry APIs, every method is defined in association with a "current" distributed context. This proposal introduces the notion of a "current" static context. These contexts are practically identical, only they are used differently.

Distributed context is passed from call to call dynamically. Static context is organized by units of code. We sometimes refer to the unit of code as "libraries", "components", or "modules". Concretely, this proposal refers to this concept as a Resource Scope.

The "current" Resource Scope determines the following properties of the static OpenTelemetry context:

  1. Tracer SDK: an implementation of the trace.Tracer API
  2. Meter SDK: an implementation of the metric.Meter API
  3. Namespace: the namespace of any new Span or Metric instrument
  4. Resources: implicit properties associated with Metric events.

To understand why the current Scope's resources only apply to Metric events, it is important to recognize that Spans are scopes of their own. Spans start with a set of attributes that serve, in this proposal as a new Resource Scope. Spans events happen in their own scope, whereas metric events happen in the context of another scope.

Scope type

The Scope type supports accessors that return the Tracer and Meter API for use. In this proposal, all Tracer and Meter API functionality are accessed via a Scope, the contract being tht when these API functions are called, the Scope's namespace and resources take effect.

Tracer relationship

The Tracer API supports only Start method. When starting a span, the resources associated with the Tracer's Scope are included in the span's attributes. The Scope's namespace is associated with the new span.

The Span interface returned by Start is considered a scope of its own. When the span interface is used, those events do not implicitly take on static properties from the current Resource Scope when used.

Meter relationship

The Meter API supports New constructors for each kind of instrument. When creating metric instruments, the namespaces associated with the Meter's Scope is used to disambiguate the new instrument's name.

The Meter API supports a RecordBatch function that reports multiple metric events. When recording a batch of measurements, the resources associated with the Meter's Scope are included in the metric event's labels.

Metric instruments (and bound instruments) are not considered scopes-- like spans are. Metric events use the current Resource Scope when called

The metric instrument and bound instrument are not considered scopes the way spans are, and they do take on static properties from the current Resource Scope when used.

Global Scope provider

This proposal replaces the independent global Tracer and Meter singletons with a single global Scope. The global scope will be used as a default whenever there is no "current" scope otherwise defined.

The global Scope is used as the default "current" Resource Scope, allowing process-wide resources to be set through the API. This proposal recommends #74, i.e., that the global scope only be initialized once.

Changes to existing APIs

The new functionality is nearly independent of existing Trace, Metric, and Context Propagation APIs. The Tracer API is unchanged in this proposal. Context propagation is completely independent of static resource scope.

The Meter API is simplified by this proposal. The metric LabelSet API moves into the Resource Scope and disappears from the metric API. In each of the metric calling conventions, whereas the former call accepted LabelSet the replacement in this proposal takes a list of additional labels (called "call site" labels). The current resource scope is combined with the call site labels to generate a call-site resource scope.

Benefits of using Scopes

Using the Scope API as proposed here ensures that it is easy for developers to coordinate their span attributes with their metric labels. Whereas before the developer was responsible for computing metric label sets, these variables can now be placed into current resource scope and used implicitly.

Metrics that happen in the context of a Span will automatically include the span's attributes in their LabelSet, by setting the Span as the Resource Scope. This addresses a topic raised in open-telemetry/opentelemetry-specification#381 about making the Tracer and Meter APIs more "aware" of each other.

If OpenTelemetry adds a logging interface, the current resource scope would implicitly apply to log events.

Relationship with Context Propagation

Context propagation happens independently of the current resource scope. In the context of #66, the Scope type here determines the current set of Propagators. The global Scope determines the global propagators.

Relationship with Span

It is unclear what the relationship between Scope and Span should be. Should the Scope's Span have been started by the Scope's Tracer? Must it have been? There are subtle implications. What resource scope should the Span's Tracer() method return?

Followed to its logical conclusion, the proposal here would replace the Span.Tracer() accessor by a Scope() accessor, and the former functionality would be accessed through Span.Scope().Tracer(). This would make it easy to explicitly switch into a Span's scope.

Relationship with "Named" Tracers and Meters

The "Named Tracer" and "Named Meter" proposal #76 overlaps with this topic because it sounds appealingly similar to a namespace, and it is also an implied resource (i.e., the reporting library).

This proposal does not directly address the topic of that proposal, however. The question behind "Named" Tracers and Meters is whether the developer is obligated to provide a name when obtaining a new Tracer or Meter. The same can be done in this proposal by preventing the construction of a Scope without providing a name.

Prototype code

This proposal has been prototyped in the Golang repository. See open-telemetry/opentelemetry-go#427. The new SDK initialization sequence looks like:

// initTelemetry initializes the global OpenTelemetry SDK and returns
// a function to shut it down before exiting the process.
func initTelemetry() func() {
	tracer := initTracer()
	meter := initMeter()
	global.SetScope(
		scope.WithTracerSDK(tracer.Get()).
			WithMeterSDK(meter.Get()).
			WithNamespace("example").
			AddResources(
				key.String("process1", "value1"),
				key.String("process2", "value2"),
			),
	)
	return func() {
		tracer.Stop()
		meter.Stop()
	}
}

To inject resources into a module of instrumented code, for example:

   // All instrumentation from this client is tagged by "shard=...".
   ctx := global.Scope().WithResources(key.String("shard", ...)).InContext(context.Background())
   someClient := somepackage.NewClient(ctx)

To attach a namespace to a group of metric instruments:

func NewClient(ctx context.Context) *Client {
   // All metrics used in this code are namespaced by "somepackage".
   scope := scope.Current(ctx).WithNamespace("somepackage")
   meter := scope.Meter()
   client := &Client{
      instrument1: meter.NewCounter("instrument1"),
      instrument2: meter.NewCounter("instrument2"),
   }
   return client
}

Points of interest:

  1. See the above example/basic/main.go for more context
  2. An example of four ways generate equivalent metric events with different uses of Scope
  3. The new Scope type
  4. The new current Scope machinery
  5. The label.Set type is now concrete

profiles/follow up: consistent time format

This is a follow up for #239 (comment) around the request for a consistent time precision:

In ProfileContainer there are start_time_unix_nano and end_time_unix_nano. Should we have the same precision with timestamps in Sample and also use ns instead of ms?

// Timestamps associated with Sample represented in ms. These timestamps are expected
// to fall within the Profile's time range. [optional]
repeated uint64 timestamps = 13;

With Profile.time_nanos there is another timestamp in the message that uses nanosecond precision.

Proposal: Remote Sampling

Introduction

Remote Sampling is a sampling technique in which sampling configuration for a path/request can be applied in a distributed way. For example, if there are 5 hosts in the fleet and user has set sampling reservoir as 5 req/sec and the fixed rate as 5% then remote sampling helps to distribute sampling configs to all the hosts in a way ultimately SDK samples 5 request/sec and additional 5% of requests after that.

This document talks about X-Ray Centralized Sampling and how to implement it using Sampling Interface for OTel SDKs.

The proposal here would be to standardized remote sampling data model/protocol required by SDK side implementation. Obviously to implement such sampling approach backend would also have to compute quota and feed data in to the SDK for sampling decision.

Simply RFC process to better leverage GitHub

Currently, the process for submitting an RFC goes through several stages: proposed -> approved -> implemented. While this is logical, it means that the actual RFC review cannot occur during a PR, as the PRs are only meant to address state changes in the RFC.

However - in practice - when an RFC is proposed, all of the meaningful discussion currently occurs in the initial PR. This has turned out to work well. Github provides a lot of tooling in the PR process to support threaded discussion, line-by-line commenting, notifications, and automatic references to other PRs and issues.

I suggest we leverage Github PRs for our approval process:

  • An RFC is proposed by posting it as a PR.
  • An RFC is approved when four reviewers github-approve the PR.
  • If approved, the RFC is updated to use the PR ID as the RFC ID, and then merged.
  • If an RFC is rejected or withdrawn, the PR is closed. Note that these RFCs submissions are still recorded, as Github retains both the discussion and the proposal, even if the branch is later deleted.
  • If an RFC discussion becomes long, and the RFC then goes through a major revision, the next version of the RFC can be posted as a new PR, which references the old PR. The old PR is then closed. This makes RFC review easier to follow and participate in.

Proposal: Ability to associate tracer by alias with exporter/appender/destination

Today's API

Currently TracerProvider.GetTracer(library_name, library_version). This API implies that one tracer provider is:

  • single-tenant
  • single-destination
  • single-exporter

This is implied from statement that instrumentation library must be supplied to Get Tracer, and that instrumentation library is NOT the instrumented library or module.

Quote:

image

library_name and library_version parameters and real-world multitenant applications

When a customer manually instruments with OpenTelemetry SDK, deploying their complex enterprise-grade application with no existing instrumentation library , it would be of benefit to provide module name rather than instrumentation library as the first parameter to API. Different components of the app should be able to obtain each their own tracer, associated with their own instrumentationName - as in their own component name, of the component being instrumented.

Proposal

It would be great to adjust the spec to allow for implementation dependent definition of instrumentationName, i.e. first parameter passed to GetTracer. Such as GetTracer(name), where name - remains instrumentationName, but its semantics could be either instrumentation library name or instrumented module name.

    auto metaTracerProvider = GetConfigurableTracerProvider("configuration.file");
    auto tracer1 = metaTracerProvider.GetTracer("com.acme.Module1");
    auto tracer2 = metaTracerProvider.GetTracer("com.acme.Module2");

Further, having this implemented - we can provision a separate configuration piece, that allows to rewire the module by-name with given tracer, and/or instrumentation outside of code:

  • named tracer1 ("com.acme.Module1") -> wired to Instrumentation A / Exporter A1 / Tenant1
  • named tracer2 ("com.acme.Module2") -> wired to Instrumentation Library A / Exporter A2 / Tenant2

This approach would enable us:

  • to follow the best external config-provisioning practices established by Apache log4j and log4cxx

  • to allow for a single TracerProvider to enable configurable different destinations: same provider by-name could return different instrumentation libraries, and/or multiple exporters of the same library_name/version chained together. For example, provisioning an instance of instrumentation library with different exporter arguments, to send data to different tenant in the cloud storage.

  • provide ability to specify configurable destinations (outside of OTEL spec) - by alias or by name, where tracers and loggers can be wired to different exporters; where configuration itself may also supply additional details, such as Instrumentation Key or Authorization Key or Storage Destination for a given Named Tracer or Named Logger.

Conceptually the same should apply to Logger Provider too. In other words, TracerProvider.GetTracer(name) or LoggerProvider.GetLogger(name) - allows to rewire to same or different exporter dynamically at runtime, supplying additional configuration details (implementation dependent) outside of code. That way developer do not need to re-instrument (they can maintain the same module tracer/logger name), in case if ingestion/authorization key changes; OR if data ingestion destination URL changes; OR even if they decide to move to different exporter / cloud provider, they would only need to reprovision a different configuration for a given tracer or logger name.

One can define the config mapping like this, with exporter and options provisioned similar to how it's done in log4j :

Tracer (Logger) Name Exporter Exporter Options
console Stream
Module1 OTLP host:port
Module2 AzMon iKey=x,maxBatchSize=2MB,etc.

That way it's easy to rewire the Module1 to whatever exporter, with whatever externally-provisioned options. Also one may either chain, or specify multiple exporters for the given tracer name. For example, allowing the same GetTracer("Module1") to return an instance that would route incoming traces, events, logs, to more than one exporter.

For the Module2 example - if authorization key (iKey) used by Module2 changes, customers do not need to reinstrument their logic of acquiring the logger. They'll adjust the corresponding config. Or if the customer wants to migrate from one channel / provider, to another, i.e. if hosted on different clouds, they would not need to recompile their application. Name alias will bind their tracer to concrete library/version via config file rather than explicitly providing library name / version via API call.

Short Summary

In addition to GetTracer(library, version) - introduce GetTracer(name), that allows for dynamic routing of named tracer or logger, i.e. allow the name parameter to be anything. Could be library_name. Could be something else, some other name, e.g. alias or unique identifier. That way a single TracerProvider (or meta provider) can be used to aggregate multiple exporters; can be used to dynamically route just single named tracer to /dev/null, or provide additional attributes outside-of-code (in config file) that allow to route events emitted thru named tracer to given destination.

What it gives - recap

Manageability. Once the logs or trace statements have been inserted into the code (for named loggers and tracers), they can be controlled with configuration files without code re-instrumentation. Loggers and Tracers can be selectively enabled or disabled, and sent / rewired to different and multiple output targets (different or multiple exporters) in user-chosen formats. Developers operate on named tracers and loggers. Configuration covers the wiring aspects from instrumentationName => concrete tracer+exporter+configuration of a tracer in opaque manner. One can also develop a meta-TracerProvider that aggregates different kinds of underlying TracerProviders. That may subsequently retain the old semantics (library_name, library_version). In that flow, TenantModuleA => maps to library_X,version_Y,configOptions.. Developer does GetTracer(TenantModuleA) and a single TracerProvider factory maps acquisition of a tracer to corresponding library, version, with given configuration / authentication settings.

Labels for closed + unmerged PRs

The OTEP process defines the states rejected and deferred which lead to closing a PR without it being merged. It would be great if we had & assigned labels to closed PRs to see in which state they are.
Additionally, I guess that OTEP 0008 is in a WIP state (to be reopened after being worked on by proposer) so labeling that would be good too.

Proposal: OpenTelemetry Sandbox

Over the last months, I have seen a few situations where people have come to our community proposing interesting ideas to be adopted. I have also seen vendors offering code donations to the project, some of which are now mostly unmaintained.

As a possible solution to this, I would like to propose a new GitHub organization, opentelemetry-sandbox. This organization would host projects until we are confident they have a healthy community behind them. They would also serve as a neutral place for the community to conduct experiments.

The advantage of a sandbox organization is that we can still have governance rules there, making sure it’s an inclusive place for people to collaborate while keeping the reputation of the OpenTelemetry project as a whole untouched, given that it would be clear that OpenTelemetry doesn’t officially support projects within the sandbox.

There is a desire, but not an expectation, that projects will be moved from the sandbox as an official SIG or incorporated into an existing SIG. There’s also no expectation that the OpenTelemetry project will provide resources to the sandbox project, like extra GitHub CI minutes or Zoom meeting rooms, although we might evaluate individual requests.

This OTEP is inspired by CNCF’s sandbox projects, but the process is significantly different.

Examples

Here are a few projects that I see as suitable for the sandbox:

  1. We have previously discussed having or fostering experiments with LLMs related to observability. The sandbox will be the perfect place for this without risking reputational damage to the project if the outcomes aren’t on par with the expectations.
  2. There are a couple of code donation proposals in place that could have been accepted as part of the sandbox, such as:
  3. During a previous Outreachy internship, a command-line interface tool was developed to assist in the bootstrapping of OpenTelemetry Collector components. It was primarily developed in the intern’s GitHub account, with little community visibility and involvement.
  4. I have a few custom distributions for the OpenTelemetry Collector, such as the “sidecar”, that are currently hosted on my employer’s organization. Given that they are not tied to my employer’s backends, they would probably benefit a broader range of users from being available at the sandbox.

Acceptance criteria

A low barrier to entry would be desired for the sandbox. While the process can be refined based on our experience, my initial proposal for the process is the following:

  1. Proposals should be written following the template below and have one Technical Committee (TC) and/or Governance Committee (GC) sponsor, who will regularly provide the TC and GC information about the state of the project.
  2. Once a sponsor is found, the TC and GC will vote on accepting this new project on the Slack channel #opentelemetry-gc-tc.
    1. After one week, the voting closes automatically, with the proposal being accepted if it has received at least one 👍 (that of the sponsor, presumably).
    2. If at least one 👎 is given, or a TC/GC member has restrictions about the project but hasn’t given a 👎 , the voting continues until a majority is reached or the restrictions are cleared.
    3. The voting closes automatically once a simple majority of the TC/GC electorate has chosen one side.
  3. Proponents should abide by OpenTelemetry’s Code of Conduct (currently the same as CNCF’s).
  4. There’s no expectation that small sandbox projects will have regular calls, but there is an expectation that all decisions will be made in public and transparently.
  5. Sandbox projects do NOT have the right to feature OpenTelemetry’s name on their websites.

Template

Project name:

Repository name:

Motivation:

Zoom room requested?

Example

Project name: OpenTelemetry Collector Community Distributions

Repository name: opentelemetry-collector-distributions

Motivation: The OpenTelemetry Collector Builder allows people to create their own distributions, and while the OpenTelemetry Collector project has no intentions (yet) on hosting other more specialized distributions, some community members are interested in providing those distributions, along with best practices on building and managing such distributions, especially around the CI/CD requirements.

Zoom room requested? No

Further details

  • A new GitHub user group will be created with the current members of the TC and GC as members. This group shall be the admin for all repositories in the organization.
  • Project proponents are added as maintainers and encouraged to recruit other maintainers from the community.
  • Code hosted under this organization is owned by the OpenTelemetry project and is under the governance of OTel’s Governance Committee.

Enable misspell go tool

Please enable the tool you enabled on specification here as well. Can you please use circleCi for that?

Proposal: Dynamic configuration of metrics

This OTEP is to add support for parsing request information to generate metrics.
Some business-related metrics need to be generated by parsing the request information (e.g., request body, header, response body).

For example.
Counting the number of active users per minute.
The configuration could be as follows.

{
	"service": "user-center",
	"config": [{
		"name": "request",
		"key": "user",
		"interval": "60s",
		"type": "counter"
	}]
}

Provide a method for end users to pass in request information.
Parse the request information, record the number of times the user field appears and report it once per minute.

I want this configuration to be stored on the remote server, and the client can listen to changes in the configuration to turn on or off the reporting of metrics.
The end user can enable data collection for a metric at any time for the information in the request.
This could be the number of successful or failed requests, the number of requests with a delay greater than 3ms, or the number of requests with the key equal to user.
If the open-telemetry group is interested, I can provide a preliminary design and implementation.

Acceptance PRs for proposed OTEPs

In https://github.com/open-telemetry/oteps#submitting-a-new-rfc:

  1. A new "work-in-progress" (WIP) pull request will be created with that updates the RFC's status to approved
    • TODO: This should probably be automated

Currently there are several OTEPs merged as proposed but without any follow-up PR:

Ideally we should also have links to any follow-up PRs in the OTEP (since there's no GitHub feature for that, it is hard to find them).

CC @jmacd

Proposal: Span Context == Span

This proposal stands as a counterpoint to #68. Both of these proposals are aimed at clarifying what it means to have a "Span object" after OTEP #66 is accepted.

The "Span object" concept does not really exist in the API as it is specified today, although the language could be improved. This proposal states that all we should do is improve is the specification language for the "Span interface".

The two reasons given in #68:

When Context API is moved into an independent layer below all other layers, the way extractors might work is like this: ctx = extractor.extract(ctx, request). Because extract() cannot start a new span, it must store span context in the returned ctx. With this proposal, it will always keep the span context only in the context, never the Span.

Extract is defined as returning a context. Extract's job is not to start a span, but the language here is correct. Spans are not created, they are started. Creating a span implies there is a new object. Starting a span implies there is an event. If an SDK decides to keep some sort of span-like object in the context, it may do so, but it would be the result of StartSpan, not Extract. The job of specifying this API is not to dictate how the SDK works; there has never been a requirement that spans be implemented as objects. The early "Streaming" SDK developed in the Go repo demonstrates this, here span is essentially just a context with the addition of a process-wide sequence number to order span events.

Not giving users references to Span objects simplifies memory management. In OpenTracing it was pretty difficult to pool span objects because user can keep a reference to the span even after calling span.finish(). The tracer can keep a buffer of active spans indexed by immutable span contexts that are kept by the user code. When span is finished the tracer can reuse the span object for another trace. if a later call from user code comes with span context not in the table, trace can either ignore the call, or capture that data by other means (if the backend supports merging of spans).

The semantics of the OpenTelemetry API already permit an SDK to simplify memory management as this implies, as demonstrated in the Go streaming SDK (link above).

I would like to revise the specification language to make it very clear that the Span returned by StartSpan() is an interface value that is logically equivalent to the span context. Operations on Span values semantically report events, associated with that span context, that the SDK will implement accordingly. An operation like CurrentSpan() is logically just Span{CurrentContext()}.

The specification will have to be adjusted to clarify this matter. We should stop describing starting a span as "Creation". Most of the specification already refers to "Span interface", but the leading defintion is incorrect:

Span is a mutable object storing information about the current operation execution.

This is just not true. We've decoupled the API from the SDK and there is certainly not a requirement that the Span be mutable or an object. Span is better described as a "span context reference", and that operations on Span interfaces create "span events".

After this revision to the specification language, I believe all of @yurishkuro's concerns will actually be addressed, and there is no need to change any existing APIs.

Detail: Finishing a span

This may raise some questions, for example, what does it mean to "Finish" a Span?

Finishing a span is just another Span event. What if the span was already finished? Then the SDK will have to deal with a duplicate span Finish event.

If the SDK maintains a span-like object in the Context as an optimization (although it risks memory management issues), perhaps it will recognize immediately that the subsequent Finish event was a duplicate--it can record an warning, but this is not a bug (it could be a race condition).

If the SDK does not maintain a span-like object in the Context, as in the streaming SDK discussed above, then it may actually not have any record of the span anywhere. This will happen naturally if the user forgets to finish spans and the SDK decides to purge unfinished spans from memory. This is not a bug, this is a duplicate Finish.

What happens to the context after the Span is finished? The context is unchanged, so it's possible for the user to continue operating after the span is finished. This is again, not a bug, and some SDKs will be able incorporate these events. For example in a stateless SDK the process itself will not know whether the Span was already finished, it will simply record an event for the downstream system to parse. The downstream system in this case may record a warning, but I wouldn't call this a bug. As discussed in comments for #66, this makes it semantically meaningful to record span events both before a span is started and after it is finished. In an SDK specification, I would say that SDKs are not required to handle such events, but they are still semantically meaningful.

profiles/follow up: location references in sample

This is a follow up for #239 (comment) around message Sample and its use of location_index, locations_start_index and locations_length:

message Sample {
// The indices recorded here correspond to locations in Profile.location.
// The leaf is at location_index[0]. [deprecated, superseded by locations_start_index / locations_length]
repeated uint64 location_index = 1;
// locations_start_index along with locations_length refers to to a slice of locations in Profile.location.
// Supersedes location_index.
uint64 locations_start_index = 7;
// locations_length along with locations_start_index refers to a slice of locations in Profile.location.
// Supersedes location_index.
uint64 locations_length = 8;

As an example, consider the following stack in a folded format:

foo;bar;baz 100
abc;def 200
foo;bar 300
abc;ghi 400
foo;bar;qux 500

Like in most stack traces, the base frames are similar, but there is a variation in the leaf frames. To reflect this, the last two traces use different leaf frames, ghi and qux.

Should the resulting sample look like the following?

sample:
  - locations_start_index: 0
    locations_length: 3
    value:
      - 100
  - locations_start_index: 3
    locations_length: 2
    value:
      - 200
  - locations_start_index: 0
    locations_length: 2
    value:
      - 300
  - locations_start_index: 5
    locations_length: 2
    value:
      - 400
  - locations_start_index: 7
    locations_length: 3
    value:
      - 500
location_indices:
  - 0 # foo
  - 1 # bar
  - 2 # baz
  - 3 # abc
  - 4 # def
  - 3 # abc 
  - 5 # ghi
  - 0 # foo
  - 1 # bar
  - 6 # qux
location:
  - line:
      - function_index: 0 # foo
  - line:
      - function_index: 1 # bar
  - line:
      - function_index: 2 # baz
  - line:
      - function_index: 3 # abc
  - line:
      - function_index: 4 # def
  - line:
      - function_index: 5 # ghi
   - line:
      - function_index: 6 # qux
function:
  - name: 1 # foo
  - name: 2 # bar
  - name: 3 # baz
  - name: 4 # abc
  - name: 5 # def
  - name: 6 # ghi
  - name: 7 # qux

In particular for deep stack traces with a high number of similar frames and where only leaf frames are different, the use of locations_start_index, locations_length with location_indices will get more complex than the (deprecated) location_index which just holds a list of IDs into the location table.

The original pprof message Sample does also not use the _start_index / _length approach. From my understanding all messages of type Sample within the same Profile groups stack traces from the same origin/with the same attributes.
For a different set of attributes, I think, a dedicated Profile should be preferred with its own attributes.

An alternative, to allow sharing Mapping, Location and Function information between stack traces with different attributes would be to move these three tables one layer up into ProfileContainer, so that they can be referenced from each Profile.

While the variety of leaf frames is usually high and attributes are often more static, can we remove the deprecated label from location_index in message Sample and let the user either set location_index or location_start_index with locations_length?

Add labels to entry level tasks for new contributors

Hi,

based on open-telemetry/community#469 I have added open-telemetry/oteps to Up For Grabs:

https://up-for-grabs.net/#/filters?names=658

There are currently no issues with label help wanted. Please add this label to your entry level tasks, so people can find a way to contribute easily.

If "help wanted" is not the right label, let me know and I can change it (e.g. to "good first issue" or "up-for-grabs"), or you can provide a pull request by editing https://github.com/up-for-grabs/up-for-grabs.net/blob/gh-pages/_data/projects/opentelemetry-oteps.yml

Thanks!

Proposal: Reduce clock-skew issues in mobile and other client-side trace sources

I'm creating this ticket per discussion in the OpenTelemetry maintainers' meeting 05/10/2021

Clock-skew will always be a problem with distributed tracing, but the degree of skew that occurs on unmanaged devices (by 'unmanaged' I mean devices outside of the software provider's control) is untenable.

Screen Shot 2021-04-26 at 10 44 06 AM

This screenshot shows the degree of clock skew between a mobile device and a backend server while tracing a synchronous request. The mobile device is using an automatically sync'd system clock, but the degree of skew could be much, much worst, as the clock can be set at the whim of the mobile phone's owner (think days, months, years of skew).

I'd like to brainstorm some solutions to this problem.
Some possible solutions could be:

  • client side monitors should operate in offset times that can be later set relative to some time authority (collector?)
  • client side could sync to a non-system time authority
  • distributed traces could be processed to re-align spans based off relation (if a http request is made to a backend service they should probably overlap to some degree)

Proposal: Support ML Monitoring

Hello all,

Not sure if this is the right repo but apologizes in advance if not and would be glad to be re-directed at the right place. Also not sure if this has been discussed before elsewhere and could be a long shot or completely inappropriate. :)
I would like to propose a way that Machine Learning (ML) monitoring becomes a first class citizen of the specification.
Machine Learning enabled applications deviate from the traditional cloud native ones but they are being adopted heavily as part of the enterprise stack.
Monitoring a ML model is important for ML in production and essentially a ML model cannot be operated without proper
visibility. ML monitoring is also part of the MLOPS practices.
It is very common to have a model served via a service and that model to emit metrics eg. latency of the
model scoring, performance related metrics such as accuracy or metrics related to concept drift etc.
Of course this is only part of the story of the data related Observability domain. To be more specific as part of the OTEL spec a new resource could be added to capture the concept of a model (similar to FaaS). Then specific metrics can be defined per ML model category, for an example check here.
Adding such support also helps connecting the metadata that exists in ML metadata stores directly to the models deployed and their emitted metrics. Tracing can also be enhanced to be ML specific eg. ML operations over input.
I am sure existing concepts could be used to build something on top but some key benefits of this:
a) Establish a model for emitted info that is understandable by data scientists and others involved in ML.
b) Help with the integration with different systems that create similar information by defining a common ground.
c) Make OpenTelemetry easy to use in an important domain so that users dont have to re-invent concepts.

Any feedback would be welcome.

Thank you!
Stavros

Proposal: The OpenTelemetry Spec should allow SDKs to export all the spans regardless of their sampled flag

To minimize telemetry processing overhead in the hosting application, let OpenTelemetry sidecar to handle telemetry data processing as much as possible is a very attractive solution, for example, to gather span metrics (statistics), and to extract other summary information from spans per trace (request), etc.

To enable that, SDK needs the capability to export all spans to agent regardless of their sampled flag and carries the sampled flag with the sampling decision made by SDK sample. By doing these, Agent has the opportunity to see all the span data and to decide whether to persist the spans in storage based on the carried in sampled flag at the end.

See related Spec issue filed: open-telemetry/opentelemetry-specification#2986

Proposal: Add Sensitive Data Labels

The Pitch

The idea is rather simple, extending the current OTLP to include an additional dictionary that is intended to store sensitive data.

The Rationale

Making the use of sensitive data being explicit instead implicit means that data vendors, data processors (pre and post) can first of all:

  • Ensure any data regulations are followed (Thinking GDPR, CCPA, FedRamp to a degree)
  • Allow for consumers of this data to handle the data correctly

As we collect more telemetry from user's experiences, the likelihood of including UGC and PII increases and this should help those system that collect it to be clear on the actions they take on it.

The outcome

The idea I have in mind and would love further discussion on is how this is consumed by the client (the application actually sending the data).

The idea being that the OTLP is extending to include another dictionary that is explicit in its sensitive nature and the SDK implement a AddSensitiveLabel method to make it clear on what it is.

From that the otel-collector could filter, drop, or transform those labels if they are known.

From there, the exported data vendors / processes can set up ingestion policies on what should happen when sensitive data arrives.

Add a "not implemented" stage to maturity levels

Let's take profiling as an example - if it becomes a new signal with a new procotol or data model, then it would be added to the specification first. Once it's in the specification, it would be added to the collector thus initially being not implemented.

In either case, not implemented should likely be an optional state - client libraries definitely have this today.

Originally posted by @flands in #232 (comment)

Proposal: specify how opentelemetry will deal with idle metrics no longer being reported

I've been testing opentelemetry SDK (python) to report counters, gauges to OTEL collector, and found out that when using PeriodicExportingMetricReader to periodically report the collected measurements, even though there is no values actually being reported from the observable gauges, the counters and gauges will keep reporting the last reported values (probably occurring within the reader itself?)

There are a couple of problems associated with this sort of behavior.

  • As for the gauge type of data, the keep reporting the last reported value with new timestamp may pose a confusion that the gauge measurement is still active, even though there's no point actually being reported from the source.
  • There is no way for the users to explicitly access the reader storage and remove the entry - during the lifecycle of the reader.
  • in a highly ephemeral environment - if the sources somehow change often, this may simply create large number of unnecessary telemetry time series that can pose for performance and storage.

Therefore, we may need some clear definition of how long we will keep the idle metrics (which stopped reporting regularly from the sources) to eventually disappear from the reader's metrics storage. As far as I'm aware, it doesn't look like there is any mention of this concept yet in opentelemetry metrics.

A good example is how statsd lets user configure such behavior in its configuration:
https://github.com/statsd/statsd/blob/master/exampleConfig.js?MobileOptOut=1#L61

I hope this issue ticket would steer the development of opentelemetry into a better direction forward.

How to convert Java Flight Recorder (JFR) file to Profiling Data Model v2

I have a question about #239.

The Java Flight Recorder (JFR for short) binary format can contains multiple (over 100) types of events.
We can use jfr tool (like pprof command line tool) to view the events in JFR file.

jfr summary jfr.jfr

 Event Type                              Count  Size (bytes)
=============================================================
 jdk.ObjectAllocationOutsideTLAB         12239        232059
 jdk.ObjectAllocationInNewTLAB            1514         33952
 jdk.ExecutionSample                      1102         15880
 jdk.JavaMonitorWait                      1030         30284
...

In the above fragment, the jdk.ExecutionSample event type is the CPU sample, contains 1102 events, the interval between 2 consecutive events of 1 thread is 10 milliseconds. The fields for each sample are timestamp, thread, thread stack, thread state.

The jdk.ObjectAllocationInNewTLAB event type is the Allocation sample, contains 1514 events, the interval between 2 consecutive events of 1 thread is not fixed, because Java record this sample when a new TLAB (Thread Local Allocation Buffer) is created, but the TLAB size is adjusted ergonomically.

I wonder is it possible to convert a JFR file to a single ProfilesData file.

Implementation issues for accepted OTEPs are missing

In the OTEP process (README.md of this repository): https://github.com/open-telemetry/oteps#implementing-an-rfc

Once an RFC has been approved, a corresponding issue should be created and prioritized accordingly in the relevant repository.

It seems that this (IMHO quite useful) part of the process has not been followed until now.

I suggest creating such issues for the accepted OTEPs. Furthermore, I suggest adding a link to the implementation issue to each accepted OTEP (in the future, I propose we even require that for merging an OTEP as "accepted").

Proposal: Enable security vulnerability scans on OTel repos

Motivation
The OpenTelemetry code repos should have security vulnerability scanning enabled by default. This can be done with a GitHub Actions workflow where a freely available security scan tool - CodeQL can be triggered on a daily basis. Running such a scan would increase trust in the code quality for the project - developer trust in providing more information about security gaps that need to be addressed (e.g. dependency updates that may need to be done) as well as customer trust in using OTel code in production.

Explanation
GitHub provides a CodeQL action workflow that can be enabled on any and all repos. See https://github.com/github/codeql-action. CodeQL automatically uploads the results to GitHub so they can be displayed in the repository's security tab. CodeQL runs an extensible set of queries (https://github.com/github/codeql), which have been developed by the community and the GitHub Security Lab (https://securitylab.github.com/) to find known vulnerabilities in your code.

Internal details
This proposal will not make blocking changes to any code, but instead will provide recommendations for how security of the code can be improved. The current development flow will not be affected as these will not be a part of the CI. These security scans will be run overnight daily as a GitHub workflow in order to consistently check for security vulnerabilities, and the results will be available under the “security” tab within each individual repo.

Trade-offs and mitigations
There are no trade-offs with this proposal, it is simply to shed light upon security recommendations. Enabling this workflow is a win-win for the developer and the customer.

Prior art and alternatives
Other security scanners such as Veracode and SonarQube also exist, however CodeQL is free, and easy to set up as a GitHub Workflow.

Future possibilities
More workflows can be added as well for security scanning. For example, we can add GoSec for the Go-based projects (ie. Collector, Go SDK, Go-Contrib). If there are popular scanning tools used for other languages, please feel free to add to this thread.

cc: @amanbrar1999 @AzfaarQureshi @shovnik

Proposal: Add support for Elastic Common Schema (ECS) in OpenTelemetry

This OTEP is to add support for the Elastic Common Schema (ECS) in the OpenTelemetry specification and provide full interoperability for ECS in OpenTelemetry component implementations.

Adding the Elastic Common Schema (ECS) to OpenTelemetry (OTEL) is a great way to accelerate the integration of vendor-created logging and OTEL component logs (ie OTEL Collector Log Receivers). The goal is to define vendor neutral semantic conventions for most popular types of systems and support vendor created or open-source components (for example HTTP access logs, network logs, system access/authentication logs) extending OTEL correlation to these new signals.
Adding the coverage of ECS to OTEL would provide guidance to authors of OpenTelemetry Collector Logs Receivers and help establish the OTEL Collector as a de facto standard log collector with a well-defined schema to allow for richer data definition.

Please see attached document for the full proposal.

Doc:
https://docs.google.com/document/d/1y63W66EyobrnCa9BNZjKzWfETyLMlhC5FiEJzGzaeWU/edit?usp=sharing

Look forward to comments, feedback from the OTEL community. Please join in for initial review of this proposal in the Logs SIG meeting on Feb 23 2022.

Thanks @cyrille-leclerc, Daniel Khan, Jonah Kowall, @kumoroku and others for collaborating on this initial proposal.

Proposal: remove Span

This is obviously controversial, but it would be nice if we removed the concept of the Span from tracing API, and replace it with methods on the Tracer, such as:

tracer.SetSpanAttribute(ctx, key, value)
tracer.RecordSpanEvent(ctx, event)

There are two reasons for that:

  1. When Context API is moved into an independent layer below all other layers, the way extractors might work is like this: ctx = extractor.extract(ctx, request). Because extract() cannot start a new span, it must store span context in the returned ctx. With this proposal, it will always keep the span context only in the context, never the Span.
  2. Not giving users references to Span objects simplifies memory management. In OpenTracing it was pretty difficult to pool span objects because user can keep a reference to the span even after calling span.finish(). The tracer can keep a buffer of active spans indexed by immutable span contexts that are kept by the user code. When span is finished the tracer can reuse the span object for another trace. if a later call from user code comes with span context not in the table, trace can either ignore the call, or capture that data by other means (if the backend supports merging of spans).

Proposal: Non-core components like Exporters should live in contrib repos

Problem
As OpenTelemetry continues to grow, the number of PRs is expected to exceed the capacity for maintainers to provide timely code reviews. The maintainers are already spread thin across the language-core and language-contrib repos for their SDKs. This increases the backlog of pending code reviews. In addition, maintainers are typically focused on achieving feature completion and source stability and often de-prioritize review of non-core components.

Solution
We propose that non-core components, such as exporters, be moved into contrib repos so that maintainers for the core components are not overburdened and so that other developers can become maintainers for the non-core code. In addition to addressing maintainability, this solution helps make building and releasing artifacts easier by decoupling from core builds and schedules. For example, if we move the Prometheus exporters from the core sdk repos into the contrib repos they can be maintained by other developers without being impeded by core-sdk maintainer schedules and concerns. Furthermore, developers who do not have maintainer rights on the core repos can then be allowed to help maintain non-core components in the contrib repos. See table below for proposed relocations of Prometheus exporters from core to contrib repos.

Note: This issue has been discussed in the maintainer, language and Collector SIG meetings and has been generally agreed to. The intent of this issue is for formally recognize this solution going forward.

Proposed relocations of Prometheus exporters in OpenTelemetry

SDK Prometheus Exporter Type Current location Proposed location
C++ Pull Core Contrib
Python Pull Core Contrib
JavaScript Pull Core Contrib
Java Pull Core Contrib
Go Pull Core Contrib
Go Push Contrib Contrib
Collector Pull Core Contrib
Collector Push Core Contrib
DotNet Pull Core Contrib
Ruby N/A - -
Erlang N/A - -
Rust Pull Core Contrib
PHP Pull Core Contrib

Proposal: Exemplars

This proposal defines exemplars within OpenTelemetry and specifies behaviour for exemplars for the default set of aggregations: Proposal doc

Proposal: Service renaming

OpenTelemetry has become the observability data industry standard for several platforms, especially related to backend services, and, while it probably was not foreseeable at the time of its creation, it has become so widely popular that it's now being used for purposes beyond telemetry for backend services, and moved onto other scopes such as mobile apps and web pages too! Which speaks volumes of how fast this community has grown and it's a testament to the hard work and love that has been put into expanding its capabilities.

As it would happen with any growth story though, as time passes by and OpenTelemetry expands and evolves, some of its parts, which were exactly what was needed at the beginning, might not make too much sense anymore, and one important aspect of it that has been discussed in the Client SIG, is the way to refer to an "entity that produces telemetry", be it a backend service, web app or mobile app for example, all of which, at the moment, would be defined as service, based on the current semantic conventions, which has proven to be confusing for people who are starting to adopt OpenTelemetry for non-backend services purposes.

This issue aims to provide a term to identify an "entity that produces telemetry" in a way that wouldn't be tied to any particular environment so that it better represents the wide range of use cases that OpenTelemetry has come to support in time, and hopefully covers any use case that might arise in the future as well. I'm conscious of the longevity of the current term service and how widely adopted it is across existing services and even non-services entities, which most likely won't make this an easy change for sure, though I believe that, based on how fast OpenTelemetry is growing, the longer we wait to make these kinds of changes, the more difficult it will become.

The proposed name to replace it with is origin, more details on what the change would look like in this PR.

Proposal: Adding profiling as a support event type

Profiling events

There is a shifting concept that performance monitoring and application monitoring (the idea of tracking the time spent in functions and or methods, vs how long it takes to serve a request) are near identical and come under the realm of Observability (understanding how your service is performing).

How is this different from tracing

Conventional tracing looks at showing the user's request flow through the application to show time spent in different operations. However, this can miss any background operations that indirectly impact the user request flow.

ie. If I take a rate limiting service that has a background sync to share state among other nodes:

func ShouldRateLimit(next http.Handler) http.Handler {
   return http.HandlerFunc(w http.ResponseWriter, r *http.Request) {
         span, ctx := otel.SpanFromContext(r.Context())
         defer span.Finish()
         key, err := ratelimit.GetKey(r)
        
        if limits.Key(key).Exceed() {
             // return 429 status code
        }
        next.ServeHTTP(w,r)
   })
}

func (l *limits) SyncLimits() {
    l.cache.RLock()
    defer l.cache.RUnlock()
    for _, limit := limits.cache {
          // publish data to each node or distributed cache
          // Update internal values with shared updates
    }
}

In the above example, I can clearly see how the function ShouldRateLimit impacts the requests processing time considering the context used as part of the request can be used to link spans together but there is a hidden cost here with SyncLimits that currently can not be exposed due to the fact it runs independently from in bound requests and thus can not / should not share the same context.

Now, the SyncLimits function could implement metrics to help expose runtime performance issues but could be problematic due to:

  • As a developer, I need to know what to start observing in order to diagnose
  • The problem may disappear due to the nature of the issue (race conditions, Heisenbug)
  • Measure performance of my function comparatively to my entire application
  • Can not easily measure deadlocks / livelocks without elaborate code orchestration

Suggestion

At least within the golang community, https://github.com/google/pprof has been the leading tool in order to facilitate these kinds of questions while also offering first part support within Go. Moreover, AWS also have their own solution https://aws.amazon.com/codeguru/ that offers something similar for JVM based applications.

Desired outcomes of data:

  • Show cumulative runtime of functions (could also derive percentage from this data)
  • Map resource usage (CPU, Memory, and I/O) to internal methods / functions

Desired outcomes of orchestration:

  • Low friction with adding profiling support (As an example, pprof adds a single handler to perform software based profiling)
  • Should not require major modifications of existing code to work (should not require adding functions that would complicate existing logic)

I understand that software based profiling is not 100% accurate as per the write up here https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md however, this could give an amazing insight into hidden application performance that could help increase reliability, performance and discover resource issues that were hard to discover with the existing events being emitted.

Proposal: Supporting Real User Monitoring Events in OpenTelemetry

Real User Monitoring in OpenTelemetry Data Model

This is a proposal to add real user monitoring (RUM) as an independent observability tool, or ‘signal’, to the Open Telemetry specification. Specifically, we propose a data model and semantics which support the collection and export of RUM telemetry.

Motivation

Our goal is to make it easy for application owners to move their real user monitoring (RUM) telemetry across services. We aim to accomplish this by providing application owners with a standardized, platform agnostic tool set for recording RUM telemetry. Such a tool set would include (1) a common API and (2) SDKs that implement the API and support multiple platforms including web applications and native mobile applications.

To achieve this goal, we propose a modification of the Open Telemetry specification to support collecting and exporting RUM telemetry. Specifically, Open Telemetry currently supports three signals: tracing, metrics and logs. We propose adding a fourth signal, RUM events, which will be used to record telemetry for interactions between end-users and the application being monitored. See the Alternatives section for a discussion of why we propose a new signal over using an existing signal.

Background

What is RUM?

RUM allows customers to monitor user interactions within their applications in real time. For example, RUM can provide application owners with insights into how users navigate their application, how quickly the application loads for users or how many new users tried the application. RUM provides application owners with a way to rapidly address issues and improve the user experience.

Examples of RUM use cases include:

  • Counting the number of new users versus number of returning users
  • Visualizing how users navigate the application's UI
  • Identifying pages that generate a high number of errors
  • Identifying pages with high load latency
  • Counting conversions from home page to purchase
  • Linking traces (e.g., RPCs and HTTP requests) to user interactions

To enable application monitoring, RUM collects telemetry (e.g., button clicks, load times, errors) from applications with user interfaces (e.g., JavaScript in browsers, or native Android or IOS applications) and dispatches this telemetry to a collection service.

RUM Model

RUM is analogous to, but semantically different from tracing. While tracing records a compute operation, RUM records data relating to the experience of a user performing a task. We refer to the interaction between a user and an application to perform a task as a session. The diagram below shows the structure of a RUM session.

RUM Session Data Model

A session represents the interactions that occur between a user and an application while the user works to accomplish a task. Because an application is UI driven, RUM records telemetry based on which page (or UI) the user is viewing. This (1) allows engineers to correlate events with the UI that generated them, and (2) allows designers to view how users navigate the application. Pages have a set of attributes (an attribute is a key/value pair) and a list of events (an event is a named and timestamped set of attributes).

Because RUM aims to aggregate data from multiple sessions into metrics, it is unnecessary and impractical to export entire sessions from a client to a collector. Instead, we export events as they occur and aggregate events from multiple sessions in the collector, or later on using batch processing. The diagram below visualizes this relationship.

RUM Event Collection

Internal Details

The Open Telemetry specification currently defines three signals: tracing, metrics and logs. We propose adding a fourth signal, RUM events, which would provide observability into real user interactions with an application.

RUM Event Context Definition

RUM records and dispatch telemetry, such as button clicks and errors, in near real-time. To support aggregating this data across dimensions, context must be propagated with each event. The context for an event includes session and page attributes. Session and page attributes represent the dimensions by which events will be aggregated.

For example, consider a JavaScript (JS) error event. Context such as (1) page ID and (2) browser type must be propagated with the event to efficiently aggregate metrics such as (1) number of JS errors by page and (2) number of JS errors by browser type.

Events are grouped by (1) session and then (2) page. Session fields include:

Field Name Type Description
Resource Resource Uniquely identifies an application.
User ID string Identifies the user of an application. This can be either a random ID for unauthenticated users, or the ID of an authenticated user.
Session ID string Identifies a series of interactions between a user and an application.
Attributes map Session attributes are extensible. For example, they may include data such as browser, operating system or device.

Pages represent discrete UIs, or views, within an application. For web applications, pages can be represented by a URL, or more commonly, a subset of the URL such as the path or hash fragment. Native mobile applications will have a different type of ID for their pages. Page fields include:

Field Name Type Description
Page/View ID string Uniquely identifies a discreet user interface within an application. For example, a web application may identify pages by the URL's path or hash fragment.
Attributes map Page attributes are extensible. For example, they may include data such as the URL of the web page.

RUM Event Definition

Pages generate zero or more events. Events store and transmit information about an interaction between a user and the application being monitored. Event fields include:

Field Name Type Description
Timestamp uint64 An epoch timestamp in milliseconds, measured on the client system when the event occurred.
Event type string Uniquely identifies an event schema. The event type contains an event name prefix (e.g., com.amazon.aws) followed by an event name (e.g., dom_event). When the event is sent to a collection service, the event schema instructs the collection service how to validate and deserialize the event details.
Details object Each event has a unique schema. Event schemas are not fixed -- they may be created, modified and removed, and are therefore outside of the scope of this data model. This field contains a JSON object. This object adheres to a schema which is unique to the event type.

RUM Event Types

Because there is no fixed set of RUM events (RUM event types may be created, modified or removed), specific events are not part of the RUM data model. Examples of RUM event types may include, but are not limited to:

  • Session start
  • Page view
  • Page load timing
  • Resource load timing
  • DOM interaction
  • JavaScript error
  • HTTP request with trace

Example of a RUM event record

We display this example as a JSON object, as JSON is natively supported by JavaScript and web protocols. Alternatively the SDK may transmit the record as a protobuff.

{
  resource: {
   application_id: '2ecec2d5-431a-41d5-a28c-1448c6284d44'
  }
  user_id: '93c71068-9cd9-11eb-a8b3-0242ac130003',
  session_id: 'a8cc5ef0-9cd9-11eb-a8b3-0242ac130003',
  session_attributes: {
      browser: 'Chrome',
      operating_system: 'Android',
      device_type: 'Mobile'
  },
  page_id: '/console/home',
  page_attributes: {
      host: 'console.amazon.aws.com',
      path: '/console/home',
      hash: '#about'
  },
  event: {
       timestamp: 1591898400000,
      type: com.amazon.aws.dom_event,
      details: {
           event: 'click',
          element_id: 'submitButton'
      }
  }
}

What does this data look like on the wire?

Events are human generated and are therefore sparse. We estimate about 1 - 60 events per minute, per user, depending on the application. The number of events for a single session is small, however because of the volume of users, the cost of network calls and storage may be high compared to the value of the data, and therefore the number of events may be capped or events may be sampled. For example, events for a session may be capped at a few hundre

Alternatives / Discussion

Why create a RUM event signal instead of using the log signal?

Benefits of transmitting RUM telemetry using the log signal include: (1) less work would be required to modify and implement the Open Telemetry specification, and (2) the complexity of the Open Telemetry specification would not increase substantially.

We proposed creating a new data model over using the existing logs signal. Using logs would require soft contracts between (1) the application and the SDK and (2) the SDK and the collector. Such soft contracts, without having a standardized and strongly typed API, could fracture SDK implementations. This would affect maintainability and the portability of RUM telemetry.

Some aspects of the RUM signal may also be cross cutting concerns, which is not supported by the log signal. For example, it may be valuable to propagate RUM context (e.g., session ID, page ID, UI events) across API boundaries, so that downstream executions can be associated with the user interactions that triggered them.


By creating a new signal, we get stronger typing at the expense of adding complexity. For example, we would not create new signal types for databases, pub sub, etc.

We view Databases and PubSub as specific technologies that need to be monitored, while tracing, metrics and logging are monitoring technologies. Our proposition is that (1) real user monitoring, like tracing, metrics and logging, is a monitoring technology and that (2) there are advantages to treating real user monitoring as a first class monitoring technology within Otel.


Could we use semantic conventions instead of a new signal by packaging RUM data as traces?

The opentelemetry-js and opentelemetry-js-contrib SDKs already capture certain browser activity associated with RUM (i.e., http requests, document load behavior and DOM events) as traces. Conceptually, we view tracing as the process of recording the execution of a program. This fits very well for specific web application execution activities like HTTP requests, load timing and executions that are initiated by DOM events.

However, we view RUM as the process of recording the experience of a person interacting with a program, which something that traces cannot effectively model. Because RUM is driven by human interactions with the application, we need a system which can capture events over a long period of time and link the events together into a timeline of the user’s experience.

RUM events can model many different types of telemetry, such as: traces, errors, sequences of DOM interactions, web vitals measurements, etc. These events must be associated with a RUM session and a view of the application (i.e., the page the user is viewing). The Splunk SDK (i.e., opentelemetry-js + splunk-sdk-javascript) makes this association by attaching the session ID and page URL to spans as attributes.

The long-term problem with using traces for recording RUM sessions is that (1) there is no guarantee that each implementation behaves the same, reducing data portability, (2) many events are not traces, which violates object oriented design principles and reduces the maintainability of the SDK, and (3) it makes it more difficult to define and validate events.

Regarding (1), we would like the ability to change RUM providers with minimal changes to an application’s monitoring instrumentation.

Regarding (2), we would like the ability to define session attributes (e.g., browser, device, platform), page attributes (e.g., page ID, page URL, page interaction level) and event attributes.

Regarding (2), we would also like the ability to install plugins in the SDK which record RUM events. I don’t think using traces or logs prevents this, however I think it reduces maintainability.

Regarding (3), we would like the ability to define schemas for events so that (a) we can type-check events when implementing RUM SDKs, (b) we can verify that incoming event payloads are valid during collection, and (c) we can query the event data after it is stored.


Could we use semantic conventions instead of a new signal by packaging RUM data as logs?

Logs stores unstructured data and also suffers from (1) and (3) above. In addition, it might be beneficial to separate the log and RUM event traffic so that the collection service doesn’t need to do so.


Could we achieve stronger type saftey on top of existing log or trace signals, for example, by adding a projection layer on top of these signals?

Potentially -- would a projection layer improve understandability and maintainability compared to adding a new signal?

cc: @qhanam

Group maturity status into experimental and stable

As discussed in #232, it may be helpful to bucket maturity statuses, especially given the terms experimental and stable are used extensively through OTel. For example:

  • Experimental: not implemented, development, alpha, beta, RC
  • Stable: GA, unmaintained, deprecated

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.