vectordotdev / vector Goto Github PK

A high-performance observability data pipeline.

License: Mozilla Public License 2.0

Rust 60.10% Dockerfile 0.08% Shell 0.14% Lua 0.06% Batchfile 0.01% Ruby 0.08% CUE 36.98% DIGITAL Command Language 0.01% PowerShell 0.01% CSS 0.07% JavaScript 0.26% TypeScript 0.77% Sass 0.12% HTML 1.33% Starlark 0.01% Python 0.01%

router logs metrics rust observability forwarder vector parser events stream-processing

vector's Introduction

Quickstart • Docs • Guides • Integrations • Chat • Download • Rust Crate Docs

What is Vector?

Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data. Collect, transform, and route all your logs and metrics to any vendors you want today and any other vendors you may want tomorrow. Vector enables dramatic cost reduction, novel data enrichment, and data security where you need it, not where it is most convenient for your vendors. Additionally, it is open source and up to 10x faster than every alternative in the space.

To get started, follow our quickstart guide or install Vector.

Principles

Reliable - Built in Rust, Vector's primary design goal is reliability.
End-to-end - Deploys as an agent or aggregator. Vector is a complete platform.
Unified - Logs, metrics (beta), and traces (coming soon). One tool for all of your data.

Use cases

Reduce total observability costs.
Transition vendors without disrupting workflows.
Enhance data quality and improve insights.
Consolidate agents and eliminate agent fatigue.
Improve overall observability performance and reliability.

Community

Vector is relied on by startups and enterprises like Atlassian, T-Mobile, Comcast, Zendesk, Discord, Fastly, CVS, Trivago, Tuple, Douban, Visa, Mambu, Blockfi, Claranet, Instacart, Forcepoint, and many more.
Vector is downloaded over 100,000 times per day.
Vector's largest user processes over 30TB daily.
Vector has over 100 contributors and growing.

Documentation

Administration

Resources

Community - chat, calendar, @vectordotdev
Releases
Policies - Code of Conduct, Privacy, Releases, Security, Versioning

Comparisons

Performance

The following performance tests demonstrate baseline performance between common protocols with the exception of the Regex Parsing test.

Test	Vector	Filebeat	FluentBit	FluentD	Logstash	SplunkUF	SplunkHF
TCP to Blackhole	86mib/s	n/a	64.4mib/s	27.7mib/s	40.6mib/s	n/a	n/a
File to TCP	76.7mib/s	7.8mib/s	35mib/s	26.1mib/s	3.1mib/s	40.1mib/s	39mib/s
Regex Parsing	13.2mib/s	n/a	20.5mib/s	2.6mib/s	4.6mib/s	n/a	7.8mib/s
TCP to HTTP	26.7mib/s	n/a	19.6mib/s	<1mib/s	2.7mib/s	n/a	n/a
TCP to TCP	69.9mib/s	5mib/s	67.1mib/s	3.9mib/s	10mib/s	70.4mib/s	7.6mib/s

To learn more about our performance tests, please see the Vector test harness.

Correctness

The following correctness tests are not exhaustive, but they demonstrate fundamental differences in quality and attention to detail:

Test	Vector	Filebeat	FluentBit	FluentD	Logstash	Splunk UF	Splunk HF
Disk Buffer Persistence	✓	✓			⚠	✓	✓
File Rotate (create)	✓	✓	✓	✓	✓	✓	✓
File Rotate (copytruncate)	✓					✓	✓
File Truncation	✓	✓	✓	✓	✓	✓	✓
Process (SIGHUP)	✓				⚠	✓	✓
JSON (wrapped)	✓	✓	✓	✓	✓	✓	✓

To learn more about our correctness tests, please see the Vector test harness.

Features

Vector is an end-to-end, unified, open data platform.

	Vector	Beats	Fluentbit	Fluentd	Logstash	Splunk UF	Splunk HF	Telegraf
End-to-end	✓							✓
Agent	✓	✓	✓			✓		✓
Aggregator	✓			✓	✓		✓	✓
Unified	✓							✓
Logs	✓	✓	✓	✓	✓	✓	✓	✓
Metrics	✓	⚠	⚠	⚠	⚠	⚠	⚠	✓
Traces	🚧
Open	✓		✓	✓				✓
Open-source	✓	✓	✓	✓	✓			✓
Vendor-neutral	✓		✓	✓				✓
Reliability	✓
Memory-safe	✓							✓
Delivery guarantees	✓					✓	✓
Multi-core	✓	✓	✓	✓	✓	✓	✓	✓

⚠ = Not interoperable, metrics are represented as structured logs

Developed with ❤️ by Datadog - Security Policy - Privacy Policy

vector's People

Contributors

Stargazers

Watchers

Forkers

vdt roki1988 shaunwarman samhays pipesocket seeyarh jithinraj kfabryczny nixm0nk3y fivesheep abdesslem jesseshieh isgasho awesome-security sebisnow hauwenc donflopez mapbased duzhanyuan cdumay itkovian mdfranz dientm bboreham celrenheit ctron hhy5277 shaunstanislauslau ildarmf wupeng889900 jshuadvd markush quinndiggity chrischiedo sudharsh akimdi neomantra zcapper sjas nerudadhich rustcamp doedoe12 bittrance briankabiro seeekr manuca kwelity clevercloud hawkw marshz9 imanghafoori1 faghani sta-szek drevoed iac-infrastructureascode mre bigkraig baitcenter jhult jbmcgill emaxerrno igordzreyev neuroradiology dio yujinqiu adamdrake doublespout benhe119 shivlondon parampavar naiduarvind alphacentauri763 klaven karbon0x everesio dut3062796s woshizilong huangweiboy b-xiang kvtm-ravikant cloudance banyue marcdenman noeltoby xcaptain stormrust e11it joe2hpimn mchesser zhouyuxiang0 alyhkafoury fschuindt pizzamig paulyc samuel-schuepbach megastef poeblu tarcinil linc01n yhpark

vector's Issues

Initial config file implementation

We should be able to take a structured config file as input (likely just JSON at first) and use it to drive the dynamic topology builder from #45.

This should be more focused on getting a feel for the right "shape" of the config data and how it interacts with the implementation than on making it the most convenient configuration system possible. Once we have this in place we can look into more convenient ways of generating that configuration data.

Improve config validation

Config validation now happens within building the actual sink but this should be pushed up into the topology section of the router.

Splunk raw TCP source first pass

Make a first pass at structuring data internally

Data is currently represented and operated on as a simple String. In order to support things like parsing, more intelligent sampling, routing, etc, we should have a more structured internal data format.

While we may eventually move towards something like SSF or Cernan's internal format, a good start would be a simple struct wrapping the string content of each log line and a map of key/value pairs.

Initial docs setup

Simple first parsing component

With #37, we have the beginning of support for structured data. To complete the initial level of support, we need a way to extract fields from raw input into that structured format.

The simplest implementation is probably regex-based with captures. Eventually we'll want parsers that are much more convenient and easy to use (e.g. native parsers for common formats, Lua-based parsers, maybe rosie-lang, etc), but we can start with the basics to get a feel for how parsers will fit into the system as a whole. Once we have one in place and working, it should be relatively straightforward to expand.

Write file source docs

New `aws_kinesis_streams` source

It would be nice if Vector could ingest logs from a AWS Kinesis data stream (not Firehose which is covered in #3566).

Requirements

Ability to exclusively read partitions across multiple Vector instances.
Checkpointing the stream to resume properly when Vector is restarted (and prevent data loss).
- Bonus points if checkpoints can be stored remotely, like in Dynamo.
Ability to specify where to start reading from the stream (horizon, etc).
Add the kinesis.stream and kinesis.partition as context fields (. denoting nested fields).
Ability to merge split/multi-line messages.
It should handle all of the various stream statuses and react accordingly.
Consider using the new enhanced fanout and HTTP2 capabilities to avoid polling.

Run healthchecks for new/changed sinks

This should probably abort the config reload altogether if the require-healthy flag is set and the healthchecks fail.

Automated performance testing

As we make architectural decisions and add features, we need to have a good idea of how they affect the performance of the system in various configurations. This has been done manually up to this point, but that is very labor-intensive and error-prone.

Similar to #39, we should have a system in place that lets us easily measure the performance of various router configurations and determine how that performance is changing over time.

Write STDIN source docs

Initial metrics support

We'd like the router to support both transforming logs into metrics and collecting metrics directly. For the first pass at this, we'll probably need a few basic components:

The ability to represent metrics in our internal data format. Both cernan and SSF could be good inspiration here.
A transform that can aggregate its input and emit output on a scheduled interval
A simple way to send that output somewhere we can see it. This could be as simple as treating it as log lines we can use with existing sinks, or just logging it out.

Some things to think about but not necessarily solve right away:

How "typed" do we want our internal data format? There's a whole spectrum between a simple map of attributes and full fledged "log line" / "metric" types with that apply to transforms/sources/sinks
How do will we support "bucketing" of aggregated metrics? Purely based off wall clock time of their arrival, or will there be a concept of event time vs processing time? How would late data be handled in that case?

Initial Lua integration for transforms

We want the router to be operator-programmable and it seems like the most straightforward implementation of that would be loadable Lua transforms (following the precedent of Nginx, Haproxy, etc). Early experiments used the rlua crate with success.

There are a few issues that an initial implementation should keep in mind and try to get a feel for:

Interface: how do we expose the "shape" of a transform to the Lua code in a way that makes it clear what the contract is for authors and allows all the functionality we want?
Testing and validation: how do we make it easy when authoring transforms in Lua to be confident that your code will behave in production? It should be trivial to have the router run your code locally with a variety of inputs (simple fuzzing, even). Ideally, it would even give you an idea of your code's performance relative to that of the larger system (i.e. "this works and handles errors, but will likely bottleneck the system to roughly X throughput").
Libraries: one of the benefits of Lua is that the Heka/Hindsight ecosystem has already built a variety of log parsers that we can potentially reuse. It'd be nice to support those parsers as well as the underlying libraries they use (lpeg) in transforms written for the router. We should think about how those will be packaged and distributed for or into the router.
Performance: we should try to get an idea of how much performance is lost when writing a sampler or parser in Lua instead of natively in Rust. If it's not much, we can consider building up a larger "standard library" of Lua functions and building blocks. If it's a lot, we can focus more on easier ways to compose native transforms.

Again, with the initial implementation we're looking more to map out the territory than to necessarily solve all of these issues. It's possible, for example, that there's not a clean way to integrate lpeg and the Heka parsers, and that even doing so in a messy way leads to much worse performance. If that's the case, we may choose to take an alternative path to operator-programmability.

Just for fun, some of the more experimental ideas we've had as alternatives to Lua:

wasm (could actually be very promising)
weld
eBPF
RPC w/ client libraries
Exec-style scripts with pipes

ES sink support

New `vector_metrics` source

Following #64, it would be neat if we could create an "internal" metrics source and use it to instrument the rest of the application.

This would give us good feedback on our metrics implementation and make our own observability story better. It would replace the existing prometheus counters and give us the flexibility to use any metrics sink we want.

Taken to the extreme, we could have the source be our existing logger and use transforms, aggregates, etc to derive all the metrics we want. We don't need to go that far at first, but it could be something interesting to work towards.

Build Splunk sink

The first sink we need to build is for Splunk. At this stage, we only provide value to them by being in front of Splunk itself.

From our perspective, by far the most desirable Splunk integration would be with the HTTP Event Collector (HEC). There are multiple open source examples of integrations with this collector that we could work from, and it's relatively simple HTTP requests.

Based on our meeting, it sounds like they're in the process of testing the HEC and it is not yet supported in production. It seems unlikely that supporting something else would be a better decision than simply making their rollout of the HEC a dependency of their rollout of the router, but we should verify that point with them before committing to it completely.

Fix flaky tests

There is now a new feature flag flaky for flaky tests that might fail but are actually correct. We should fix up these tests so that we can allow them to run on CI.

Build S3 sink

After #19, the next most valuable sink to build is likely to be S3. This would provide them with a cheap place to send data that is dropped from the Splunk ingestion pipeline.

Simple compressed flat files will probably suffice as a first pass, but longer term we'll want to support more efficient structured formats like ORC and Parquet.

Add test harness

Improve integration testing story

Currently, our "integration tests" consist of running the main binary (which is an awkward combination of the actual program and a harness for exercising it) and watching for things to blow up. This is in the process of falling apart as we add a second configuration (the ES writer) and would require a big increase in complexity to cover that case as well.

Instead, we should build a first-class test harness that lets us easily run data through various topologies (e.g. tcp in -> sample -> tcp out, tcp in -> parse -> ES out) and assert that they behave as expected.

There are a lot of ways this could be implemented, but a good source of inspiration is the TopologyTestDriver from Kafka Streams. We don't (yet) have an equivalent to their StreamBuilder, so it's possible we need to take a different approach. There are also likely benefits to a completely external test harness that treats the router process as a black box, but it would need a way to configure the topologies it wants to test.

Expose config reloading in main

Current plan is to trigger with SIGHUP

Write S3 sink docs

Failure testing

We should have a portion of the test suite to exercise the various failure paths of the router. This could use something like the fail points library to inject failures deterministically.

There will be a lot of decisions we need to make about what behavior is desirable under various types and combinations of failures, and the point of this work is not necessarily to make all of those decisions. Instead, we should aim to map out as many of those decision points as possible so that we can chat as a group and hopefully come up with a coherent overall plan for failure handling.

Gzip ES sink bodies

Like the S3 sink, we should be gzipping our request bodies to ES to save on bandwidth.

Consistent sampling

Modify the Sampler behavior to provide a consistent sampling decision for a given input. That is, if the log line "foo" is passed through the sampler, all future log lines "foo" will also be passed. Likewise, if "foo" is rejected, all future "foo" lines will also be rejected.

To do so, replace the Rng-based implementation with one based on a hash of the content. Eventually this will be tweaked to look at a configurable field, but working with the whole line is a reasonable start and should behave roughly the same as the random sampler for a given sampling rate.

Investigate dynamic configuration strategy

The purpose of this exercise to gauge how much work dynamic configuration would take to implement. We need a general idea of it's cost and time to determine if it should make the 0.1 milestone.

Derive basic metrics from trace events

Try implementing a custom subscriber that can aggregate counters and timers from trace events and expose that data in a configurable way.

The most important data we'll want initial are record throughput per node, rates of things like parse failures, timings for calls to external services, retry counts, and batch sizes.

If the custom subscriber router turns out to be tricky or slow, reevaluate using a more standard metrics collection library.

Flesh out CI builds

Clippy
S3 integration test
Splunk integration test
Cloudwatch integration test
Run benchmarks (and log the performance changes between builds?)

Test with Splunk

Splunk integration test

Handle `data_dir` changing

Currently, changes to data_dir are completely ignored during config reloading. We should probably prevent the reload if it's changed (and require a clean shutdown+restart to change it).

Improve the logging story

This is just a tracking issue to follow up on how we should be logging things. We have a lot of state machines and it will be useful to have trace statements that help debug why things may go wrong.

Samplers should add sampling rate as field to records

This would allow upstream systems to calculate a statistically accurate view of the whole dataset as sampling rates vary over time and across different record types.

Have `Topology::start` use dynamic setup

Rather than having a special way of starting all of the components at boot time, startup could be treated as a "reload" where the previous config was completely empty.

Write TCP sink docs

Write Syslog source docs

In-process data plane

A component crucial to the efficiency and reliability of the router is exactly how we move data from the various sources, between different transforms, and ultimately to sinks. The Heka postmortem essentially blames Go channels for killing the project, so we want to be sure to avoid the same mistake.

The project that Heka points to as doing this right, Hindsight, uses disk queues. It essentially uses files on disk as the intermediary between components, relying on the OS pagecache to keep things efficient. This is a very similar strategy to Kafka, and well proven in practice.

When building the equivalent functionality for the router, we should dig into Hindsight's implementation as much as possible to learn from their experience. We should also model our solution as much as possible as a single-machine Kafka. This model provides a lot of benefits that we want, such as decoupling producers from consumers, allowing limited retention, durability across restarts, etc.

Config validation and error messages

We should try to get the basics of a framework in place that will let us provide users with useful and comprehensive error messages when something is wrong with their configuration.

This exercise could also provide some valuable feedback on the design of the configuration system, which ideally would make it hard to write an incorrect config.

Write HTTP sink docs

Generative testing of config reloading

There are undoubtedly some corner cases in config loading and reloading that some basic generative testing could help expose.

Write console sink docs

Startup checks for sources and sinks

We should allow sources and sinks to run a health check on their dependencies at startup to avoid situations where we boot up and then immediately fail or go into a surprising retry loop. This could potentially tie into #66.

One thing to keep in mind is that there may be cases (e.g. intermittent connectivity issues or recovering after an incident) where starting up in spite of certain types of issues is desirable. A simple solution would be a flag to skip these checks, but a better (and more difficult) one would be some way of differentiating between errors that can be retried and those that are fatal.

Write CloudWatch Logs sink docs

Write TCP source docs

Move components that make sense to tokio stack

Decide on initial source

Should we limit ourselves to places they've already deployed Filebeat, or do we need to support input from Splunk?

Filebeat

If we're willing to tie ourselves to their Filebeat rollout, we have a few options:

Lumberjack protocol (i.e. pretend to be Logstash)
Kafka protocol
Redis protocol

The Lumberjack protocol would take some research and reverse-engineering, since it is not directly documented. On the plus side, being a drop-in replacement for Logstash (at least for this type of input) could be interesting.

The Kafka protocol is not incredibly simple, but it's more of a known quantity and well designed for the type of workload we're targeting. It also has the big advantage of cluster-aware clients. Even though we're unlikely to support clustering right away, using the Kafka protocol gives us one of the only clear paths towards it.

The Redis protocol is very simple and would likely be the quickest way to get started.

Splunk

If we're not willing to limit ourselves to places they've already rolled out alternatives to the Splunk collector, our only option (other than reverse-engineering the Splunk protocol, which is likely not allowed) is to implement a BSD syslog server.

The syslog protocol is quite simple and likely wouldn't take terribly long to implement, but it's not a protocol that provides the kind of features we want moving forward.

Write Elasticsearch sink docs

Dynamic topologies

Currently, the topology of sources, transforms, and sinks run by the router is hard-coded in main. For users to define their own topologies without writing code and recompiling, we need to be able to build them up dynamically at runtime.

We shouldn't worry too much about configuration file formats or anything like that for this first pass but just focus on a convenient way to describe a topology as data and then build and run the actual topology from that data.

Build sampling transform

It seems that the "transform" they have the most interest in by far is sampling. Their stated goal was to send only 10% of their existing volume to Splunk, though it wasn't clear how feasible they considered this.

We can quite easily write a native sampling transform that performs extremely well and can have the sampling rate updated on the fly by hitting a configuration endpoint. This feels like the obvious first piece of this type of functionality to build.

One additional feature I imagine they'll require is a way to match certain types of log lines and always forward them (i.e. skip the sampling step). This should also be very straightforward to support as part of the sampling transform, but we should find out exactly how sophisticated this matching would need to be. Ideally simple constant string matching would be fine (e.g. does the log line include the literal string very-important-billing-event), but there's a chance they'd also want regex.

Expose metrics for important router processes

This will be required when building a monitoring UI.