Coder Social home page Coder Social logo

vectordotdev / vector Goto Github PK

View Code? Open in Web Editor NEW
16.5K 148.0 1.4K 109.41 MB

A high-performance observability data pipeline.

Home Page: https://vector.dev

License: Mozilla Public License 2.0

Rust 60.10% Dockerfile 0.08% Shell 0.14% Lua 0.06% Batchfile 0.01% Ruby 0.08% CUE 36.98% DIGITAL Command Language 0.01% PowerShell 0.01% CSS 0.07% JavaScript 0.26% TypeScript 0.77% Sass 0.12% HTML 1.33% Starlark 0.01% Python 0.01%
router logs metrics rust observability forwarder vector parser events stream-processing

vector's Introduction

Quickstart  •   Docs  •   Guides  •   Integrations  •   Chat  •   Download  •   Rust Crate Docs

Vector

What is Vector?

Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data. Collect, transform, and route all your logs and metrics to any vendors you want today and any other vendors you may want tomorrow. Vector enables dramatic cost reduction, novel data enrichment, and data security where you need it, not where it is most convenient for your vendors. Additionally, it is open source and up to 10x faster than every alternative in the space.

To get started, follow our quickstart guide or install Vector.

Principles

  • Reliable - Built in Rust, Vector's primary design goal is reliability.
  • End-to-end - Deploys as an agent or aggregator. Vector is a complete platform.
  • Unified - Logs, metrics (beta), and traces (coming soon). One tool for all of your data.

Use cases

  • Reduce total observability costs.
  • Transition vendors without disrupting workflows.
  • Enhance data quality and improve insights.
  • Consolidate agents and eliminate agent fatigue.
  • Improve overall observability performance and reliability.

Community

  • Vector is relied on by startups and enterprises like Atlassian, T-Mobile, Comcast, Zendesk, Discord, Fastly, CVS, Trivago, Tuple, Douban, Visa, Mambu, Blockfi, Claranet, Instacart, Forcepoint, and many more.
  • Vector is downloaded over 100,000 times per day.
  • Vector's largest user processes over 30TB daily.
  • Vector has over 100 contributors and growing.

About

Setup

Reference

Administration

Resources

Comparisons

Performance

The following performance tests demonstrate baseline performance between common protocols with the exception of the Regex Parsing test.

Test Vector Filebeat FluentBit FluentD Logstash SplunkUF SplunkHF
TCP to Blackhole 86mib/s n/a 64.4mib/s 27.7mib/s 40.6mib/s n/a n/a
File to TCP 76.7mib/s 7.8mib/s 35mib/s 26.1mib/s 3.1mib/s 40.1mib/s 39mib/s
Regex Parsing 13.2mib/s n/a 20.5mib/s 2.6mib/s 4.6mib/s n/a 7.8mib/s
TCP to HTTP 26.7mib/s n/a 19.6mib/s <1mib/s 2.7mib/s n/a n/a
TCP to TCP 69.9mib/s 5mib/s 67.1mib/s 3.9mib/s 10mib/s 70.4mib/s 7.6mib/s

To learn more about our performance tests, please see the Vector test harness.

Correctness

The following correctness tests are not exhaustive, but they demonstrate fundamental differences in quality and attention to detail:

Test Vector Filebeat FluentBit FluentD Logstash Splunk UF Splunk HF
Disk Buffer Persistence
File Rotate (create)
File Rotate (copytruncate)
File Truncation
Process (SIGHUP)
JSON (wrapped)

To learn more about our correctness tests, please see the Vector test harness.

Features

Vector is an end-to-end, unified, open data platform.

Vector Beats Fluentbit Fluentd Logstash Splunk UF Splunk HF Telegraf
End-to-end
Agent
Aggregator
Unified
Logs
Metrics
Traces 🚧
Open
Open-source
Vendor-neutral
Reliability
Memory-safe
Delivery guarantees
Multi-core

⚠ = Not interoperable, metrics are represented as structured logs


Developed with ❤️ by Datadog - Security Policy - Privacy Policy

vector's People

Contributors

001wwang avatar binarylogic avatar blt avatar bruceg avatar dependabot[bot] avatar dsmith3197 avatar fanatid avatar fuchsnj avatar hoverbear avatar jamtur01 avatar jdrouet avatar jeanmertz avatar jeffail avatar jszwedko avatar juchiast avatar ktff avatar leebenson avatar luciofranco avatar lucperkins avatar lukesteensen avatar michaelfairley avatar mozgiii avatar neuronull avatar pablosichert avatar prognant avatar pront avatar spencergilbert avatar stephenwakely avatar tobz avatar tshepang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vector's Issues

Initial config file implementation

We should be able to take a structured config file as input (likely just JSON at first) and use it to drive the dynamic topology builder from #45.

This should be more focused on getting a feel for the right "shape" of the config data and how it interacts with the implementation than on making it the most convenient configuration system possible. Once we have this in place we can look into more convenient ways of generating that configuration data.

Improve config validation

Config validation now happens within building the actual sink but this should be pushed up into the topology section of the router.

Make a first pass at structuring data internally

Data is currently represented and operated on as a simple String. In order to support things like parsing, more intelligent sampling, routing, etc, we should have a more structured internal data format.

While we may eventually move towards something like SSF or Cernan's internal format, a good start would be a simple struct wrapping the string content of each log line and a map of key/value pairs.

Simple first parsing component

With #37, we have the beginning of support for structured data. To complete the initial level of support, we need a way to extract fields from raw input into that structured format.

The simplest implementation is probably regex-based with captures. Eventually we'll want parsers that are much more convenient and easy to use (e.g. native parsers for common formats, Lua-based parsers, maybe rosie-lang, etc), but we can start with the basics to get a feel for how parsers will fit into the system as a whole. Once we have one in place and working, it should be relatively straightforward to expand.

New `aws_kinesis_streams` source

It would be nice if Vector could ingest logs from a AWS Kinesis data stream (not Firehose which is covered in #3566).

Requirements

  • Ability to exclusively read partitions across multiple Vector instances.
  • Checkpointing the stream to resume properly when Vector is restarted (and prevent data loss).
    • Bonus points if checkpoints can be stored remotely, like in Dynamo.
  • Ability to specify where to start reading from the stream (horizon, etc).
  • Add the kinesis.stream and kinesis.partition as context fields (. denoting nested fields).
  • Ability to merge split/multi-line messages.
  • It should handle all of the various stream statuses and react accordingly.
  • Consider using the new enhanced fanout and HTTP2 capabilities to avoid polling.

Automated performance testing

As we make architectural decisions and add features, we need to have a good idea of how they affect the performance of the system in various configurations. This has been done manually up to this point, but that is very labor-intensive and error-prone.

Similar to #39, we should have a system in place that lets us easily measure the performance of various router configurations and determine how that performance is changing over time.

Initial metrics support

We'd like the router to support both transforming logs into metrics and collecting metrics directly. For the first pass at this, we'll probably need a few basic components:

  • The ability to represent metrics in our internal data format. Both cernan and SSF could be good inspiration here.
  • A transform that can aggregate its input and emit output on a scheduled interval
  • A simple way to send that output somewhere we can see it. This could be as simple as treating it as log lines we can use with existing sinks, or just logging it out.

Some things to think about but not necessarily solve right away:

  • How "typed" do we want our internal data format? There's a whole spectrum between a simple map of attributes and full fledged "log line" / "metric" types with that apply to transforms/sources/sinks
  • How do will we support "bucketing" of aggregated metrics? Purely based off wall clock time of their arrival, or will there be a concept of event time vs processing time? How would late data be handled in that case?

Initial Lua integration for transforms

We want the router to be operator-programmable and it seems like the most straightforward implementation of that would be loadable Lua transforms (following the precedent of Nginx, Haproxy, etc). Early experiments used the rlua crate with success.

There are a few issues that an initial implementation should keep in mind and try to get a feel for:

  1. Interface: how do we expose the "shape" of a transform to the Lua code in a way that makes it clear what the contract is for authors and allows all the functionality we want?
  2. Testing and validation: how do we make it easy when authoring transforms in Lua to be confident that your code will behave in production? It should be trivial to have the router run your code locally with a variety of inputs (simple fuzzing, even). Ideally, it would even give you an idea of your code's performance relative to that of the larger system (i.e. "this works and handles errors, but will likely bottleneck the system to roughly X throughput").
  3. Libraries: one of the benefits of Lua is that the Heka/Hindsight ecosystem has already built a variety of log parsers that we can potentially reuse. It'd be nice to support those parsers as well as the underlying libraries they use (lpeg) in transforms written for the router. We should think about how those will be packaged and distributed for or into the router.
  4. Performance: we should try to get an idea of how much performance is lost when writing a sampler or parser in Lua instead of natively in Rust. If it's not much, we can consider building up a larger "standard library" of Lua functions and building blocks. If it's a lot, we can focus more on easier ways to compose native transforms.

Again, with the initial implementation we're looking more to map out the territory than to necessarily solve all of these issues. It's possible, for example, that there's not a clean way to integrate lpeg and the Heka parsers, and that even doing so in a messy way leads to much worse performance. If that's the case, we may choose to take an alternative path to operator-programmability.

Just for fun, some of the more experimental ideas we've had as alternatives to Lua:

  1. wasm (could actually be very promising)
  2. weld
  3. eBPF
  4. RPC w/ client libraries
  5. Exec-style scripts with pipes

New `vector_metrics` source

Following #64, it would be neat if we could create an "internal" metrics source and use it to instrument the rest of the application.

This would give us good feedback on our metrics implementation and make our own observability story better. It would replace the existing prometheus counters and give us the flexibility to use any metrics sink we want.

Taken to the extreme, we could have the source be our existing logger and use transforms, aggregates, etc to derive all the metrics we want. We don't need to go that far at first, but it could be something interesting to work towards.

Build Splunk sink

The first sink we need to build is for Splunk. At this stage, we only provide value to them by being in front of Splunk itself.

From our perspective, by far the most desirable Splunk integration would be with the HTTP Event Collector (HEC). There are multiple open source examples of integrations with this collector that we could work from, and it's relatively simple HTTP requests.

Based on our meeting, it sounds like they're in the process of testing the HEC and it is not yet supported in production. It seems unlikely that supporting something else would be a better decision than simply making their rollout of the HEC a dependency of their rollout of the router, but we should verify that point with them before committing to it completely.

Fix flaky tests

There is now a new feature flag flaky for flaky tests that might fail but are actually correct. We should fix up these tests so that we can allow them to run on CI.

Build S3 sink

After #19, the next most valuable sink to build is likely to be S3. This would provide them with a cheap place to send data that is dropped from the Splunk ingestion pipeline.

Simple compressed flat files will probably suffice as a first pass, but longer term we'll want to support more efficient structured formats like ORC and Parquet.

Improve integration testing story

Currently, our "integration tests" consist of running the main binary (which is an awkward combination of the actual program and a harness for exercising it) and watching for things to blow up. This is in the process of falling apart as we add a second configuration (the ES writer) and would require a big increase in complexity to cover that case as well.

Instead, we should build a first-class test harness that lets us easily run data through various topologies (e.g. tcp in -> sample -> tcp out, tcp in -> parse -> ES out) and assert that they behave as expected.

There are a lot of ways this could be implemented, but a good source of inspiration is the TopologyTestDriver from Kafka Streams. We don't (yet) have an equivalent to their StreamBuilder, so it's possible we need to take a different approach. There are also likely benefits to a completely external test harness that treats the router process as a black box, but it would need a way to configure the topologies it wants to test.

Failure testing

We should have a portion of the test suite to exercise the various failure paths of the router. This could use something like the fail points library to inject failures deterministically.

There will be a lot of decisions we need to make about what behavior is desirable under various types and combinations of failures, and the point of this work is not necessarily to make all of those decisions. Instead, we should aim to map out as many of those decision points as possible so that we can chat as a group and hopefully come up with a coherent overall plan for failure handling.

Gzip ES sink bodies

Like the S3 sink, we should be gzipping our request bodies to ES to save on bandwidth.

Consistent sampling

Modify the Sampler behavior to provide a consistent sampling decision for a given input. That is, if the log line "foo" is passed through the sampler, all future log lines "foo" will also be passed. Likewise, if "foo" is rejected, all future "foo" lines will also be rejected.

To do so, replace the Rng-based implementation with one based on a hash of the content. Eventually this will be tweaked to look at a configurable field, but working with the whole line is a reasonable start and should behave roughly the same as the random sampler for a given sampling rate.

Investigate dynamic configuration strategy

The purpose of this exercise to gauge how much work dynamic configuration would take to implement. We need a general idea of it's cost and time to determine if it should make the 0.1 milestone.

Derive basic metrics from trace events

Try implementing a custom subscriber that can aggregate counters and timers from trace events and expose that data in a configurable way.

The most important data we'll want initial are record throughput per node, rates of things like parse failures, timings for calls to external services, retry counts, and batch sizes.

If the custom subscriber router turns out to be tricky or slow, reevaluate using a more standard metrics collection library.

Flesh out CI builds

  • Clippy
  • S3 integration test
  • Splunk integration test
  • Cloudwatch integration test
  • Run benchmarks (and log the performance changes between builds?)

Handle `data_dir` changing

Currently, changes to data_dir are completely ignored during config reloading. We should probably prevent the reload if it's changed (and require a clean shutdown+restart to change it).

Improve the logging story

This is just a tracking issue to follow up on how we should be logging things. We have a lot of state machines and it will be useful to have trace statements that help debug why things may go wrong.

Have `Topology::start` use dynamic setup

Rather than having a special way of starting all of the components at boot time, startup could be treated as a "reload" where the previous config was completely empty.

In-process data plane

A component crucial to the efficiency and reliability of the router is exactly how we move data from the various sources, between different transforms, and ultimately to sinks. The Heka postmortem essentially blames Go channels for killing the project, so we want to be sure to avoid the same mistake.

The project that Heka points to as doing this right, Hindsight, uses disk queues. It essentially uses files on disk as the intermediary between components, relying on the OS pagecache to keep things efficient. This is a very similar strategy to Kafka, and well proven in practice.

When building the equivalent functionality for the router, we should dig into Hindsight's implementation as much as possible to learn from their experience. We should also model our solution as much as possible as a single-machine Kafka. This model provides a lot of benefits that we want, such as decoupling producers from consumers, allowing limited retention, durability across restarts, etc.

Config validation and error messages

We should try to get the basics of a framework in place that will let us provide users with useful and comprehensive error messages when something is wrong with their configuration.

This exercise could also provide some valuable feedback on the design of the configuration system, which ideally would make it hard to write an incorrect config.

Startup checks for sources and sinks

We should allow sources and sinks to run a health check on their dependencies at startup to avoid situations where we boot up and then immediately fail or go into a surprising retry loop. This could potentially tie into #66.

One thing to keep in mind is that there may be cases (e.g. intermittent connectivity issues or recovering after an incident) where starting up in spite of certain types of issues is desirable. A simple solution would be a flag to skip these checks, but a better (and more difficult) one would be some way of differentiating between errors that can be retried and those that are fatal.

Decide on initial source

Should we limit ourselves to places they've already deployed Filebeat, or do we need to support input from Splunk?

Filebeat

If we're willing to tie ourselves to their Filebeat rollout, we have a few options:

  • Lumberjack protocol (i.e. pretend to be Logstash)
  • Kafka protocol
  • Redis protocol

The Lumberjack protocol would take some research and reverse-engineering, since it is not directly documented. On the plus side, being a drop-in replacement for Logstash (at least for this type of input) could be interesting.

The Kafka protocol is not incredibly simple, but it's more of a known quantity and well designed for the type of workload we're targeting. It also has the big advantage of cluster-aware clients. Even though we're unlikely to support clustering right away, using the Kafka protocol gives us one of the only clear paths towards it.

The Redis protocol is very simple and would likely be the quickest way to get started.

Splunk

If we're not willing to limit ourselves to places they've already rolled out alternatives to the Splunk collector, our only option (other than reverse-engineering the Splunk protocol, which is likely not allowed) is to implement a BSD syslog server.

The syslog protocol is quite simple and likely wouldn't take terribly long to implement, but it's not a protocol that provides the kind of features we want moving forward.

Dynamic topologies

Currently, the topology of sources, transforms, and sinks run by the router is hard-coded in main. For users to define their own topologies without writing code and recompiling, we need to be able to build them up dynamically at runtime.

We shouldn't worry too much about configuration file formats or anything like that for this first pass but just focus on a convenient way to describe a topology as data and then build and run the actual topology from that data.

Build sampling transform

It seems that the "transform" they have the most interest in by far is sampling. Their stated goal was to send only 10% of their existing volume to Splunk, though it wasn't clear how feasible they considered this.

We can quite easily write a native sampling transform that performs extremely well and can have the sampling rate updated on the fly by hitting a configuration endpoint. This feels like the obvious first piece of this type of functionality to build.

One additional feature I imagine they'll require is a way to match certain types of log lines and always forward them (i.e. skip the sampling step). This should also be very straightforward to support as part of the sampling transform, but we should find out exactly how sophisticated this matching would need to be. Ideally simple constant string matching would be fine (e.g. does the log line include the literal string very-important-billing-event), but there's a chance they'd also want regex.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.