jaegertracing / jaeger-clickhouse Goto Github PK
View Code? Open in Web Editor NEWJaeger ClickHouse storage plugin implementation
License: Apache License 2.0
Jaeger ClickHouse storage plugin implementation
License: Apache License 2.0
Amount of spans that can be recorded at a time is a costant, so not flexible at all.
Make that amount a parameter in config.
Describe the bug
Got error in Jaeger UI with Clickhouse gRPC plugin when search for traces:
HTTP Error: plugin error: rpc error: code = Unavailable
desc = connection error:
desc = "transport: error while dialing: dial unix /tmp/plugin2381205323:
connect: connection refused
Seems it happens
Clickhouse is up and running.
Restarting Jaeger Query fix the problem temporary (until next long search).
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Traces are found
Version (please complete the following information):
What troubleshooting steps did you try?
Additional context
jaeger-query-clickhouse-5ff64c9dbc-h7jr4.log
Add integration test that spins up ClickHouse, Jaeger with storage plugin. The test would store and query data.
Jaeger already has storage integration tests - https://github.com/jaegertracing/jaeger/tree/master/plugin/storage/integration. It would be great if they could be reused.
We should document how the old data can be removed (alter table jager_spans drop partition 20201
) and add support for TTL https://clickhouse.tech/docs/en/sql-reference/statements/alter/ttl/ (The user could specify the number of days in the config).
E.g.
CREATE TABLE IF NOT EXISTS jaeger_index_local (
timestamp DateTime CODEC(Delta, ZSTD(1)),
traceID String CODEC(ZSTD(1)),
service LowCardinality(String) CODEC(ZSTD(1)),
operation LowCardinality(String) CODEC(ZSTD(1)),
durationUs UInt64 CODEC(ZSTD(1)),
tags Array(String) CODEC(ZSTD(1)),
INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 64,
INDEX idx_duration durationUs TYPE minmax GRANULARITY 1
) ENGINE MergeTree()
PARTITION BY toDate(timestamp)
ORDER BY (service, -toUnixTimestamp(timestamp))
TTL timestamp + INTERVAL 90 DAY
SETTINGS index_granularity=1024
cc) @chhetripradeep could you please loop in and document how do you delete old data?
There is 2 issues while working with ClickHouse database with non-UTC time zone:
could not connect to database: "could not load time location: unknown time zone {any non-UTC time zone}"
The repository does not have any unit tests. It would be great to add tests to increase coverage. E.g. Jaeger uses 95% test coverage.
We currently use the jaeger-clickhouse image and our security team has flagged it as being impacted by two HIGH CVEs
To resolve these CVEs the following packages need to be updated to a minimum version of:
We prefer to have the packages fixed upstream to ensure that everyone can benefit from the updates.
Using a vulnerability scanners (e.g. aqua/trivy) scan the jaeger-clickhouse image
trivy image jaeger-clickhouse:0.13.0
No vulnerabilities listed.
No response
No response
No response
No response
No response
No response
No response
No response
No response
No response
Test replicated database in integration tests as well.
No such config in workflows.
Hi! Thanks for the project, I believe it's of a great value to the community.
Currently, this plugin accumulated data and writes it to the database. I think it's important to do several things to ensure more durable writes:
What do you think?
Currently, we insert batch by one span. But for ClickHouse, it's better to insert big amounts of data.
The distributed table could be created with multiple sharding functions: rand()
, cityHash64(traceID)
- see https://clickhouse.tech/docs/en/sql-reference/functions/hash-functions/.
The hash functions take an argument, we should consider using traceID
to keep data from a single trace in the same location.
CREATE TABLE IF NOT EXISTS jaeger_spans AS jaeger_spans_local ENGINE = Distributed('{cluster}', default, jaeger_spans_local, cityHash64(traceID));
As an active jaeger-clickhouse user I'd like to suggest to use the native ClickHouse communication protocol instead of database/sql-compatible one. This change might significantly increase the overall performance and speed up the spans writer.
jaeger-clickhouse uses the clickhouse-go 1.5.4 client. It provides the standard database/sql interface for communication with ClickHouse.
There is a benchmark section in the readme of the repository. It claims that migration to v2 might significantly speed up write and read. This speed up is possible due to usage of the native TCP ClickHouse client-server protocol. Furthermore, new versions (>= 2.3.0) use ch-go for compression.
Switch to the newer version of the clickhouse-go client and enjoy the better performance.
Maybe it's even possible to switch to ch-go, but I think the library may not support all the used high-level features of clickhouse-go at the moment.
I have two questions in mind:
Do you, folks, see any pitfalls?
Ability to achieve service dependency graph for clickhouse backend
Hi team, was exploring this repo
and wonder how we are achieving the service dependency graph in Jaeger using Clickhouse datasource. Is there similar spark job like
https://github.com/jaegertracing/spark-dependencies , that support Clickhouse backend ??
No response
No response
I have a number of proposals that I can make to this project:
ghcr.io/jaegertracing/jaeger-collector:1.29.0-clickhouse-0.8.0-stretch
ghcr.io/jaegertracing/jaeger-collector:1.29.0-clickhouse-0.8.0
ghcr.io/jaegertracing/jaeger-collector:1.29.0-clickhouse
ghcr.io/jaegertracing/jaeger-collector:clickhouse-0.8.0-stretch
ghcr.io/jaegertracing/jaeger-collector:clickhouse-0.8.0
ghcr.io/jaegertracing/jaeger-collector:clickhouse
all-in-one
, jaeger-agent
, jaeger-collector
, jaeger-ingester
, jaeger-query
.The implementation of part of the above can be found in this project https://github.com/levonet/docker-jaeger.
I'm ready to move this infrastructure and do support by my team during the time of using Jaeger.
There are long living processes in our ecosystem involving many services. Sometimes it leads to anomaly huge traces (100k+ spans). Fetching all spans from such a traces slow down trace search.
There is no option (similar to es.max-num-spans for Elasticsearch storage) to limit amount of spans being fetched for each trace.
Introduce new setting in config file to imit amount of spans being fetched for each trace.
Our replication and sharding guide uses https://github.com/pavolloffay/jaeger-clickhouse/blob/main/guide-sharding-and-replication.md#replication '{cluster}' substitution when creating distributed table e.g.
CREATE TABLE IF NOT EXISTS jaeger_spans ON CLUSTER '{cluster}' AS jaeger_spans_local ENGINE = Distributed('{cluster}', default, jaeger_spans_local, cityHash64(traceID));
I am not sure if I understand what it exactly does. Could somebody explain it? @EinKrebs @chhetripradeep
Let's say my CH deployment defines two clusters
<remote_servers>
<example_cluster1>
...
</example_cluster1>
<example_cluster2>
...
</example_cluster2>
</remote_servers>
So if the create command is executed would it crate tables on all clusters?
On jaeger_index tables, the tags is coded as a nested array with key and values.
It is good for the only usage of Jaeger-query but in our company we are using jaeger also for analytics purposes.
Since Clickhouse 21.3, the Map type (https://clickhouse.com/docs/en/sql-reference/data-types/map/) is available. I think It could be a good alternative to Nested .
Do you have already made some performance (time and storage) tests with Map ?
Could it be an acceptable contribution (with a flag to not activate it by default) ?
If we would like to change TTL days clickhouse by default will be merge by rows a lot of data https://clickhouse.com/docs/en/operations/settings/settings/#ttl_only_drop_parts
It would be great don't waste resources during merges by rows expired by TTL, right now this is setting is not possible to set during creation time. If you would like to change TTL later , it will consume a lot of CPU resources.
Add ttl_only_drop_parts settings for tables in to 1 (drop by parts means days) by default or make it possible configure before.
No response
@chhetripradeep from https://github.com/pavolloffay/jaeger-clickhouse/pull/34#issuecomment-886882402
We run with 3 replica and as we need to expand the cluster we add more shards. One thing to note is clickhouse doesn't have any inbuilt data balancing feature i.e. once a data is written to a node, it will stay there throughout the lifetime of that data unless the operator moves the data manually, so it's good to do capacity planning in the beginning of cluster provisioning.
Would you like to create a doc/guide for capacity planning?
As a Clickhouse analytics user, I want the clickhouse-jaeger
schema to allow using Clickhouse native JSON columns so that we can query data in clickhouse more efficiently (both in terms of performance and query simplicity)
Currently, Clickhouse-Jaeger stores JSON span data as a string column-type, which makes it quite verbose to have to query based on fields within the column using Clickhouse's JSON functions , especially if you get past 2 levels of nesting.
This is very evident, when you want to query the ingested data to generate your own analytics/insights. It would be nice if jaeear-clickhouse added support for Clickhouse native JSON columns
A solution may be to start providing support for the native JSON datatype (It's still "experimental", but the spec has been quite stable for a while)
The major open question is how this would affect the split between protobuf and json encoded data (currently, string supports both) and whether it'll add more complexities to the project. Need to observe more to see the impact of this, but wanted to raise this with the community/maintainers to get an idea of their thoughts.
The https://github.com/pavolloffay/jaeger-clickhouse/blob/main/guide-sharding-and-replication.md#replication requires uses to run SQL scripts on one node (bc we use ON CLUSTER
).
We could add a new config option replication: true
that would indicate that replication is enabled. The plugin would then use
ON CLUSTER
cc) @EinKrebs is this smth that interests you?
Describe the bug
Filter "error=true" does not show traces with errors.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Traces with Redis error found
Version (please complete the following information):
What troubleshooting steps did you try?
Tag filter works as expected with another tags (e.g. "param.driverID").
Error tag filter works with Elasticsearch plugin.
I use that docker-compose to compare ELK and Clickhouse storages.
Make considers path to jaeger all-in-one as ${HOME}/projects/jaegertracing/jaeger/cmd/all-in-one/all-in-one-linux-amd64
, maybe it's better to make environmental variable for this
I ran jaeger grpc-plugin integration tests with this plugin and it failed.
Integration test failed because this plugin doesn't support jaeger/spanstore.Operation.SpanKind.
can we change insert sql
insert into mytable select col1, col2 from input('col1 String, col2 DateTime64(3), col3 Int32')
with this sql, it is best perfomance .
https://clickhouse.com/docs/en/integrations/java#with-input-table-function
change sql can get no more drop spans..
No response
No response
I do it in the simplest way,
,https://github.com/jaegertracing/jaeger-clickhouse/blob/main/guide-kubernetes.md
kubectl get cm jaeger-clickhouse -o yaml
apiVersion: v1
data:
config.yaml: |
address: clickhouse-jaeger:9000
username: clickhouse_operator
password: clickhouse_operator_password
spans_table:
spans_index_table:
operations_table:
kind: ConfigMap
But the report cannot connect to the clickhouse
The error log is as follows:
[ERROR] jaeger-clickhouse: Failed to create a storage: @module=jaeger-clickhouse EXTRA_VALUE_AT_END="could not connect to database: \"dial tcp: missing address\"" timestamp=2023-02-09T09:02:29.083Z
Defaulted container "jaeger" out of: jaeger, install-plugin (init)
2023/02/09 09:02:29 maxprocs: Leaving GOMAXPROCS=36: CPU quota undefined
{"level":"info","ts":1675933349.0723028,"caller":"flags/service.go:119","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1675933349.0723774,"caller":"flags/service.go:125","msg":"Mounting expvar handler on admin server","route":"/debug/vars"}
{"level":"info","ts":1675933349.07262,"caller":"flags/admin.go:129","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1675933349.0726702,"caller":"flags/admin.go:143","msg":"Starting admin HTTP server","http-addr":":14269"}
{"level":"info","ts":1675933349.0726848,"caller":"flags/admin.go:121","msg":"Admin server started","http.host-port":"[::]:14269","health-status":"unavailable"}
2023-02-09T09:02:29.075Z [WARN] plugin configured with a nil SecureConfig
2023-02-09T09:02:29.075Z [DEBUG] starting plugin: path=/plugin/jaeger-clickhouse args=["/plugin/jaeger-clickhouse", "--config", "/plugin-config/config.yaml"]
2023-02-09T09:02:29.076Z [DEBUG] plugin started: path=/plugin/jaeger-clickhouse pid=23
2023-02-09T09:02:29.076Z [DEBUG] waiting for RPC address: path=/plugin/jaeger-clickhouse
2023-02-09T09:02:29.083Z [ERROR] jaeger-clickhouse: Failed to create a storage: @module=jaeger-clickhouse EXTRA_VALUE_AT_END="could not connect to database: \"dial tcp: missing address\"" timestamp=2023-02-09T09:02:29.083Z
{"level":"fatal","ts":1675933349.0847218,"caller":"./main.go:109","msg":"Failed to init storage factory","error":"grpc-plugin builder failed to create a store: error attempting to connect to plugin rpc client: Unrecognized remote plugin message: \n\nThis usually means that the plugin is either invalid or simply\nneeds to be recompiled to support the latest protocol.","stacktrace":"main.main.func1\n\t./main.go:109\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:968\nmain.main\n\t./main.go:240\nruntime.main\n\truntime/proc.go:250"}
2023-02-09T09:02:29.084Z [ERROR] plugin process exited: path=/plugin/jaeger-clickhouse pid=23 error="exit status 1"
clickhouse and database is normal
# kubectl get pod
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 25 (30m ago) 25h
chi-jaeger-cluster1-0-0-0 1/1 Running 0 18m
jaeger-clickhouse-854dfc4c5d-fkcq9 0/1 CrashLoopBackOff 6 (5m6s ago) 10m
# kubectl exec -it statefulset.apps/chi-jaeger-cluster1-0-0 -- clickhouse-client -h clickhouse-jaeger --user clickhouse_operator --password clickhouse_operator_password
ClickHouse client version 22.1.3.7 (official build).
Connecting to clickhouse-jaeger:9000 as user clickhouse_operator.
Connected to ClickHouse server version 22.1.3 revision 54455.
:) use jaeger
USE jaeger
Query id: c0bb766d-3fbe-4e4f-9fe8-68ac5a0b2345
Ok.
0 rows in set. Elapsed: 0.001 sec.
:) show databases;
SHOW DATABASES
Query id: 19942759-efd9-4d13-adf7-26acd678425b
┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default │
│ information_schema │
│ jaeger │
│ system │
└────────────────────┘
5 rows in set. Elapsed: 0.002 sec.
SELECT
query_id,
client_hostname,
initial_address
FROM system.processes
Query id: 13ca8ec2-1449-45ca-a44a-89e37fa070b7
┌─query_id─────────────────────────────┬─client_hostname────────────────────┬─initial_address────┐
│ 13ca8ec2-1449-45ca-a44a-89e37fa070b7 │ clickhouse-client-5574484945-b7zx9 │ ::ffff:10.100.3.42 │
└──────────────────────────────────────┴────────────────────────────────────┴────────────────────┘
1 rows in set. Elapsed: 0.002 sec.
I do it in the simplest way,
https://github.com/jaegertracing/jaeger-clickhouse/blob/main/guide-kubernetes.md
Is this document missing the necessary steps, but I also tried to create the table manually
my jaeger-operator version is 1.41.0
Is it an image version problem?
No response
No response
No response
No response
No response
No response
No response
No response
No response
No response
Describe the bug
Plugin writes very much traces(service='jaeger-query', operation='/jaeger.storage.v1.SpanWriterPlugin/WriteSpan'), even where there's no else spans to write. Count of traces per tick on default settings is at least 20. After writing spans that are not jaeger-query, circa several thousands of spans every tick can last for very long, even without writing any spans to jaeger.
To Reproduce
Steps to reproduce the behavior:
SELECT count()
FROM jaeger_index_local
WHERE (service = 'jaeger-query') AND (operation = '/jaeger.storage.v1.SpanWriterPlugin/WriteSpan') AND (timestamp >= (now() - toIntervalMinute(1)))
┌─count()─┐
│ 15329 │
└─────────┘
Expected behavior
No/very little of such spans.
Version (please complete the following information):
What troubleshooting steps did you try?
Didn't find any of the info about such problem.
Describe the bug
serialized, err = proto.Marshal(span) insert error
Version (please complete the following information):
What troubleshooting steps did you try?
Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.
Additional context
Add any other context about the problem here.
Right now the default encoding is JSON. It's JSON bc it was historically set to JSON. I would like to understand why JSON is preferred over protobuf.
https://github.com/pavolloffay/jaeger-clickhouse/blob/main/config.yaml#L8
This project does not seem to have an active maintainer. There are a couple of open PRs from @nickbp and @bocharovf. Is anybody of you willing to take part in the project and maintain it?
cc) @EinKrebs
When I was using jaeger-clickhouse, I found that WriteSpan via grpc stream will reduce the CPU utilization of jaeger and increase the throughput. So I made a pull request to jaeger, which supports grpc stream. But this still requires plugin support.
Atomic engine should allow us to remove arguments from ENGINE ReplicatedMergeTree('/clickhouse/tables/{shard}/jaeger_index', '{replica}')
Log add SQL statements that are executed during init. Use debug level.
It's useful to know how schema is initialized.
For fine-tuning parameters like
# Batch size. Default 10_000.
batch_write_size:
# Batch flush size. Default 5s.
batch_flush_interval:
It would be great to expose metrics like:
Although, I am not sure if the batch size influences performance or not.
Upload released binary with example config to Github Release page.
The SQL scripts https://github.com/pavolloffay/jaeger-clickhouse/tree/main/sqlscripts are needed at Jaeger startup. To simplify distribution these scripts can be embedded into the binary.
Disclaimer I am just starting with Clickhouse.
As far as I can tell our scripts https://github.com/pavolloffay/jaeger-clickhouse/blob/main/sqlscripts/0002-jaeger-spans.sql create only local tables and not distributed ones so this tutorial will likely need different scripts.
System Architecture reported an error when I used ClickHouse as the storage backend,the error like this:
HTTP Error: plugin error:rpc error: code = Unknown desc = not implemented
HTTP Error: plugin error:rpc error: code = Unknown desc = not implemented
I expect System Architecture to display a DAG of the calling relationship of each system
No response
No response
No response
1.47.0
OpenTelemetry javaagent 1.28.0
javaagent->otelcol->jaeger
clickhouse
linux
CLI
No response
As a Jaeger Operator
I want to be able to modify the TTL configuration of my tables/databases
So that I can change these settings after the initial database creation
Currently, TTL is set ONLY on database creation.
A change on TTL config values, after database creation, will not get propagated to the ddbb nor tables
We can add sqlscripts to perform the TTL adjustment independenty from ddbb creation.
For the spans table, this new script will look similar to
ALTER TABLE {{.SpansTable}}
MODIFY {{.TTLTimestamp}}
and we should make sure we run this script AFTER the one that creates the table, so it wont fail on new installs.
No response
Describe the bug
Integration tests fails due to incorrect behaviour of GetOperations.
Expected behavior
Test won't fail
Screenshots
If applicable, add screenshots to help explain your problem.
Version (please complete the following information):
2021.07.14 17:06:49.783711 [ 219 ] {11925d3b-7684-4919-827b-319af811c400} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
2021.07.14 17:06:49.783769 [ 1010 ] {d4de6e5e-6305-4802-842e-13c660886ef2} <Error> TCPHandler: Code: 202, e.displayText() = DB::Exception: Too many simultaneous queries. Maximum: 100, Stack trace:
0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8d31b5a in /usr/bin/clickhouse
1. DB::ProcessList::insert(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::IAST const*, std::__1::shared_ptr<DB::Context const>) @ 0xfcd6802 in /usr/bin/clickhouse
2. ? @ 0xfe21ab3 in /usr/bin/clickhouse
3. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, bool) @ 0xfe208e3 in /usr/bin/clickhouse
4. DB::TCPHandler::runImpl() @ 0x1069f6c2 in /usr/bin/clickhouse
5. DB::TCPHandler::run() @ 0x106b25d9 in /usr/bin/clickhouse
6. Poco::Net::TCPServerConnection::start() @ 0x1338b30f in /usr/bin/clickhouse
7. Poco::Net::TCPServerDispatcher::run() @ 0x1338cd9a in /usr/bin/clickhouse
8. Poco::PooledThread::run() @ 0x134bfc19 in /usr/bin/clickhouse
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x134bbeaa in /usr/bin/clickhouse
10. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
11. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
The DB is started as docker run --rm -it -p9000:9000 --name some-clickhouse-server --ulimit nofile=262144:262144 yandex/clickhouse-server:21
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.