jaegertracing / jaeger-clickhouse Goto Github PK

View Code? Open in Web Editor NEW

233.0 233.0 50.0 2.11 MB

Jaeger ClickHouse storage plugin implementation

License: Apache License 2.0

Makefile 2.13% Go 97.70% Dockerfile 0.17%

clickhouse clickhouse-database grpc jaegertracing

jaeger-clickhouse's People

Contributors

Stargazers

Watchers

Forkers

einkrebs chhetripradeep pavolloffay trendingtechnology suryatmodulus atercattus njasm zewade levonet bocharovf wangpu666 metrico dino-ma youngwookim faceair qiaogj1 etienne-carriere vuuihc datalabs-apps-tools-docs-tips olizzz krya-kryak clannadxr tryweirdier chenlujjj albertlockett2 albertlockett arajkumar mar7ius khoanguyen1806 sonrai-doyle siddharthsingh025 darkwanderer chillorb majunyang iq-scm brunodebus haanhvu m0nikasingh testwill easayliu tjackpaul theoptz rfyiamcool twxstar seno5979 smunukutla-mycarrier jw10041229 fredericgit eapotapov

jaeger-clickhouse's Issues

[Feature]: Add ttl_only_drop_parts into table setting or possible be configured

Requirement

If we would like to change TTL days clickhouse by default will be merge by rows a lot of data https://clickhouse.com/docs/en/operations/settings/settings/#ttl_only_drop_parts

Problem

It would be great don't waste resources during merges by rows expired by TTL, right now this is setting is not possible to set during creation time. If you would like to change TTL later , it will consume a lot of CPU resources.

Proposal

Add ttl_only_drop_parts settings for tables in to 1 (drop by parts means days) by default or make it possible configure before.

Open questions

No response

Remove "_v2" suffix from table names

Configure release

Upload released binary with example config to Github Release page.

Support writeSpan via grpc stream

Requirement - what kind of business use case are you trying to solve?

When I was using jaeger-clickhouse, I found that WriteSpan via grpc stream will reduce the CPU utilization of jaeger and increase the throughput. So I made a pull request to jaeger, which supports grpc stream. But this still requires plugin support.

ref: jaegertracing/jaeger#3636

Hardcoded jaeger path

Make considers path to jaeger all-in-one as ${HOME}/projects/jaegertracing/jaeger/cmd/all-in-one/all-in-one-linux-amd64, maybe it's better to make environmental variable for this

[Feature]: Allow changing TTL configuration on existing tables

Requirement

As a Jaeger Operator
I want to be able to modify the TTL configuration of my tables/databases
So that I can change these settings after the initial database creation

Problem

Currently, TTL is set ONLY on database creation.
A change on TTL config values, after database creation, will not get propagated to the ddbb nor tables

Proposal

We can add sqlscripts to perform the TTL adjustment independenty from ddbb creation.

For the spans table, this new script will look similar to

ALTER TABLE {{.SpansTable}}
    MODIFY {{.TTLTimestamp}}

and we should make sure we run this script AFTER the one that creates the table, so it wont fail on new installs.

Open questions

No response

Durable database writes

Hi! Thanks for the project, I believe it's of a great value to the community.

Currently, this plugin accumulated data and writes it to the database. I think it's important to do several things to ensure more durable writes:

Retry network and database failures. Use exponential backoff in a case when the database cannot server write immediately.
Buffer data not written to DB. Ensure that the buffer does not overflow. Sacrifice data intentionally if it cannot be stored in DB.
Reload connection string when requested: a user can add new shards to CH installation

What do you think?

Model alternative for jaeger_index table

On jaeger_index tables, the tags is coded as a nested array with key and values.
It is good for the only usage of Jaeger-query but in our company we are using jaeger also for analytics purposes.
Since Clickhouse 21.3, the Map type (https://clickhouse.com/docs/en/sql-reference/data-types/map/) is available. I think It could be a good alternative to Nested .

Do you have already made some performance (time and storage) tests with Map ?
Could it be an acceptable contribution (with a flag to not activate it by default) ?

Implement dependency store

Make maxSpanCount of WriteWorkerPool a parameter

Problem

Amount of spans that can be recorded at a time is a costant, so not flexible at all.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Make that amount a parameter in config.

Add integration test for replicated database.

Requirement - what kind of business use case are you trying to solve?

Test replicated database in integration tests as well.

Problem - what in Jaeger-ClickHouse blocks you from solving the requirement?

No such config in workflows.

Find optimal values for batch size and flush interval

https://github.com/pavolloffay/jaeger-clickhouse/blob/main/config.yaml#L4

Document and add support for deleting data/TTL

We should document how the old data can be removed (alter table jager_spans drop partition 20201) and add support for TTL https://clickhouse.tech/docs/en/sql-reference/statements/alter/ttl/ (The user could specify the number of days in the config).

E.g.

CREATE TABLE IF NOT EXISTS jaeger_index_local (
     timestamp DateTime CODEC(Delta, ZSTD(1)),
     traceID String CODEC(ZSTD(1)),
     service LowCardinality(String) CODEC(ZSTD(1)),
     operation LowCardinality(String) CODEC(ZSTD(1)),
     durationUs UInt64 CODEC(ZSTD(1)),
     tags Array(String) CODEC(ZSTD(1)),
     INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 64,
     INDEX idx_duration durationUs TYPE minmax GRANULARITY 1
) ENGINE MergeTree()
PARTITION BY toDate(timestamp)
ORDER BY (service, -toUnixTimestamp(timestamp))
TTL timestamp + INTERVAL 90 DAY
SETTINGS index_granularity=1024

cc) @chhetripradeep could you please loop in and document how do you delete old data?

serialized, err = proto.Marshal(span) insert error

Describe the bug
serialized, err = proto.Marshal(span) insert error

Screenshots

Version (please complete the following information):

OS: [e.g. Linux]
Jaeger version: latest
clickhouse version :21.8.3.44

What troubleshooting steps did you try?
Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.

Additional context
Add any other context about the problem here.

Explanation of '{cluster}'

Our replication and sharding guide uses https://github.com/pavolloffay/jaeger-clickhouse/blob/main/guide-sharding-and-replication.md#replication '{cluster}' substitution when creating distributed table e.g.

CREATE TABLE IF NOT EXISTS jaeger_spans ON CLUSTER '{cluster}' AS jaeger_spans_local ENGINE = Distributed('{cluster}', default, jaeger_spans_local, cityHash64(traceID));

I am not sure if I understand what it exactly does. Could somebody explain it? @EinKrebs @chhetripradeep

Let's say my CH deployment defines two clusters

<remote_servers>
    <example_cluster1>
       ...
    </example_cluster1>
    <example_cluster2>
       ...
    </example_cluster2>
</remote_servers>

So if the create command is executed would it crate tables on all clusters?

Add integration tests

Add integration test that spins up ClickHouse, Jaeger with storage plugin. The test would store and query data.

Jaeger already has storage integration tests - https://github.com/jaegertracing/jaeger/tree/master/plugin/storage/integration. It would be great if they could be reused.

Looking for maintainers

This project does not seem to have an active maintainer. There are a couple of open PRs from @nickbp and @bocharovf. Is anybody of you willing to take part in the project and maintain it?

cc) @EinKrebs

Log all SQL statements that are executed during init

Log add SQL statements that are executed during init. Use debug level.

It's useful to know how schema is initialized.

Make replicated deployment work without user explicitly creating tables

The https://github.com/pavolloffay/jaeger-clickhouse/blob/main/guide-sharding-and-replication.md#replication requires uses to run SQL scripts on one node (bc we use ON CLUSTER).

We could add a new config option replication: true that would indicate that replication is enabled. The plugin would then use

ON CLUSTER
replicated merge trees in local tables
create global tables

cc) @EinKrebs is this smth that interests you?

Add TLS database connection support

Search by error tag does not work

Describe the bug
Filter "error=true" does not show traces with errors.

To Reproduce
Steps to reproduce the behavior:

Use clickhouse storage plugin
Run HotROD and produce some traces
In Jaeger UI select "Redis" service and find traces. Check that there are traces with errors in Redis service.
Add Tags filter "error=true"
See "No trace results. Try another query."

Expected behavior
Traces with Redis error found

Screenshots

Version (please complete the following information):

OS: windows
Jaeger version: 1.27
Deployment: docker-compose
Clickhouse plugin version: 0.8
Clickhouse version: yandex/clickhouse-server:21

What troubleshooting steps did you try?
Tag filter works as expected with another tags (e.g. "param.driverID").
Error tag filter works with Elasticsearch plugin.

I use that docker-compose to compare ELK and Clickhouse storages.

Integration GetOperations test fails

Describe the bug
Integration tests fails due to incorrect behaviour of GetOperations.

Expected behavior
Test won't fail

Screenshots
If applicable, add screenshots to help explain your problem.

Version (please complete the following information):

OS: Linux
Jaeger-ClickHouse version: 0.7.0
Deployment: bare metal & docker ClickHouse server

How to support ARM

[Bug]: why jaeger don't connect clickhouse databases;

What happened?

I do it in the simplest way,
,https://github.com/jaegertracing/jaeger-clickhouse/blob/main/guide-kubernetes.md

kubectl get cm jaeger-clickhouse -o yaml
apiVersion: v1
data:
  config.yaml: |
    address: clickhouse-jaeger:9000
    username: clickhouse_operator
    password: clickhouse_operator_password
    spans_table:
    spans_index_table:
    operations_table:
kind: ConfigMap

But the report cannot connect to the clickhouse

The error log is as follows:

[ERROR] jaeger-clickhouse: Failed to create a storage: @module=jaeger-clickhouse EXTRA_VALUE_AT_END="could not connect to database: \"dial tcp: missing address\"" timestamp=2023-02-09T09:02:29.083Z

kubectl logs jaeger-clickhouse-854dfc4c5d-fkcq9

Defaulted container "jaeger" out of: jaeger, install-plugin (init)
2023/02/09 09:02:29 maxprocs: Leaving GOMAXPROCS=36: CPU quota undefined
{"level":"info","ts":1675933349.0723028,"caller":"flags/service.go:119","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1675933349.0723774,"caller":"flags/service.go:125","msg":"Mounting expvar handler on admin server","route":"/debug/vars"}
{"level":"info","ts":1675933349.07262,"caller":"flags/admin.go:129","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1675933349.0726702,"caller":"flags/admin.go:143","msg":"Starting admin HTTP server","http-addr":":14269"}
{"level":"info","ts":1675933349.0726848,"caller":"flags/admin.go:121","msg":"Admin server started","http.host-port":"[::]:14269","health-status":"unavailable"}
2023-02-09T09:02:29.075Z [WARN]  plugin configured with a nil SecureConfig
2023-02-09T09:02:29.075Z [DEBUG] starting plugin: path=/plugin/jaeger-clickhouse args=["/plugin/jaeger-clickhouse", "--config", "/plugin-config/config.yaml"]
2023-02-09T09:02:29.076Z [DEBUG] plugin started: path=/plugin/jaeger-clickhouse pid=23
2023-02-09T09:02:29.076Z [DEBUG] waiting for RPC address: path=/plugin/jaeger-clickhouse
2023-02-09T09:02:29.083Z [ERROR] jaeger-clickhouse: Failed to create a storage: @module=jaeger-clickhouse EXTRA_VALUE_AT_END="could not connect to database: \"dial tcp: missing address\"" timestamp=2023-02-09T09:02:29.083Z
{"level":"fatal","ts":1675933349.0847218,"caller":"./main.go:109","msg":"Failed to init storage factory","error":"grpc-plugin builder failed to create a store: error attempting to connect to plugin rpc client: Unrecognized remote plugin message: \n\nThis usually means that the plugin is either invalid or simply\nneeds to be recompiled to support the latest protocol.","stacktrace":"main.main.func1\n\t./main.go:109\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:968\nmain.main\n\t./main.go:240\nruntime.main\n\truntime/proc.go:250"}
2023-02-09T09:02:29.084Z [ERROR] plugin process exited: path=/plugin/jaeger-clickhouse pid=23 error="exit status 1"

clickhouse and database is normal

# kubectl get pod
NAME                                             READY   STATUS             RESTARTS       AGE
busybox                                          1/1     Running            25 (30m ago)   25h
chi-jaeger-cluster1-0-0-0                        1/1     Running            0              18m
jaeger-clickhouse-854dfc4c5d-fkcq9               0/1     CrashLoopBackOff   6 (5m6s ago)   10m


# kubectl exec -it statefulset.apps/chi-jaeger-cluster1-0-0 -- clickhouse-client  -h clickhouse-jaeger --user clickhouse_operator --password clickhouse_operator_password
ClickHouse client version 22.1.3.7 (official build).
Connecting to clickhouse-jaeger:9000 as user clickhouse_operator.
Connected to ClickHouse server version 22.1.3 revision 54455.

:) use jaeger

USE jaeger

Query id: c0bb766d-3fbe-4e4f-9fe8-68ac5a0b2345

Ok.

0 rows in set. Elapsed: 0.001 sec. 

 :) show databases;

SHOW DATABASES

Query id: 19942759-efd9-4d13-adf7-26acd678425b

┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default            │
│ information_schema │
│ jaeger             │
│ system             │
└────────────────────┘

5 rows in set. Elapsed: 0.002 sec. 

SELECT
    query_id,
    client_hostname,
    initial_address
FROM system.processes

Query id: 13ca8ec2-1449-45ca-a44a-89e37fa070b7

┌─query_id─────────────────────────────┬─client_hostname────────────────────┬─initial_address────┐
│ 13ca8ec2-1449-45ca-a44a-89e37fa070b7 │ clickhouse-client-5574484945-b7zx9 │ ::ffff:10.100.3.42 │
└──────────────────────────────────────┴────────────────────────────────────┴────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

Steps to reproduce

I do it in the simplest way,
https://github.com/jaegertracing/jaeger-clickhouse/blob/main/guide-kubernetes.md

Is this document missing the necessary steps, but I also tried to create the table manually

my jaeger-operator version is 1.41.0

Expected behavior

Is it an image version problem?

Relevant log output

No response

Screenshot

No response

Additional context

No response

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

Add option to limit number of fetched spans per trace

Requirement - what kind of business use case are you trying to solve?

There are long living processes in our ecosystem involving many services. Sometimes it leads to anomaly huge traces (100k+ spans). Fetching all spans from such a traces slow down trace search.

Problem - what in Jaeger blocks you from solving the requirement?

There is no option (similar to es.max-num-spans for Elasticsearch storage) to limit amount of spans being fetched for each trace.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Introduce new setting in config file to imit amount of spans being fetched for each trace.

Got plugin error "transport: error while dialing: dial unix /tmp/plugin"

Describe the bug
Got error in Jaeger UI with Clickhouse gRPC plugin when search for traces:
HTTP Error: plugin error: rpc error: code = Unavailable
desc = connection error:
desc = "transport: error while dialing: dial unix /tmp/plugin2381205323:
connect: connection refused

Seems it happens

either after several hours of inactivity of Jaeger Query
either after jaeger_index_local exceeds ~70kk rows

Clickhouse is up and running.
Restarting Jaeger Query fix the problem temporary (until next long search).

To Reproduce
Steps to reproduce the behavior:

Ingest ~70kk rows in jaeger_index_local
Search for traces

Expected behavior
Traces are found

Screenshots

Version (please complete the following information):

OS: Linux
Jaeger Query version: 1.25
Clickhouse plugin versin: 0.8
Clickhouse version: 21.8.10.19
Deployment: Kubernetes

What troubleshooting steps did you try?

Additional context
jaeger-query-clickhouse-5ff64c9dbc-h7jr4.log

Add golangci

https://github.com/golangci/golangci-lint

[Feature]: change insert

Requirement

can we change insert sql

insert into mytable select col1, col2 from input('col1 String, col2 DateTime64(3), col3 Int32')

with this sql, it is best perfomance .
https://clickhouse.com/docs/en/integrations/java#with-input-table-function

Problem

change sql can get no more drop spans..

Proposal

No response

Open questions

No response

[Feature]: Support Native JSON columns in Clickhouse

Requirement

As a Clickhouse analytics user, I want the clickhouse-jaeger schema to allow using Clickhouse native JSON columns so that we can query data in clickhouse more efficiently (both in terms of performance and query simplicity)

Problem

Currently, Clickhouse-Jaeger stores JSON span data as a string column-type, which makes it quite verbose to have to query based on fields within the column using Clickhouse's JSON functions , especially if you get past 2 levels of nesting.

This is very evident, when you want to query the ingested data to generate your own analytics/insights. It would be nice if jaeear-clickhouse added support for Clickhouse native JSON columns

Proposal

A solution may be to start providing support for the native JSON datatype (It's still "experimental", but the spec has been quite stable for a while)

Open questions

The major open question is how this would affect the split between protobuf and json encoded data (currently, string supports both) and whether it'll add more complexities to the project. Need to observe more to see the impact of this, but wanted to raise this with the community/maintainers to get an idea of their thoughts.

Running with hotrod results in Too many simultaneous queries. Maximum: 100

2021.07.14 17:06:49.783711 [ 219 ] {11925d3b-7684-4919-827b-319af811c400} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
2021.07.14 17:06:49.783769 [ 1010 ] {d4de6e5e-6305-4802-842e-13c660886ef2} <Error> TCPHandler: Code: 202, e.displayText() = DB::Exception: Too many simultaneous queries. Maximum: 100, Stack trace:

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8d31b5a in /usr/bin/clickhouse
1. DB::ProcessList::insert(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::IAST const*, std::__1::shared_ptr<DB::Context const>) @ 0xfcd6802 in /usr/bin/clickhouse
2. ? @ 0xfe21ab3 in /usr/bin/clickhouse
3. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, bool) @ 0xfe208e3 in /usr/bin/clickhouse
4. DB::TCPHandler::runImpl() @ 0x1069f6c2 in /usr/bin/clickhouse
5. DB::TCPHandler::run() @ 0x106b25d9 in /usr/bin/clickhouse
6. Poco::Net::TCPServerConnection::start() @ 0x1338b30f in /usr/bin/clickhouse
7. Poco::Net::TCPServerDispatcher::run() @ 0x1338cd9a in /usr/bin/clickhouse
8. Poco::PooledThread::run() @ 0x134bfc19 in /usr/bin/clickhouse
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x134bbeaa in /usr/bin/clickhouse
10. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
11. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

The DB is started as docker run --rm -it -p9000:9000 --name some-clickhouse-server --ulimit nofile=262144:262144 yandex/clickhouse-server:21

Create "tutorial" with sharding and replication and HA setup

Disclaimer I am just starting with Clickhouse.

As far as I can tell our scripts https://github.com/pavolloffay/jaeger-clickhouse/blob/main/sqlscripts/0002-jaeger-spans.sql create only local tables and not distributed ones so this tutorial will likely need different scripts.

Decide on sharding function for distribbuted table

The distributed table could be created with multiple sharding functions: rand(), cityHash64(traceID) - see https://clickhouse.tech/docs/en/sql-reference/functions/hash-functions/.

The hash functions take an argument, we should consider using traceID to keep data from a single trace in the same location.

CREATE TABLE IF NOT EXISTS jaeger_spans AS jaeger_spans_local ENGINE = Distributed('{cluster}', default, jaeger_spans_local, cityHash64(traceID));

Dockerizing proposal

I have a number of proposals that I can make to this project:

Two-stage build in docker. In this way, we will have a build in a reproducible environment.
Optimized linking as much as possible. The image is required for production use. At high loads, even minor optimizations save resources.
Build plugin along with Jaeger source code. In this way, we will influence the optimization of the building of Jaeger. We can use cache or saved docker levels to speed up building.
Use Debian releases instead of Alpine distributive. One of the optimizations is linking with system libraries. Alpine has limited multithreading functionality due to the use of musl instead of libc. But there is no problem supporting both distributions.
Image versioning that includes the Jaeger version, the plugin version, and the label that this container contains the plugin. The same approach is used by snyk. For example, the image will have the following tags:
- ghcr.io/jaegertracing/jaeger-collector:1.29.0-clickhouse-0.8.0-stretch
- ghcr.io/jaegertracing/jaeger-collector:1.29.0-clickhouse-0.8.0
- ghcr.io/jaegertracing/jaeger-collector:1.29.0-clickhouse
- ghcr.io/jaegertracing/jaeger-collector:clickhouse-0.8.0-stretch
- ghcr.io/jaegertracing/jaeger-collector:clickhouse-0.8.0
- ghcr.io/jaegertracing/jaeger-collector:clickhouse
Have a complete set of images of own production: all-in-one, jaeger-agent, jaeger-collector, jaeger-ingester, jaeger-query.
Run E2E-tests using docker-compose. example.

The implementation of part of the above can be found in this project https://github.com/levonet/docker-jaeger.
I'm ready to move this infrastructure and do support by my team during the time of using Jaeger.

Add support for archive storage

Too many jaeger-query WriteSpan traces written.

Describe the bug
Plugin writes very much traces(service='jaeger-query', operation='/jaeger.storage.v1.SpanWriterPlugin/WriteSpan'), even where there's no else spans to write. Count of traces per tick on default settings is at least 20. After writing spans that are not jaeger-query, circa several thousands of spans every tick can last for very long, even without writing any spans to jaeger.

To Reproduce
Steps to reproduce the behavior:

Start docker image of clickhouse-server
Start Jaeger with jaeger-clickhouse on default settings
Generate a little spans(e.g. with tracegen or HotR.O.D.)
See huge amount of spans being written every timer tick.
If to check what are these spans, there's almost only jaeger-query/WriteSpan.
Query done after a long time after tracegen finished work:

SELECT count()
FROM jaeger_index_local 
WHERE (service = 'jaeger-query') AND (operation = '/jaeger.storage.v1.SpanWriterPlugin/WriteSpan') AND (timestamp >= (now() - toIntervalMinute(1)))

┌─count()─┐
│   15329 │
└─────────┘

Expected behavior
No/very little of such spans.

Version (please complete the following information):

OS: Linux
Jaeger version: Jaeger v1.24, jaeger-clickhouse v0.7.0
Deployment: bare metal

What troubleshooting steps did you try?
Didn't find any of the info about such problem.

Decide on default encoding

Right now the default encoding is JSON. It's JSON bc it was historically set to JSON. I would like to understand why JSON is preferred over protobuf.

https://github.com/pavolloffay/jaeger-clickhouse/blob/main/config.yaml#L8

Add unit tests

The repository does not have any unit tests. It would be great to add tests to increase coverage. E.g. Jaeger uses 95% test coverage.

[Bug]: System Architecture reported an error when I used Click House as the storage backend

What happened?

System Architecture reported an error when I used ClickHouse as the storage backend,the error like this:

HTTP Error: plugin error:rpc error: code = Unknown desc = not implemented

Steps to reproduce

Use ClickHouse as the storage,the cmd is : SPAN_STROAGE_TYPE=grpc-plugin jaeger-all-in-one --grpc.stroage.plugin.binary=/path/to/jaeger-clickhouse-linux-amd64 --grpc.stroage.configuration=/path/to/config.yml
start up jaeger

click "System Architecture" report a error,like this :

HTTP Error: plugin error:rpc error: code = Unknown desc = not implemented

Expected behavior

I expect System Architecture to display a DAG of the calling relationship of each system

Relevant log output

No response

Screenshot

No response

Additional context

No response

Jaeger backend version

1.47.0

SDK

OpenTelemetry javaagent 1.28.0

Pipeline

javaagent->otelcol->jaeger

Stogage backend

clickhouse

Operating system

linux

Deployment model

CLI

Deployment configs

No response

Expose metrics

For fine-tuning parameters like

# Batch size. Default 10_000.
batch_write_size:
# Batch flush size. Default 5s.
batch_flush_interval:

It would be great to expose metrics like:

batch flush size
number of flushes due to timer expiration (batch_flush_interval)

Although, I am not sure if the batch size influences performance or not.

Capacity planning

@chhetripradeep from https://github.com/pavolloffay/jaeger-clickhouse/pull/34#issuecomment-886882402

We run with 3 replica and as we need to expand the cluster we add more shards. One thing to note is clickhouse doesn't have any inbuilt data balancing feature i.e. once a data is written to a node, it will stay there throughout the lifetime of that data unless the operator moves the data manually, so it's good to do capacity planning in the beginning of cluster provisioning.

Would you like to create a doc/guide for capacity planning?

Add support for sharding and replication to archive store

[Feature]: Use native ClickHouse interface instead of database/sql

Requirement

As an active jaeger-clickhouse user I'd like to suggest to use the native ClickHouse communication protocol instead of database/sql-compatible one. This change might significantly increase the overall performance and speed up the spans writer.

Problem

jaeger-clickhouse uses the clickhouse-go 1.5.4 client. It provides the standard database/sql interface for communication with ClickHouse.

There is a benchmark section in the readme of the repository. It claims that migration to v2 might significantly speed up write and read. This speed up is possible due to usage of the native TCP ClickHouse client-server protocol. Furthermore, new versions (>= 2.3.0) use ch-go for compression.

Proposal

Switch to the newer version of the clickhouse-go client and enjoy the better performance.

Maybe it's even possible to switch to ch-go, but I think the library may not support all the used high-level features of clickhouse-go at the moment.

Open questions

I have two questions in mind:

Is it possible to switch to the native ClickHouse TCP protocol without breaking compatibility?
Is it worth it? We need to create a benchmark that compares two protocols on specific to jaeger-clickhouse queries.

Do you, folks, see any pitfalls?

[Feature]: Dependencies job for clickhouse backend

Requirement

Ability to achieve service dependency graph for clickhouse backend

Problem

Hi team, was exploring this repo
and wonder how we are achieving the service dependency graph in Jaeger using Clickhouse datasource. Is there similar spark job like
https://github.com/jaegertracing/spark-dependencies , that support Clickhouse backend ??

Proposal

No response

Open questions

No response

Use atomic database in replicated setup

https://clickhouse.tech/docs/en/engines/database-engines/atomic/#replicatedmergetree-in-atomic-database

Atomic engine should allow us to remove arguments from ENGINE ReplicatedMergeTree('/clickhouse/tables/{shard}/jaeger_index', '{replica}')

Add Operation.SpanKind support

Requirement - what kind of business use case are you trying to solve?

I ran jaeger grpc-plugin integration tests with this plugin and it failed.

Problem - what in Jaeger blocks you from solving the requirement?

Integration test failed because this plugin doesn't support jaeger/spanstore.Operation.SpanKind.

Document how multitenant deployment should look like

[Bug]: Resolve High CVEs

What happened?

We currently use the jaeger-clickhouse image and our security team has flagged it as being impacted by two HIGH CVEs

To resolve these CVEs the following packages need to be updated to a minimum version of:

golang.org/x/net - 0.1.1-0.20221104162952-702349b0e862
golang.org/x/text - 0.3.8

We prefer to have the packages fixed upstream to ensure that everyone can benefit from the updates.

Steps to reproduce

Using a vulnerability scanners (e.g. aqua/trivy) scan the jaeger-clickhouse image

trivy image jaeger-clickhouse:0.13.0

Expected behavior

No vulnerabilities listed.

Relevant log output

No response

Screenshot

No response

Additional context

No response

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

Fix time zone issues

There is 2 issues while working with ClickHouse database with non-UTC time zone:

Plugin cannot connect to database due to following error:
could not connect to database: "could not load time location: unknown time zone {any non-UTC time zone}"
Searching for traces works incorrectly: search range "shifts" by difference between UTC and ClickHouse server time zone

Embed default SQL scripts into go binary

The SQL scripts https://github.com/pavolloffay/jaeger-clickhouse/tree/main/sqlscripts are needed at Jaeger startup. To simplify distribution these scripts can be embedded into the binary.

See https://pkg.go.dev/embed

Change batch writing policy

Currently, we insert batch by one span. But for ClickHouse, it's better to insert big amounts of data.