zalando-zmon / service-level-reporting Goto Github PK
View Code? Open in Web Editor NEWCalculate SLI/SLO metrics from ZMON's timeseries data
License: Other
Calculate SLI/SLO metrics from ZMON's timeseries data
License: Other
At the moment the endpoint GET /service-level-objectives/{product}/reports/{report_type}
will only generate an average value for each SLI for the whole day. This average should be weighted, i.e. minutes with more requests should get more weight.
Example of avg vs weighted avg:
select avg(sli_value) from zsm_data.service_level_indicator where sli_name = 'latency.p95' and date(sli_timestamp) = '2016-09-27'
-> 102.34ms
select sum(sli1.sli_value * sli2.sli_value)/sum(sli2.sli_value) from zsm_data.service_level_indicator sli1 join zsm_data.service_level_indicator sli2 on sli1.sli_timestamp = sli2.sli_timestamp and sli2.sli_name = 'requests' where sli1.sli_name = 'latency.p95' and date(sli1.sli_timestamp) = '2016-09-27';
-> 113.86ms
There is an extra /
after slr
but when visiting this URL we end up on the root page of the reports
https://slo-host/slr//my-services/a-service/20170831-20170906/index.html
The configuration can only be done via SQL right now --- provide an UI or at least HTTP API.
Currently, there are some failing requests to KairosDB because the size of the requested data slice is too large. That's because the app tries to fill in all the missing data since it was last updated. That could be a couple of minutes to months. This should be restricted to a maximum of a day.
@hjacobs would you agree?
Report generation seems to fail when no breaches are found.
Getting this error:
File "/generate-slr.py", line 153, in generate_weekly_report
slo['breaches'] = max(breaches_by_sli.values())
ValueError: max() arg is an empty sequence
The error disappears when adjusting the threshold in the SLO definition to a value that ensures there are some breaches
We get the following error when trying to generate a report:
$ ./generate-slr.py $API_ENDPOINT pss
Can not determine "period_from" and "period_to" for the report. Terminating!
What is the cause?
The current swagger spec contains only the example securityDefinitions
securityDefinitions:
oauth2:
type: oauth2
flow: implicit
authorizationUrl: https://example.com/oauth2/dialog
scopes:
uid: Unique identifier of the user accessing the service.
Which suggests a wrong flow is specified.
Ability to add, update or delete SLO from the UI
Use Authorization headers in verifying token info
Users reported being unable to find new products. Given that current amount of products is more than 100, and pagination size is of 100, newer products are not displayed. Given that Search function is done on the Frontend, products out of range are not found.
As a quick fix, pagination limit will be increased to 150 and Search will be modified to be done via the API. Proper pagination handling should be implemented in the UI afterwards.
Unit tests with decent coverage.
Data collection (as defined in zsm_data.data_source
) should be automatic. This could be done with a simple background job/thread (locking?).
CLI should:
Should return error 400
or 404
. Now it ends up with NULL
product_group_id
Switch legacy service/app to read-only API.
Showing the average number of requests per second per day is not very meaningful. We should compute the total number of requests per day and show it instead.
It seems that, currently, it's not possible to delete SLO resources.
We should extend the API to be able to delete SLOs
Currently anybody with a valid token can modify the product/SLO/SLI data for any of the teams. As suggested by @hjacobs we should add authorization at possibly the product level.
An OAuth2 token sent in an Authorization: Bearer <token>
HTTP header is ignored, if the HTTP request also contains a slr-session
cookie. The application continues to use the token stored in the session, even if this token is expired.
Authorization: Bearer
header with a valid token. (This succeeds.)slr-session
cookie.Authorization: Bearer
header with a new (valid) token. This fails with a 401 Unauthorized
response.API requests should have no session handling at all and always rely on the Authorization
header.
Some resources accept POST requests directly on the resource path while others require the .../update
URL. The HTTP verb should be enough for the operation decision.
weight = 1
When two different metrics are plotted in one graph, using the left and right y axis, both axes use the same range (same minimum and maximum).
This compresses one of the graphs and makes it useless in many cases.
The two axes should be scaled independently of each other.
SLRs should ideally be generated and served (?) by the web application itself.
To be discussed:
Currently the handler which retrieves the data for the report has the time range hardcoded into it which is the past week. If the app has to support reports for larger intervals like a month or a year this handler should be able to accept a start and end data as parameters.
To immediately see what day of week it is (esp. relevant for weekends).
Creating a product group that violates the UNIQUE constraint returns a basic 500 error:
HTTP/1.1 500 INTERNAL SERVER ERROR
Connection: keep-alive
Content-Length: 252
Content-Type: application/problem+json
Date: Wed, 15 Feb 2017 16:51:16 GMT
{
"detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
"status": 500,
"title": "Internal Server Error",
"type": "about:blank"
}
We should improve the errors so that clients can figure out how to work around them
The gnuplot csplines smoothing causes spikes that make it more difficult to interpret the data.
Add end
field in resource to allow updating specific ranges. This could be useful to manually fix certain failing ranges with large data points count.
Optimize report generation
Your Zappr file does not yield the correct config.
Consider doing this
X-Zalando-Team: "zmon"
X-Zalando-Type: code
approvals:
minimum: 2
commit:
message:
patterns: # commit message has to match any one of
- "^ *#[0-9]+" # starts with hash and digits
Can we upgrade to the latest Connexion release (https://github.com/zalando/connexion/releases/tag/1.1.14) or are there any impediments?
Currently the same template is used for directory indexes. This gives us the useful reversed sorting index of reports but doesn't really help with the product page index.
Provide UI:
TBD
Using the Client, in case a user tries to create a new data-source while the "sli_name" already exists, it returns 500.
Perhaps a better error message and type makes more sense.
Report: requests aggregation is wrong
In run_sli_update
greenlet, if one of the products SLI update failed, other products won't be updated.
I have another use case:
I have rare events, so that when I run my check every 5 mins, often there are no events at all or just 1 or 2. I cannot calculate 95th percentile every 5 minutes in this case. Instead I can easily record the maximum value for the last 5 minutes, or 0.
Would be nice to have something like "aggregation": "p95"
in the reporting service, so that when the report is generated, it aggregates all the data for the week and is able to calculate correct percentiles.
Could you please consider implementing this type of aggregation?
When editing the indicator you press delete button and you get a message error "Can't delete indicator"
Usually the API responds with informative error messages that can be helpful in UI (i.e. avoid generic error messages)
See #79
Ability to display count of metrics on the UI for the information of management users. This can also serve as an indicator of the capacity for the tool.
Check result keys should be flattened when compared with SLI source keys.
Currently hardcoded to minutes
While inserting new SLI values.
There is, at least, one code path that can result in a division by zero exception under certain circumstances.
Traceback (most recent call last):
File "/generate-slr.py", line 159, in <module>
cli()
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/generate-slr.py", line 155, in cli
generate_weekly_report(base_url, product, output_dir)
File "/generate-slr.py", line 87, in generate_weekly_report
val = sum(values_by_sli[target['sli_name']]) / len(values_by_sli[target['sli_name']])
ZeroDivisionError: division by zero
Right now the max-y value of chart is calculated according to SLO threshold not the maximum value of the chart. As a result - not the whole chart is visible - it is cut from the top. The points are are much higher than SLO threshold are not visible.
Example of problem: https://pages.github.bus.zalan.do/continuous-delivery/automata-service-level-reports/messaging-bus/nakadi/20170519-20170522/index.html
Is it expected behavior or a bug?
If it's a bug - I can contribute with a fix.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.