Coder Social home page Coder Social logo

book's People

Contributors

710leo avatar canghai908 avatar dependabot[bot] avatar hantmac avatar ning1875 avatar nxsre avatar ulricqin avatar yimeng avatar yubo avatar yutons avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

book's Issues

飞书告警通知

升级python

1.安装编译环境包(防止出现安装错误)
yum install gcc-c++ gcc make cmake zlib-devel bzip2-devel openssl-devel ncurse-devel libffi-devel -y
2.在线下载Python3.7源码包
#进入tmp目录
cd /tmp
#下载python3.7.3
wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tar.xz
3.解压并配置

#解压
tar Jxvf Python-3.7.3.tar.xz
#进入python3.7.3目录
cd Python-3.7.3
#创建目录
mkdir -p /usr/local/python3
#配置(指定安装目录)
./configure --prefix=/usr/local/python3 --enable-optimizations
4.编译及安装

make && make install
5.更换系统默认Python版本
1).备份原系统旧版本python

mv /usr/bin/python /usr/bin/python.bak
mkdir /usr/bin/pip
mv /usr/bin/pip /usr/bin/pip.bak
2).配置环境变量:创建新版本Python和pip的软链接

ln -s /usr/local/python3/bin/python3.7 /usr/bin/python
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
3).查看python版本

python -V
6.修改yum功能
因为yum的功能依赖Pyhon2,现在更改默认Python版本后会导致yum无法正常工作,所以进行以下3处修复
第1处:
vim /usr/bin/yum

改成:#! /usr/bin/python2.7

vim /usr/libexec/urlgrabber-ext-down

把最顶部的
改成:#! /usr/bin/python2.7

vim /usr/sbin/firewalld
vim /usr/bin/firewall-cmd

修改N9E server.yaml文件
vim /opt/n9e/server/etc/server.yml
contactKeys:

  • label: "Feishu Robot Token" ##新增加
    key: feishu_robot_token ##新增加

notifyChannels:

  • feishu ##新增加

修改N9E告警脚本
vim /opt/n9e/server/etc/script/notify.py
替换为以下内容

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
import json
import os
import smtplib
import time
import requests
from email.mime.text import MIMEText
from email.header import Header

# 希望的demo实现效果:
# 1. 从stdin拿到告警信息之后,格式化为一个有缩进的json写入一个临时文件
# 2. 文件路径和名字是.alerts/${timestamp}_${ruleid}
# 3. 调用SMTP服务器发送告警,微信、钉钉、飞书、slack、jira、短信、电话等等留给社区实现

# 脚本二开指南
# 1. 可以根据下面的TEST_ALERT_JSON 中的结构修改脚本发送逻辑,定制化告警格式格式如下
"""
[告警类型:prometheus]
[规则名称:a]
[是否已恢复:已触发]
[告警级别:1]
[触发时间:2021-07-02 16:05:14]
[可读表达式:go_goroutines>0]
[当前值:[vector={__name__="go_goroutines", instance="localhost:9090", job="prometheus"}]: [value=33.000000]]
[标签组:instance=localhost:9090 job=prometheus]
"""
# 2. 每个告警会以json文件的格式存储在LOCAL_EVENT_FILE_DIR 下面,文件名为 filename = '%d_%d_%d' % (rule_id, event_id, trigger_time)
# 3. 告警通道需要自行定义Send类中的send_xxx同名方法,反射调用:举例 event.notify_channels = [qq dingding] 则需要Send类中 有 send_qq send_dingding方法
# 4. im发群信息,比如钉钉发群信息需要群的webhook机器人 token,这个信息可以在user的contacts map中,各个send_方法处理即可
# 5. 用户创建一个虚拟的用户保存上述im群 的机器人token信息 user的contacts map中

mail_host = "smtp.qq.com"
mail_port = 994
mail_user = "ulricqin"
mail_pass = "password"
mail_from = "[email protected]"

# 本地告警event json存储目录
LOCAL_EVENT_FILE_DIR = ".alerts"
NOTIFY_CHANNELS_SPLIT_STR = " "

# 群机器人token 配置字段
FEISHU_ROBOT_TOKEN_NAME = "feishu_robot_token"
FEISHU_API = "修改为机器人webhook 地址"

# stdin 告警json实例
TEST_ALERT_JSON = {
    "event": {
        "alert_duration": 10,
        "notify_channels": "feishu",
        "res_classpaths": "",
        "id": 4,
        "notify_group_objs": None,
        "rule_note": "",
        "history_points": [
            {
                "metric": "go_goroutines",
                "points": [
                    {
                        "t": 1625213114,
                        "v": 33.0
                    }
                ],
                "tags": {
                    "instance": "localhost:9090",
                    "job": "prometheus"
                }
            }
        ],
        "priority": 1,
        "last_sent": True,
        "tag_map": {
            "instance": "localhost:9090",
            "job": "prometheus"
        },
        "hash_id": "ecb258d2ca03454ee390a352913c461b",
        "status": 0,
        "tags": "instance=localhost:9090 job=prometheus",
        "trigger_time": 1625213114,
        "res_ident": "",
        "rule_name": "a",
        "is_prome_pull": 1,
        "notify_users": "1",
        "notify_groups": "",
        "runbook_url": "",
        "values": "[vector={__name__=\"go_goroutines\", instance=\"localhost:9090\", job=\"prometheus\"}]: [value=33.000000]",
        "readable_expression": "go_goroutines>0",
        "notify_user_objs": None,
        "is_recovery": 0,
        "rule_id": 1
    },
    "rule": {
        "alert_duration": 10,
        "notify_channels": "feishu",
        "enable_stime": "00:00",
        "id": 1,
        "note": "",
        "create_by": "root",
        "append_tags": "",
        "priority": 1,
        "update_by": "root",
        "type": 1,
        "status": 0,
        "recovery_notify": 0,
        "enable_days_of_week": "1 2 3 4 5 6 7",
        "callbacks": "localhost:10000",
        "notify_users": "1",
        "notify_groups": "",
        "runbook_url": "",
        "name": "a",
        "update_at": 1625211576,
        "create_at": 1625211576,
        "enable_etime": "23:59",
        "group_id": 1,
        "expression": {
            "evaluation_interval": 4,
            "promql": "go_goroutines>0"
        }
    },
    "users": [
        {
            "username": "root",
            "status": 0,
            "contacts": {
                "feishu_robot_token": "xxxxxxx"
            },
            "create_by": "system",
            "update_at": 1625211432,
            "create_at": 1624871926,
            "email": "",
            "phone": "",
            "role": "Admin",
            "update_by": "root",
            "portrait": "",
            "nickname": "\u8d85\u7ba1",
            "id": 1
        }
    ]
}


def main():
    payload = json.load(sys.stdin)
    trigger_time = payload['event']['trigger_time']
    event_id = payload['event']['id']
    rule_id = payload['rule']['id']
    notify_channels = payload['event'].get('notify_channels').strip().split(NOTIFY_CHANNELS_SPLIT_STR)
    if len(notify_channels) == 0:
        msg = "notify_channels_empty"
        print(msg)
        return
    # 持久化到本地json文件
    persist(payload, rule_id, event_id, trigger_time)
    # 生成告警内容
    alert_content = content_gen(payload)

    for ch in notify_channels:
        send_func_name = "send_{}".format(ch.strip())
        has_func = hasattr(Send, send_func_name)

        if not has_func:
            msg = "[send_func_name_err][func_not_found_in_Send_class:{}]".format(send_func_name)
            print(msg)
            continue
        send_func = getattr(Send, send_func_name)
        send_func(alert_content, payload)


def content_gen(payload):
    # 生成格式化告警内容
    text = ""
    event_obj = payload.get("event")

    rule_type = event_obj.get("is_prome_pull")
    type_str_m = {1: "prometheus", 0: "n9e"}
    rule_type = type_str_m.get(rule_type)

    text += "[告警类型:{}]\n".format(rule_type)

    rule_name = event_obj.get("rule_name")
    text += "[规则名称:{}]\n".format(rule_name)

    is_recovery = event_obj.get("is_recovery")
    is_recovery_str_m = {1: "已恢复", 0: "已触发"}
    is_recovery = is_recovery_str_m.get(is_recovery)
    text += "[是否已恢复:{}]\n".format(is_recovery)

    priority = event_obj.get("priority")
    text += "[告警级别:{}]\n".format(priority)

    trigger_time = event_obj.get("trigger_time")
    text += "[触发时间:{}]\n".format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(trigger_time))))

    readable_expression = event_obj.get("readable_expression")
    text += "[可读表达式:{}]\n".format(readable_expression)

    values = event_obj.get("values")
    text += "[当前值:{}]\n".format(values)

    tags = event_obj.get("tags")
    text += "[标签组:{}]\n".format(tags)

    print(text)
    return text


def persist(payload, rule_id, event_id, trigger_time):
    if not os.path.exists(LOCAL_EVENT_FILE_DIR):
        os.makedirs(LOCAL_EVENT_FILE_DIR)

    filename = '%d_%d_%d' % (rule_id, event_id, trigger_time)
    filepath = os.path.join(LOCAL_EVENT_FILE_DIR, filename)
    with open(filepath, 'w') as f:
        f.write(json.dumps(payload, indent=4))


class Send(object):
    @classmethod
    def send_feishu(cls, alert_content, payload):
        # 飞书发群信息需要群的webhook机器人 token,这个信息可以在user的contacts map中

        users = payload.get("users")

        for u in users:
            contacts = u.get("contacts")

            feishu_robot_token = contacts.get(FEISHU_ROBOT_TOKEN_NAME, "")

            if feishu_robot_token == "":
                print("feishu_robot_token_not_found")
                continue

            feishu_api_url = "修改为机器人webhook地址"
            atMobiles = [u.get("phone")]
            headers = {'Content-Type': 'application/json;charset=utf-8'}
            pay_load = {
                "msg_type": "text",
                "content": {
                    "text": alert_content
                },
                "at": {
                    "atMobiles": atMobiles,
                    "isAtAll": False
                }
            }
            res = requests.post(feishu_api_url, json.dumps(pay_load), headers=headers)
            print(res.status_code)
            print(res.text)

            print("send_feishu")


def mail_test():
    print("mail_test_todo")

    recipients = ["[email protected]", "[email protected]"]

    message = MIMEText(mail_body, 'html', 'utf-8')
    message['From'] = mail_from
    message['To'] = ", ".join(recipients)
    message["Subject"] = "n9e alert"

    smtp = smtplib.SMTP_SSL(mail_host, mail_port)
    smtp.login(mail_user, mail_pass)
    smtp.sendmail(mail_from, recipients, message.as_string())
    smtp.close()

    print("mail_test_done")


if __name__ == "__main__":
    if len(sys.argv) == 1:
        main()
    elif sys.argv[1] == "mail":
        mail_test()
    else:
        print("I am confused")

调用创建策略接口

请求创建策略接口‘api/portal/stra’,报错未知的开始时间 :{u'err': u'unknown enable_stime: '} , 参数是按照官方文档示例写的

通过rpm包安装

启动所有组件命令中第四个组件名字前面多了一个“n”

运行hugo server报错

hugo version: Hugo Static Site Generator v0.74.3-DA0437B4 linux/amd64 BuildDate: 2020-07-23T16:22:34Z

错误信息:

Error: Error building site: TOCSS: failed to transform "scss/main.scss" (text/x-scss): resource "scss/scss/main.scss_9fadf33d895a46083cdd64396b57ef68" not found in file cache

源码调试问题

在goland 调试的时候,runner.Cwd 得到的路径是 /private/var/folders/16/4t4szn4n6rbblq528_m2lsqh0000gn/T/GoLand, 会导致启动失败,需要手动修改了路径,官方可以看下算不算bug

索引相关

清理监控索引 文档中 提供访问的接口404,查看源码,也没有发现有相关方法的提供,
只有这几个,并没有 DELETE Method

POST /api/index/metrics
POST /api/index/tagkv
POST /api/index/counter/clude
POST /api/index/counter/fullmatch

通过rpm包安装

已建立 n9e 用户,密码为 n9epwd123 修改/usr/local/n9e/etc/mysql.yml.yml

should be

已建立 n9e 用户,密码为 n9epwd123 修改/usr/local/n9e/etc/mysql.yml

安装部署

image

There is no login page, and both 8000 and 9000 of the server are available

修改监控数据库

环境:Nightingale V5版本
问题:通过一键安装,安装完成后想对监控数据写入mysql,需要怎么修改。

日志监控

日志监控,时间格式可以增加一种类型吗?Caché数据库log信息如下:
11/20/20-08:50:14:496 (12748) 0 Stopping System Jobs
11/20/20-08:50:14:602 (2196) 0 EXPDMN exited due to system shutdown
11/20/20-08:50:14:604 (5072) 0 JRNDMN exited due to system shutdown
11/20/20-08:50:14:605 (9296) 0 GARCOL exited due to system shutdown
11/20/20-08:50:15:207 (2620) 0 No blocks pending in WIJ file
11/20/20-08:50:15:230 (2620) 0 WRTDMN exited due to system shutdown
11/20/20-08:50:15:312 (13120) 0 CONTROL exited due to system shutdown
11/20/20-08:50:16:411 (12748) 0 Shutdown complete
mm/dd/yy-HH:MM:SS

Collector单独部署

启动失败:/home/n9e/n9e-collector (code=exited, status=203/EXEC)
[root@Nightingale etc]# systemctl status n9e-collector.service
● n9e-collector.service - Nightingale collector
Loaded: loaded (/usr/lib/systemd/system/n9e-collector.service; enabled; vendor preset: disabled)
Active: active (running) since 一 2020-07-20 15:03:14 CST; 2ms ago
Main PID: 73473 ((ollector))
Tasks: 0
Memory: 0B
CGroup: /system.slice/n9e-collector.service
└─73473 (ollector)

7月 20 15:03:14 open_falcon systemd[1]: Started Nightingale collector.
7月 20 15:03:14 open_falcon systemd[1]: n9e-collector.service: main process exited, code=exited, status=203/EXEC
7月 20 15:03:14 open_falcon systemd[1]: Unit n9e-collector.service entered failed state.
7月 20 15:03:14 open_falcon systemd[1]: n9e-collector.service failed.

[root@Nightingale etc]# systemctl status n9e-collector.service
● n9e-collector.service - Nightingale collector
Loaded: loaded (/usr/lib/systemd/system/n9e-collector.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since 一 2020-07-20 15:03:16 CST; 28ms ago
Process: 73481 ExecStart=/home/n9e/n9e-collector (code=exited, status=203/EXEC)
Main PID: 73481 (code=exited, status=203/EXEC)

7月 20 15:03:16 Nightingale systemd[1]: Unit n9e-collector.service entered failed state.
7月 20 15:03:16 Nightingale systemd[1]: n9e-collector.service failed.

安装部署

按照4.0版本的步骤安装部署都成功了,就是访问界面时候,Nginx一直报403,这是为啥啊,我已经把所有的文件都改成root用户了,并且权限都是755

监控数据

查看transfer中报错
2020-08-21 14:45:36.203742 ERROR routes/query_router.go:37 index addr is nil {1597980000 1597980120 [{[192.168.22.189 192.168.22.189] proc.num [0xc0005474d0 0xc000547500]}]}

使用的命令是
curl -d '{"start":1597980000,"end":1597980120,"series":[{"endpoints":["192.168.22.189","192.168.22.189"],"metric":"proc.num","tagkv":[{"tagk":"target","tagv":["/data/joygames/jdk/bin/java"]},{"tagk":"service","tagv":["mtj-game-server"]}]}]} ' http://192.168.22.189/api/transfer/data -u root:root

服务器双网卡以后不能加入节点。

transfer 下日志 ifconfig: --help' gives usage information.:5841 fail: ens34: 未知的主机 ifconfig: --help' gives usage information.:5841 get connection fail: conn , err address ens34: 未知的主机
ifconfig: --help' gives usage information.:5841: too many colons in address. proc: Name:ens34: 未知的主机 ifconfig: --help' gives usage information.:5841,Cnt:0,active:0,all:0,free:0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.