sassoftware / viya4-monitoring-kubernetes Goto Github PK

Provides simple scripts and customization options to deploy monitoring, alerts, and log aggregation for Viya 4 running on Kubernetes

License: Apache License 2.0

Shell 92.23% Smarty 0.42% Dockerfile 0.64% Python 6.71%

sas-quest

viya4-monitoring-kubernetes's Introduction

SAS® Viya® Monitoring for Kubernetes

SAS® Viya® Monitoring for Kubernetes provides scripts and customization options to deploy metric monitoring, alerts, and log-message aggregation for SAS® Viya®.

Note: The documentation included in this GitHub project uses "SAS Viya" to refer to the SAS Viya platform and compatible SAS Viya applications, including these products:

SAS® Viya® (previously called SAS Visual Machine Learning)
SAS® Viya® Advanced (previously called SAS Visual Data Science)
SAS® Viya® Programming (previously called SAS Visual Data Science Programming)
SAS® Viya® Enterprise (previously called SAS Visual Data Science Decisioning)

Log monitoring and metric monitoring can be deployed independently or together. There are no hard dependencies between the two.

For information about SAS® Viya® Monitoring for Kubernetes, see the following sections in the SAS Viya Monitoring for Kubernetes Help Center:

For an introduction, see About SAS® Viya® Monitoring for Kubernetes.
For information about the application components deployed by SAS® Viya® Monitoring for Kubernetes, the prerequisites, and more, see Getting Started.
For information about pre-deployment and customization, see Pre-deployment.
For information about deployment, see Deploy.

viya4-monitoring-kubernetes's People

Contributors

Stargazers

Watchers

viya4-monitoring-kubernetes's Issues

Kibana adds an ingress rule for root url (/)

Hi,

I (successfully) deployed the logging stack using an ingress controller and path based routing following the instructions.

I noticed that the ingress for Kibana is pointing to the root URL (/) while it should be pointing to /kibana instead. Using the root URL might result in issues with Viya as SASDrive will add an ingress for the root URL as well.

After deployment Kibana is reachable using both paths:

http://<ingress.controller.dns.name>/
http://<ingress.controller.dns.name>/kibana

I would have expected that Kibana can only be reached using /kibana. my site-config-monitoring/logging/user-values-elasticsearch-open.yaml contains the right path setting:

kibana:
...
  ingress:
...
    hosts:
    - <ingress.controller.dns.name>
    path: /kibana

I used these commands to check (abbr.). Note the Path initally pointing to "/":

[me@jumphost mirror]$ kubectl describe ing v4m-es-kibana -n logging
Name:             v4m-es-kibana
Namespace:        logging
Address:          <ingress.controller.dns.name>
Default backend:  default-http-backend:80 (<none>)
Rules:
  Host                                                                           Path  Backends
  ----                                                                           ----  --------
  <ingress.controller.dns.name>                               /   v4m-es-kibana-svc:443 (192.168.68.69:5601)

I can fix this by patching the path:

[me@jumphost mirror]$ kubectl get ing v4m-es-kibana -o json -n logging \
>  | jq '(.spec.rules[].http.paths[].path |= "/kibana")' \
>  | kubectl apply -f -

Which gives me the right path: "/kibana"

[me@jumphost  mirror]$ kubectl describe ing v4m-es-kibana -n logging
Name:             v4m-es-kibana
Namespace:        logging
Address:          <ingress.controller.dns.name>
Default backend:  default-http-backend:80 (<none>)
Rules:
  Host                                                                           Path  Backends
  ----                                                                           ----  --------
  <ingress.controller.dns.name>                                /kibana   v4m-es-kibana-svc:443 (192.168.68.69:5601)

Grafana dashboard - SAS CAS overview

The dashboard SAS CAS Overview reports memory usage for the CAS server, which is very different than reported by the OS or kubectl top.

The Grafana dashboard reports about 22GB memory, while kubectl reports about 1GB. It may be that the stats are exported from the 'sas-cas-control' pod which is not on the same node as the 'sas-cas-server-default-controller' pod.

failed calling webhook validate.nginx.ingress.kubernetes.io

I got this issue when I tried to install the monitoring on AKS 1.22 (it may be rilevant)

W0328 10:52:45.748943 6465 reflector.go:441] k8s.io/[email protected]/tools/cache/reflector.go:167: watch of *unstructured.Unstructured ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Error: release v4m-prometheus-operator failed, and has been uninstalled due to atomic being set: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s": dial tcp 10.0.144.68:443: connect: connection refused
ERROR Exiting script [deploy_monitoring_cluster.sh] due to an error executing the command [helm $helmDebug upgrade --install $promRelease --namespace $MON_NS -f monitoring/values-prom-operator.yaml -f $istioValuesFile -f $tlsValuesFile -f $nodePortValuesFile -f $wnpValuesFile -f $PROM_OPER_USER_YAML --atomic --timeout 20m --set nameOverride=$promName --set fullnameOverride=$promName --set prometheus-node-exporter.fullnameOverride=$promName-node-exporter --set kube-state-metrics.fullnameOverride=$promName-kube-state-metrics --set grafana.fullnameOverride=$promName-grafana --set grafana.adminPassword="$grafanaPwd" --version $KUBE_PROM_STACK_CHART_VERSION prometheus-community/kube-prometheus-stack].

ERROR The OpenSearch REST endpoint has NOT become accessible in the expected time; exiting

Hi,
This error is happening when deploying:

(⎈|jre-prod:jreprod)➜ logging git:(master) ✗ bash -xv /home/domora/JRE/JRE-60/monitoring_logging_issue_2/viya4-monitoring-kubernetes/logging/bin/deploy_logging.sh 2>~/tmp.tmp
INFO User directory: /home/domora/JRE/JRE-60/monitoring_logging_issue_2/cis-viya4-monitoring-kubernetes/tls
INFO Helm client version: 3.8.2
INFO Kubernetes client version: v1.21.1
INFO Kubernetes server version: v1.22.6
INFO Loading user environment file: /home/domora/JRE/JRE-60/monitoring_logging_issue_2/cis-viya4-monitoring-kubernetes/tls/logging/user.env

Deploying logging components to the [logging] namespace [Tue Jun 14 20:55:55 -03 2022]

INFO Deploying Event Router ...
serviceaccount/eventrouter unchanged
clusterrole.rbac.authorization.k8s.io/eventrouter unchanged
clusterrolebinding.rbac.authorization.k8s.io/eventrouter unchanged
configmap/eventrouter-cm unchanged
deployment.apps/eventrouter unchanged
INFO Event Router has been deployed

configmap "run-securityadmin.sh" deleted
configmap/run-securityadmin.sh created
configmap/run-securityadmin.sh labeled
"opensearch" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "opensearch" chart repository
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "fluent" chart repository
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈Happy Helming!⎈
INFO Deploying OpenSearch
Release "opensearch" has been upgraded. Happy Helming!
NAME: opensearch
LAST DEPLOYED: Tue Jun 14 20:56:57 2022
NAMESPACE: logging
STATUS: deployed
REVISION: 5
TEST SUITE: None
NOTES:
Watch all cluster members come up.
$ kubectl get pods --namespace=logging -l app.kubernetes.io/component=v4m-search -w
INFO Waiting on OpenSearch pods to be Ready
pod/v4m-search-0 condition met
INFO OpenSearch has been deployed

INFO Loading Content into OpenSearch
ERROR The OpenSearch REST endpoint has NOT become accessible in the expected time; exiting.
ERROR Review the OpenSearch pod's events and log to identify the issue and resolve it before trying again.
ERROR Exiting script [deploy_logging.sh] due to an error executing the command [logging/bin/deploy_opensearch_content.sh].
(⎈|jre-prod:jreprod)➜ logging git:(master) ✗

Already saw that there is a 503 error coming from that endpoint from debugging the script deploy_opensearch_content.sh:

for pause in 30 30 30 30 30 30 60
++ curl -s -o /dev/null -w '%{http_code}' -XGET https://localhost:41829 --user admin: --insecure
response=503

Checked pod logs and says:
[2022-06-15T00:10:10,469][ERROR][o.o.s.a.BackendRegistry ] [v4m-search-0] Not yet initialized (you may need to run securityadmin)

Which kind of matches somehow with comments in that loop to verify the endpoint:
"returns 503 (and outputs "Open Distro Security not initialized.") when ODFE isn't ready yet"
"TO DO: check for 503 specifically?"

What to do resolve that?

fluent bit pod failed with " cannot open database" error

After deploy deploy_logging_open.sh and got the error below

[2021/11/01 08:27:26] [error] [sqldb] cannot open database /var/log/sas_viya_flb.db
[2021/11/01 08:27:26] [error] [input:tail:tail.0] could not open/create database
[2021/11/01 08:27:26] [error] Failed initialize input tail.0
[2021/11/01 08:27:26] [error] [lib] backend failed

Any idea?

Integration with PagerDuty

A customer has asked if we can integrate with PagerDuty, is this something that we can support in the future?

Upgrade instruction

Hi Team,
We are planning the to upgrade our existing viya monitoring and logging component. Is there any instruction to upgrade? We have some customization like path based routing , slack alert we don't want to loose these settings.

Grafana username "admin" and password not working

Hi All,
I'm Ira from SAS Technical Support. I'm working on a ticket opened by a PSD engineer following Viya 2021.2.5 deployment on AWS.
Viya Monitoring has been installed with the project Viya 4 Deployment . He configured the password in the ansible-vars.yaml file with the variable V4M_GRAFANA_PASSWORD.

In the Ansible installation log, the following message appears:
TASK [monitoring : cluster-monitoring - output credentials] ********************
ok: [localhost] => {
"msg": [
"Grafana - username: admin, password: Password123"
]
}

When he logs in to Grafana, he gets an error: invalid username or password. He also tried admin/admin but it doesn't work.

I've asked him to check the current password, which is kept as a pod secret, by querying the info using Kubectl. He tried it, but the issue is still persisting:

kubectl -n monitoring get secret v4m-grafana -o yaml
apiVersion: v1
data:
admin-password: UGFzc3dvcmQxMjM=
admin-user: YWRtaW4=

echo "YWRtaW4=" | base64 --decode
admin
echo "UGFzc3dvcmQxMjM=" | base64 --decode
Password123

Is there any additional steps to be performed while setting up Grafana by using this project?

Thank you for your help

Best regards,
Ira

Helm 3.9 compatibility issue

Helm version 3.9.0 (latest as of June 2022) causes an error dring kibana logging install.

“Error: Kubernetes cluster unreachable: exec plugin: invalid apiVersion "client.authentication.k8s.io/v1alpha1”.

Once we downgraded HELM to 3.8.2 we are able to successfully install Kibana/logging.

Please put in above note "Dependencies - Helm version"

Sudhir

authentication for prometheus and alertManager

I've recently deployed the monitoring dashboards and components in AWS using the example files available in samples/ingress. The Grafana dashboard includes an authentication layer by default, but Prometheus and alertManager are not protected by any sort of authentication. Is there a way to configure authentication with the Prometheus and alertManager applications? If not, what do you recommend doing to ensure that these applications are not available to unauthenticated users.

For this deployment, we are using the same URL as the Viya applications themselves and it would be great if there was a way to leverage SASLogon so users could login using their SAS credentials to use the monitoring tools.

Dashboard for Viya 4 Open Distro Elasticsearch (as opposed to Viya 4 logging Elasticsearch)

After installing Viya 4 Monitoring and Logging, I noticed that the dashboard for Elasticsearch is for the Viya 4 logging. However, I would like to see the dashboard for Viya 4 Open Distro for Elasticsearch as well.

I do see the ServiceMonitor for sas-elasticsearch.

Appreciate any recommendation how to surface the metrics in Grafana for the Viya 4 Open Distro for Elasticsearch. Maybe make a copy of the Viya 4 logging Elasticsearch dashboard and modify? However, I didn't see the necessary metrics. Must be missing something obvious.

Thanks.

404's from monitoring Readme for TLS samples

The links in the monitoring README at
https://github.com/sassoftware/viya4-monitoring-kubernetes/blob/master/monitoring/README.md
point to the wrong location for the monitoring sample. They point to https://github.com/sassoftware/viya4-monitoring-kubernetes/blob/master/samples/tls/monitoring instead of https://github.com/sassoftware/viya4-monitoring-kubernetes/blob/master/samples/tls/monitoring.md

What's the hardware and resource requirements?

May I ask what's the hardware resource requirements for this solution(both monitoring and logging)?
If I want to use a dedicated node for these tools, what's the minimum size of the instance I need to use?

azure sample "user-values-elasticsearch-open.yaml" makes the logging framework installation fail

Hi there, I'm deploying the logging framework in azure and when I use the provided Azure sample (with a path-based ingress).

It use to work but now the setup fails with the message below:

Successfully packaged chart and saved it to: /tmp/sas.mon.1kETJF7q/opendistro-build/helm/opendistro-es/opendistro-es-1.8.0.tgz
Release "odfe" does not exist. Installing it now.
Error: Ingress.extensions "v4m-es-kibana" is invalid: metadata.annotations: Invalid value: "nginx.ingress.kubernetes.io configuration-snippet": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')

If I remove the nginx.ingress.kubernetes.io configuration-snippet, the setup is successful but I can not reach the Kibana web page.
I'm using the 1.0.9 release.

Thanks for your help !

Host based to Path based Ingress

Can some one pls let me know how do we convert the Host based Ingress to Path based Ingress for Logging and Monitoring, when we run the cluster-monitoring and cluster-logging using the ansible-playbook method with viya-deployment github code.

Example:
rules:
- host: dashboards.nachai2ns.stic.unx.sas.com
http:
paths:
- path: /

rules:
- host: nachai2ns.stic.unx.sas.com
http:
paths:
- path: /dashboards

Documentation issue in logging readme

Logging readme.md mentions a NODE_PLACEMENT_ENABLE variable that does not work / exist.
Looks like this should read LOG_NODE_PLACEMENT_ENABLE as this variable is used in .sh scripts.

Grafana dashboard - SAS Launched Jobs - User Activity

I am struggling to reconcile the information presented by the graph for 'filesystem I/O by user'. While creating a 100GB file to SAS WORK, the dahsboard reports the following data points for filesystem I/O:

sasadm - Reads | 0 B/s | 8.2 kB/s | 340 B/s | 512 B/s
sasadm - Writes | -58.1 kB/s | 0 B/s | -1.8 kB/s | 0 B/s

while iotop run on the OS reports an I/O of about 200MBs/second. SAS WORK resides on local NMVEs. The same happens in the dashboard 'SAS Launched Jobs - User Activity'.

TLS defaults are different for monitoring and logging stack

Hi,

I noticed a difference in the default settings which causes an error when deploying the logging stack without TLS (stating that cert_manager is not available).

While the monitoring stack defaults to "false" if the TLS_ENABLE env var is not set, the logging stack defaults to "true":

export TLS_ENABLE="${LOG_TLS_ENABLE:-${TLS_ENABLE:-true}}"
(logging/bin/common.sh).

This is not the default value as indicated in logging/user.env. If TLS is unwanted, one needs to set TLS_ENABLE to "false" in $USER_DIR/logging/user.env

Set the maximum log retention period beyond one month

Hi All,

I'm Ira, a member SAS Technical Support. A customer is asking whether it is possible to increase the maximum log retention period to beyond one month. If yes, how can it be done?

According to SAS PSD, "the customer will need log retention of more than 1 month because we are implementing complex processes under Viya 4 where 1700 users will be triggering logging jobs. The production start-up is done on a very tight schedule and we will need to follow the jobs launched in a precise manner and eventually go back to the logs history."

Thank you

Regards,
Ira

Custom storage class does not have any effect on AlertManager

The custom storage class works fine as documented for prometheus and grafana PVCs, but it does not work for AlertManager, which falls back to the default storage class.

Breaking change in helm 3.3.2

Hi,

I wanted to make you aware of a breaking change introduced in helm version 3.3.2 (https://github.com/helm/helm/releases/tag/v3.3.2):
"The default behavior of helm repo add was changed: It no longer overwrites repositories by default. "

As a result, the deployment scripts deploy_monitoring_cluster.sh and deploy_logging_open.sh will fail with this error:
"Error: repository name (stable) already exists, please specify a different name"

To work around this issue, one can edit the scripts and add --force-update to the helm repo add commands. (however this would cause errors for helm version below 3.3.2)

See here for a discussion: helm/helm#8771

Grafana's SAS Dashboards - no data

Hi,
After installing viya4 monitoring and logging using viya4-deployment tool, the Kubernetes dashboards are showing nice statistics and graphs in Grafana, however, all the SAS dashboards are not picking up any data. The Viya version is LTS 2021.1.

Prometheus' query text box is also not able to find any matching metric for "cas_" and "cas_nodes".
Would there be any post-installation step that I am missing?

How can i disable and create new monitoring rule and alert

I am trying to configure custom alert such “Two Cas pods should not run on same worker node”. Using below step to complete the same but unable to configure new alert rule :

1)- Created new alter rule :

additionalPrometheusRules: ----- ####i have tried with to add and remove this line

name: CustomeRules
groups:
- name: CAS-Alert-Rule
  rules:
  - alert: CASPodsNode
    expr: topk(1,sort_desc((count (kube_pod_info{created_by_kind="CASDeployment"}) by (node)))) == 1
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "More than One CAS Node Are running on Same Node: (instance {{ $labels.node }})"
    description: "Two CAS pods got Assgined to Same node. Could be cause Perfomeance Issue"

2)- Add line in rules in user-values-prom-operator.yaml pormetheus section

3)- Run monitoring/bin/deploy_monitoring_cluster.sh Script to apply the changes .

Unfortunately, not working for me. Would be helpful if you could provide any assistance in this.

postgreSQL dashboard not showing namespace choice.

We have 2 namespaces of viya4 in 1 cluster.
For PostgreSQL I cannot choose the namespace.
For PostgreSQL database I also cannot choose thenamespace, but I can choose 2 instanceid's .

How can I solve this ?

Regards,

Dik Pater

Issues executing this script on MacOS

I'm experiencing issues executing this script in MacOS. The issue seems be related to the sleep function. Specifically, MacOS doesn't like the 's' quantifier associated with the sleep value. When I remove the 's' from all values in each of the scripts, kibana/grafana deploys without issue.

Here are further details:
MacOS 12.0.1
Mac Terminal v2.12
also on iTerm2 Build 3.4.15
Helm client version: 3.7.1
Kubernetes client version: v1.21.0
Kubernetes server version: v1.20.7
git version 2.32.0
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin21)

I checked out Viya Monitoring for Kubernetes v1.1.4

For example in viya4-monitoring-kubernetes/logging/bin/deploy_elasticsearch_open.sh this seems to fail:
#log_verbose "Waiting [90] seconds to allow PVCs for pod [v4m-es-master-0] to be matched with available PVs [$(date)]" sleep 90s

However, when I update the script to this, it succeeds:
#log_verbose "Waiting [90] seconds to allow PVCs for pod [v4m-es-master-0] to be matched with available PVs [$(date)]" sleep 90

It's also applicable to other pauses in the script, it fails with this:
log_verbose "Pod [v4m-es-master-0] is not ready yet...sleeping for [$pause] more seconds before checking again." sleep ${pause}s

But works with this:
log_verbose "Pod [v4m-es-master-0] is not ready yet...sleeping for [$pause] more seconds before checking again." sleep ${pause}

Deploy Duration

Why does it take many times longer to deploy logging and monitoring than it takes to deploy Viya itself ... the output below is only for adding logging, if you add monitoring it also seems to take forever:

PLAY RECAP *********************************************************************
localhost                  : ok=119  changed=37   unreachable=0    failed=0    skipped=66   rescued=0    ignored=0

Wednesday 17 November 2021  17:04:36 +0000 (0:00:00.236)       0:14:27.898 ****
===============================================================================
monitoring : cluster-logging - deploy --------------------------------- 663.18s
vdm : manifest - deploy ------------------------------------------------ 87.83s
vdm : kustomize - Generate deployment manifest ------------------------- 35.94s
vdm : prereqs - cluster-local deploy ----------------------------------- 10.62s
vdm : manifest - deploy update ------------------------------------------ 9.35s
vdm : assets - Download ------------------------------------------------- 9.27s
vdm : prereqs - cluster-wide -------------------------------------------- 5.55s
vdm : assets - Get License ---------------------------------------------- 5.52s
vdm : prereqs - cluster-api --------------------------------------------- 4.08s
vdm : copy - VDM generators --------------------------------------------- 2.51s
vdm : copy - VDM transformers ------------------------------------------- 2.07s
vdm : assets - Extract downloaded assets -------------------------------- 2.03s
monitoring : v4m - download --------------------------------------------- 1.31s
baseline : Deploy cert-manager ------------------------------------------ 0.97s
jump-server : jump-server - create folders ------------------------------ 0.95s
vdm : Download viya4-orders-cli ----------------------------------------- 0.88s
jump-server : jump-server - lookup groups ------------------------------- 0.86s
vdm : Create namespace -------------------------------------------------- 0.84s
Gathering Facts --------------------------------------------------------- 0.82s
vdm : copy - VDM resources ---------------------------------------------- 0.81s

Grafana not accessible through its ingress

In the sample file: monitoring/samples/azure-deployment there are ingresses set up for prometheus, alertmanager and grafana. Using the sample file (changing hostnames as directed) I could not access grafana from its ingress address (but I could access prometheus and alertmanager). What I found is that the section of this file pertaining to grafana is missing the following two lines that appear in both the prometheus and alertmanager sections:

annotations:
  kubernetes.io/ingress.class: nginx

Once I added those two lines to the grafana section and re-deployed I am able to access the grafana ingress. If I'm missing something and have approached this incorrectly, please let me know -- but I think this could be an issue.

getlogs.sh problem

Hi Expert,

Thanks for giving such a good tool to monitor our viya deployment.
Recently, we want to export our logs to files.
During using "getlogs.sh", we keep getting error even we're pretty sure the host/port/user/passwd is correct cause we can login by the website.

We use ingress host and port as below.

Here is the index page of our Elastic.

Did we do anything wrong or misunderstand the parameters meanings?
Thanks

Best,
Stacey

deployment of monitoring gives an error

I receive this error when deploying monitoring. Cloned the repository on 6/28/2022. Error occurs on the deploy_logging step.

msg: '[Errno 2] No such file or directory: b''/tmp/ansible.nvc9ptvd/viya4-monitoring-kubernetes/logging/bin/deploy_logging.sh'''
rc: 2

Grafana Dashboard for SPRE sessions

Pls let us know how to create Grafana dashboards on monitoring SPRE sessions of users.

Way to integrate Grafana with LDAP/AD

I would love to see an example of connecting Grafana to an AD/LDAP server for authentication.

Missing toleration for Prometheus push-gateway pod ?

In my AKS deployment, there is no available node for the push-gateway pod.
Unlike the other prometheus/grafana pods, the "push-gateway" pod does not have a toleration for the "stateful" node.
As the result the pod remains pending.
When I add this in my $USER_DIR/monitoring/user-values-pushgateway.yaml :

tolerations:
  - effect: NoSchedule
    key: workload.sas.com/class
    value: stateful

it solves the issue.
Thanks

Missing sas-deployment-operator metrics

As per #151, creating another issue to track missing deployment operator metrics.

Thanks.

Upgrade : Unable to access Kibana URL after upgrade.

Hi Team,

Our Kibana url before upgrade was https://Client -viya4-Box.xxxxxxxx.xxxxxxx.corp/kibana and after upgrade "ip-x-x-x-x.ec2.internal:" instance is coming . Unable to access Kibana right now

Note : Changed the the above URL name security purpose

NOTES:
Viya Monitoring for Kubernetes 1.1.8-SNAPSHOT is installed

^[[1;37m^[[44mElasticsearch/Kibana Access Controls ^[[0m
^[[1;37m^[[44mAssign users the back-end role of [V4MCLUSTER_ADMIN_kibana_users] to ^[[0m
^[[1;37m^[[44mgrant them access to Kibana and log messages from ^[[0m
^[[1;37m^[[44mALL tenants and ALL namespaces ^[[0m
^[[1;37m^[[44m ^[[0m
^[[1;37m^[[44mAccessing the monitoring applications ^[[0m
^[[1;37m^[[44m ^[[0m
^[[1;37m^[[44mYou can access KIBANA via the following URL: ^[[0m
^[[1;37m^[[44mhttp://ip-x-x-x-x.ec2.internal:31033 ^[[0m
^[[1;37m^[[44m ^[[0m
^[[1;37m^[[44mIt was not possible to determine the URL needed to access ELASTICSEARCH ^[[0m
^[[1;37m^[[44m ^[[0m
^[[1;37m^[[44mNote: These URLs may be incorrect if your ingress and/or other network ^[[0m
^[[1;37m^[[44mconfiguration includes options this script does not handle. ^[[0m
^[[1;37m^[[44m ^[[0m

^[[1;37m^[[44mThe deployment of logging components has completed [Tue Apr 26 12:12:55 UTC 2022]^[[0m

^[[1;37mDEBUG ^[[0;37mDeleted temporary directory: [/tmp/sas.mon.Ff1P1CSh]^[[0m

upgrade process

1)- Download latest Repo
2)- export $USER_DIR= path --- Custom files mention below

user.env
user-values-elasticsearch-path.yaml -- Has custom URL

3)- Run ./logging/bin/deploy_logging_open.sh

kibana:
extraEnvs:

Needed for path-based ingress

name: SERVER_BASEPATH
value: /kibana

Username & password need to be set here since helm replaces array values

name: ELASTICSEARCH_USERNAME
valueFrom:
secretKeyRef:
name: internal-user-kibanaserver
key: username
name: ELASTICSEARCH_PASSWORD
valueFrom:
secretKeyRef:
name: internal-user-kibanaserver
key: password
service:
type: ClusterIP
nodePort: null
ingress:
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/configuration-snippet: |
rewrite (?i)/kibana/(.*) /$1 break;
rewrite (?i)/kibana$ / break;
nginx.ingress.kubernetes.io/rewrite-target: /kibana
enabled: true
ingressClassName is not currently supported by the Open Distro helm chart

ingressClassName: nginx
path: /kibana
pathType: Prefix
hosts:
- Client -viya4-Box.xxxxxxxx.xxxxxxx.corp

Change in Log retention period for Kibana is not working

Hi Team,
I have been trying to change log retention period for kibana however changes are not working.

1)- changed in parameter LOG_RETENTION_PERIOD=15 in logging/user.env file and rerun deploy_logging_open.sh script again. But in kibana dashboard still showing logs for 3 days only.

Could you please help into this?

Instructions for applying Viya ServiceMonitors

I have recently deployed the monitoring components into a new Viya deployment on EKS. However, prometheus and grafana are not displaying any information and tech support directed me to ask this project for further instructions or resources. I do see the documentation regarding ServiceMonitors here.

However, there is not a specific set of instructions for deployment. Is it as simple as applying the manifest, or are there additional steps? For example:

kubectl apply -f viya/serviceMonitor-sas-cas-operator.yaml -n monitoring

Service discovery not working properly

Hello,

I have deployed the monitoring cluster and viya on AKS version 1.21.9 and SAS version 2021.2.5

When I go on dashboard I see the dashboard of k8s working good but the SAS dashboard empty and if I go on prometheus I found this service discovery at 0

[podMonitor/monitoring/eventrouter/0 (0 / 652 active targets)
[podMonitor/monitoring/sas-deployment-operator/0 (0 / 24 active targets)]
[podMonitor/monitoring/sas-go-pods/0 (0 / 24 active targets)]
[podMonitor/monitoring/sas-java-pods/0 (0 / 24 active targets)]
[serviceMonitor/monitoring/elasticsearch/0 (0 / 455 active targets)]
[serviceMonitor/monitoring/fluent-bit-v2/0 (0 / 455 active targets)]
[serviceMonitor/monitoring/fluent-bit/0 (0 / 455 active targets)]
[serviceMonitor/monitoring/sas-arke/0 (0 / 21 active targets)]
[serviceMonitor/monitoring/sas-cas-operator/0 (0 / 21 active targets)]
[serviceMonitor/monitoring/sas-cas-server/0 (0 / 21 active targets)]
[serviceMonitor/monitoring/sas-elasticsearch/0 (0 / 21 active targets)]
[serviceMonitor/monitoring/sas-postgres/0 (0 / 21 active targets)]
[serviceMonitor/monitoring/sas-rabbitmq-server/0 (0 / 21 active targets)]
[serviceMonitor/monitoring/v4m-kube-controller-manager/0 (0 / 41 active targets)]
[serviceMonitor/monitoring/v4m-kube-etcd/0 (0 / 41 active targets)]
[serviceMonitor/monitoring/v4m-kube-proxy/0 (0 / 41 active targets)]
[serviceMonitor/monitoring/v4m-kube-scheduler/0 (0 / 41 active targets)]

I have deploy monitoring cluster in https with my valid certificate.

Any suggestion?

Integration of Viya4 Monitoring with GCP Cloud Monitoring

GCP recommends using workload metrics instead of the sidecar container for integration with its Cloud Logging and Monitoring framework. See https://cloud.google.com/stackdriver/docs/solutions/gke/managing-metrics#workload-metrics for more details. Our deployment descriptors from under https://github.com/sassoftware/viya4-monitoring-kubernetes/tree/9ead02a1e46cd34842005d2b903d3d839081122d/monitoring/monitors/viya are not readily compatible with GCP. Could you please embrace the new recommended way of integrating with Cloud Monitoring & Logging.

shows sas-esp-operator-metrics target as down even though working

Dear sassoftware,
we've installed the viya4 monitoring components and find these targets down even though the PODs are running fine:

serviceMonitor/viya4/sas-esp-operator-metrics/0 (0/1 up)
serviceMonitor/viya4/sas-esp-operator-metrics/1 (0/1 up)

in the details of the targets' configuration, the URL differs from the POD annotations.

The endpoint prometheus uses:
http://10.244.0.167:8383/metrics
http://10.244.0.167:8686/metrics

The endpoint according to the POD's annotations
annotations:
prometheus.io/path: /internal/metrics
prometheus.io/port: '8383'
prometheus.io/scheme: http
prometheus.io/scrape: 'true'

annotations:
prometheus.io/path: /internal/metrics
prometheus.io/port: '8686'
prometheus.io/scheme: http
prometheus.io/scrape: 'true'

Is there a way to correct and update the URL path setting for monitoring?

SAS Viya 4 - logging & monitoring - how to shutdown?

Problem Description:
I did try this, but I think there's some missing steps. They're still up.

[root@server ~]# kubectl -n monitoring scale deployment v4m-grafana --replicas=0
deployment.apps/v4m-grafana scaled
[root@server ~]# kubectl -n monitoring scale deployment v4m-kube-state-metrics --replicas=0
deployment.apps/v4m-kube-state-metrics scaled
[root@server ~]# kubectl -n monitoring scale deployment v4m-operator --replicas=0
deployment.apps/v4m-operator scaled
[root@server ~]# kubectl -n logging scale deployment v4m-es-kibana --replicas=0
deployment.apps/v4m-es-kibana scaled
[root@server ~]# kubectl -n logging scale deployment v4m-es-exporter --replicas=0
deployment.apps/v4m-es-exporter scaled
[root@server ~]# kubectl -n logging scale deployment v4m-es-client --replicas=0
deployment.apps/v4m-es-client scaled

Deploying rules in AWS

Hi,

INFO Helm client version: 3.5.4
INFO Kubernetes client version: v1.18.19
INFO Kubernetes server version: v1.18.16-eks-7737de

We ran into this error with webhook while deploying monitoring on an EKS cluster. The error occurred when apply this rule: rules-sas-launcher-job.yaml

INFO Adding Prometheus recording rules...
Error from server (InternalError): error when creating "monitoring/rules/viya/rules-sas-launcher-job.yaml": Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://v4m-operator.monitoring.svc:443/admission-prometheusrules/mutate?timeout=10s: context deadline exceeded

I modified the deploy_monitoring_cluster.sh and commented out the this section and the monitoring installed correctly with exception of applying the rules.

    # Rules
    log_info "Adding Prometheus recording rules..."
    for f in monitoring/rules/viya/rules-*.yaml; do
      kubectl apply -n $MON_NS -f $f
    done

I ran into a similar issue with webhooks in AWS but it didn't seem to like any modifications I made to ( --set webhook.hostNetwork=true,webhook.securePort=) during the helm install of prometheus. Those 2 settings have had to set in another helm install with a webhook issue and it fixed this issue in the past. Also, checked to see if there were any lingering CRDs or validatingwebhookconfigurations or MutatingWebhookConfiguration.

Thanks in advance

TLS kibana 502 bad gateway

Hello, I'm deploying the logging part with TLS enable.

I have made the customization in my yaml file as follow

kibana:
  extraEnvs:
  - name: SERVER_BASEPATH
    value: /kibana
  - name: ELASTICSEARCH_USERNAME
    valueFrom:
       secretKeyRef:
          name: internal-user-kibanaserver
          key: username
  - name: ELASTICSEARCH_PASSWORD
    valueFrom:
       secretKeyRef:
          name: internal-user-kibanaserver
          key: password
  service:
    type: ClusterIP
    nodePort: null
  ingress:
    annotations:
      kubernetes.io/ingress.class: nginx
      cert-manager.io/cluster-issuer: letsencrypt-prod
      nginx.ingress.kubernetes.io/backend-protocol: HTTPS
      nginx.ingress.kubernetes.io/affinity: "cookie"
      nginx.ingress.kubernetes.io/ssl-redirect: "false"
      nginx.ingress.kubernetes.io/configuration-snippet: |
        rewrite (?i)/kibana/(.*) /$1 break;
        rewrite (?i)/kibana$ / break;
      nginx.ingress.kubernetes.io/rewrite-target: /kibana
    enabled: true
    hosts:
    - hostname/kibana
    tls:
      - secretName: kibana-ingress-tls-secret
        hosts:
          - hotsname

When I try to acces the web page I get 502 bad gateway with the following error in nginx ingress log

2021/09/28 12:26:38 [error] 5574#5574: *415692 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: 192.168.16.4, server: hosthame, request: "GET /kibana HTTP/2.0", upstream: "https://192.168.16.89:5601/", host: "hosthame"
2021/09/28 12:26:38 [error] 5574#5574: *415692 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: 192.168.16.4, server: hosthame, request: "GET /kibana HTTP/2.0", upstream: "https://192.168.16.89:5601/", host: "hosthame"
2021/09/28 12:26:38 [error] 5574#5574: *415692 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: 192.168.16.4, server: hosthame, request: "GET /kibana HTTP/2.0", upstream: "https://192.168.16.89:5601/", host: "hosthame"
192.168.16.4 - admin [28/Sep/2021:12:26:38 +0000] "GET /kibana HTTP/2.0" 502 552 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" 25 0.003 [logging-v4m-es-kibana-svc-443] [] 192.168.16.89:5601, 192.168.16.89:5601, 192.168.16.89:5601 0, 0, 0 0.004, 0.000, 0.000 502, 502, 502 df84a71026cdc776daf36c9cc15789b8

Any suggestion where the error can be?

Thanks

getlogs.sh uses https protocol even when http is specified, leading to query failure

On this line 608 in logging/bin/getlogs.sh, the -XPOST argument to curl specifies a hardcoded protocol of 'https'. It should use the variable '$protocol' so as to support either https (default) or http, as the user has requested.

When attempting to use getlogs.sh with an http on port 80, this leads to the curl query failing with an error like this:

$ LOG_DEBUG_ENABLE=true logging/bin/getlogs.sh -ho my-hostname -po 80 -us admin -pw my-elasticsearch-admin-password -pr http -n my-viya-namespace
DEBUG OpenShift not detected
DEBUG OpenShift not detected. Skipping 'oc' checks.
DEBUG Working directory: /home/cloud-user/viya4-monitoring-kubernetes
DEBUG User directory: /home/cloud-user/viya4-monitoring-kubernetes
DEBUG Temporary directory: [/tmp/sas.mon.Oyvw3rPs]
DEBUG Number of Input Parms: 12 Input Parms: -ho my-hostname -po 80 -us admin -pw my-elasticsearch-admin-password -pr http -n my-viya-namespace
DEBUG Connection options PROTOCOL: http HOST: my-hostname PORT: 80  USERNAME: admin
DEBUG Validation of connection information succeeded  [200] rc:[0]
DEBUG Query Count:1  List of Days: 2021-09-03
INFO Search for matching log messages started... Fri Sep  3 06:09:44 EDT 2021
DEBUG Query [1] of [1] Fri Sep  3 06:09:44 EDT 2021
DEBUG curl command response: [000] rc:[35]
ERROR The curl command failed while attempting to submit query [1]; script exiting.
DEBUG Deleted temporary directory: [/tmp/sas.mon.Oyvw3rPs]

I edited my copy of getlogs.sh to change line 608 to replace the 'https' with '$protocol', so that it looks like this:
response=$(curl -m $maxtime -s -o $qresults_file -w "%{http_code}" -XPOST "$protocol://$host:$port/_opendistro/_sql?format=$format" -H 'Content-Type: application/json' -d @$query_file $output_txt --user $username:$password -k)

Then I reran my query, and it successfully returned logs in CSV format.

K8s v1.19.9 is Out of Support for Azure

See and the workaround that was required until this is resolved.
sassoftware/viya4-deployment#168

Path-based ingress sample for Kibana fails with release 1.1.1

We have discovered that the sample showing how to enable path-based ingress for Kibana does not work for release 1.1.1. An update to the sample to address this issue is in development.

No values from CAS

The CAS dashboard is empty, it looks like prometheus is not getting any metrics

cas_node_cpu_time_seconds

cas_node_cpu_time_seconds{cas_node="controller.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Controller", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="idle"}	NaN
cas_node_cpu_time_seconds{cas_node="controller.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Controller", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="iowait"}	NaN
cas_node_cpu_time_seconds{cas_node="controller.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Controller", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="irq"}	NaN
cas_node_cpu_time_seconds{cas_node="controller.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Controller", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="softirq"}	NaN
cas_node_cpu_time_seconds{cas_node="controller.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Controller", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="system"}	NaN
cas_node_cpu_time_seconds{cas_node="controller.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Controller", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="user"}	NaN
cas_node_cpu_time_seconds{cas_node="worker-0.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Worker", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="idle"}	NaN
cas_node_cpu_time_seconds{cas_node="worker-0.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Worker", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="iowait"}	NaN
cas_node_cpu_time_seconds{cas_node="worker-0.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Worker", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="irq"}	NaN
cas_node_cpu_time_seconds{cas_node="worker-0.sas-cas-server-default.saslab.svc.cluster.local", cas_node_type="Worker", cas_server="cas-shared-default", container="cas", endpoint="http", instance="10.110.129.86:8777", job="sas-cas-server-default-client", namespace="saslab", pod="sas-cas-server-default-controller", service="sas-cas-server-default-client", type="softirq"}

Typo in 1.0.3 version of viya4-monitoring-kubernetes/samples/tls/monitoring/user.env file

Last week I was deploying the 1.0.3 version of the monitoring components and noticed what I believe is a typo in the viya4-monitoring-kubernetes/samples/tls/monitoring/user.env file.

It says TLS_ENABLED which I believe is incorrect, and it should be TLS_ENABLE.

I didn't realize this was wrong at first and I copied that sample file over to my USER_DIR directory (as per the instructions) and didn't change that variable. I did the rest of the steps and the deployment didn't report any errors, but I got HTTP 502 errors when trying to access my grafana and prometheus endpoints. When I eventually realized it should be TLS_ENABLE, I re-did the monitoring deployment exactly as I did before (but just with the correct environment variable this time in that user.env file) and it worked.

This does seem to be fixed in that same file on the master branch, so perhaps this typo has already been noticed. But I did want to report it here because the 'Perform Pre-Deployment Tasks' steps do explicitly say to grab the latest release from the Releases page, and as of right now that is 1.0.3. So if others are following these instructions using the 1.0.3 release and using TLS, they might run into problems like I did if they don't realize the variable in the sample TLS user.env file is wrong.

1.0.3 version I was using shows TLS_ENABLED:
https://github.com/sassoftware/viya4-monitoring-kubernetes/blob/1.0.3/samples/tls/monitoring/user.env

master version shows TLS_ENABLE:
https://github.com/sassoftware/viya4-monitoring-kubernetes/blob/master/samples/tls/monitoring/user.env

Viya monitotung upgrade error

I have downloaded latest repo in zip and on ruining deploy_logging_open.sh for upgrading our existing logging , Facing below Issue.

DEBUG Working directory: /home/ec2-user/new_sas4viya/viya4-monitoring-kubernetes-logging/viya4-monitoring-kubernetes-master
INFO User directory: /home/ec2-user/new_sas4viya/viya4-monitoring-kubernetes-logging/Logging_cust_files
INFO Helm client version: 3.4.0
ERROR Unsupported kubectl version: [v1.18.8-eks-7c9bda]
[ec2-user@ip-x-x-x-x-] viya4-monitoring-kubernetes-master]$

kubectl version --short --client
Client Version: v1.18.8-eks-7c9bda

From Aws , tried with mention version of kubectl for EKS 2.21 as per below doc and AWS providing same version for kubectl/

https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html

Grafana Dashboards not working

Hi ,

I tried a couple of time to deploy the Viya 4 monitoring framework, during installation of the standard scripts I see no errors, however, when I try to access one the of the standard dashboards I get this error

and nothing is displayed
Also when I try to access some of the Viya 4 specific dashboards I get similar errors :

Anything I'm missing ?

Both SAS Launched Jobs (node and user) do not show activity

Hi,

I just installed the latest version of the Viya 4 monitoring project, and both the SAS Launched Jobs - Node Activity and SAS Launched Jobs - User Activity, show no data, although I clearly have sas-launcher- pods running n my namespace.

Is there anything I need to configure to make sure these 2 reports work ?

The rest (CAS, Rabbitmq, postgres, ...) all show correct data)

Bye,
Wouter.

sassoftware / viya4-monitoring-kubernetes Goto Github PK

viya4-monitoring-kubernetes's Introduction

SAS® Viya® Monitoring for Kubernetes

viya4-monitoring-kubernetes's People

Contributors

Stargazers

Watchers

Forkers

viya4-monitoring-kubernetes's Issues

echo "YWRtaW4=" | base64 --decode admin echo "UGFzc3dvcmQxMjM=" | base64 --decode Password123

Needed for path-based ingress

Username & password need to be set here since helm replaces array values

ingressClassName is not currently supported by the Open Distro helm chart

ingressClassName: nginx

Recommend Projects

Recommend Topics

Recommend Org

echo "YWRtaW4=" | base64 --decode
admin
echo "UGFzc3dvcmQxMjM=" | base64 --decode
Password123