canonical / opensearch-operator Goto Github PK

View Code? Open in Web Editor NEW

10.0 9.0 6.0 12.37 MB

OpenSearch operator

License: Apache License 2.0

Python 100.00%

data-platform opensearch

opensearch-operator's Introduction

OpenSearch Operator

Description

The Charmed OpenSearch Operator deploys and operates the OpenSearch software on VMs and machine clusters.

This operator provides an OpenSearch cluster, with:

TLS (for the HTTP and Transport layers)
Automated node discovery
Observability
Backup / Restore
Safe horizontal scale-down/up
Large deployments

The Operator in this repository is a Python project installing and managing OpenSearch installed from the OpenSearch Snap, providing lifecycle management and handling events (install, start, etc).

Usage

Bootstrap a lxd controller to juju and create a model:

juju add-model opensearch

Configure the system settings required by OpenSearch, we'll do that by creating and setting a cloudinit-userdata.yaml file on the model. As well as setting some kernel settings on the host machine.

cat <<EOF > cloudinit-userdata.yaml
cloudinit-userdata: |
  postruncmd:
    - [ 'echo', 'vm.max_map_count=262144', '>>', '/etc/sysctl.conf' ]
    - [ 'echo', 'vm.swappiness=0', '>>', '/etc/sysctl.conf' ]
    - [ 'echo', 'net.ipv4.tcp_retries2=5', '>>', '/etc/sysctl.conf' ]
    - [ 'echo', 'fs.file-max=1048576', '>>', '/etc/sysctl.conf' ]
    - [ 'sysctl', '-p' ]
EOF

sudo tee -a /etc/sysctl.conf > /dev/null <<EOT
vm.max_map_count=262144
vm.swappiness=0
net.ipv4.tcp_retries2=5
fs.file-max=1048576
EOT

sudo sysctl -p

juju model-config --file=./cloudinit-userdata.yaml

Basic Usage

To deploy a single unit of OpenSearch using its default configuration.

juju deploy opensearch --channel=2/edge

Relations / Integrations

The relevant provided relations of Charmed OpenSearch are:

Client interface:

To connect to the Charmed OpenSearch Operator and exchange data, relate to the opensearch-client endpoint:

juju deploy data-integrator --channel=2/edge
juju integrate opensearch data-integrator

Large deployments:

Charmed OpenSearch also allows to form large clusters or join an existing deployment, through the relations:

peer-cluster
peer-cluster-orchestrator

juju integrate main:peer-cluster-orchestrator data-hot:peer-cluster

TLS:

The Charmed OpenSearch Operator also supports TLS encryption as a first class citizen, on both the HTTP and Transport layers. TLS is enabled by default and is a requirement for the charm to start.

The charm relies on the tls-certificates interface.

1. Self-signed certificates:

# Deploy the self-signed TLS Certificates Operator.
juju deploy self-signed-certificates --channel=latest/stable

# Add the necessary configurations for TLS.
juju config \
    self-signed-certificates \
    ca-common-name="Test CA" \
    certificate-validity=365 \
    root-ca-validity=365
    
# Enable TLS via relation.
juju integrate self-signed-certificates opensearch

# Disable TLS by removing relation.
juju remove-relation opensearch self-signed-certificates

Note: The TLS settings shown here are for self-signed-certificates, which are not recommended for production clusters. The Self Signed Certificates Operator offers a variety of configuration options. Read more on the TLS Certificates Operator here.

Security

Security issues in the Charmed OpenSearch Operator can be reported through LaunchPad. Please do not file GitHub issues about security issues.

Contributing

Please see the Juju SDK docs for guidelines on enhancements to this charm following best practice guidelines, and CONTRIBUTING.md for developer guidance.

License

The Charmed OpenSearch Operator is free software, distributed under the Apache Software License, version 2.0. See LICENSE for more information.

opensearch-operator's People

Contributors

Stargazers

Watchers

Forkers

gabrielcocenza wrfitch phvalguima batalex changsongyang vinylstage

opensearch-operator's Issues

Remove reference to proxy requests in backups

The backups module makes use of a a proxy request method which handles empty, None, error responses in a special way.
We need to revisit this part and make the module converge with using exclusively the main request method.

Remove role re-balancing

If there are an even number of cluster-manager-eligible nodes, OpenSearch automatically excludes one from the voting configuration

There should normally be an odd number of master-eligible nodes in a cluster. If there is an even number, Elasticsearch leaves one of them out of the voting configuration to ensure that it has an odd size.

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-voting.html#_even_numbers_of_master_eligible_nodes

Therefore, there is no need to ensure an odd number of cluster-manager-eligible nodes by re-balancing the roles ourselves

This solves the issue with the deterministic role re-balancing in #209 where:

initial deployment
even number of units
leader unit is highest unit number
(highest unit number is not cluster manager eligible)
(leader unit has to start first, but must be cluster manager eligible)

Alternative solutions to that issue:

start highest unit as cluster manager eligible and restart it later to remove cluster-manager-eligible role (cons: extra restart, at what point do you restart unit—requires special case for initial start, could affect HA during initial start)
replace deterministic role re-balancing with non-deterministic role re-balancing (cons: upgrade cannot be coordinated without use of peer databag which harms rollback resiliency—detailed in #209 (comment) or without race for lock [which has other concerns & deviates from upgrade design used in other charms])

This also removes restarts (that were needed on a unit when its roles changed)

Context: https://chat.canonical.com/canonical/pl/756bhdey33ysjx3qdee3ktgoxo

[Upgrade] Keep the same `Paths` set for the old snap until unit is free to upgrade

At upgrade, two things should happen: (1) the refresh, which replaces the charm for the new charm version; and (2) the unit must wait for its turn to eventually be able to upgrade to the new version.

Therefore, for a certain time, we need to account the original Paths and following env. variables must be created with the old values. Besides, we also must account to the fact Paths will be upgraded within a hook, where it will:

Start with the old snap revision
Stop the service
Run the upgrade of the snap
Start the service with the new snap revision
Finish hook

During this time, we need to be able to update Paths and environment variables accordingly.

With current #242, this is the error that happens at this stage:

unit-opensearch-0: 20:13:16 ERROR unit.opensearch/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_plugin_manager.py", line 422, in _installed_plugins
    return self._opensearch.run_bin("opensearch-plugin", "list").split("\n")
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 182, in run_bin
    return self._run_cmd(script_path, args, stdin=stdin)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 338, in _run_cmd
    raise OpenSearchCmdError(output.stderr)
charms.opensearch.v0.opensearch_exceptions.OpenSearchCmdError: could not find java in OPENSEARCH_JAVA_HOME at /snap/opensearch/current/usr/lib/jvm/java-21-openjdk-amd64/bin/java


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 615, in _on_config_changed
    if self.plugin_manager.run():
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_plugin_manager.py", line 158, in run
    logger.debug(f"Status: {self.status(plugin)}")
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_plugin_manager.py", line 328, in status
    if not self._is_installed(plugin):
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_plugin_manager.py", line 346, in _is_installed
    return plugin.name in self._installed_plugins()
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_plugin_manager.py", line 424, in _installed_plugins
    raise OpenSearchPluginError("Failed to list plugins: " + str(e))
charms.opensearch.v0.opensearch_plugins.OpenSearchPluginError: Failed to list plugins: could not find java in OPENSEARCH_JAVA_HOME at /snap/opensearch/current/usr/lib/jvm/java-21-openjdk-amd64/bin/java


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-0/charm/./src/charm.py", line 264, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 627, in _on_config_changed
    self.status.set(BlockedStatus(PluginConfigChangeError), app=True)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/helper_charm.py", line 84, in set
    context.status = upgrade_status
  File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/ops/model.py", line 404, in status
    raise RuntimeError('cannot set application status as a non-leader unit')

Add snap hold and pin version

We should set snap version and hold settings per juju revision.

`resume-upgrade` returns `failed` although the charm is ready for upgrade

Occasionally, resume-upgrade is failing at the early steps of an upgrade, although the cluster shows itself as ready for upgrade status.

This has been observed multiple times in this CI run: #267

We need to further investigate what is causing this early failures.

[RFE] Extend plugin manager

The plugin manager should support more use cases:

More than one relation per plugin
Same relation for multiple plugins
A mix of configs and relations

Juju status does not show open ports

Steps to reproduce

Deploy
Run:

$ juju status

Expected behavior

I would expect to see the workload open ports under the column ports when running juju status

Actual behavior

$ juju status
Model       Controller         Cloud/Region     Version  SLA          Timestamp
opensearch  google-controller  google/us-east1  2.9.44   unsupported  16:38:20-05:00

App                        Version  Status   Scale  Charm                      Channel      Rev  Exposed  Message
opensearch                          blocked      1  opensearch                 2/edge        28  no       1 or more 'replica' shards are not assigned, please scale your application up.
opensearch-di-admin                 blocked      1  data-integrator            latest/edge   13  no       Please specify either topic, index, or database name
tls-certificates-operator           active       1  tls-certificates-operator  latest/edge   27  no       

Unit                          Workload  Agent  Machine  Public address  Ports  Message
opensearch-di-admin/0*        blocked   idle   0        34.23.69.175           Please specify either topic, index, or database name
opensearch/0*                 active    idle   0        34.23.69.175           
tls-certificates-operator/0*  active    idle   0        34.23.69.175           

Machine  State    Address       Inst id        Series  AZ          Message
0        started  34.23.69.175  juju-952c6f-0  jammy   us-east1-b  RUNNING

Versions

Operating system: Ubuntu Jammy
Juju CLI: 2.9.44-ubuntu-amd64
Juju Controller: 2.9.44
Charm revision: 2/edge (rev 2)
Cloud Substrate: GCP

Additional context

Workaround

juju run -u opensearch/0 'open-port 9200/TCP'

Unable to re-use lock after upgrade

#263 broke upgrades since whenever unit checks if it has opensearch lock, it runs

opensearch-operator/lib/charms/opensearch/v0/opensearch_locking.py

Lines 262 to 280 in 2071f19

    
           # Attempt to create document id 0 
        
           try: 
        
               response = self._opensearch.request( 
        
                   "PUT", 
        
                   endpoint=f"/{self.OPENSEARCH_INDEX}/_create/0?refresh=true&wait_for_active_shards=all", 
        
                   host=host, 
        
                   alt_hosts=alt_hosts, 
        
                   retries=3, 
        
                   payload={"unit-name": self._charm.unit.name}, 
        
               ) 
        
           except OpenSearchHttpError as e: 
        
               if e.response_code == 409 and "document already exists" in e.response_body.get( 
        
                   "error", {} 
        
               ).get("reason", ""): 
        
                   # Document already created 
        
                   pass 
        
               else: 
        
                   logger.exception("Error creating OpenSearch lock document") 
        
                   return False

When a unit starts after upgrade, wait_for_active_shards=all causes a timeout (since the upgrading unit is offline)

example of timeout

unit-opensearch-2: 11:14:14 ERROR unit.opensearch/2.juju-log Error creating OpenSearch lock document
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/connectionpool.py", line 462, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.10/ssl.py", line 1303, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.10/ssl.py", line 1159, in read
    return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/connectionpool.py", line 469, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/urllib3/connectionpool.py", line 358, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.139.243.54', port=9200): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 272, in request
    resp = call(urls[0])
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 223, in call
    for attempt in Retrying(
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 250, in call
    response = s.request(**request_kwargs)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='10.139.243.54', port=9200): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 264, in acquired
    response = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 279, in request
    raise OpenSearchHttpError(response_text=str(e))
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=None
self.response_text="HTTPSConnectionPool(host='10.139.243.54', port=9200): Read timed out. (read timeout=5)"

This could be worked around by checking if a unit already has the opensearch lock instead of trying to create it, but that would bypass this check

opensearch-operator/lib/charms/opensearch/v0/opensearch_locking.py

Lines 282 to 295 in 2071f19

    
           # Ensure write was successful on all nodes 
        
           # "It is important to note that this setting [`wait_for_active_shards`] greatly 
        
           # reduces the chances of the write operation not writing to the requisite 
        
           # number of shard copies, but it does not completely eliminate the possibility, 
        
           # because this check occurs before the write operation commences. Once the 
        
           # write operation is underway, it is still possible for replication to fail on 
        
           # any number of shard copies but still succeed on the primary. The `_shards` 
        
           # section of the write operation’s response reveals the number of shard copies 
        
           # on which replication succeeded/failed." 
        
           # from 
        
           # https://www.elastic.co/guide/en/elasticsearch/reference/8.13/docs-index_.html#index-wait-for-active-shards 
        
           if response["_shards"]["failed"] > 0: 
        
               logger.error("Failed to write OpenSearch lock document to all nodes") 
        
               return False

Potential solutions:

remove wait_for_active_shards=all
delete lock if

opensearch-operator/lib/charms/opensearch/v0/opensearch_locking.py

Line 293 in 2071f19

if response["_shards"]["failed"] > 0:

and check if opensearch lock exists instead of trying to create it (and we'll know that if it exists, it was replicated to all shards—otherwise we would've deleted it)

Endpoints relation-data passed to client applications are refreshed in random order every `update-status`

Issue

Non-sorted Opensearch endpoints causing unnecessary relation-changed events every update-status for client applications.
Cause is because they are not sorted, resulting in random ordering each time, causing a 'change' that isn't necessary

Steps to reproduce

Relate a charm to opensearch:opensearch-client, log out relation-changed events and the endpoints in it

Expected behavior

Not getting relation-changed every update-status with the same content

Actual behavior

Getting relation-changed every update-status with the same content

Log output

(from a client application, printing out what it gets from the relation-changed event)

❯ juju debug-log -l DEBUG --lines 10000 --exclude-module juju | grep self.charm.state.opensearch_server.endpoints | grep dashboards-32 | tail -n 5

unit-opensearch-dashboards-32: 15:54:10 INFO unit.opensearch-dashboards/32.juju-log opensearch_client:35: self.charm.state.opensearch_server.endpoints=['10.103.116.180:9200', '10.103.116.216:9200', '10.103.116.16:9200']
unit-opensearch-dashboards-32: 15:54:20 INFO unit.opensearch-dashboards/32.juju-log opensearch_client:35: self.charm.state.opensearch_server.endpoints=['10.103.116.216:9200', '10.103.116.180:9200', '10.103.116.16:9200']
unit-opensearch-dashboards-32: 15:54:30 INFO unit.opensearch-dashboards/32.juju-log opensearch_client:35: self.charm.state.opensearch_server.endpoints=['10.103.116.216:9200', '10.103.116.16:9200', '10.103.116.180:9200']
unit-opensearch-dashboards-32: 15:54:39 INFO unit.opensearch-dashboards/32.juju-log opensearch_client:35: self.charm.state.opensearch_server.endpoints=['10.103.116.16:9200', '10.103.116.180:9200', '10.103.116.216:9200']
unit-opensearch-dashboards-32: 15:54:48 INFO unit.opensearch-dashboards/32.juju-log opensearch_client:35: self.charm.state.opensearch_server.endpoints=['10.103.116.216:9200', '10.103.116.180:9200', '10.103.116.16:9200']

[PIPELINE] HA horizontal scaling tests failing

See https://github.com/canonical/opensearch-operator/actions/runs/8589861845 (and others around the time)

Move from /cat/_indices to _aliases

This is a note from #158: we need to move the following code from /cat/_indices to _aliases.

Original note from @Mehdi-Bendriss:

in a future PR: we should replace this call with the _aliases API. We should only use _cat when there is no viable or straightforward alternative.

[STABILITY] Service stuck in `blocked` state: 1 or more 'replica' shards are not assigned, please scale your application up.

This is an old error, occured here (and a few more other data-integrator pipelines at the time

https://github.com/canonical/data-integrator/actions/runs/7846452147/job/21418965179#step:8:414

A bugfix was provided at the time (https://warthogs.atlassian.net/browse/DPE-3573), However this error still seems to hang around. Reproducible currently on the relation

Recent occurances:

Narrow scope of cluster health where upgrade should proceed

42d8bf5

Currently, between each upgrade, we check if cluster health is green or yellow

it may be possible to narrow this to "just green" or "green" and "yellow in specific cases" so that we can be more confident with automatically proceeding with the upgrade/less likely to break things/have data loss

See https://chat.canonical.com/canonical/pl/s5j64ekxwi8epq53kzhd8fhrco and https://chat.canonical.com/canonical/pl/zaizx3bu3j8ftfcw67qozw9dbo

[TLS] OpenSearch is not tracking invalidated certificates event

We need to extend our logic to manage certificates to track the case where certificates are partially or all of them are invalidated by the TLS charms. More details in the tls charm docstring here

Destroying opensearch application results in 2/3 errored units

Seems it is not possible to remove opensearch application without --force anymore. In the end, it ends with 2x openseach units, both in error with:

Model       Controller         Cloud/Region   Version  SLA          Timestamp
opensearch  aws-tf-controller  aws/us-east-1  3.4.2    unsupported  14:16:53+02:00

SAAS                             Status  Store              URL
alertmanager-karma-dashboard     active  aws-tf-controller  admin/cos.alertmanager-karma-dashboard
grafana-grafana-dashboard        active  aws-tf-controller  admin/cos.grafana-grafana-dashboard
loki-logging                     active  aws-tf-controller  admin/cos.loki-logging
prometheus-receive-remote-write  active  aws-tf-controller  admin/cos.prometheus-receive-remote-write

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
grafana-agent                      unknown      0  grafana-agent             latest/stable   65  no       
opensearch                         active       2  opensearch                2/edge          60  no       
self-signed-certificates           active       1  self-signed-certificates  latest/stable   72  no       
ubuntu                    22.04    active       1  ubuntu                    latest/stable   24  no       

Unit                         Workload  Agent  Machine  Public address   Ports  Message
opensearch/1*                error     idle   2        192.168.235.125         hook failed: "storage-detaching"
opensearch/2                 error     idle   3        192.168.235.252         hook failed: "storage-detaching"
self-signed-certificates/0*  active    idle   0        192.168.235.97          
ubuntu/0*                    active    idle   4        192.168.235.243         

Machine  State    Address          Inst id              Base          AZ          Message
0        started  192.168.235.97   i-071c5e7d69b1c481e  [email protected]  us-east-1a  running
2        started  192.168.235.125  i-088477c2ef3b121f9  [email protected]  us-east-1a  running
3        started  192.168.235.252  i-01d8ebfd3ae8d2a24  [email protected]  us-east-1a  running
4        started  192.168.235.243  i-074de4290a1500f16  [email protected]  us-east-1a  running

Full logs: https://pastebin.ubuntu.com/p/vHxJX9rWdr/

Core error being:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 272, in request
    resp = call(urls[0])
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 224, in call
    for attempt in Retrying(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 251, in call
    response.raise_for_status()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://192.168.235.252:9200/.charm_node_lock/_source/0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 467, in _on_opensearch_data_storage_detaching
    self.node_lock.release()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 327, in release
    if self._unit_with_lock(host) == self._charm.unit.name:
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 199, in _unit_with_lock
    document_data = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 284, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}

[Large Deployments] Allow to override the `OpenSearchHealth.apply`

Currently, on large deployments, OpenSearchHealth.apply will return HealthColors.IGNORE for the orchestrator clusters. However, there are certain situations, e.g. upgrades, where we need to know the status of the cluster before executing the node upgrade.

Therefore, we need an option to override the check:

    def apply(
        self,
        wait_for_green_first: bool = False,
        use_localhost: bool = True,
        app: bool = True,
        override: bool = False,
    ) -> str:
        """Fetch cluster health and set it on the app status."""
        try:
            host = self._charm.unit_ip if use_localhost else None
            status = self._fetch_status(host, wait_for_green_first)

......

            # compute health only in clusters where data nodes exist
            if override:
                return status
            else:
                compute_health = (
                    deployment_desc.start == StartMode.WITH_GENERATED_ROLES
                    or "data" in deployment_desc.config.roles
                )
                if not compute_health:
                    return HealthColors.IGNORE

Missing `open-port` handling for 9200

[STABILITY] No connection to opensearch charm

More details on the corresponding pipeline:

https://github.com/canonical/opensearch-operator/actions/runs/8630978165/job/23658714075#step:21:2331

Race condition due to mix of start/restart logic

When running the OpenSearchPluginManager, it is failing to correctly restart all the nodes post-install / -configuration. The only node that reliably gets restarted is the leader unit. I am occasionally getting clusters that have not accounted for the newly installed repository-s3 plugin (i.e. they did not restart after install). I can see the locks being required and then released, but journalctl does not show a restart at the same corresponding time.

The first thing is that we are using the acquire_lock event with 2x different types of callbacks: _start_opensearch and _restart_opensearch. Looking at RollingOps code, I can see it sets the callback_override function at acquire_lock.emit and then, it uses the method that was defined at constructor time. That means we are always using _start_opensearch.

The first step is to get the charm to use a single callback. The _restart_opensearch is de facto choice, as it optionally stops the running opensearch service.

However, _restart_opensearch checks first if we have a starting flag set in the peer relation. That is important at the moment of the very first start, but after it can cause some deadlocks (e.g. the following _start_opensearch may set that flag and prematurely exit, before removing the flag, then no other _restart_opensearch calls will be able stop the service and no new configuration will be picked up).

The second step is to find a more suitable check to decide if we should stop the service.

The next issue: we are over-complexifying the _restart_opensearch by calling the _start_opensearch in it. The _start_opensearch has multiple exits with a defer. That effectively means we are deferring the RunWithLockEvent, which means potentially new restarts.

I think there is a mix of tasks in this section of the code.

The start and restart should be distinct steps. Indeed, OpenSearch start can be very complex, if we consider large clusters; where we need a combination of on_start, peer_joined and config_changed events to trigger different information.

However, restart should be simple. The restart means the lock has been given to the unit and it must either do a restart quickly or pass the lock onwards to other units.

We should break the starting logic: to have a step-by-step that consciously start its unit manually, as the cluster is not yet set. Then, once that logic is finished, we can start calling the RollingOps to manage locks and restarts for us.

[Large Deployments] `_rel_err_data` will always return `should_sever_relation = True` for main orchestrator

The following check in this line is:

         elif orchestrators.failover_app and orchestrators.failover_app != self.charm.app.name:  # <<<-----------
             should_sever_relation = True
             blocked_msg = (
                 "Cannot have 2 'failover'-orchestrators. Relate to the existing failover."
             )
         elif not self.charm.is_admin_user_configured():

The check: orchestrators.failover_app != self.charm.app.name will always be true. I believe that line should instead be:

         ... and orchestrators.failover_app == self.charm.app.name:

Where we check if the the orchestrators' failover application is the application in this current cluster. That safeguards from a cluster that used to be main and got recently demoted.

Juju status

Model              Controller           Cloud/Region         Version  SLA          Timestamp
test-backups-5a8g  localhost-localhost  localhost/localhost  3.4.2    unsupported  17:34:04+02:00

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
data-hot                           blocked      2  opensearch                                 1  no       Cannot have 2 'failover'-orchestrators. Relate to the existing failover.
failover                           blocked      1  opensearch                                 2  no       Cannot have 2 'failover'-orchestrators. Relate to the existing failover.
main                               active       2  opensearch                                 0  no       
s3-integrator                      active       1  s3-integrator             latest/edge     17  no       
self-signed-certificates           active       1  self-signed-certificates  latest/stable   72  no       

Unit                         Workload  Agent      Machine  Public address  Ports  Message
data-hot/0*                  active    idle       3        10.41.46.161           
data-hot/1                   active    idle       6        10.41.46.204           
failover/0*                  active    idle       2        10.41.46.242           
main/0                       active    idle       4        10.41.46.197           
main/1*                      active    executing  5        10.41.46.145           
s3-integrator/0*             active    idle       1        10.41.46.58            
self-signed-certificates/0*  active    idle       0        10.41.46.3             

Machine  State    Address       Inst id        Base          AZ  Message
0        started  10.41.46.3    juju-6d82d4-0  [email protected]      Running
1        started  10.41.46.58   juju-6d82d4-1  [email protected]      Running
2        started  10.41.46.242  juju-6d82d4-2  [email protected]      Running
3        started  10.41.46.161  juju-6d82d4-3  [email protected]      Running
4        started  10.41.46.197  juju-6d82d4-4  [email protected]      Running
5        started  10.41.46.145  juju-6d82d4-5  [email protected]      Running
6        started  10.41.46.204  juju-6d82d4-6  [email protected]      Running

Integration provider                   Requirer                           Interface            Type     Message
data-hot:node-lock-fallback            data-hot:node-lock-fallback        node_lock_fallback   peer     
data-hot:opensearch-peers              data-hot:opensearch-peers          opensearch_peers     peer     
failover:node-lock-fallback            failover:node-lock-fallback        node_lock_fallback   peer     
failover:opensearch-peers              failover:opensearch-peers          opensearch_peers     peer     
failover:peer-cluster-orchestrator     data-hot:peer-cluster              peer_cluster         regular  
main:node-lock-fallback                main:node-lock-fallback            node_lock_fallback   peer     
main:opensearch-peers                  main:opensearch-peers              opensearch_peers     peer     
main:peer-cluster-orchestrator         data-hot:peer-cluster              peer_cluster         regular  
main:peer-cluster-orchestrator         failover:peer-cluster              peer_cluster         regular  
s3-integrator:s3-integrator-peers      s3-integrator:s3-integrator-peers  s3-integrator-peers  peer     
self-signed-certificates:certificates  data-hot:certificates              tls-certificates     regular  
self-signed-certificates:certificates  failover:certificates              tls-certificates     regular  
self-signed-certificates:certificates  main:certificates                  tls-certificates     regular

Steps to reproduce

Deploy as follows:

juju deploy tls-certificates-operator --channel stable --show-log --verbose
juju config tls-certificates-operator generate-self-signed-certificates=true ca-common-name="CN_CA"

# deploy main-orchestrator cluster 
juju deploy -n 3 ./opensearch.charm \
    main \
    --config cluster_name="log-app" --config init_hold=false --config roles="cluster_manager"

# deploy failover-orchestrator cluster
juju deploy -n 2 ./opensearch.charm \
    failover \
    --config cluster_name="log-app" --config init_hold=true --config roles="cluster_manager"

# deploy data-hot cluster
juju deploy -n 2 ./opensearch.charm \
    data-hot \
    --config cluster_name="log-app" --config init_hold=true --config roles="data.hot"

# integrate TLS
juju integrate tls-certificates-operator main
juju integrate tls-certificates-operator failover
juju integrate tls-certificates-operator data-hot

# integrate the "main"-orchestrator with all clusters:
juju integrate main:peer-cluster-orchestrator failover:peer-cluster
juju integrate main:peer-cluster-orchestrator data-hot:peer-cluster
juju integrate failover:peer-cluster-orchestrator data-hot:peer-cluster

Expected behavior

Should render an all-green deployment.

Actual behavior

Non main orchestrators are stuck in "blocked" on app level

`resume-upgrade` fails if highest unit is also the leader unit

The resume-upgrade fails with:

Running operation 7 with 1 task
  - task 8 on unit-failover-1

Waiting for task 8...
Action id 8 failed: Highest number unit is unhealthy. Upgrade will not resume.

If the leader unit is running on the unit with the highest identifier.

Using pdb, I can confirm the following, on:

  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(267)<module>()
-> main(OpenSearchOperatorCharm)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(544)main()
-> manager.run()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(520)run()
-> self._emit()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(509)_emit()
-> _emit_charm_event(self.charm, self.dispatcher.event_name)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(143)_emit_charm_event()
-> event_to_emit.emit(*args, **kwargs)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(352)emit()
-> framework._emit(event)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(851)_emit()
-> self._reemit(event_path)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(941)_reemit()
-> custom_handler(event)
  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(188)_on_resume_upgrade_action()
-> self._upgrade.reconcile_partition(action_event=event)
> /var/lib/juju/agents/unit-failover-1/charm/src/machine_upgrade.py(114)reconcile_partition()
-> unhealthy = state is not upgrade.UnitState.HEALTHY

The charm will fail as state reports:

(Pdb) state
<UnitState.UPGRADING: 'upgrading'>

Full Status:

Model                                Controller           Cloud/Region         Version  SLA          Timestamp
test-large-deployment-upgrades-36oo  localhost-localhost  localhost/localhost  3.4.2    unsupported  16:59:24+02:00

App                       Version  Status   Scale  Charm                               Channel        Rev  Exposed  Message
failover                           blocked      2  opensearch                                           1  no       Upgrading. Verify highest unit is healthy & run `resume-upgrade` action. To rollback, `juju refresh` to last revision
main                               active       1  pguimaraes-opensearch-upgrade-test  latest/edge     19  no       
opensearch                         active       3  opensearch                                           0  no       
self-signed-certificates           active       1  self-signed-certificates            latest/stable   72  no       

Unit                         Workload  Agent      Machine  Public address  Ports     Message
failover/0                   active    idle       0        10.173.208.166  9200/tcp  OpenSearch 2.12.0 running; Snap rev 40 (outdated); Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty+3cebf...
failover/1*                  active    executing  1        10.173.208.236  9200/tcp  (resume-upgrade) OpenSearch 2.12.0 running; Snap rev 44; Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty...
main/0*                      active    idle       2        10.173.208.119  9200/tcp  
opensearch/0                 active    idle       3        10.173.208.182  9200/tcp  
opensearch/1*                active    idle       4        10.173.208.21   9200/tcp  
opensearch/2                 active    idle       5        10.173.208.245  9200/tcp  
self-signed-certificates/0*  active    idle       6        10.173.208.15             

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.173.208.166  juju-bb32e7-0  [email protected]      Running
1        started  10.173.208.236  juju-bb32e7-1  [email protected]      Running
2        started  10.173.208.119  juju-bb32e7-2  [email protected]      Running
3        started  10.173.208.182  juju-bb32e7-3  [email protected]      Running
4        started  10.173.208.21   juju-bb32e7-4  [email protected]      Running
5        started  10.173.208.245  juju-bb32e7-5  [email protected]      Running
6        started  10.173.208.15   juju-bb32e7-6  [email protected]      Running

[STABILITY] Opensearch charm gets stuck at install: Waiting for OpenSearch to start...

More details can be found on the pipelines where the issue occurs:

What's worth mentioning here is that this issue is different from #219, where there's an attempt to access the service.

Here the service is seemingly stuck, with NO particular issue communicated.

At some point we just run on a timeout.

Incorrect usage of `emit()` or incorrect comment

This block of code does not behave as the comment describes

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Lines 471 to 476 in b0208dc

    
           # since when an IP change happens, "_on_peer_relation_joined" won't be called, 
        
           # we need to alert the leader that it must recompute the node roles for any unit whose 
        
           # roles were changed while the current unit was cut-off from the rest of the network 
        
           self.on[PeerRelationName].relation_joined.emit( 
        
               self.model.get_relation(PeerRelationName) 
        
           )

relation-joined will only be emitted on that unit (where

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Line 466 in b0208dc

if self.opensearch_config.update_host_if_needed():

evaluates to True)

And that unit will immediately return

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Lines 294 to 297 in b0208dc

    
           def _on_peer_relation_joined(self, event: RelationJoinedEvent): 
        
               """Event received by all units when a new node joins the cluster.""" 
        
               if not self.unit.is_leader(): 
        
                   return

More info:

Note that the emission of custom events is handled immediately. In other words, custom events are not queued, but rather nested. For example:
1. Main hook handler (emits custom_event_1)
2.   Custom event 1 handler (emits custom_event_2)
3.     Custom event 2 handler
4.   Resume custom event 1 handler
5. Resume main hook handler

https://ops.readthedocs.io/en/latest/#ops.BoundEvent.emit

Upgrade fails if the leader is the highest unit

If the leader is the highest unit in the cluster, then the resume-upgrade will fail with: Highest number unit is unhealthy. Upgrade will not resume.

The reason is because the leader will, being the highest unit, already did its own upgrade and moved from UnitState.HEALTHY to UnitState.UPGRADING:

-> if outdated or unhealthy:
(Pdb) l
107                 outdated = (
108                     self._unit_workload_container_versions.get(unit.name)
109                     != self._app_workload_container_version
110                 )
111                 unhealthy = state is not upgrade.UnitState.HEALTHY
112  ->             if outdated or unhealthy:
113                     if outdated:
114                         message = "Highest number unit has not upgraded yet. Upgrade will not resume."
115                     else:
116                         message = "Highest number unit is unhealthy. Upgrade will not resume."
117                     logger.debug(f"Resume upgrade event failed: {message}")
(Pdb) p unhealthy
True
(Pdb) p state
<UnitState.UPGRADING: 'upgrading'>

The check should be instead:

unhealthy = state not in [upgrade.UnitState.HEALTHY, upgrade.UnitState.UPGRADING]

OpenSearch fails if grafana-agent is related since the start

When deploying OpenSearch from 2/edge and relating it right away with grafana-agent, it fails with:

2024-03-04 17:51:57,868 DEBUG    `cos-tool` unavailable. Leaving expression unchanged: sum by (cluster, instance, node) (opensearch_jvm_mem_heap_used_percent) > 75

2024-03-04 17:51:57,871 DEBUG    `cos-tool` unavailable. Leaving expression unchanged: sum by (cluster, instance, node) (opensearch_os_cpu_percent) > 90

2024-03-04 17:51:57,873 DEBUG    `cos-tool` unavailable. Leaving expression unchanged: sum by (cluster, instance, node) (opensearch_process_cpu_percent) > 90

2024-03-04 17:51:57,876 DEBUG    Reading <property object at 0x7f247ca65c10> rule from src/alert_rules/prometheus/prometheus_alerts.yaml
2024-03-04 17:51:57,879 DEBUG    Alert rules path does not exist: src/loki_alert_rules
2024-03-04 17:51:57,909 ERROR    Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-5/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-5/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-5/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-5/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-5/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-5/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-5/charm/lib/charms/grafana_agent/v0/cos_agent.py", line 376, in _on_refresh
    metrics_scrape_jobs=self._scrape_jobs,
  File "/var/lib/juju/agents/unit-opensearch-5/charm/lib/charms/grafana_agent/v0/cos_agent.py", line 394, in _scrape_jobs
    scrape_configs = self._scrape_configs()
  File "/var/lib/juju/agents/unit-opensearch-5/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 1086, in _scrape_config
    ca = app_secrets.get("ca-cert")
AttributeError: 'NoneType' object has no attribute 'get'

Juju status:

Model       Controller         Cloud/Region   Version  SLA          Timestamp
opensearch  aws-tf-controller  aws/us-east-1  3.4.0    unsupported  18:59:14+01:00

SAAS                             Status  Store              URL
alertmanager-karma-dashboard     active  aws-tf-controller  admin/cos.alertmanager-karma-dashboard
grafana-grafana-dashboard        active  aws-tf-controller  admin/cos.grafana-grafana-dashboard
loki-logging                     active  aws-tf-controller  admin/cos.loki-logging
prometheus-metrics-endpoint      active  aws-tf-controller  admin/cos.prometheus-metrics-endpoint
prometheus-receive-remote-write  active  aws-tf-controller  admin/cos.prometheus-receive-remote-write

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
grafana-agent                      active       3  grafana-agent             stable          28  no       
opensearch                         active       3  opensearch                2/edge          39  yes      
self-signed-certificates           active       1  self-signed-certificates  latest/stable   72  no       
sysconfig                          blocked      3  sysconfig                 stable          33  no       update-grub and reboot required. Changes in: /etc/default/grub.d/90-sysconfig.cfg

Unit                         Workload     Agent      Machine  Public address   Ports  Message
opensearch/3                 maintenance  executing  4        192.168.235.136         Plugin configuration started.
  grafana-agent/2            active       idle                192.168.235.136         
  sysconfig/2                blocked      idle                192.168.235.136         update-grub and reboot required. Changes in: /etc/default/grub.d/90-sysconfig.cfg
opensearch/4                 maintenance  executing  5        192.168.235.232         Plugin configuration started.
  grafana-agent/1            active       idle                192.168.235.232         
  sysconfig/1                blocked      idle                192.168.235.232         update-grub and reboot required. Changes in: /etc/default/grub.d/90-sysconfig.cfg
opensearch/5*                active       executing  6        192.168.235.198         
  grafana-agent/0*           active       idle                192.168.235.198         
  sysconfig/0*               blocked      idle                192.168.235.198         update-grub and reboot required. Changes in: /etc/default/grub.d/90-sysconfig.cfg
self-signed-certificates/0*  active       idle       0        192.168.235.38          

Machine  State    Address          Inst id              Base          AZ          Message
0        started  192.168.235.38   i-0cbe3dd0e477c2b80  [email protected]  us-east-1a  running
4        started  192.168.235.136  i-09a8a15dbebd3cf1e  [email protected]  us-east-1a  running
5        started  192.168.235.232  i-0646338afc099528e  [email protected]  us-east-1a  running
6        started  192.168.235.198  i-0e7cd4c302b0d77bf  [email protected]  us-east-1a  running

Integration provider                                  Requirer                                     Interface                Type         Message
grafana-agent:grafana-dashboards-provider             grafana-grafana-dashboard:grafana-dashboard  grafana_dashboard        regular      
grafana-agent:peers                                   grafana-agent:peers                          grafana_agent_replica    peer         
loki-logging:logging                                  grafana-agent:logging-consumer               loki_push_api            regular      
opensearch:cos-agent                                  grafana-agent:cos-agent                      cos_agent                subordinate  
opensearch:juju-info                                  sysconfig:juju-info                          juju-info                subordinate  
opensearch:opensearch-peers                           opensearch:opensearch-peers                  opensearch_peers         peer         
opensearch:service                                    opensearch:service                           rolling_op               peer         
prometheus-receive-remote-write:receive-remote-write  grafana-agent:send-remote-write              prometheus_remote_write  regular      
self-signed-certificates:certificates                 opensearch:certificates                      tls-certificates         regular

Plugin manager doesn't handle case where charm deployed with config options set

The charm should handle from the get-go the possibility to deploy the charm with config options set and without the opensearch service being up.

[unit test] Move to a single framework: `pytest` and update secret mocking

We are using unittest as our main framework currently. However, as we move to data platform workflows, we can get a lot of advantages moving it to pytest instead.

One important point raised by @carlcsaposs-canonical is how can we parametrize juju versions for secret management:
https://github.com/canonical/mysql-router-k8s-operator/blob/f2cbb11ba9c333563acbb9f9b1e159adbded15b6/tests/unit/conftest.py#L60-L63

When testing it with OpenSearch, I've got the following errors:

_________________________________________________________________________________ ERROR at setup of TestOpenSearchInternalData.test_data_has_0_app _________________________________________________________________________________
test_data_has_0_app does not support fixtures, maybe unittest.TestCase subclass?
Node id: tests/unit/lib/test_opensearch_secrets.py::TestOpenSearchInternalData::test_data_has_0_app
Function type: TestCaseFunction

Also, we must remove dependency: parameterized and unify it under pytest.mark.parametrize.

ha/test_ha_networking.py fails to run lxc command since lxd 5.21

test passed on 2024-04-11 with lxd 5.20-f3dd836 snap rev 27049 https://github.com/canonical/opensearch-operator/actions/runs/8640068334/job/23687547198
test failed on 2024-04-12 with lxd 5.21.1-43998c6 snap rev 28155 https://github.com/canonical/opensearch-operator/actions/runs/8655512868/job/23734702946#step:21:1686

 _________ test_full_network_cut_without_ip_change_node_with_elected_cm _________
Traceback (most recent call last):
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/_pytest/runner.py", line 341, in from_call
    result: Optional[TResult] = func()
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/_pytest/runner.py", line 262, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_callers.py", line 181, in _multicall
    return outcome.get_result()
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_result.py", line 99, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/_pytest/runner.py", line 177, in pytest_runtest_call
    raise e
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/_pytest/runner.py", line 169, in pytest_runtest_call
    item.runtest()
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/_pytest/python.py", line 1792, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_callers.py", line 181, in _multicall
    return outcome.get_result()
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_result.py", line 99, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/_pytest/python.py", line 194, in pytest_pyfunc_call
    result = testfunction(**testargs)
  File "/home/runner/work/opensearch-operator/opensearch-operator/.tox/integration/lib/python3.10/site-packages/pytest_asyncio/plugin.py", line 532, in inner
    _loop.run_until_complete(task)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/opensearch-operator/opensearch-operator/tests/integration/ha/test_ha_networking.py", line 330, in test_full_network_cut_without_ip_change_node_with_elected_cm
    await cut_network_from_unit_without_ip_change(ops_test, app, first_elected_cm_unit_id)
  File "/home/runner/work/opensearch-operator/opensearch-operator/tests/integration/ha/helpers.py", line 347, in cut_network_from_unit_without_ip_change
    subprocess.check_call(limit_set_command.split())
  File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['lxc', 'config', 'set', 'juju-fa995e-2', 'limits.network.priority=10']' returned non-zero exit status 1.

https://chat.canonical.com/canonical/pl/877qxrhr37d99n3qibhgdayyra

[backup] Tests occasionally fail after the 30m mark with "repository-s3 unreachable"

I am noticing that, occasionally, we get test failures because the OpenSearch is not able to reach its S3 repository any longer:

The CI run eventually fails and the logs will point to:

2024-05-06T19:31:54.7439602Z unit-main-0: 19:26:12 ERROR unit.main/0.juju-log s3-credentials:19: Request PUT to https://10.81.35.210:9200/_snapshot/s3-repository with payload: {'type': 's3', 'settings': {'endpoint': 'http://localhost', 'protocol': 'http', 'bucket': 'error', 'base_path': '/', 'region': 'default'}} failed.(Attempts left: 5)
2024-05-06T19:31:54.7443652Z 	Error: 500 Server Error: Internal Server Error for url: https://10.81.35.210:9200/_snapshot/s3-repository
2024-05-06T19:31:54.7447603Z unit-main-0: 19:26:14 ERROR unit.main/0.juju-log s3-credentials:19: Request PUT to https://10.81.35.210:9200/_snapshot/s3-repository with payload: {'type': 's3', 'settings': {'endpoint': 'http://localhost', 'protocol': 'http', 'bucket': 'error', 'base_path': '/', 'region': 'default'}} failed.(Attempts left: 4)
2024-05-06T19:31:54.7451431Z 	Error: 500 Server Error: Internal Server Error for url: https://10.81.35.210:9200/_snapshot/s3-repository
2024-05-06T19:31:54.7455157Z unit-main-0: 19:26:15 ERROR unit.main/0.juju-log s3-credentials:19: Request PUT to https://10.81.35.210:9200/_snapshot/s3-repository with payload: {'type': 's3', 'settings': {'endpoint': 'http://localhost', 'protocol': 'http', 'bucket': 'error', 'base_path': '/', 'region': 'default'}} failed.(Attempts left: 3)
2024-05-06T19:31:54.7458823Z 	Error: 500 Server Error: Internal Server Error for url: https://10.81.35.210:9200/_snapshot/s3-repository
2024-05-06T19:31:54.7462450Z unit-main-0: 19:26:17 ERROR unit.main/0.juju-log s3-credentials:19: Request PUT to https://10.81.35.210:9200/_snapshot/s3-repository with payload: {'type': 's3', 'settings': {'endpoint': 'http://localhost', 'protocol': 'http', 'bucket': 'error', 'base_path': '/', 'region': 'default'}} failed.(Attempts left: 2)
2024-05-06T19:31:54.7466642Z 	Error: 500 Server Error: Internal Server Error for url: https://10.81.35.210:9200/_snapshot/s3-repository
2024-05-06T19:31:54.7471900Z unit-main-0: 19:26:18 ERROR unit.main/0.juju-log s3-credentials:19: Request PUT to https://10.81.35.210:9200/_snapshot/s3-repository with payload: {'type': 's3', 'settings': {'endpoint': 'http://localhost', 'protocol': 'http', 'bucket': 'error', 'base_path': '/', 'region': 'default'}} failed.(Attempts left: 1)
2024-05-06T19:31:54.7475420Z 	Error: 500 Server Error: Internal Server Error for url: https://10.81.35.210:9200/_snapshot/s3-repository
2024-05-06T19:31:54.7477620Z unit-main-0: 19:26:19 ERROR unit.main/0.juju-log s3-credentials:19: Failed to setup backup service with state repository s3 is unreachable

`OpenSearchUserManager.create_user` does not distinguish between creation error and user already exists

The _start_opensearch constantly fails with errors below at _post_start_init.

unit-opensearch-0: 15:04:27 ERROR unit.opensearch/0.juju-log creating user monitor failed
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 826, in _start_opensearch
    self._post_start_init(event)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 925, in _post_start_init
    self._put_or_update_internal_user_leader(COSUser)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 1115, in _put_or_update_internal_user_leader
    self.user_manager.put_internal_user(user, hashed_pwd)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_users.py", line 309, in put_internal_user
    self.create_user(COSUser, roles, hashed_pwd)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_users.py", line 179, in create_user
    raise OpenSearchUserMgmtError(f"creating user {user_name} failed")
charms.opensearch.v0.opensearch_users.OpenSearchUserMgmtError: creating user monitor failed

That happens because the create_user does not distinguish between failing to create an user because it actually failed or the user already exists. The later means the response returns status_code==200 and status=="OK". The method create_role has a similar problem but a better error handling logic:

        if resp.get("status") != "CREATED" and not (
            resp.get("status") == "OK" and "updated" in resp.get("message")
        ):
            logging.error(f"Couldn't create role: {resp}")
            raise OpenSearchUserMgmtError(f"creating role {role_name} failed")

hook "leader-elected" fails when adding a unit after scale down to zero units

Steps to reproduce

juju add-model opensearch
# apply the kernel parameters required for opensearch
juju model-config --file ./cloudinit-userdata.yaml
juju create-storage-pool opensearch-storage lxd volume-type=standard
juju deploy opensearch -n 2 --channel 2/edge --storage opensearch-data=opensearch-storage,1G,1
juju deploy self-signed-certificates
juju config self-signed-certificates ca-common-name="CN_CA"
juju relate self-signed-certificates opensearch
juju remove-unit opensearch/1
juju remove-unit opensearch/0
juju add-unit opensearch --attach-storage=opensearch-data/0

Expected behavior

The newly added unit should start up without error.

Actual behavior

$ juju status --storage
Model  Controller  Cloud/Region         Version  SLA          Timestamp
dev    opensearch  localhost/localhost  3.1.8    unsupported  06:52:18Z

App                       Version  Status  Scale  Charm                     Channel  Rev  Exposed  Message
opensearch                         active      1  opensearch                           1  no       
self-signed-certificates           active      1  self-signed-certificates  stable    72  no       

Unit                         Workload  Agent  Machine  Public address  Ports  Message
opensearch/2*                error     idle   5        10.27.170.244          hook failed: "leader-elected"
self-signed-certificates/0*  active    idle   2        10.27.170.141          

Machine  State    Address        Inst id        Base          AZ  Message
2        started  10.27.170.141  juju-622e8b-2  [email protected]      Running
5        started  10.27.170.244  juju-622e8b-5  [email protected]      Running

Storage Unit  Storage ID         Type        Pool                Mountpoint                   Size     Status    Message
              opensearch-data/1  filesystem  opensearch-storage                               1.0 GiB  detached  
opensearch/2  opensearch-data/0  filesystem  opensearch-storage  /var/snap/opensearch/common  1.0 GiB  attached

Versions

Operating system: Ubuntu 24.04 LTS, Ubuntu 22.04 LTS
Juju CLI: 3.1.8-genericlinux-amd64
Juju agent: 3.1.8
Charm revision: 47
LXD: 5.21.1 LTS

Log output

unit-opensearch-2: 06:53:05 ERROR unit.opensearch/2.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-2/charm/./src/charm.py", line 267, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 302, in _on_leader_elected
    self._put_or_update_internal_user_leader(user)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 1244, in _put_or_update_internal_user_leader
    self.user_manager.update_user_password(user, hashed_pwd)
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_users.py", line 268, in update_user_password
    resp = self.opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-2/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 266, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=None
self.response_text='Host 10.27.170.244:9200 and alternative_hosts: [] not reachable.'
unit-opensearch-4: 06:53:06 ERROR juju.worker.uniter.operation hook "leader-elected" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

I assume the issue is with security_index_initialised, this is not in the peer data anymore:

$ jhack show-relation opensearch:opensearch-peers opensearch:opensearch-peers
                                                                                             relation data v0.6                                                                                             
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ peer relation (id: 2) ┃ opensearch                                                                                                                                                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ type                  │ peer                                                                                                                                                                             │
│ interface             │ opensearch_peers                                                                                                                                                                 │
│ model                 │ the current model                                                                                                                                                                │
│ relation ID           │ 2                                                                                                                                                                                │
│ endpoint              │ opensearch-peers                                                                                                                                                                 │
│ leader unit           │ 2                                                                                                                                                                                │
├───────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ application data      │ ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│                       │ │                                                                                                                                                                              │ │
│                       │ │  admin_user_initialized                     True                                                                                                                             │ │
│                       │ │  allocation-exclusions-to-delete            ,opensearch-2                                                                                                                    │ │
│                       │ │  delete-voting-exclusions                   True                                                                                                                             │ │
│                       │ │  deployment-description                     {"config": {"cluster_name": "opensearch-attz", "init_hold": false, "roles": [], "data_temperature": null}, "start":              │ │
│                       │ │                                             "start-with-generated-roles", "pending_directives": [], "typ": "main-orchestrator", "app": "opensearch", "state": {"value":      │ │
│                       │ │                                             "active", "message": ""}, "promotion_time": 1716446675.797672}                                                                   │ │
│                       │ │  opensearch:app:admin-password              secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7ebls8c16j9paghi7g                                                               │ │
│                       │ │  opensearch:app:admin-password-hash         secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7ebls8c16j9paghi80                                                               │ │
│                       │ │  opensearch:app:app-admin                   secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eblc8c16j9paghi50                                                               │ │
│                       │ │  opensearch:app:kibanaserver-password       secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eblk8c16j9paghi6g                                                               │ │
│                       │ │  opensearch:app:kibanaserver-password-hash  secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eblk8c16j9paghi70                                                               │ │
│                       │ │  opensearch:app:monitor-password            secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7ec248c16j9paghib0                                                               │ │
│                       │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ unit data             │ ╭─ opensearch/opensearch/2 ──────────────────────────────────────────────────────────────────────────────╮                                                                       │
│                       │ │                                                                                                        │                                                                       │
│                       │ │  opensearch:unit:2:unit-http       secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eevc8c16j9paghic0  │                                                                       │
│                       │ │  opensearch:unit:2:unit-transport  secret://d95bf0dc-53cc-4a8c-8f9e-538bd7622e8b/cp7eevc8c16j9paghibg  │                                                                       │
│                       │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯                                                                       │
└───────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This is where an adjustment might be necessary: https://github.com/canonical/opensearch-operator/blob/main/lib/charms/opensearch/v0/opensearch_base_charm.py#L271

[STABILITY] Opensearch service doesn't come up: Error: 503 Server Error: Service Unavailable for url: https://10.36.58.203:9200/_nodes

All details can be found on the pipelines where the issue occurs:

https://github.com/canonical/opensearch-dashboards-operator/actions/runs/8590250501/job/23537646297#step:21:1001

[LOW IMPORTANCE][STABILITY] KeyError: 'node.roles'

I've seen this error a few times on local runs, so I add this ticket to signify - in case it may occur to others or on pipelines pls add more info to the ticket.

The following exception has been raised a couple of times:

unit-opensearch-1: 11:00:28 ERROR unit.opensearch/1.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 408, in current
    nodes = self.request("GET", f"/_nodes/{self.node_id}", alt_hosts=self._charm.alt_hosts)
  File "/usr/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 375, in node_id
    nodes = self.request("GET", "/_nodes").get("nodes")
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 297, in request
    resp = call(retries, resp_status_code)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 252, in call
    raise OpenSearchHttpError()
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 456, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 943, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 434, in _on_opensearch_data_storage_detaching
    self._stop_opensearch()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 825, in _stop_opensearch
    self.opensearch_exclusions.delete_current()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_nodes_exclusions.py", line 58, in delete_current
    self._node.is_cm_eligible() or self._node.is_voting_only()
  File "/usr/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_nodes_exclusions.py", line 161, in _node
    return self._charm.opensearch.current()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 423, in current
    roles=conf_on_disk["node.roles"],
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ruamel/yaml/comments.py", line 842, in __getitem__
    return ordereddict.__getitem__(self, key)
KeyError: 'node.roles'

May worth to take a look, may be just a programming oversight?

OpenSearch Snap depends on snapd 2.60+

Commit: canonical/opensearch-snap@e9964ef

In opensearch-snap added a dependency to snapd 2.60, otherwise installing opensearch snap fails with permission denied to /dev/shm/performance-analyzer.
However, not all deployments will start at this version and should be refreshed before installation can proceed.

We should add something like:


    def __init__(self, charm, peer_relation: str):
        super().__init__(charm, peer_relation)

        for attempt in Retrying(stop=stop_after_attempt(5), wait=wait_fixed(wait=5)):
            with attempt:
                cache = snap.SnapCache()
                self._opensearch = cache["opensearch"]
+               self._snapd = cache["snapd"]


    @override
    def install(self):
        """Install opensearch from the snapcraft store."""
        if self._opensearch.present:
            return

        try:
+           self._snapd.ensure(snap.SnapState.Latest, channel="latest/stable")
            self._opensearch.ensure(snap.SnapState.Latest, channel="edge")
            self._opensearch.connect("process-control")
+           self._opensearch.connect("shmem-perf-analyzer")
        except SnapError as e:
            logger.error(f"Failed to install opensearch. \n{e}")
            raise OpenSearchInstallError()

Add `main` branch to charmcraftcache-hub and remove `dpe-workflows-part2` once merged

Clean-up `relation_changed.emit()` calls

I noticed that we are generating an disproportionate amount of opensearch_peers_relation_changed events that are getting constantly deferred. That increases significantly the startup time.

One example:

2024-04-26 14:16:51 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Re-emitting deferred event <ConfigChangedEvent via OpenSearchOperatorCharm/on/config_changed[36]>.
2024-04-26 14:16:52 WARNING unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: 'app' expected but not received.
2024-04-26 14:16:52 WARNING unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: 'app_name' expected in snapshot but not found.
2024-04-26 14:16:52 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Emitting custom event <RelationChangedEvent via OpenSearchOperatorCharm/on/opensearch_peers_relation_changed[47]>.
2024-04-26 14:16:52 ERROR unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: [Errno 111] Connection refused
2024-04-26 14:16:52 ERROR unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: [Errno 111] Connection refused
2024-04-26 14:16:52 ERROR unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: [Errno 111] Connection refused
2024-04-26 14:16:53 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Deferring <RelationChangedEvent via OpenSearchOperatorCharm/on/opensearch_peers_relation_changed[47]>.
2024-04-26 14:16:53 ERROR unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: [Errno 111] Connection refused
2024-04-26 14:16:53 WARNING unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Plugin management: cluster not ready yet at config changed
2024-04-26 14:16:53 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Deferring <ConfigChangedEvent via OpenSearchOperatorCharm/on/config_changed[36]>.

We can see that RelationChangedEvent via OpenSearchOperatorCharm/on/opensearch_peers_relation_changed gets constantly created and deferred in the process.

My recommendation is to replace any known relation_* emits for actual calls to their handler functions.

Race condition in h-scaling-integration between Coordinator, Juju app leader and node going away

I am seeing what seems to me as a race condition between several nodes in the h-scaling-integration, at our CI.

This is one example of failed run: https://github.com/canonical/opensearch-operator/actions/runs/7199744236/job/19612482542

The failure happens in tests/integration/ha/test_horizontal_scaling.py::test_safe_scale_down_roles_reassigning.

The main issue in that run happens at 20:15:{12,19}. During this time period:

opensearch/0 is selected as new CM node, and given it is juju app leader, it immediately restarts itself
(it is also possible to see that, at 20:15:13 in opensearch/0, the cluster is considered green state)
opensearch/0 restarts between 20:15:15 and 20:15:16: https://pastebin.ubuntu.com/p/Mx6B45z2gH/
opensearch/5, as a coordinator, and at 20:15:12 it marks the node opensearch/0 as "faulty": https://pastebin.ubuntu.com/p/YgHS2hV3yy/
opensearch/2 will detect opensearch/0 is gone at 20:15:12, then at 20:15:16, the charm stops the unit for the tear down: https://pastebin.ubuntu.com/p/ZhYHV2JPDH/

So, we have opensearch/5 having to deal 2x nodes going away at the same time, in a 4-node cluster.

The unit opensearch/2 will fail to finish storage-detaching: https://pastebin.ubuntu.com/p/8wTTBKzrK5/
Because opensearch/5 set the cluster in Red as it could not copy/move some of the shard replicas to deal with opensearch/2 going away.
Besides that, CI is running with:

automatically-retry-hooks:
  value: false
  source: model

https://pastebin.ubuntu.com/p/488WTZBRHZ/

Which means, for the next 20 minutes, storage-detaching will never be retried and the test will fail.

Proposal:

I do think we should keep that value as automatically-retry-hooks=false. That will highlight race conditions as this one.

I believe one fix would be to check the cluster health before stopping the opensearch service, within storage-detaching and to have a retrial within the hook, as such:

with retry_x_times:
    if cluster_is_healthy:
        stop_service
    if wait_until_cluster_is_healthy_or_timeout:
        finish_hook_and_exit
    raise Exception

That way, node opensearch/2 would've noticed opensearch/0 is leaving and waited for the cluster to settle before it decides to leave itself. Also, if the cluster does not return to a healthy state, i.e. first cluster_is_healthy returns false every time; then we are protected against multiple nodes change state at once as happened in this case, because stop will never happen.

Deferring start/restart event causes lock to be prematurely released

I am currently seeing failures such as this error in the early steps of the CI run.

The entire description here is based in this PR: https://github.com/canonical/opensearch-operator/tree/DPE-3352-last-passing-tests

Full deployment:

Model              Controller           Cloud/Region         Version  SLA          Timestamp
test-backups-23v7  github-pr-2fa19-lxd  localhost/localhost  3.1.7    unsupported  14:16:34Z

App                       Version  Status       Scale  Charm                     Channel  Rev  Exposed  Message
opensearch                         maintenance      3  opensearch                           0  no       Beginning rolling service
s3-integrator                      active           1  s3-integrator             stable    13  no       
self-signed-certificates           active           1  self-signed-certificates  stable    72  no       

Unit                         Workload  Agent      Machine  Public address  Ports  Message
opensearch/0                 waiting   idle       2        10.241.94.24           Waiting for OpenSearch to start...
opensearch/1*                waiting   executing  3        10.241.94.127          Waiting for OpenSearch to start...
opensearch/2                 error     idle       4        10.241.94.7            hook failed: "opensearch-peers-relation-changed"
s3-integrator/0*             active    idle       1        10.241.94.199          
self-signed-certificates/0*  active    idle       0        10.241.94.194          

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.241.94.194  juju-ad79f3-0  [email protected]      Running
1        started  10.241.94.199  juju-ad79f3-1  [email protected]      Running
2        started  10.241.94.24   juju-ad79f3-2  [email protected]      Running
3        started  10.241.94.127  juju-ad79f3-3  [email protected]      Running
4        started  10.241.94.7    juju-ad79f3-4  [email protected]      Running

Integration provider                   Requirer                           Interface            Type     Message
opensearch:opensearch-peers            opensearch:opensearch-peers        opensearch_peers     peer     
opensearch:service                     opensearch:service                 rolling_op           peer     
s3-integrator:s3-credentials           opensearch:s3-credentials          s3                   regular  
s3-integrator:s3-integrator-peers      s3-integrator:s3-integrator-peers  s3-integrator-peers  peer     
self-signed-certificates:certificates  opensearch:certificates            tls-certificates     regular  

Storage Unit  Storage ID         Type        Pool    Mountpoint                   Size    Status    Message
opensearch/0  opensearch-data/0  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached  
opensearch/1  opensearch-data/1  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached  
opensearch/2  opensearch-data/2  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached

Unit opensearch/2 is failing because all shards disappeared at once. At 14:15:01, everything is there:

2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/indices?v HTTP/1.1" 200 606
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Getting secret app:admin-password
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Starting new HTTPS connection (1): 10.241.94.7:9200
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/shards?v HTTP/1.1" 200 1774
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: indices status:
[{'health': 'green', 'status': 'open', 'index': '.opensearch-observability', 'uuid': '_pC83dCcQVuYW5VWnbasfg', 'pri': '1', 'rep': '2', 'docs.count': '0', 'docs.deleted': '0', 'store.size': '416b', 'pri.store.size': '208b'}, {'health': 'green', 'status': 'open', 'index': '.plugins-ml-config', 'uuid': 'RXeUPKcXToOBpRQPM4VQEA', 'pri': '1', 'rep': '2', 'docs.count': '1', 'docs.deleted': '0', 'store.size': '7.7kb', 'pri.store.size': '3.8kb'}, {'health': 'green', 'status': 'open', 'index': '.opendistro_security', 'uuid': 'FXZtZFctTWyEjVt9ZDRqcw', 'pri': '1', 'rep': '2', 'docs.count': '10', 'docs.deleted': '2', 'store.size': '105.2kb', 'pri.store.size': '57.8kb'}]
indices shards:
[{'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.8kb', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '0', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '10', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '10', 'store': '57.8kb', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '10', 'store': '47.4kb', 'ip': '10.241.94.24', 'node': 'opensearch-0'}]

Then, it disappears:

2024-02-07 14:16:21 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/shards?v HTTP/1.1" 200 555
2024-02-07 14:16:21 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: indices status:
[{'health': 'green', 'status': 'open', 'index': '.opensearch-observability', 'uuid': '_pC83dCcQVuYW5VWnbasfg', 'pri': '1', 'rep': '0', 'docs.count': '0', 'docs.deleted': '0', 'store.size': '208b', 'pri.store.size': '208b'}, {'health': 'green', 'status': 'open', 'index': '.plugins-ml-config', 'uuid': 'RXeUPKcXToOBpRQPM4VQEA', 'pri': '1', 'rep': '0', 'docs.count': '1', 'docs.deleted': '0', 'store.size': '3.9kb', 'pri.store.size': '3.9kb'}, {'health': 'red', 'status': 'open', 'index': '.opendistro_security', 'uuid': 'FXZtZFctTWyEjVt9ZDRqcw', 'pri': '1', 'rep': '0', 'docs.count': None, 'docs.deleted': None, 'store.size': None, 'pri.store.size': None}]
indices shards:
[{'index': '.opensearch-observability', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'p', 'state': 'UNASSIGNED', 'docs': None, 'store': None, 'ip': None, 'node': None}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'UNASSIGNED', 'docs': None, 'store': None, 'ip': None, 'node': None}]

Looking at the logs, I can see that:

Opensearch/0: stops at: 2024-02-07 14:15:57 DEBUG unit.opensearch/0.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called
Opensearch/1: stops at: 2024-02-07 14:15:34 DEBUG unit.opensearch/1.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called

That means, when the node opensearch/2 is running its routine, all nodes are gone and it fails.

It is happening because both opensearch 0 and 1 are deferring their RunWithLocks event:

Opensearch/0: RunWithLock via OpenSearchOperatorCharm/on/service_run_with_lock[283] appears multiple time
Opensearch/1: RunWithLock via OpenSearchOperatorCharm/on/service_run_with_lock[318] appears multiple time

Looking into the rolling-ops code: there is no post-processing, if the event has been deferred or not. It will only react to an exception, which will break the entire hook.

I will report a bug with rolling ops as well, asking for the run_on_lock to capture any exceptions and double check if the event has been marked for deferral post callback returned.

[BUG] `internal_users.yaml` is not syncronized across units

Steps to reproduce

Start up a 2-members cham cluster and check with
juju ssh opensearch/? sudo cat /var/snap/opensearch/current/etc/opensearch/opensearch-security/internal_users.yml

Expected behavior

In order to safely switch leader anytime (potentially re-initializing the security index), these local users must be the same on all nodes.

Now what happens is: on the leader on leader-elecetd event we

wipe out the internal_users.yml file
add back the users that we need

This process has to run on all units.

Large deployments is not cleaning its error in `_rel_err_data`

Steps to reproduce

Deploy a large deployment setup with:

juju deploy tls-certificates-operator --channel stable --show-log --verbose
juju config tls-certificates-operator generate-self-signed-certificates=true ca-common-name="CN_CA"

# deploy main-orchestrator cluster 
juju deploy -n 3 ./opensearch.charm \
    main \
    --config cluster_name="log-app" --config init_hold=false --config roles="cluster_manager"

# deploy failover-orchestrator cluster
juju deploy -n 2 ./opensearch.charm \
    failover \
    --config cluster_name="log-app" --config init_hold=true --config roles="cluster_manager"

# deploy data-hot cluster
juju deploy -n 2 ./opensearch.charm \
    data-hot \
    --config cluster_name="log-app" --config init_hold=true --config roles="data.hot"

# integrate TLS
juju integrate tls-certificates-operator main
juju integrate tls-certificates-operator failover
juju integrate tls-certificates-operator data-hot

# integrate the "main"-orchestrator with all clusters:
juju integrate main:peer-cluster-orchestrator failover:peer-cluster
juju integrate main:peer-cluster-orchestrator data-hot:peer-cluster

~~juju integrate failover:peer-cluster-orchestrator data-hot:peer-cluster~~ # Do not add this relation

Once the cluster has settled, it should show the blocked message in app-level for both failover and data-hot.

Relate the data-hot with failover: juju integrate failover:peer-cluster-orchestrator data-hot:peer-cluster

The cluster will keep the blocked message.

Juju status

(testing with s3-integrator, but this is not mandatory)

Model              Controller           Cloud/Region         Version  SLA          Timestamp
test-backups-ti92  localhost-localhost  localhost/localhost  3.4.2    unsupported  10:31:46+02:00

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
data-hot                           blocked      1  opensearch                                 2  no       Cannot have 2 'failover'-orchestrators. Relate to the existing failover.
failover                           blocked      1  opensearch                                 0  no       Cannot have 2 'failover'-orchestrators. Relate to the existing failover.
main                               active       2  opensearch                                 1  no       
s3-integrator                      blocked      1  s3-integrator             latest/edge     17  no       Missing parameters: ['access-key', 'secret-key']
self-signed-certificates           active       1  self-signed-certificates  latest/stable   72  no       

Unit                         Workload  Agent  Machine  Public address  Ports  Message
data-hot/0*                  active    idle   3        10.41.46.190           
failover/0*                  waiting   idle   2        10.41.46.80            Waiting for OpenSearch to start...
main/0*                      active    idle   4        10.41.46.88            
main/1                       active    idle   5        10.41.46.254           
s3-integrator/0*             blocked   idle   1        10.41.46.63            Missing parameters: ['access-key', 'secret-key']
self-signed-certificates/0*  active    idle   0        10.41.46.113           

Machine  State    Address       Inst id        Base          AZ  Message
0        started  10.41.46.113  juju-1998d4-0  [email protected]      Running
1        started  10.41.46.63   juju-1998d4-1  [email protected]      Running
2        started  10.41.46.80   juju-1998d4-2  [email protected]      Running
3        started  10.41.46.190  juju-1998d4-3  [email protected]      Running
4        started  10.41.46.88   juju-1998d4-4  [email protected]      Running
5        started  10.41.46.254  juju-1998d4-5  [email protected]      Running

Integration provider                   Requirer                           Interface            Type     Message
data-hot:node-lock-fallback            data-hot:node-lock-fallback        node_lock_fallback   peer     
data-hot:opensearch-peers              data-hot:opensearch-peers          opensearch_peers     peer     
failover:node-lock-fallback            failover:node-lock-fallback        node_lock_fallback   peer     
failover:opensearch-peers              failover:opensearch-peers          opensearch_peers     peer     
main:node-lock-fallback                main:node-lock-fallback            node_lock_fallback   peer     
main:opensearch-peers                  main:opensearch-peers              opensearch_peers     peer     
main:peer-cluster-orchestrator         data-hot:peer-cluster              peer_cluster         regular  
main:peer-cluster-orchestrator         failover:peer-cluster              peer_cluster         regular  
s3-integrator:s3-integrator-peers      s3-integrator:s3-integrator-peers  s3-integrator-peers  peer     
self-signed-certificates:certificates  data-hot:certificates              tls-certificates     regular  
self-signed-certificates:certificates  failover:certificates              tls-certificates     regular  
self-signed-certificates:certificates  main:certificates                  tls-certificates     regular

Expected behavior

To see the error_data present in the relation to go away.

Actual behavior

The error message between main (provider) and its follower clusters continues.

Unit Test `test_opensearch_tls.py::test_get_sans` fail with wrong `sans_ip`

It is not 100% reproducible, but I can see test_get_sans failing most of the times in PR #248 with:

AssertionError: {'san[34 chars]: ['192.0.2.0', 'XX.XXX.XX.XXX', 'address1', '[55 chars]-0']} != {'san[34 chars]: ['1.1.1.1', 'XX.XXX.XX.XXX', 'address1', 'ad[53 chars]-0']}
  {'sans_dns': ['alias', 'nebula', 'opensearch-0'],
-  'sans_ip': ['192.0.2.0', 'XX.XXX.XX.XXX', 'address1', 'address2'],
?                -- ^ ^ ^

+  'sans_ip': ['1.1.1.1', 'XX.XXX.XX.XXX', 'address1', 'address2'],
?                 ^ ^ ^

   'sans_oid': ['1.2.3.4.5.5']}

Looking into the origins of this IP, I noticed that newer ops framework has this commit, that adds a network 192.0.2.0 if we do not have any existing networks.

We were originally using patch_network_get decorator on this test. I think we should move to _TestModelBackend.add_network instead.

`_start_opensearch` does not catch OpenSearchUserMgmtError

There seems to be some race condition happening at _post_start_init, where some of the CI tests fail with OpenSearchUserMgmtError. That is caused in the stack at the end.

Side note: _post_start_init should be renamed to something else, as it runs as the very first line at _start_opensearch.

Stack trace: https://pastebin.ubuntu.com/p/WRMY3FnZCc/

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 276, in call
    response = s.request(**request_kwargs)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='10.206.26.175', port=9200): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_users.py", line 137, in get_users
    return self.opensearch.request("GET", f"{USER_ENDPOINT}/")
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 297, in request
    resp = call(retries, resp_status_code)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 285, in call
    return call(
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 238, in call
    raise OpenSearchHttpError()
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-4/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/ops/main.py", line 434, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/ops/framework.py", line 863, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/rolling_ops/v0/rollingops.py", line 410, in _on_run_with_lock
    callback(event)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 618, in _start_opensearch
    self._post_start_init()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 720, in _post_start_init
    self._put_monitoring_user()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 840, in _put_monitoring_user
    users = self.user_manager.get_users()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_users.py", line 139, in get_users
    raise OpenSearchUserMgmtError(e)
charms.opensearch.v0.opensearch_users.OpenSearchUserMgmtError
2024-01-24 08:12:51 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1

`DeferTriggerEvent` has no effect

Current implementation of `DeferTriggerEvent`

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Line 125 in e7acbdd

defer_trigger_event = EventSource(DeferTriggerEvent)

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Lines 163 to 164 in e7acbdd

    
           # helper to defer events without any additional logic 
        
           self.framework.observe(self.defer_trigger_event, self._on_defer_trigger)

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Lines 190 to 192 in e7acbdd

    
           def _on_defer_trigger(self, _: DeferTriggerEvent): 
        
               """Hook for the trigger_defer event.""" 
        
               pass

Example usage

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Lines 339 to 340 in e7acbdd

    
           event.defer() 
        
           self.defer_trigger_event.emit()

Intended behavior

(secondhand understanding from @Mehdi-Bendriss)
Within current juju event, re-trigger deferred events (similar to infinite while loop) until event is no longer deferred

Actual behavior

self.defer_trigger_event.emit() has no effect

Test results

Charm used for testing

#!/usr/bin/env python3
# Copyright 2024 Ubuntu
# See LICENSE file for licensing details.
#
# Learn more at: https://juju.is/docs/sdk

"""Charm the service.

Refer to the following tutorial that will help you
develop a new k8s charm using the Operator Framework:

https://juju.is/docs/sdk/create-a-minimal-kubernetes-charm
"""

import logging

import ops

logger = logging.getLogger(__name__)

class FooEvent(ops.EventBase):
    pass


class FooCharm(ops.CharmBase):
    """Charm the service."""

    foo_event = ops.EventSource(FooEvent)

    def __init__(self, *args):
        super().__init__(*args)
        self.framework.observe(self.on.install, self._on_install)
        self.framework.observe(self.on.update_status, self._on_update_status)
        self.framework.observe(self.foo_event, self._on_foo)

    def _on_install(self, event: ops.InstallEvent):
        logger.warning("A")
        event.defer()
        self.foo_event.emit()
        return

    def _on_foo(self, _):
        pass

    def _on_update_status(self, event: ops.UpdateStatusEvent):
        logger.warning(f"{list(self.framework._storage.notices())=}")

if __name__ == "__main__":  # pragma: nocover
    ops.main(FooCharm)  # type: ignore

juju debug-log

Steps to reproduce:

Deploy charm

unit-foo-0: 10:43:43 INFO juju.worker.uniter found queued "install" hook
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log ops 2.11.0 up and running.
unit-foo-0: 10:43:44 INFO unit.foo/0.juju-log Running legacy hooks/install.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log ops 2.11.0 up and running.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Charm called itself via hooks/install.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Legacy hooks/install exited with status 0.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Using local storage: not a Kubernetes podspec charm
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Initializing SQLite local storage: /var/lib/juju/agents/unit-foo-0/charm/.unit-state.db.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Emitting Juju event install.
unit-foo-0: 10:43:44 WARNING unit.foo/0.juju-log A
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Deferring <InstallEvent via FooCharm/on/install[1]>.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Emitting custom event <FooEvent via FooCharm/foo_event[2]>.
unit-foo-0: 10:43:44 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)
unit-foo-0: 10:43:44 INFO juju.worker.uniter found queued "leader-elected" hook
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log ops 2.11.0 up and running.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Re-emitting deferred event <InstallEvent via FooCharm/on/install[1]>.
unit-foo-0: 10:43:44 WARNING unit.foo/0.juju-log A
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Deferring <InstallEvent via FooCharm/on/install[1]>.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Emitting custom event <FooEvent via FooCharm/foo_event[7]>.
unit-foo-0: 10:43:44 DEBUG unit.foo/0.juju-log Emitting Juju event leader_elected.
unit-foo-0: 10:43:44 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log ops 2.11.0 up and running.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Re-emitting deferred event <InstallEvent via FooCharm/on/install[1]>.
unit-foo-0: 10:43:45 WARNING unit.foo/0.juju-log A
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Deferring <InstallEvent via FooCharm/on/install[1]>.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Emitting custom event <FooEvent via FooCharm/foo_event[13]>.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Emitting Juju event config_changed.
unit-foo-0: 10:43:45 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-foo-0: 10:43:45 INFO juju.worker.uniter found queued "start" hook
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log ops 2.11.0 up and running.
unit-foo-0: 10:43:45 INFO unit.foo/0.juju-log Running legacy hooks/start.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log ops 2.11.0 up and running.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Charm called itself via hooks/start.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Legacy hooks/start exited with status 0.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Re-emitting deferred event <InstallEvent via FooCharm/on/install[1]>.
unit-foo-0: 10:43:45 WARNING unit.foo/0.juju-log A
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Deferring <InstallEvent via FooCharm/on/install[1]>.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Emitting custom event <FooEvent via FooCharm/foo_event[19]>.
unit-foo-0: 10:43:45 DEBUG unit.foo/0.juju-log Emitting Juju event start.
unit-foo-0: 10:43:45 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)

ops source code

This lines up with ops source code where deferred events are ran with ops.framework.Framework.reemit(), which is called in ops.main.main(). Framework.reemit() calls _reemit()

Event.emit() calls ops.framework.Framework._emit()—which calls _reemit(event_path)

behavior of _reemit():

when called without event path, call handlers for all events (including deferred events)
when called with event path, only call handlers for that event

Unit tests fail with: importlib.metadata.PackageNotFoundError: No package metadata was found for juju

Running our unit tests against latest main branch.

I've setup a lxc container and ran on it:

git clone https://github.com/canonical/opensearch-operator
cd opensearch-operator/
apt update
apt install -y python3 python3-pip
pip3 install tox
tox -e unit

Output: https://pastebin.ubuntu.com/p/jx73gwBXKP/

Running:
LIBJUJU_VERSION_SPECIFIER=3.1.2.0 tox -e unit

Resolves the issue. I believe we can move to 3.1 by default on our tox.ini, and have some CI tests to check 2.9 compatibility as this is the newer supported version.

[SHARDING][data-integrator] 1 or more 'replica' shards are not assigned, please scale your application up.

Steps to reproduce

Run data-integrator Opensearch pipeline with 3 units

Expected behavior

Tests pass

Actual behavior

If 3 units are used in the test, I consistently keep getting the following error, while the unit is stuck in a blocked state.

1 or more 'replica' shards are not assigned, please scale your application up.

The problem goes away when running the test with 2 units only.

Versions

Operating system: jammy

Juju CLI:

Juju agent: 3.2.2

Charm revision: edge

Log output

Juju debug log: see the pipeline referenced above

Additional context

I can't reproduce the issue locally. I executed the same pipeline multiple times on a powerful C5 AWS server. Passes green.
(May worth to do the same on a less powerful node too...?)

Fix role re-balancing for large deployments

          This works fine for simple deployment where we only have 1 cluster. However, in large deployments (multiple clusters) - it doesn't seem like `if len(current_cluster_nodes) % 2 == 0` is enough, this must be tested against the full list of nodes of the fleet where `node.is_cm_eligible()`.

Originally posted by @Mehdi-Bendriss in #209 (comment)

`_post_start_init` triggers continuous password updates on `COSUser`

The _post_start_init routine may be re-executed multiple times on a single start, if conditions are not yet met for the node to start. In this scenario, there is a risk of COSUser being recreated multiple times, resulting in new passwords and, hence, secret-changed events.

We should adapt _put_or_update_internal_user to: (1) check for existing users if password was not provided, if user already present: leave the method; and (2) if password provided, then proceed with the update.

Also, the typing of the return value of get_user and its docstring do not match

Add equivalent of ML upgrade mode

See step 3 and step 11 here https://www.elastic.co/guide/en/elastic-stack/8.13/upgrading-elasticsearch.html

Uses https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-set-upgrade-mode.html

Missing from #261

context: https://chat.canonical.com/canonical/pl/pe9zs4oswbny5j6mpfnbt95sha

[STABILITY] Timeout awaiting service operation

All details available on the pipeline. The issue occured more than this two times.

[STABILITY] KeyError: 'results'

See corresponding pipelines.

	# Attempt to create document id 0
	try:
	response = self._opensearch.request(
	"PUT",
	endpoint=f"/{self.OPENSEARCH_INDEX}/_create/0?refresh=true&wait_for_active_shards=all",
	host=host,
	alt_hosts=alt_hosts,
	retries=3,
	payload={"unit-name": self._charm.unit.name},
	)
	except OpenSearchHttpError as e:
	if e.response_code == 409 and "document already exists" in e.response_body.get(
	"error", {}
	).get("reason", ""):
	# Document already created
	pass
	else:
	logger.exception("Error creating OpenSearch lock document")
	return False

	# Ensure write was successful on all nodes
	# "It is important to note that this setting [`wait_for_active_shards`] greatly
	# reduces the chances of the write operation not writing to the requisite
	# number of shard copies, but it does not completely eliminate the possibility,
	# because this check occurs before the write operation commences. Once the
	# write operation is underway, it is still possible for replication to fail on
	# any number of shard copies but still succeed on the primary. The `_shards`
	# section of the write operation’s response reveals the number of shard copies
	# on which replication succeeded/failed."
	# from
	# https://www.elastic.co/guide/en/elasticsearch/reference/8.13/docs-index_.html#index-wait-for-active-shards
	if response["_shards"]["failed"] > 0:
	logger.error("Failed to write OpenSearch lock document to all nodes")
	return False

	# since when an IP change happens, "_on_peer_relation_joined" won't be called,
	# we need to alert the leader that it must recompute the node roles for any unit whose
	# roles were changed while the current unit was cut-off from the rest of the network
	self.on[PeerRelationName].relation_joined.emit(
	self.model.get_relation(PeerRelationName)
	)

	def _on_peer_relation_joined(self, event: RelationJoinedEvent):
	"""Event received by all units when a new node joins the cluster."""
	if not self.unit.is_leader():
	return

	# helper to defer events without any additional logic
	self.framework.observe(self.defer_trigger_event, self._on_defer_trigger)

	def _on_defer_trigger(self, _: DeferTriggerEvent):
	"""Hook for the trigger_defer event."""
	pass

canonical / opensearch-operator Goto Github PK

opensearch-operator's Introduction

OpenSearch Operator

Description

Usage

Basic Usage

Relations / Integrations

Client interface:

Large deployments:

TLS:

1. Self-signed certificates:

Security

Contributing

License

opensearch-operator's People

Contributors

Stargazers

Watchers

Forkers

opensearch-operator's Issues

Steps to reproduce

Expected behavior

Actual behavior

Versions

Additional context

Issue

Steps to reproduce

Expected behavior

Actual behavior

Log output

Juju status

Steps to reproduce

Expected behavior

Actual behavior

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

Steps to reproduce

Expected behavior

Steps to reproduce

Juju status

Expected behavior

Actual behavior

Current implementation of DeferTriggerEvent

Example usage

Intended behavior

Actual behavior

Test results

ops source code

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

Recommend Projects

Recommend Topics

Recommend Org

Current implementation of `DeferTriggerEvent`