Describe the bug Hello. I have very strange bug and cannot locate

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Noticed, that I cannot connect from one redis pod to another: <div class="snippet-

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

What happens if you deliberately kill master in staging? </blockquote

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

[chart/redis-ha][BUG],about dandydeveloper/charts

Comments (14)

yariksheptykin commented on May 18, 2024

@krakazyabra are you connecting to redis via ha-proxy? Or stright to one of the redis pods? What chart version are you using?

from charts.

krakazyabra commented on May 18, 2024

Noticed, that I cannot connect from one redis pod to another:

/data $ redis-cli -p 6379 -h prod-redis-ha-server-1
Could not connect to Redis at prod-redis-ha-server-1:6379: Name does not resolve
not connected>

may be because of

root@nextcloud-0:/var/www/html# getent ahosts prod-redis-ha-server-0.prod-redis-ha-announce-0

Zero output, but in the same time

root@nextcloud-0:/var/www/html# getent ahosts prod-redis-ha-server-0.prod-redis-ha
10.113.63.110   STREAM prod-redis-ha-server-0.prod-redis-ha.nextcloud.svc.cluster.local
10.113.63.110   DGRAM  
10.113.63.110   RAW

from charts.

krakazyabra commented on May 18, 2024

@krakazyabra are you connecting to redis via ha-proxy? Or stright to one of the redis pods? What chart version are you using?

Hello. I'm connecting via haproxy

      'redis' =>
      array (
        'host' => 'prod-redis-ha-haproxy',
        'port' => '6379',
        'timeout' => 0.0,
      ),

Chart version is 4.6.2

from charts.

yariksheptykin commented on May 18, 2024

@krakazyabra that's interesting.

To me it seems that you are facing several problems here. One is frequent redis failover, that causes election of a new master. The connectivity issues, gateway timeouts, logs about "redis going awas", and the amount of traffic should all be inter-related.

Another is Nextcloud being unable to gracefully deal with redis being temporarily unavailable (during failover for example). This probably causes pages without css and the need to flush cache after the connectivity between Nexcloud and redis restores.

I believe good questions to ask are, does redis failover work as expected in staging environment? What happens if you deliberately kill master in staging? Can the new master be successfully elected? What causes connectivity problems or failover in production environment? Do production redis-pods get re-scheduled by kubernetes to other nodes?

from charts.

krakazyabra commented on May 18, 2024

What happens if you deliberately kill master in staging?

Thanks for your advise. I killed master pod.
While master pod was down on slaves I saw

stage-redis-ha-announce-2:6379> INFO replication
# Replication
role:slave
master_host:10.105.28.146
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:533984
master_link_down_since_seconds:32
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:11a0f7f63526fdb89f601913be98421a31b3f4df
master_replid2:d309ed9e580077c3df3babbcc0a64178a6dc1097
master_repl_offset:533984
second_repl_offset:496807
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:867
repl_backlog_histlen:533118

Can the new master be successfully elected?

In 32 sec redis was not able to choose new master. Is it ok? And redis-0 became master again (but I deleted him)

What causes connectivity problems or failover in production environment?

I don't know. May be node overload, may be OOM, may be k8s reshedulles pods to free nodes.

from charts.

DandyDeveloper commented on May 18, 2024

@yariksheptykin Failover should be almost immediate assuming your quorum is healthy and the tow remianing slaves are able to determine:

a. Master is down (Above shows they are)
b. Slaves are healthy.

I should also check - What state is the container itself in during these 32 seconds? Does it no longer exist? As I think the announce service will rely on the pod physically not existing / the process to have completely terminated to be considered in a state of "changeover". For example, if you actually just remove the selector / affinity for the existing master, how quickly does it failover then?

One thing I should ask, is hairpin mode enabled on your CNI? I've had a lot of known issues where people don't have hair pinning, so the announce services fail to resolve from their own pod.

from charts.

krakazyabra commented on May 18, 2024

Hi, @DandyDeveloper. Thanks for you message.

What state is the container itself in during these 32 seconds?

Which container? Master is restarting (Terminating, Init, ContainerCreating). Slaves are always Running.

if you actually just remove the selector / affinity for the existing master, how quickly does it failover then?

during the day do this

is hairpin mode enabled on your CNI?

Yes, I'm using hairpin in service. I enabled it, when could not connect to pod through his service.

kind: Service
metadata:
  annotations:
    kube-router.io/service.hairpin: ""

from charts.

krakazyabra commented on May 18, 2024

if you actually just remove the selector / affinity for the existing master, how quickly does it failover then?

switching was instant. Now it's time to test on prod :)

Probably previous time I deleted 2 pods from 3. And there were no qourum, that's why pod couldn't became master.

from charts.

krakazyabra commented on May 18, 2024

So, perform such test on production redis. The result is another. It took more time to elect master (~20sec), but after new master was elected, the application couldn't work with it. In nextcloud's log I saw such message:

"message": {
  "Exception": "RedisException",
  "Message": "NOREPLICAS Not enough good replicas to write.",

If you need, here is full trace

Nextcloud log

{
	"reqId": "45BO04TogwWjXaeqZ5Nm",
	"level": 3,
	"time": "2020-08-24T11:21:48+02:00",
	"remoteAddr": "46.28.104.66",
	"user": "domain-auth-dd2f1a98-72ff-42a8-b69e-d0c997fcf33e",
	"app": "index",
	"method": "GET",
	"url": "/apps/files/?dir=/home-nextcloud/PHOTO/Photos/%D0%91%D0%B5%D0%B7%D0%B5%D0%BD%D0%B3%D0%B8%20%D0%BE%D1%82%20%D0%A1%D0%B0%D1%88%D0%B8%2C%20%D0%B8%D1%8E%D0%BD%D1%8C%202018/100ANV01&fileid=5407188",
	"message": {
		"Exception": "RedisException",
		"Message": "NOREPLICAS Not enough good replicas to write.",
		"Code": 0,
		"Trace": [
			{
				"file": "/var/www/html/lib/private/Memcache/Redis.php",
				"line": 65,
				"function": "setex",
				"class": "Redis",
				"type": "->",
				"args": [
					"5adeb77b6d2a45ad965d6c2b4311e1e7/AutoloaderViewController",
					60,
					"[]"
				]
			},
			{
				"file": "/var/www/html/lib/autoloader.php",
				"line": 171,
				"function": "set",
				"class": "OC\\Memcache\\Redis",
				"type": "->",
				"args": [
					"ViewController",
					[],
					60
				]
			},
			{
				"function": "load",
				"class": "OC\\Autoloader",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"function": "spl_autoload_call",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/Utility/SimpleContainer.php",
				"line": 99,
				"function": "__construct",
				"class": "ReflectionClass",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/Utility/SimpleContainer.php",
				"line": 116,
				"function": "resolve",
				"class": "OC\\AppFramework\\Utility\\SimpleContainer",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/ServerContainer.php",
				"line": 149,
				"function": "query",
				"class": "OC\\AppFramework\\Utility\\SimpleContainer",
				"type": "->",
				"args": [
					"ViewController",
					true
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/DependencyInjection/DIContainer.php",
				"line": 388,
				"function": "query",
				"class": "OC\\ServerContainer",
				"type": "->",
				"args": [
					"ViewController",
					true
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/App.php",
				"line": 110,
				"function": "query",
				"class": "OC\\AppFramework\\DependencyInjection\\DIContainer",
				"type": "->",
				"args": [
					"ViewController"
				]
			},
			{
				"file": "/var/www/html/lib/private/AppFramework/Routing/RouteActionHandler.php",
				"line": 47,
				"function": "main",
				"class": "OC\\AppFramework\\App",
				"type": "::",
				"args": [
					"ViewController",
					"index",
					{
						"__class__": "OC\\AppFramework\\DependencyInjection\\DIContainer"
					},
					{
						"_route": "files.view.index"
					}
				]
			},
			{
				"function": "__invoke",
				"class": "OC\\AppFramework\\Routing\\RouteActionHandler",
				"type": "->",
				"args": [
					{
						"_route": "files.view.index"
					}
				]
			},
			{
				"file": "/var/www/html/lib/private/Route/Router.php",
				"line": 297,
				"function": "call_user_func",
				"args": [
					{
						"__class__": "OC\\AppFramework\\Routing\\RouteActionHandler"
					},
					{
						"_route": "files.view.index"
					}
				]
			},
			{
				"file": "/var/www/html/lib/base.php",
				"line": 1007,
				"function": "match",
				"class": "OC\\Route\\Router",
				"type": "->",
				"args": [
					"/apps/files/"
				]
			},
			{
				"file": "/var/www/html/index.php",
				"line": 37,
				"function": "handleRequest",
				"class": "OC",
				"type": "::",
				"args": []
			}
		],
		"File": "/var/www/html/lib/private/Memcache/Redis.php",
		"Line": 65,
		"CustomMessage": "--"
	},
	"userAgent": "Mozilla/5.0 (X11; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0",
	"version": "19.0.1.1"
}

In nextcloud config I'm using prod-redis-ha-haproxy host

  'redis' => 
  array (
    'host' => 'prod-redis-ha-haproxy',
    'port' => '6379',
    'timeout' => 0.0,
  ),

from charts.

krakazyabra commented on May 18, 2024

Outage happened again.
redis/sentinel/haproxy logs are empty (nothing unusual)
From nextcloud pod I can connect to redis:

redis-0

root@nextcloud-0:/var/www/html# redis-cli -h prod-redis-ha-announce-0
prod-redis-ha-announce-0:6379> INFO replication
# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1332637369
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332637369
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331588794
repl_backlog_histlen:1048576


prod-redis-ha-announce-0:6379> INFO
# Server
redis_version:5.0.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c33a24c35f10df99
redis_mode:standalone
os:Linux 4.15.18-28-pve x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:8.3.0
process_id:1
run_id:7d15135b21e457ef0bbe0077dcdc25238faf1ffc
tcp_port:6379
uptime_in_seconds:324213
uptime_in_days:3
hz:10
configured_hz:10
lru_clock:4450371
executable:/data/redis-server
config_file:/data/conf/redis.conf

# Clients
connected_clients:8
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# Memory
used_memory:3030880
used_memory_human:2.89M
used_memory_rss:6508544
used_memory_rss_human:6.21M
used_memory_peak:5972704
used_memory_peak_human:5.70M
used_memory_peak_perc:50.75%
used_memory_overhead:2098660
used_memory_startup:791320
used_memory_dataset:932220
used_memory_dataset_perc:41.63%
allocator_allocated:3249752
allocator_active:3780608
allocator_resident:11415552
total_system_memory:67416342528
total_system_memory_human:62.79G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.16
allocator_frag_bytes:530856
allocator_rss_ratio:3.02
allocator_rss_bytes:7634944
rss_overhead_ratio:0.57
rss_overhead_bytes:-4907008
mem_fragmentation_ratio:2.18
mem_fragmentation_bytes:3519664
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:0
mem_clients_normal:168148
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1976
rdb_bgsave_in_progress:0
rdb_last_save_time:1598285145
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:581632
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:1442317
total_commands_processed:73777474
instantaneous_ops_per_sec:7
total_net_input_bytes:8463013966
total_net_output_bytes:111404463614
instantaneous_input_kbps:0.48
instantaneous_output_kbps:1.46
rejected_connections:0
sync_full:2
sync_partial_ok:3
sync_partial_err:2
expired_keys:92715
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:63706642
keyspace_misses:245179
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:201
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1332704850
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332704850
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331656275
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:4092.593556
used_cpu_user:1406.304003
used_cpu_sys_children:1.139776
used_cpu_user_children:4.543999

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1360,expires=485,avg_ttl=2400226

redis-1

root@nextcloud-0:/var/www/html# redis-cli -h prod-redis-ha-announce-1
prod-redis-ha-announce-1:6379> INFO replication
# Replication
role:master
connected_slaves:2
min_slaves_good_slaves:2
slave0:ip=10.111.179.178,port=6379,state=online,offset=1332660983,lag=0
slave1:ip=10.98.125.94,port=6379,state=online,offset=1332660983,lag=0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:ee8596993dd6a4ea403df2728166bbd092be62c2
master_repl_offset:1332661125
second_repl_offset:1292024488
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331612550
repl_backlog_histlen:1048576


prod-redis-ha-announce-1:6379> INFO
# Server
redis_version:5.0.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c33a24c35f10df99
redis_mode:standalone
os:Linux 5.4.0-42-generic x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:8.3.0
process_id:1
run_id:088c4d4f1a2d405c5efbd2bc817cb90fc7ea4beb
tcp_port:6379
uptime_in_seconds:324300
uptime_in_days:3
hz:10
configured_hz:10
lru_clock:4450415
executable:/data/redis-server
config_file:/data/conf/redis.conf

# Clients
connected_clients:13
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# Memory
used_memory:3330240
used_memory_human:3.18M
used_memory_rss:6664192
used_memory_rss_human:6.36M
used_memory_peak:7263552
used_memory_peak_human:6.93M
used_memory_peak_perc:45.85%
used_memory_overhead:2365354
used_memory_startup:791320
used_memory_dataset:964886
used_memory_dataset_perc:38.00%
allocator_allocated:3520432
allocator_active:4083712
allocator_resident:11649024
total_system_memory:67312480256
total_system_memory_human:62.69G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.16
allocator_frag_bytes:563280
allocator_rss_ratio:2.85
allocator_rss_bytes:7565312
rss_overhead_ratio:0.57
rss_overhead_bytes:-4984832
mem_fragmentation_ratio:2.05
mem_fragmentation_bytes:3416888
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:99388
mem_clients_normal:318302
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1647
rdb_bgsave_in_progress:0
rdb_last_save_time:1598285145
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:577536
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:737061
total_commands_processed:29667369
instantaneous_ops_per_sec:355
total_net_input_bytes:3172946430
total_net_output_bytes:35185604100
instantaneous_input_kbps:41.23
instantaneous_output_kbps:539.16
rejected_connections:0
sync_full:1
sync_partial_ok:2
sync_partial_err:1
expired_keys:38949
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:21096558
keyspace_misses:127908
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:478
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:2
min_slaves_good_slaves:2
slave0:ip=10.111.179.178,port=6379,state=online,offset=1332715787,lag=0
slave1:ip=10.98.125.94,port=6379,state=online,offset=1332715787,lag=0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:ee8596993dd6a4ea403df2728166bbd092be62c2
master_repl_offset:1332715787
second_repl_offset:1292024488
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331667212
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:2146.312761
used_cpu_user:908.084540
used_cpu_sys_children:1.189789
used_cpu_user_children:5.337125

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1372,expires=497,avg_ttl=1384779

redis-2

root@nextcloud-0:/var/www/html# redis-cli -h prod-redis-ha-announce-2
prod-redis-ha-announce-2:6379> INFO replication
# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:1332670864
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332670864
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331622289
repl_backlog_histlen:1048576


prod-redis-ha-announce-2:6379> INFO
# Server
redis_version:5.0.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c33a24c35f10df99
redis_mode:standalone
os:Linux 4.15.18-28-pve x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:8.3.0
process_id:1
run_id:5785cea7517aa3589b4525cc83809bb441f822ef
tcp_port:6379
uptime_in_seconds:324199
uptime_in_days:3
hz:10
configured_hz:10
lru_clock:4450297
executable:/data/redis-server
config_file:/data/conf/redis.conf

# Clients
connected_clients:7
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# Memory
used_memory:3011544
used_memory_human:2.87M
used_memory_rss:6574080
used_memory_rss_human:6.27M
used_memory_peak:4686520
used_memory_peak_human:4.47M
used_memory_peak_perc:64.26%
used_memory_overhead:2082442
used_memory_startup:791320
used_memory_dataset:929102
used_memory_dataset_perc:41.85%
allocator_allocated:3270800
allocator_active:3854336
allocator_resident:11304960
total_system_memory:67416309760
total_system_memory_human:62.79G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.18
allocator_frag_bytes:583536
allocator_rss_ratio:2.93
allocator_rss_bytes:7450624
rss_overhead_ratio:0.58
rss_overhead_bytes:-4730880
mem_fragmentation_ratio:2.18
mem_fragmentation_bytes:3563584
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:0
mem_clients_normal:151226
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1881
rdb_bgsave_in_progress:0
rdb_last_save_time:1598285145
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:577536
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:380762
total_commands_processed:7715420
instantaneous_ops_per_sec:7
total_net_input_bytes:571975929
total_net_output_bytes:799037066
instantaneous_input_kbps:0.43
instantaneous_output_kbps:1.28
rejected_connections:0
sync_full:3
sync_partial_ok:0
sync_partial_err:3
expired_keys:9
expired_stale_perc:0.29
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:4588
keyspace_misses:14
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:232
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:slave
master_host:10.110.222.19
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1332685387
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:7f1074e6bfcccc057654c286ae9281b71d8edda0
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1332685387
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1331636812
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:409.080339
used_cpu_user:312.498500
used_cpu_sys_children:1.044103
used_cpu_user_children:4.027608

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1371,expires=496,avg_ttl=1169059

FLUSHALL + FLUSHDB fixed the problem.

from charts.

yariksheptykin commented on May 18, 2024

@krakazyabra so it seems that the failover does work fine. According to @DandyDeveloper it takes somewhat longer than usual (20 sec on Production). Nevertheless, the sentinels eventually elect the new master and the redis "cluster" restores its normal operation. To me it all sounds like the redis-ha chart does its job.

Still, for some reason Nexcloud cannot handle redis failover. You mention that flushing redis contents restores Nextcloud's operation. Does failover corrupt redis contents? This might occur if two masters exist simulataneusly unaware of each other due to network partition. If Nextloud writes into the stale master during failover and then switches to the newly elected master the data went into the old master gets lost. But that seems to happen also when you manually delete the "master pod". In this case there are no stale master to write to. Still Nextcloud breaks. You mention that your "simple script" can successfully communicate with redis after the breakage whereas Nextcloud cannot. Thus it seems like Nextcloud is reading something from redis that causes it to break.

Why do you believe the bug is in the chart, and not in Nextcloud?

I don't think that you can solve this issue without being able to reproduce it in staging environemnt where you can study the breakage safely. Meanwhile on production, as a workaround, you could check Nextcloud health periodically and flush redis cache if you detect outage. Or, if this not an option, take the next outage as an opportunitiy to investigate the problem as it occures before you flush the cache manually.

from charts.

krakazyabra commented on May 18, 2024

Hi @yariksheptykin
There is logic in your words :) May be the deal is in Nextcloud.
I've already opened same issue in Nexctloud's repo nextcloud/server#22250 but there is no answer or reply. You were faster :)

I can exclude network problems, because we use a lot of redis-ha "clusters" from this chart with another applications, and there are no errors.

But one thing haunts me: outage can happen, if I use single-replica without haproxy and sentinel. So, master cannot go away in this case.

I'll wait for answers from Nextcloud team then. Anyway thank you for your help.

from charts.

DandyDeveloper commented on May 18, 2024

@yariksheptykin @krakazyabra Worth noting, I'm saying it can take upwards of 20 seconds depending on how your cluster responds to the announce services declaring that the container hosting the master is offline.

Typically, in normal scenarios, main "quick" failovers would be things like: corruption in the db, node networking completely dying - Things that result in an immediate notice that the master is offline.

Terminating a container is not the same as it's a graceful shutdown, which means the master is still responsive until the Redis process terminates on the pod itself. So, there's a potential in delay for failover until the master reports it's down down.

That also being said, completely agree with @yariksheptykin in having some more firm reproducible steps on a minikube environment or something similar. I'm not doubting improvements can be made here, but issues like this are much harder to tie down.

One workaround I could advise, is you could add a automated remediation step to ungracefully force kill the pod, which should help the failover happen faster, but could have other negative effects in bringing that original master back online.

from charts.

DandyDeveloper commented on May 18, 2024

I'm closing this as it's been a while since I've heard from the reporter and NExtCloud appeared to be the RC. Let me know if you did find a fix!

from charts.

[chart/redis-ha][BUG] about charts HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent