Deion [NOTE]: # The replication job for couchdb is stucked.<

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

sorry I didn't notice it sooner, but thanks to <a class="user-mention notranslate" dat

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

The replication job for couchdb is stucked,about apache/couchdb

Comments (24)

zhangjingcat commented on June 24, 2024 2

Hi @rnewson and @nickva , yes I commented out the buffer_response=true in .ini and I'm able to sync the huge database now, thank you a lot for your help!

from couchdb.

rnewson commented on June 24, 2024 1

sorry I didn't notice it sooner, but thanks to @nickva for spotting it.

from couchdb.

rnewson commented on June 24, 2024

Can you clarify step 5? I assume you meant that document "penny-003" did not appear in couchdb-A?

Further questions;

The one-shot replications return important information in the response like the number of documents written to the target. Can you please show those?
Are you running any proxy or other software in front of either couchdb server? If so, what is it and what is it doing?
In your 5 step example did you use a filter on the replication?
Your discovery of "not_found" missing is perhaps a failure to find the design document that your filter is in, rather than suggesting that the _changes endpoint itself is missing (which would indicate a broken installation).

from couchdb.

zhangjingcat commented on June 24, 2024

Can you clarify step 5? I assume you meant that document "penny-003" did not appear in couchdb-A?

Further questions;

The one-shot replications return important information in the response like the number of documents written to the target. Can you please show those?

Are you running any proxy or other software in front of either couchdb server? If so, what is it and what is it doing?

In your 5 step example did you use a filter on the replication?

Your discovery of "not_found" missing is perhaps a failure to find the design document that your filter is in, rather than suggesting that the _changes endpoint itself is missing (which would indicate a broken installation).

Hi @rnewson , thanks a lot for your quick reply!

Yes the "penny-003" did not appear in couchdb-A.

1.I used UI to config the replications and did not keep the response. What I remembered is 0 Documents written and null pending. (I can try to re-produce that and copy the new result here for you later)

2.No.

3.I did not use the filter on replication and I just follow the UI instructions to use a source and target.

4.I have some more discoveries depending on the .ini files that might cause the replication to fail.

This is the special case for data syncing , I carried out some tests on replicating data from local to local (with another dummy name).

For example, we have a "sw360db" which is a small database, and a "sw360changelogs" which is a large database.

The replication job from local existing "sw360db" database to local new "penny1" database succeed.

The replication job from local existing "sw360changelogs" database to local new "penny2" database failed. (0 documents written and null pending)

I'm suspicious of the [replicator] configurations and carrying out some tests to prove my assumptions.

[couchdb]
max_document_size = 4294967296 ; bytes
uuid = ca6e9bf190c040ec9987fc5512563704
database_dir = ./data
view_index_dir = ./data
file_compression = snappy
max_document_size = 8000000 ; bytes
os_process_timeout = 7000 ; 7 seconds. for view servers.
max_document_size = 8000000 ; bytes
max_dbs_open = 500
attachment_stream_buffer_size = 8192

[purge]
max_document_id_number = 1000
max_revisions_number = 1000
index_lag_warn_seconds = 86400

[cluster]
q=8
n=3
seedlist = xxx,xxx,xxx

[database_compaction]
doc_buffer_size = 524288
checkpoint_after = 5242880

[view_compaction]
keyvalue_buffer_size = 2097152

[smoosh]
db_channels = upgrade_dbs,ratio_dbs,slack_dbs
view_channels = upgrade_views,ratio_views,slack_views
cleanup_channels = index_cleanup
compaction_log_level = info
persist = true

[smoosh.ratio_dbs]
priority = ratio
min_priority = 2.0

[smoosh.ratio_views]
priority = ratio
min_priority = 2.0

[smoosh.slack_dbs]
priority = slack
min_priority = 536870912

[smoosh.slack_views]
priority = slack
min_priority = 536870912

[attachments]
compression_level = 9
compressible_types = text/*, application/javascript, application/json, application/xml

[uuids]
algorithm = sequential

[stats]
interval = 10

[ioq]
concurrency = 12
ratio = 0.01

[ioq.bypass]
os_process = true
read = true
write = true
view_update = true

[replicator]
max_jobs = 500
interval = 60000
max_churn = 20
max_history = 20
worker_batch_size = 500
worker_processes = 8
http_connections = 20
retries_per_request = 5
connection_timeout = 30000
socket_options = [{keepalive, true}, {nodelay, false}]
checkpoint_interval = 5000
use_checkpoints = true
use_bulk_get = true
valid_socket_options = buffer,keepalive,nodelay,priority,recbuf,sndbuf

[chttpd]
bind_address = 0.0.0.0
buffer_response = true
max_http_request_size = 4294967296 ; 4 GB

[httpd]
server_options = [{backlog, 1024}, {acceptor_pool_size, 16}, {max, 8192}]

[chttpd_auth]
secret = f2336db1
allow_persistent_cookies = true
auth_cache_size = 50
timeout = 600

[rexi]
buffer_count = 2000
server_per_node = true
use_kill_all = true
stream_limit = 7

[prometheus]
additional_port = true
bind_address = 0.0.0.0
port = 9985

[admins]
admin = xxxxxx

Steps:

1.Create the one-time replication job

2.Check the replication job status

See if any documents succeed, but unfortunately none

I got this error when creating the replication job.

Request:

curl --request POST
--url https://xxx:xxx@xxxxxx/_replicate
--header 'Content-Type: application/json'
--data '
{
"source": "https://xxx:xxx@xxxxxx/sw360changelogs",
"target": "https://xxx:xxx@xxxxxx/penny",
"create_target" : true
}'

Response:

{
"error": "changes_req_failed",
"reason": "{error,req_timedout}"
}

from couchdb.

rnewson commented on June 24, 2024

Ok, so it sounds like the replication simply crashes at the start, which explains the state of the databases, we don't get as far as knowing which documents on the source need replicating to the target.

I don't see anything in your configuration that explains the issue.

I see you are using https but your configuration did not include the settings to enable that natively in couchdb. is there something else providing https or did you just omit those settings from the configuration above?

Can you test independently of the replicator? e.g, with ApacheBench, something like ab -n 5000 -A user:pass <url to source db/_changes> and see if they all succeed.

from couchdb.

zhangjingcat commented on June 24, 2024

Hi @rnewson ,thanks for your suggestions. Actually we created the DNS pointing to the ip address in a separate place on cloud configuration panel. I tried use the ApacheBench to see the results ,here attached the response(70007). Actually for other database the replication were quite fast and not sure if the setup .ini params affects the efficiency.

I used local way to set replication job but also failed.

Request:

{
"source": "sw360changelogs",
"target": "penny",
"create_target" : true
}

Response:
{
	"error": "changes_req_failed",
	"reason": "{error,req_timedout}"
}

For sw360changelogs Test Results

(base) i545542@C02FD05EMD6T ~ % ab -n 5000 -A user:pass https://FQDN/sw360changelogs/_changes
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking FQDN (be patient)
apr_pollset_poll: The timeout specified has expired (70007)

And I know the replication might be difficult for big databases and could we declare a small limit under [replicator] label and make it faster for us to get each partition for the whole changes, is it supported by setting this in [replicator] for .ini files now? Thank you very much!

from couchdb.

rnewson commented on June 24, 2024

This does not sound like a replicator problem. Your couchdb is not contactable at all, focus on testing connectivity with curl and apachebench, figure out why that is not working first, and then the replication problem will also be solved.

There is no "local way" for replication, internally the "target":"penny" is converted to a url anyway, all replication is over http/https, so you should specify source and target as url's yourself for clarity.

try curl https://FQDN/sw360changelogs/_changes and tell me if that works. try it ten times and tell me if any fail.

from couchdb.

zhangjingcat commented on June 24, 2024

Hi @rnewson ,thanks a lot for the hints!
Now I tried the "curl https://FQDN/sw360changelogs/_changes" several times and it took about 4minutes to respond as our sw360changelogs database is too huge and have 1818337 docs inside. We may try to increase the connect time out for replicator label in .ini first. It confused me is that when we reduce case to use very simple .ini file like below we do not face the replicator issue for sw360changelogs, but if we change back to use the complex .ini file for db setup then the timeout issue would appear.

The simple .ini file that works.

[admins]
xxx = xxx

[couchdb]
uuid = 9cfc779d1c23054f8e515f92042695c0
max_document_size = 4294967296 ; bytes
database_dir = ./data
view_index_dir = ./data
file_compression = snappy
os_process_timeout = 7000 ; 7 seconds. for view servers.
max_dbs_open = 500
attachment_stream_buffer_size = 8192

[couch_httpd_auth]
secret = fdccaf1f00c36ea9550a84b423b2e755

[chttpd]
bind_address = 0.0.0.0
port = 5984
max_http_request_size = 4294967296 ; 4 GB

[httpd]
max_http_request_size = 4294967296 ; 4 GB
server_options = [{backlog, 1024}, {acceptor_pool_size, 16}, {max, 8192}]

[cluster]
n = 1

[replicator]
max_jobs = 500
interval = 60000
max_churn = 20
max_history = 20
worker_batch_size = 500
worker_processes = 8
http_connections = 20
retries_per_request = 5
connection_timeout = 30000
socket_options = [{keepalive, true}, {nodelay, false}]
checkpoint_interval = 5000
use_checkpoints = true
use_bulk_get = true
valid_socket_options = buffer,keepalive,nodelay,priority,recbuf,sndbuf

[prometheus]
additional_port = true
bind_address = 0.0.0.0
port = 9985

from couchdb.

rnewson commented on June 24, 2024

you say 4 minutes to respond, is that accurate? I would expect retrieving the entire changes response body will take time, but that is not a problem (the replicator processes it as a stream, no matter how long it is).

the req timedout error is about whether the response even starts, not finishes.

Try curl again and add ?limit=1, tell me if that is quick and reliable.

from couchdb.

nickva commented on June 24, 2024

I wonder if it's the buffer_response = true setting, noticed the simpler ini file didn't have that?

from couchdb.

zhangjingcat commented on June 24, 2024

you say 4 minutes to respond, is that accurate? I would expect retrieving the entire changes response body will take time, but that is not a problem (the replicator processes it as a stream, no matter how long it is).

the req timedout error is about whether the response even starts, not finishes.

Try curl again and add ?limit=1, tell me if that is quick and reliable.

Yes with limit=1 it responds quick. Not sure whether the data length matters.

from couchdb.

zhangjingcat commented on June 24, 2024

I wonder if it's the buffer_response = true setting, noticed the simpler ini file didn't have that?

Thanks for pointing that, I'll do some compare tests then.

from couchdb.

rnewson commented on June 24, 2024

hm ok, one tip, you really should remove buffer_response = true setting. that is not the default and it causes couchdb to attempt to build the entire response in memory before starting the response.

from couchdb.

rnewson commented on June 24, 2024

sidebar: we should exclude _changes response from that setting, or perhaps remove it entirely.

from couchdb.

zhangjingcat commented on June 24, 2024

hm ok, one tip, you really should remove buffer_response = true setting. that is not the default and it causes couchdb to attempt to build the entire response in memory before starting the response.

Ok thanks a lot! Let me try the re-setup without the param.

from couchdb.

nickva commented on June 24, 2024

np at all @zhangjingcat! thanks for stopping by

from couchdb.

pennyZhang2024 commented on June 24, 2024

Hi @rnewson and @nickva,

We may have another discovery on the replication. What is the difference for Replicator and _replicate?

Actually we found the "_replicate" works very quickly for our dev couchdb while the "_replicator" does not seems to start.

What might cause this? Thank you very much!

Attached the .ini files from one of the 3 nodes:


[couchdb]
uuid = a709547499a14e1c9f7ac7b98512b334
file_compression = snappy
os_process_timeout = 7000 ; 7 seconds. for view servers.
max_dbs_open = 500

[purge]
max_document_id_number = 1000  ;by default 100
max_revisions_number = 1000    ;by default 1000
index_lag_warn_seconds = 86400 ;by default 86400

[cluster]
q=8
n=3
seedlist = [email protected],[email protected],[email protected]

[ioq]
concurrency = 12
ratio = 0.01  ; by default 0.01

; [ioq.bypass]
; os_process = true
; read = true
; write = true
; view_update = true

[replicator]
max_jobs = 500
interval = 60000
max_churn = 20
max_history = 20
worker_batch_size = 500
worker_processes = 8
http_connections = 20
retries_per_request = 5
connection_timeout = 30000
socket_options = [{keepalive, true}, {nodelay, false}]
checkpoint_interval = 5000
use_checkpoints = true
use_bulk_get = true
valid_socket_options = buffer,keepalive,nodelay,priority,recbuf,sndbuf

[chttpd]
bind_address = 0.0.0.0
; buffer_response = false  ; root cause, cannot replica data
max_http_request_size = 4294967296 ; 4 GB

[httpd]
server_options = [{backlog, 1024}, {acceptor_pool_size, 16}, {max, 8192}]

[chttpd_auth]
secret = c0bc8fb2
timeout = 600

[rexi]
buffer_count = 2000
server_per_node = true
use_kill_all = true
stream_limit = 8 ;refer to: https://docs.couchdb.org/en/stable/config/cluster.html#rexi/stream_limit

; [prometheus]
; additional_port = true
; bind_address = 0.0.0.0
; port = 9985

[admins]
xxx = xxx

This replication works with "_replicate",it does not show the time

This replication fails with "replicator":

Thank you very much!

from couchdb.

pennyZhang2024 commented on June 24, 2024

And when I tried with the APIs,

curl --request PUT
--url https://xxx:xxx@FQDN/_replicator
--header 'Content-Type: application/json'
--data '
{
"_id":"penny_test_001",
"source": "https://xxx:xxx@FQDN/sw360changelogs",
"target": "https://xxx:xxx@FQDN/tomato"
}'

It seems the response is weird.

{
"error": "file_exists",
"reason": "The database could not be created, the file already exists."
}

We do not have the tomato database as I never created the tomato database. The _replicator database is empty with no docs in now and should not have dirty data. I didn't delete it for now and not sure whether I can delete it to see if I could re-creaate another empty _replicator database.

from couchdb.

pennyZhang2024 commented on June 24, 2024

And it does not only failed on syncing local database to another local database, but also failed for configuring remote replicators, so the issue might be replicators are broken I guess.

This also failed.

from couchdb.

pennyZhang2024 commented on June 24, 2024

Emmm due to some reason one node from the cluster might be rebooting again and again.

And the membership can see the nodes goes in and out each second.

The response changes frequently between the two.
{
"all_nodes": [
"[email protected]",
"[email protected]"
],
"cluster_nodes": [
"[email protected]",
"[email protected]",
"[email protected]"
]
}

{
"all_nodes": [
"[email protected]",
"[email protected]",
"[email protected]"
],
"cluster_nodes": [
"[email protected]",
"[email protected]",
"[email protected]"
]
}

Something might be wrong with the node itself now.

error logs found:

[error] 2024-06-17T07:29:51.795109Z [email protected] <0.23400.24> -------- CRASH REPORT Process  (<0.23400.24>) with 0 neighbors exited with reason: killed at gen_server:decode_msg/9(line:481) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_index_updater,init,['Argument__1']}, ancestors: [<0.23385.24>,<0.23384.24>], message_queue_len: 0, links: [], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 29, reductions: 2786440
[error] 2024-06-17T07:29:51.795167Z [email protected] <0.23400.24> -------- CRASH REPORT Process  (<0.23400.24>) with 0 neighbors exited with reason: killed at gen_server:decode_msg/9(line:481) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_index_updater,init,['Argument__1']}, ancestors: [<0.23385.24>,<0.23384.24>], message_queue_len: 0, links: [], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 29, reductions: 2786440
[error] 2024-06-17T07:29:51.795200Z [email protected] <0.23401.24> -------- gen_server <0.23401.24> terminated with reason: killed
  last msg: redacted
     state: {st,<0.23385.24>,couch_mrview_index,undefined}
    extra: []
[error] 2024-06-17T07:29:51.795219Z [email protected] <0.23401.24> -------- gen_server <0.23401.24> terminated with reason: killed
  last msg: redacted
     state: {st,<0.23385.24>,couch_mrview_index,undefined}
    extra: []
[error] 2024-06-17T07:29:51.795270Z [email protected] <0.23401.24> -------- CRASH REPORT Process  (<0.23401.24>) with 0 neighbors exited with reason: killed at gen_server:decode_msg/9(line:481) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_index_compactor,init,['Argument__1']}, ancestors: [<0.23385.24>,<0.23384.24>], message_queue_len: 0, links: [], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 29, reductions: 48074
[error] 2024-06-17T07:29:51.795312Z [email protected] <0.23401.24> -------- CRASH REPORT Process  (<0.23401.24>) with 0 neighbors exited with reason: killed at gen_server:decode_msg/9(line:481) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_index_compactor,init,['Argument__1']}, ancestors: [<0.23385.24>,<0.23384.24>], message_queue_len: 0, links: [], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 29, reductions: 48074

from couchdb.

rnewson commented on June 24, 2024

_replicate starts a non-persistent replication process, if it crashes (or completes) then it is over.

documents created inside the _replicator database also create replication processes but (with continuous: true) these jobs are restarted on a crash (or node reboot).

https://docs.couchdb.org/en/stable/replication/intro.html#transient-and-persistent-replication

from couchdb.

rnewson commented on June 24, 2024

As the original issue is resolved I am closing this ticket. If you want a more general chat I suggest joining our Slack instance or the user mailing list.

from couchdb.

pennyZhang2024 commented on June 24, 2024

_replicate starts a non-persistent replication process, if it crashes (or completes) then it is over.

documents created inside the _replicator database also create replication processes but (with continuous: true) these jobs are restarted on a crash (or node reboot).

https://docs.couchdb.org/en/stable/replication/intro.html#transient-and-persistent-replication

Thanks a lot! If possible could you put me in your user mailing list as currently I'm not able to use Slack now. Thanks.

from couchdb.

rnewson commented on June 24, 2024

Hi, the steps to subscribe to our mailing list are on https://couchdb.apache.org

from couchdb.

The replication job for couchdb is stucked about couchdb HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent