Comments (6)
Thanks Nick, we bounced our nodes to sync doc count with master ( source ) , now we found that replication for same databases stopped replicating once again, we have 300 databases , out of 6 /7 databases stopped replicating again on the same nodes. Anything we need to check at database/ shards level to see if there is any issue with databases. we have not yet deployed your recommended config changes , we will apply the changes and observe for few days and let you know
from couchdb.
-
If you don't see the replications in
_scheduler/jobs
try passing in a higherlimit=N
value. By default_scheduler/jobs
returns the first 100 jobs if you have more, pass in a higher limit value. There is askip=N
parameter too to skip over a number of jobs if paginating over the jobs. -
32 worker processes and 300 http connections does seems a bit high, try reducing them some. Especially with 1500 jobs per cluster (if indeed you reach that limit).
-
retries_per_request = 10
also seems higher, try using 5 or so. When a single request fails it will be retried with an exponential backoff 10 retries can be a relatively long backoff so the replication job would then be stalled waiting to retry that one request. It may be better to just let the job crash and restart again. -
We just released 3.3.3, there we fixed a few security and replication related issues as well as upgraded our Erlang version (used for packages and docker) see if upgrading would make any difference.
-
See if there are any associated crashes or timeouts in the logs right around the time it looks like the replication jobs get "stuck".
from couchdb.
Nick , We seeing the same issue after configuring our cluster with new values as per your recommendation:
{
"max_churn": "20",
"interval": "60000",
"checkpoint_interval": "5000",
"startup_jitter": "5000",
"worker_batch_size": "2000",
"ssl_certificate_max_depth": "3",
"max_jobs": "1500",
"retries_per_request": "5",
"connection_timeout": "120000",
"max_history": "7",
"cluster_quiet_period": "600",
"verify_ssl_certificates": "false",
"socket_options": "[{keepalive, true}, {nodelay, false}]",
"worker_processes": "4",
"http_connections": "20"
}
we have 300 databases , after we bounced all the nodes ( 5 nodes ) with the new configuration ( retries_per_request=5, worker_processes = 4 , worker_processes=20) , we see replication stopped for some databases after few days.
"database": "_replicator",
"id": "d61a276b9c8a25a384356f230df2e840+continuous",
"pid": "<0.15554.5790>",
"source": "https://item-cache-master.pr-xxxxxxx.str.xxxxxx.com/item_store-104/",
"target": "http://localhost:5984/item_store-104/",
"user": null,
"doc_id": "9a1dfe729a75461380105a54506eed95",
"info": {
"revisions_checked": 56566938,
"missing_revisions_found": 12566327,
"docs_read": 12548807,
"docs_written": 12548807,
"changes_pending": null,
"doc_write_failures": 0,
"bulk_get_docs": 4183751,
"bulk_get_attempts": 4183751,
"checkpointed_source_seq": "23258687-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISI9JBcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEjNWrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUGt7DnacWWKaZUpaGs1CzYRIxx2AOA4apefawY4zMDExNzC1JN-CLADRX9Se",
"source_seq": "23258687-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISI9JBcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEjNWrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUGt7DnacWWKaZUpaGs1CzYRIxx2AOA4apefawY4zMDExNzC1JN-CLADRX9Se",
"through_seq": "23258687-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISI9JBcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEjNWrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUGt7DnacWWKaZUpaGs1CzYRIxx2AOA4apefawY4zMDExNzC1JN-CLADRX9Se"
},
"history": [
{
"timestamp": "2024-01-05T20:04:20Z",
"type": "started"
},
{
"timestamp": "2024-01-05T20:04:20Z",
"type": "crashed",
"reason": "{noproc,{gen_server,call,[<0.16674.5865>,stop,1200000]}}"
},
{
"timestamp": "2024-01-05T19:54:14Z",
"type": "started"
},
{
"timestamp": "2024-01-05T19:54:14Z",
"type": "crashed",
"reason": "{changes_reader_died,{timeout,ibrowse_stream_cleanup}}"
},
{
"timestamp": "2024-01-05T19:51:39Z",
"type": "started"
},
{
"timestamp": "2024-01-05T19:51:39Z",
"type": "crashed",
"reason": "{changes_reader_died,{timeout,ibrowse_stream_cleanup}}"
},
{
"timestamp": "2024-01-05T19:38:58Z",
"type": "started"
}
],
"node": "[email protected]",
"start_time": "2024-01-03T15:21:28Z"
}
Error log :
Jan 04 09:52:20 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.18866.1840> -------- Replicator, request GET to "https://item-cache-master.pr-store-xxxxxx.str.xxxxxx.com/item_store-104/_changes?feed=continuous&style=all_docs&since=23237954-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISLYMxcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEhRdrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUCt2BjvOLDHNMiUtjWahZkKk4w5AHAeN0p3rwY4zMDExNzC1JN-CLAAhTNNR&timeout=40000" failed due to error {connection_closed,mid_stream}
Jan 04 09:52:50 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.23398.1979> -------- Replication d61a276b9c8a25a384356f230df2e840+continuous
(https://item-cache-master.pr-store-xxxxx.str.xxxxx.com/item_store-104/
-> http://localhost:5984/item_store-104/
) failed: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}
Jan 04 09:58:51 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.10629.1972> -------- Replicator, request GET to "https://item-cache-master.pr-store-xxxxxx.str.xxxxxx.com/item_store-104/_changes?feed=continuous&style=all_docs&since=23237977-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISLYPxcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEhRfrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUCt2AzvOLDHNMiUtjWahZkKk4w5AHAeN0p1bwY4zMDExNzC1JN-CLAA-XdNo&timeout=40000" failed due to error {connection_closed,mid_stream}
Jan 04 09:59:21 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.26825.1830> -------- Replication d61a276b9c8a25a384356f230df2e840+continuous
(https://item-cache-master.pr-store-xxxxx.str.xxxxxx.com/item_store-104/
-> http://localhost:5984/item_store-104/
) failed: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}
Note : The above error messages appears for all databases , also this message was appread for this database(item_store-104) earlier , but replication was not stopped.
from couchdb.
Do the source servers show any errors in the logs failed due to error {connection_closed,mid_stream}
indicates something interrupted the changes feeds while it was reading or waiting on data. Do you have a proxy in between that might have an inactivity timeout or a max connection usage?
Are there other issue with the endpoints, when the condition occurs, do basic doc reads and writes still work?
Could try perhaps reducing the batch size a bit more (2000 -> 500), and the worker count as well (4 -> 2) and increase the timeout a bit more 120000 -> 240000.
Another idea is to try to upgrade to 3.3.3 with the latest deb package as we upgraded Erlang/OTP there to fix a memory leak issue. There were also some replication fixes since then.
If possible when identified a case of a stopped db, try updating the source doc and see what happens, does that change get propagated to the target?
from couchdb.
Related Issues (20)
- failed shard open and then high CPU usage HOT 4
- Previous response state retained when an ibrowse client is reused HOT 2
- [Question] _design/repl_filters HOT 3
- ERROR INSTALLING COUCHDB!! HOT 2
- Error message all of the sudden HOT 1
- Inconsistent behavior between `config:get*` accessors during cache failures HOT 1
- "fabric_rpc:open_revs/3 error:undef" print stack HOT 3
- `config:reload/0` ignores delete markers HOT 3
- no match of right hand value HOT 5
- Configurable "Special Character" to allow JSON fieldnames beginning with underscore "_" HOT 3
- Chttpd request processing does not reset pdict predictably HOT 7
- Database size growing unexpectdly after upgrading couchdb to 3.2.2 from 3.1 HOT 1
- FreeBSD ARM64 CI Failures HOT 2
- Regex containing umlauts in square brackets not matching HOT 4
- Huge ram usage HOT 4
- Connection refused in Macbook M1 HOT 2
- Documentation: add default values HOT 1
- Database not reachable after snap auto refresh to 3.3.3 HOT 4
- Snap 3.3.3 breaks previous version (3.1.1) functionality HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from couchdb.