Coder Social home page Coder Social logo

Comments (6)

Sdas0000 avatar Sdas0000 commented on May 26, 2024 1

Thanks Nick, we bounced our nodes to sync doc count with master ( source ) , now we found that replication for same databases stopped replicating once again, we have 300 databases , out of 6 /7 databases stopped replicating again on the same nodes. Anything we need to check at database/ shards level to see if there is any issue with databases. we have not yet deployed your recommended config changes , we will apply the changes and observe for few days and let you know

from couchdb.

nickva avatar nickva commented on May 26, 2024
  • If you don't see the replications in _scheduler/jobs try passing in a higher limit=N value. By default _scheduler/jobs returns the first 100 jobs if you have more, pass in a higher limit value. There is a skip=N parameter too to skip over a number of jobs if paginating over the jobs.

  • 32 worker processes and 300 http connections does seems a bit high, try reducing them some. Especially with 1500 jobs per cluster (if indeed you reach that limit).

  • retries_per_request = 10 also seems higher, try using 5 or so. When a single request fails it will be retried with an exponential backoff 10 retries can be a relatively long backoff so the replication job would then be stalled waiting to retry that one request. It may be better to just let the job crash and restart again.

  • We just released 3.3.3, there we fixed a few security and replication related issues as well as upgraded our Erlang version (used for packages and docker) see if upgrading would make any difference.

  • See if there are any associated crashes or timeouts in the logs right around the time it looks like the replication jobs get "stuck".

from couchdb.

Sdas0000 avatar Sdas0000 commented on May 26, 2024

Nick , We seeing the same issue after configuring our cluster with new values as per your recommendation:
{
"max_churn": "20",
"interval": "60000",
"checkpoint_interval": "5000",
"startup_jitter": "5000",
"worker_batch_size": "2000",
"ssl_certificate_max_depth": "3",
"max_jobs": "1500",
"retries_per_request": "5",
"connection_timeout": "120000",
"max_history": "7",
"cluster_quiet_period": "600",
"verify_ssl_certificates": "false",
"socket_options": "[{keepalive, true}, {nodelay, false}]",
"worker_processes": "4",
"http_connections": "20"
}

we have 300 databases , after we bounced all the nodes ( 5 nodes ) with the new configuration ( retries_per_request=5, worker_processes = 4 , worker_processes=20) , we see replication stopped for some databases after few days.

"database": "_replicator",
"id": "d61a276b9c8a25a384356f230df2e840+continuous",
"pid": "<0.15554.5790>",
"source": "https://item-cache-master.pr-xxxxxxx.str.xxxxxx.com/item_store-104/",
"target": "http://localhost:5984/item_store-104/",
"user": null,
"doc_id": "9a1dfe729a75461380105a54506eed95",
"info": {
"revisions_checked": 56566938,
"missing_revisions_found": 12566327,
"docs_read": 12548807,
"docs_written": 12548807,
"changes_pending": null,
"doc_write_failures": 0,
"bulk_get_docs": 4183751,
"bulk_get_attempts": 4183751,
"checkpointed_source_seq": "23258687-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISI9JBcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEjNWrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUGt7DnacWWKaZUpaGs1CzYRIxx2AOA4apefawY4zMDExNzC1JN-CLADRX9Se",
"source_seq": "23258687-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISI9JBcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEjNWrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUGt7DnacWWKaZUpaGs1CzYRIxx2AOA4apefawY4zMDExNzC1JN-CLADRX9Se",
"through_seq": "23258687-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISI9JBcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEjNWrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUGt7DnacWWKaZUpaGs1CzYRIxx2AOA4apefawY4zMDExNzC1JN-CLADRX9Se"
},
"history": [
{
"timestamp": "2024-01-05T20:04:20Z",
"type": "started"
},
{
"timestamp": "2024-01-05T20:04:20Z",
"type": "crashed",
"reason": "{noproc,{gen_server,call,[<0.16674.5865>,stop,1200000]}}"
},
{
"timestamp": "2024-01-05T19:54:14Z",
"type": "started"
},
{
"timestamp": "2024-01-05T19:54:14Z",
"type": "crashed",
"reason": "{changes_reader_died,{timeout,ibrowse_stream_cleanup}}"
},
{
"timestamp": "2024-01-05T19:51:39Z",
"type": "started"
},
{
"timestamp": "2024-01-05T19:51:39Z",
"type": "crashed",
"reason": "{changes_reader_died,{timeout,ibrowse_stream_cleanup}}"
},
{
"timestamp": "2024-01-05T19:38:58Z",
"type": "started"
}
],
"node": "[email protected]",
"start_time": "2024-01-03T15:21:28Z"
}

Error log :

Jan 04 09:52:20 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.18866.1840> -------- Replicator, request GET to "https://item-cache-master.pr-store-xxxxxx.str.xxxxxx.com/item_store-104/_changes?feed=continuous&style=all_docs&since=23237954-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISLYMxcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEhRdrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUCt2BjvOLDHNMiUtjWahZkKk4w5AHAeN0p3rwY4zMDExNzC1JN-CLAAhTNNR&timeout=40000" failed due to error {connection_closed,mid_stream}
Jan 04 09:52:50 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.23398.1979> -------- Replication d61a276b9c8a25a384356f230df2e840+continuous (https://item-cache-master.pr-store-xxxxx.str.xxxxx.com/item_store-104/ -> http://localhost:5984/item_store-104/) failed: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}
Jan 04 09:58:51 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.10629.1972> -------- Replicator, request GET to "https://item-cache-master.pr-store-xxxxxx.str.xxxxxx.com/item_store-104/_changes?feed=continuous&style=all_docs&since=23237977-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISLYPxcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEhRfrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUCt2AzvOLDHNMiUtjWahZkKk4w5AHAeN0p1bwY4zMDExNzC1JN-CLAA-XdNo&timeout=40000" failed due to error {connection_closed,mid_stream}
Jan 04 09:59:21 item-cache-ro-us-east4-5 couchdb[3345357]: [email protected] <0.26825.1830> -------- Replication d61a276b9c8a25a384356f230df2e840+continuous (https://item-cache-master.pr-store-xxxxx.str.xxxxxx.com/item_store-104/ -> http://localhost:5984/item_store-104/) failed: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}

Note : The above error messages appears for all databases , also this message was appread for this database(item_store-104) earlier , but replication was not stopped.

from couchdb.

nickva avatar nickva commented on May 26, 2024

Do the source servers show any errors in the logs failed due to error {connection_closed,mid_stream} indicates something interrupted the changes feeds while it was reading or waiting on data. Do you have a proxy in between that might have an inactivity timeout or a max connection usage?

Are there other issue with the endpoints, when the condition occurs, do basic doc reads and writes still work?

Could try perhaps reducing the batch size a bit more (2000 -> 500), and the worker count as well (4 -> 2) and increase the timeout a bit more 120000 -> 240000.

Another idea is to try to upgrade to 3.3.3 with the latest deb package as we upgraded Erlang/OTP there to fix a memory leak issue. There were also some replication fixes since then.

If possible when identified a case of a stopped db, try updating the source doc and see what happens, does that change get propagated to the target?

from couchdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.