Coder Social home page Coder Social logo

resqu-server's People

Contributors

aharsani avatar dmolnarqu avatar ftibensky avatar jperdochqu avatar jzernovic avatar martincivan avatar matuszeman avatar mcerva avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

resqu-server's Issues

Stucked ResQU jobs La_WorkReport_ProcessUserSlotReportJob

Describe the bug
In maintenance I found some stucked workers:

|7|[[email protected] ~]$ ps aux | grep resqu-v5-
root      3298  0.0  0.0 492948 11384 ?        S    07:50   0:00 resqu-v5-11602-std-proc: Processing job La_WorkReport_ProcessUserSlotReportJob
root      3308  0.0  0.0 492948 11436 ?        S    07:50   0:00 resqu-v5-11601-std-proc: Processing job La_WorkReport_ProcessUserSlotReportJob
root     11601  0.1  0.0 490900  5568 ?        S    máj03  24:12 resqu-v5-w-default_job_pool-s: Forked 3308 at 2020-05-13 00:50:14
root     11602  0.1  0.0 490900  5668 ?        S    máj03  24:08 resqu-v5-w-default_job_pool-s: Forked 3298 at 2020-05-13 00:50:14

It seems, that only exist and nothing to do.

Sending SIGTERM not have any effect:

|7|[[email protected] ~]$ kill 3298 3308
|7|[[email protected] ~]$ ps aux | grep resqu-v5-
root      3298  0.0  0.0 492948 11384 ?        S    07:50   0:00 resqu-v5-11602-std-proc: Processing job La_WorkReport_ProcessUserSlotReportJob
root      3308  0.0  0.0 492948 11436 ?        S    07:50   0:00 resqu-v5-11601-std-proc: Processing job La_WorkReport_ProcessUserSlotReportJob
root     11601  0.1  0.0 490900  5568 ?        S    máj03  24:12 resqu-v5-w-default_job_pool-s: Forked 3308 at 2020-05-13 00:50:14
root     11602  0.1  0.0 490900  5668 ?        S    máj03  24:08 resqu-v5-w-default_job_pool-s: Forked 3298 at 2020-05-13 00:50:14

Force killing was successfull:

|7|[[email protected] ~]$ kill 3298 3308
|7|[[email protected] ~]$ ps aux | grep resqu-v5-
root      3298  0.0  0.0 492948 11384 ?        S    07:50   0:00 resqu-v5-11602-std-proc: Processing job La_WorkReport_ProcessUserSlotReportJob
root      3308  0.0  0.0 492948 11436 ?        S    07:50   0:00 resqu-v5-11601-std-proc: Processing job La_WorkReport_ProcessUserSlotReportJob
root     11601  0.1  0.0 490900  5568 ?        S    máj03  24:12 resqu-v5-w-default_job_pool-s: Forked 3308 at 2020-05-13 00:50:14
root     11602  0.1  0.0 490900  5668 ?        S    máj03  24:08 resqu-v5-w-default_job_pool-s: Forked 3298 at 2020-05-13 00:50:14
root     28504  0.0  0.0 112736   964 pts/0    S+   09:39   0:00 grep --color=auto resqu-v5-
|7|[[email protected] ~]$ kill -9 3298 3308
|7|[[email protected] ~]$ ps aux | grep resqu-v5-
root     11601  0.1  0.0 186312  6528 ?        R    máj03  24:12 resqu-v5-w-default_job_pool-s: Shutting down
root     11602  0.1  0.0 359828  6660 ?        R    máj03  24:08 resqu-v5-w-default_job_pool-s: Shutting down
root     28576  0.0  0.0 112736   964 pts/0    S+   09:39   0:00 grep --color=auto resqu-v5-
|7|[[email protected] ~]$ ps aux | grep resqu-v5-
root     28604  0.0  0.0 112732   964 pts/0    S+   09:39   0:00 grep --color=auto resqu-v5-

Expected behavior
All ResQU 5 threads shutdown correctly without any killing.

Remove conflicting composer dependencies

Dependency version differences between LA and Resqu can cause serious issues with job execution.

The whole LA environment is directly included into resqu environment, which makes this a possibility.
There are few available solutions:

  • Create a standalone process for each executed job without resqu environment in it:
    This would severely impact performance and would likely require additional scaling. The development efforts are significant, because the whole new execution model would have to be implemented, causing instability and maintenance overhead. This is effective, but relatively expensive longer term solution, and might not be worth it considering the following proposal.
  • Migrate away from resqu completely:
    The most expensive of all the solutions, the point of this migration is elimination of the artificial feature requirements that are difficult to reach and maintain. By altering application code slightly, we should in theory be able to use any off-the-shelf solution for job queuing and execution, which would in theory reduce our maintenance costs. This solution however comes with its own challenges, and it needs to be very long term. This scenario is our original target, but the scope of the task is too big to fit into the time requirements we currently have.
  • Bake all the dependencies into the project namespace:
    By far the fastest solution. Even though the project is almost at a standstill from the perspective of new features, any updates to support new major PHP release will require this to be repeated. The cost, however, is still lower than the cost of manually keeping the versions of dependencies of resqu and LA in sync, which would be very time consuming and error prone.
    This can be further optimized by limiting the scope of used libraries to better match the needs of resqu, lowering the amount of work needed to bake them all into the project.
  • Do not depend on the same libraries:
    This is an option in theory, but not in practice.

Edit: Another proposal was to merge resqu into LA and run workers as LA entry points. That approach is equivalent to the second option, because workers for every deployed version of LA would have to exist. As mentioned above, a significant effort is needed to achieve that, and at that point, there is no reason to prefer resqu over other solutions on the market.

same pid for 2 pools

at docker installation we were able to identify PID on worker processes which was used by different pool .

This happened at docker installation and after that, conversation indexing worker wasnt spawned, because pid was allocated by sendmails

Can we rethink allocation check in future ?

Interrupted system call in SignalTracker.php

Log from Kibana

pcntl_sigtimedwait(): Interrupted system call in /opt/qu/php-resque-releases/php-resque-5.5.1/lib/Process/SignalTracker.php on line 66

Additional context
host: 4_app-q_la_ws-eu, 1_app-q_la_linode-us-nj, 2_app-q_la_linode-de

Exception: All pool names must be unique in GlobalConfig.php

Logs from Kibana:

Uncaught Resque\Config\ConfigException: All pool names must be unique. 
in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Config/GlobalConfig.php:221 Stack trace: 
#0 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Config/GlobalConfig.php(123): Resque\Config\GlobalConfig->validatePoolNames() 
#1 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Config/GlobalConfig.php(64): Resque\Config\GlobalConfig::reload() 
#2 /opt/qu/php-resque-releases/php-resque-5.6.4/scripts/startManagers.php(14): Resque\Config\GlobalConfig::initialize('/etc/resque-ser...') 
#3 {main}   thrown in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Config/GlobalConfig.php on line 221

Details:

  • 1_app-q_la_linode-uk
  • level: Fatal_error

Provide SHA256 hash sum on Resqu built stage

Whenever a new Resqu version is built, hash sum of generated archive should be calculated right away and placed in either separate file, release notes, whatever would be convenient or the most easy to implement. This is needed because in current workflow admins download the archive and calculating hash on their own, which means that the archive possibly could've been altered between the GH <-> admin, and there is no way to prove it wasn't.

PHP 8.1 compatibility

In order to update to PHP 8.1, compatibility should be checked and any found problems should be fixed.

Uncaught RedisException: socket error on read socket in Client.php.

Logs from Kibana:

Uncaught RedisException: socket error on read socket in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Libs/Colinmollenhour/Credis/Client.php:1172 Stack trace: 
#0 [internal function]: Redis->eval('redis.call('DEL...', Array, 2) 
#1 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Libs/Colinmollenhour/Credis/Client.php(1172): call_user_func_array(Array, Array) #2 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Redis.php(370): Resque\Libs\Colinmollenhour\Credis\Client->__call('eval', Array) #3 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Worker/WorkerImage.php(115): Resque\Redis->__call('eval', Array) 
#4 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Maintenance/StaticPoolMaintainer.php(88): Resque\Worker\WorkerImage->unregister() 
#5 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Maintenance/StaticPoolMaintainer.php(62): Resque\Maintenance\StaticPoolMaintainer->cleanupWorkers(10) 
#6 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Init/InitProcess.php(45): Resque\Maintenance\StaticPoolMaintainer->maintain() #7 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Init/InitProcess.php(37): Resque\Init\InitProcess->recover() 
#8 /opt/qu/php-resque-releases/php-resque-5.6.4/scripts/startManagers.php(21): Resque\Init\InitProcess->maintain() 
#9 {main}  
Next Resque\Libs\Colinmollenhour\Credis\CredisException: socket error on read socket in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Libs/Colinmollenhour/Credis/Client.php:1190 Stack trace: 
#0 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Redis.php(370): Resque\Libs\Colinmollenhour\Credis\Client->__call('eval', Array) #1 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Worker/WorkerImage.php(115): Resque\Redis->__call('eval', Array) 
#2 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Maintenance/StaticPoolMaintainer.php(88): Resque\Worker\WorkerImage->unregister() 
#3 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Maintenance/StaticPoolMaintainer.php(62): Resque\Maintenance\StaticPoolMaintainer->cleanupWorkers(10) 
#4 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Init/InitProcess.php(45): Resque\Maintenance\StaticPoolMaintainer->maintain() #5 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Init/InitProcess.php(37): Resque\Init\InitProcess->recover() 
#6 /opt/qu/php-resque-releases/php-resque-5.6.4/scripts/startManagers.php(21): Resque\Init\InitProcess->maintain() 
#7 {main}  
Next Resque\RedisError: Error communicating with Redis: socket error on read socket in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Redis.php:412 Stack trace: 
#0 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Redis.php(372): Resque\Redis->attemptCallRetry(Object(Resque\Libs\Colinmollenhour\Credis\CredisException), 'eval', Array) 
#1 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Worker/WorkerImage.php(115): Resque\Redis->__call('eval', Array) 
#2 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Maintenance/StaticPoolMaintainer.php(88): Resque\Worker\WorkerImage->unregister() 
#3 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Maintenance/StaticPoolMaintainer.php(62): Resque\Maintenance\StaticPoolMaintainer->cleanupWorkers(10) 
#4 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Init/InitProcess.php(45): Resque\Maintenance\StaticPoolMaintainer->maintain() #5 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Init/InitProcess.php(37): Resque\Init\InitProcess->recover() 
#6 /opt/qu/php-resque-releases/php-resque-5.6.4/scripts/startManagers.php(21): Resque\Init\InitProcess->maintain() 
#7 {main}   thrown in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Redis.php on line 412

Additional context
Level: Fatal_error
Host: 2_app-q_la_linode-us-tx

Overloaded resque workers by mass action

Today both app-q machines in la.linode-de were overloaded possibly by this mass action job. Both workers (1.app-q and 2.app-q)) had 100% CPU utilization that time and there were queue rising for async_rpc_jobs which caused problems when computing filter was requested.
We were able to fix this by restarting resqu-5 service on both workers, when load on 2.app-q lowers to normal values (around 11:42 CEST). 1.app-q was overloaded even after restart, but I suppose that mass action need to be finished.

Stuck queue locks

We found lingering queue locks without any trace of what might have happened to not remove them. The issue can be temporarily solved by simply purging resqu-v5:unique:queue_locks.

Remove locks that are older than one day

Seeing that #11 is not going to be fixed any time soon, we should at least add a way to mitigate locks getting stuck. Since we are also slowly moving away from resqu, the proper fix might not be needed at all.

Missing Resqu documentation

Currently, there is no actual Resqu documentation for its configuration. https://github.com/QualityUnit/resqu-server/tree/master/resources is clearly outdated and provides no context. We should have documentation with capabilities of what we may change, what is better not, and, generally, what can we do in order to adjust Resqu settings in production. Ideally, the doc's should give clear explanation of each setting that we may use, any details about the inner work of Resqu that in the opinion of author will be good to know for admins are very appreciated.

Warning: Failed to open stream: Permission denied in Lang/Storage/CacheFile.class.php

Logs from Kibana:

file_put_contents(/opt/qu/apps/accounts/u196628/accounts/default1/cache/lang/la_sk_5.37.2.18.s.php): 
Failed to open stream: Permission denied in /opt/qu/apps/versions/la/5-37-2-18/include/Gpf/Lang/Storage/CacheFile.class.php on line 21

Details:

  • host: 5_app-q_la_ws-eu
  • version: 5-37-2-18
  • level: Warning
  • account id: u196628

Array to string conversion in Redis.php.

Logs from Kibana:

Array to string conversion in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Redis.php on line 404

Additional context
Level: Warning
Host: 2_app-q_la_linode-us-tx

Config file failed to parse in GlobalConfig.php

Logs from Kibana:

Uncaught RuntimeException: Config file failed to parse. in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Config/GlobalConfig.php:78 Stack trace: 
#0 /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Config/GlobalConfig.php(64): Resque\Config\GlobalConfig::reload() 
#1 /opt/qu/php-resque-releases/php-resque-5.6.4/scripts/startManagers.php(14): Resque\Config\GlobalConfig::initialize('/etc/resque-ser...') 
#2 {main}   thrown in /opt/qu/php-resque-releases/php-resque-5.6.4/lib/Config/GlobalConfig.php on line 78

Details:

  • host: 1_app-q_la_linode-uk
  • level: Fatal_error

Resque lock is not removed automatically

If I now we have mechanism what is doing removing resque locks. But customer ftmo.ladesk.com has already lock 2 times and lock was not removed automatically. We don;t know what caused first lock but second lock seems was caused by error during execution of mass action job. Error is reported here https://github.com/QualityUnit/la-issues/issues/13439.

Whole situation caused number of emails send was just ~4 emails per hour and there was >1000 emails to be send. It seems the mostly emails instead of send was rescheduled.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.