chaoss / grimoirelab-kingarthur Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 65.0 600 KB

King Arthur commands his loyal knight Perceval on the quest to retrieve data from software repositories.

License: GNU General Public License v3.0

Python 91.02% HTML 8.02% Makefile 0.27% Dockerfile 0.70%

grimoirelab-kingarthur's People

Contributors

Stargazers

Watchers

Forkers

acs pombredanne willingc mafesan jgbarah-chaoss jgbarah-tests valeriocos sduenas olblak aswanipranjal lf-engineering cewilliams stjordanis dlumbrer amrohassaan zhquan canasdiaz artymandy ajaragz vchrombie snack0verflow divya063 animeshk08 abhiandthetruth sanjana091001 tabladrum sne3091 jens-lipidus vsevagen darmis007 allmight2099 rohanreddych stevenkolawole sourabhsaraswat-191939 ezlo-cristiant sammonsjl pistoolster thomaswolter csm-dev-hub jjmerchante mgalexander tieway59 chaoss-conversion-rate-mabel-old-1 chaoss-conversion-rate-mabel-old2 tanvitalkar chaoss-conversion-rate-mabel lijiawen1222 aepfli lbergmans estrangezz benwillemsen st01cs cristianmedeiros roshvin uscrab anisham7 sobiscuit anmollenka sandrociceros-claro snitch3s vinhbt thangln2015 mhauru codejeffrey

grimoirelab-kingarthur's Issues

Decouple job events listener from job scheduler

The job events listener is coupled inside the scheduler. It's function is to listed for new events and to handle those events. The current handlers only reschedule the events. These handlers are provided when the listener is created

The problem with structure is that to reschedule the jobs, it needs to pass the scheduler to the handlers, so there's a cycle scheduler-listener-scheduler. This is a bad design.

This task aims to refactorize the code decoupling objects. It will improve the logic, how to monitorize what jobs are doing what and how the code is tested.

Store job logs

We should find a way to store the log messages produced by each job. These messages contain valuable information to keep track of the status of a job and to debug any possible error. So far these messages are only printed by the workers.

The easiest way to add this feature is to store the logs into log files but that will make difficult to access them using the rest API that arthurd provides. Maybe the best way is to store them within the jobs in RQ and Redis.

Log messages should be printed to the stdout/stderr when --no-daemon is set

When the server is running in foreground, log messages are written to a log file instead to the standard or error output.

The default behaviour should be to write first to stderr.

Moving task scheduling outside of king arthur

Hi,

After one year working with different Big Data platforms I have reached the conclusion that the task scheduling should be done outside arthur. arthur should be a task repository and also the engine for executing the tasks, tracking the execution, stopping it and having stats about the track executions. The real execution is done by the data processing engine. In our case, RQ and the python tasks for collecting the data.

The task scheduling is something totally different from the task execution. For example in Unix systems, the scheduling is done using tools like cron. In our case, for the GrimoireLab platform, there is a python based tool called Airflow (https://airflow.apache.org/) which is the base for the Google Composer service, demoing its maturity.

Airflow could asks arthur to execute the jobs according to different schedules. The scheduling could be defined in Python code, and it has a great web interface to visualize complex scheduling (I am not sure but probably you can also visualize the execution process of the tasks).

My proposal is to use Airflow in GrimoireLab platform for task scheduling, and to simplify arthur cleaning the logic related to scheduling.

Guarding against result.summary being unset

I'm trying King Arthur out, and have run into a couple of instances where the job result is None or result.summary is None, but the code tries to get fields from it and crashes. I put in some quick if result.summary: clauses to the plug the hole in my local version, but I wonder if that's just masking some deeper issue about an assumed guarantee of when result.summary is set that is not being respected.

I've seen this happen with at least this snippet from scheduler.py:

            logger.error("Job #%s (task: %s) failed but will be resumed",
                         job_id, task_id)

            if result.summary.fetched > 0:
                task.backend_args['next_from_date'] = result.summary.max_updated_on

                if result.summary.max_offset:
                    task.backend_args['next_offset'] = result.summary.max_offset

which raises AttributeError: 'NoneType' object has no attribute 'summary', and this one in jobs.py

    logger.debug("Job #%s (task: %s) completed (%s) - %s/%s items (%s) fetched",
                 result.job_id, task_id, result.backend,
                 str(result.summary.fetched), str(result.summary.skipped),
                 result.category)

which raises AttributeError: 'NoneType' object has no attribute 'fetched'.

Update task to RUNNING state

Tasks are not updated to RUNNING state when their jobs are running in a worker. To do this, it's necessary that the job sends an event when it enters in the worker.

[server] Return job_number field when querying tasks

Once #98 is added, the next step is to add to the CherryPy server the functionality of returning the job_number in the petitions of tasks and task by its id:

In this line is where the list of job ids is returned, the idea is to change this list of string to a list of objects that contains the job_id and the job_number field: https://github.com/chaoss/grimoirelab-kingarthur/blob/master/arthur/scheduler.py#L196

This task is directly related to Bitergia/raistlin#16

Async is now a reserved word in python 3.7

Python 3.7 makes async a reserved word: https://docs.python.org/3/reference/compound_stmts.html#async

This causes errors where used with rq as a parameter name:
https://github.com/chaoss/grimoirelab-kingarthur/blob/master/arthur/scheduler.py#L90
https://github.com/chaoss/grimoirelab-kingarthur/blob/master/tests/test_jobs.py#L544

Rq version 0.12 changes the parameter name to is_async: CHANGES.md

On travis-ci using

dist: xenial
python:
  - "3.7"

will show the error: build

Require DCO sign-off for new commits

This issue is to activate protobot/dco (or similar bot) to check that all commits have a sign-off in this repository.

The CHAOSS Project Charter section 8.2.1 requires that all contributions are signed-off. The CHAOSS project has been piloting the use of DCO sign-offs. Once contributors know how to do it, sign-offs are easy to do with little overhead.

For users of the git command line interface, a sign-off is accomplished with the -s as part of the commit command: git commit -s -m 'This is a commit message'

For users of the GitHub interface, a sign-off is accomplished by writing Signed-off-by: Your Name <[email protected]> into the commit comment field. This can be automated by using a browser plugin like scottrigby/dco-gh-ui

To-Do for repo maintainers: Please inform your contributors about DCO sign-offs and comment on this issue when your are ready for the DCO bot to be activated on this repository.

Status page

In order to monitoring Arthur server and workers, it would be good to have a status report using the Rest API which will inform about the list of workers and their status, the list of queues, the number of tasks and jobs queued, and the status of other services as Redis.

pipermail backend does not work

The pipermail perceval backend does not work with arthur.

I start arthurd and a worker:

(acs@dellx) ~ $ arthurd -g -d redis://localhost/8 --log-path /tmp/arthurd --no-cache && tail -f /tmp/arthurd/arthur.log

(acs@dellx) ~ $ arthurw -g -d redis://localhost/8

and send the task:

(acs@dellx) ~ $ curl -XPOST -H "Content-Type: application/json" http://localhost:8080/add -d'
{
 "tasks": [
  {
   "task_id": "http://lists.wikimedia.org/pipermail/analytics",
   "backend_args": {
    "tag": "http://lists.wikimedia.org/pipermail/analytics",
    "uri": "http://lists.wikimedia.org/pipermail/analytics"
   },
   "cache": {
    "cache": true,
    "cache_path": null,
    "fetch_from_cache": false
   },
   "backend": "pipermail",
   "scheduler": {
    "delay": 60
   }
  }
 ]
}'
Tasks added

and in the worker logs:

[2017-12-18 06:21:58,375 - rq.worker - ERROR] - arthur.errors.NotFoundError: pipermail not found
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 112, in __init__
    self._bklass = perceval.find_backends(perceval.backends)[0][backend]
KeyError: 'pipermail'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 700, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 500, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 299, in execute_perceval_job
    rq_job.connection, qitems)
  File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 114, in __init__
    raise NotFoundError(element=backend)
arthur.errors.NotFoundError: pipermail not found
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 112, in __init__
    self._bklass = perceval.find_backends(perceval.backends)[0][backend]
KeyError: 'pipermail'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 700, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 500, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 299, in execute_perceval_job
    rq_job.connection, qitems)
  File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 114, in __init__
    raise NotFoundError(element=backend)
arthur.errors.NotFoundError: pipermail not found
[2017-12-18 06:21:58,375 - rq.worker - DEBUG] - Invoking exception handler <bound method Worker.move_to_failed_queue of <arthur.worker.ArthurWorker object at 0x7fbf3c447e10>>
[2017-12-18 06:21:58,375 - rq.worker - WARNING] - Moving job to 'failed' queue

Extract handlers from the scheduler

The handlers that process JobEvent.COMPLETED and JobEvent.FAILURE (methods _handle_successful_job and _handle_failed_job) are tightly coupled to the Scheduler, therefore, it's hard to test if events are really handled because these are private methods and they run in a listener.

The goal is to extract these methods and convert them into callables.

command to output larger data-set

Hi Team,

Thanks for constantly showing support.
Here I am with yet another issue that is I was able to start and follow all steps to generate till task.
I need some help to understand how to check all records generated from task in elasticsearch?
i am trying to fetch node js repository data for further analysis and i could create tasks.json and see everything but not sure how to see the entire data set loaded ?

Regards

Allow to run again a failed task

When a task fails is impossible to reload it. So far, the only way to do that is to delete and add the task again. This is a problem since related information, like jobs history will be lost.

Arthur should provide a method to re-run these tasks, probably updating their configuration parameters.

Task job history

Tasks should keep track of the series of jobs that were run. Tasks should keep the jobs identifiers are if they finished successfully or with an error.

REST API should return this list of jobs.

rq worker has a wrong from_date

Hello, I found a problem in Arthur.
I configured Mordred to fetch only JIRA.

In executing arthurw,
Sometimes, Arthurw has a wrong 'from_date'.
It is not m,ade by Arthurd. It means that Arthurd didn't make wrong 'from_date' job.
But, Arthurw has a previous 'from_date' value, often.

So, Perceval fetch previous 'from_date'.
It cause duplicated fetch.

e.g. 1st update job: 'from_date': datetime.datetime(2018, 3, 20, 4, 43, 56, tzinfo=tzlocal()),
...
3rd update job: 'from_date': datetime.datetime(2018, 3, 7, 4, 45, 39, tzinfo=tzlocal()),
Back to the previous date!

Plz, check out.
Thanks.

Task status

With the current design, there's no way to know the status of a task. The system should update the status of each task during their life cycle.

Tasks are composed of recurring jobs. Only one job is running at a time. When a job finishes successfully, the scheduler adds a new job to the queue for that task; when a job finishes with an error, the scheduler cancels recurring jobs for that task.

Therefore, tasks go through several stages. These stages might be defined as:

NEW: when the task is added to the system
SCHEDULED: when a task job was scheduled to run
ENQUEUED: when a task job is in a queue waiting its time to run
RUNNING: when a task job is running in a worker
COMPLETED: when the last task job finished successfully
FAILED: when the last task job failed during its execution

The goal of this issue is to define a set of statuses and tasks should be updated accordingly over their life cycle.

Add job_number identifier to the Job meta field

In order to avoid errors when returning a job that is "queued", it is needed to add the job_number field in the metadata of the Job, before adding the job to the queue, here:

https://github.com/chaoss/grimoirelab-kingarthur/blob/master/arthur/scheduler.py#L185

Expose jobs to the REST API

In order to add the feature of the job logs (#72) in Raistlin, we have to expose the jobs with a REST petition to the CherryPy server.

The petition to the server must return all the needed information of the job asked.

Upgrade version to a stable one

For producing a pip package that can be later used as a dependency for mordred, I would need a stable version.

Reschedule tasks a limited number of times

When a task is scheduled, it will run over and over again until either it fails or it is cancelled. Sometimes it is useful to run a task only once or few times. This might be used for testing purposes too.

README.md

Could we have a README.md stating what is being developed in this repo, how to install it, and how to run it? (at least the basics). I find the help message in bin/arthur could be a good starting point...

Failed tasks on 'create' queue are rescheduled into 'update'

When a task fails inside the create queue it is re-scheduled into update queue. These tasks should be re-scheduled into create.

Support Perceval's archive mode

Recently, Perceval has added a new feature that replaces the old cache mode. To support this in King Arthur I suggest the following:

Remove completely the Cache (it is not supported anymore by Perceval).
Use fetch and fetch_from_archive functions defined on backend module. Some refactoring will be needed because non-primitive objects cannot be sent to workers (i.e ArchiveManager).
Create a new dedicated queue named archive where these kind of jobs will be pushed. Once these jobs end, they will be rescheduled to update queue.

Define better log messages

Most ot the messages defined inside the code are not really useful to understand what Arthur is doing and what tasks is running. We need better messages and to define the right threshold levels for each message.

Does Arthur enrich the data?

Does Arthur only fetch the raw data? Is there a way to enrich data in multiprocess?

And also can I start different workers on one machine?

Wrong date conversion

[2018-05-09 11:28:48,289] - JobListener instence crashed. Error: None is not a valid date
[2018-05-09 11:28:48,290] - Traceback (most recent call last):
  File "/arthur/src/grimoirelab-toolkit/grimoirelab/toolkit/datetime.py", line 165, in unixtime_to_datetime
    dt = datetime.datetime.utcfromtimestamp(ut)
TypeError: a float is required

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/arthur/.local/lib/python3.5/site-packages/kingarthur-0.1.6-py3.5.egg/arthur/scheduler.py", line 211, in run
    self.listen()
  File "/arthur/.local/lib/python3.5/site-packages/kingarthur-0.1.6-py3.5.egg/arthur/scheduler.py", line 245, in listen
    handler(job)
  File "/arthur/.local/lib/python3.5/site-packages/kingarthur-0.1.6-py3.5.egg/arthur/scheduler.py", line 339, in _handle_successful_job
    from_date = unixtime_to_datetime(result.max_date)
  File "/arthur/src/grimoirelab-toolkit/grimoirelab/toolkit/datetime.py", line 169, in unixtime_to_datetime
    raise InvalidDateError(date=str(ut))
grimoirelab.toolkit.datetime.InvalidDateError: None is not a valid date

Human readable job ids

Job ids are generated using a hash function. This is fine to generate unique identifiers but they are not human readable.

The goal of this task is to generate ids that can be printed and read by humans.

User-defined job queues

So far, Arthur only allows to use two kind of queues: create and update. In some cases, such as to reduce the congestion or to have priority lines, it might be useful that admins or user can defined their own queues when the tasks are scheduled.

Take into account this can also create problems because workers need to be spawned manually.

Documentation: Configuration file?

Hi
I see that we can run an arthur daemon with a configuration file but I can't find an example (or documentation) of this one.
Does it exist?

Wrong dependencies with perceval in setup.py

In setup.py only perceval dependency is defined but arthur also needs:

perceval-mozilla
perceval-opnfv
perceval-puppet (this pip package does not exists yet)

Use Perceval summary

Since Perceval incorporates a summary object with the result of a fetching process (see chaoss/grimoirelab-perceval/issues/529), Arthur doesn't need to generate this summary by its own.

JobResult and PercevalJob classes should be rewritten to add the information that comes from Perceval.

Update depedencies

Arthur relies on old dependencies (see below). Since we are putting effort on it, it would be good to update them to reduce the gap with the latest ones available.

redis==3.0.0 ---> latest 3.3.11
rq==1.0.0 ---> latest 1.3.1
cheroot==5.8.3 ---> latest 8.2.1
cherrypy>=8.1, <=11.0.0 ---> latest 18.3.0

Persistent tasks

As the number of repositories to analyze increases in Arthur it starts to make sense to store the tasks in a persistent place. In my opinion, tasks should be stored in Redis, taking as an example the way rq stores its Job objects.

After implementing this, and as a future work, Arthur will be able to read the tasks from the storage system once the service is restarted.

Error creating an index in ElasticSearch

I followed instructions to run arthur (latest commit checkout in this repo), but I get an error:

$ arthurd -g -d redis://localhost/8 --es-index http://localhost:9200/items --log-path /tmp/logs --no-cacheTraceback (most recent call last):
  File "/tmp/test2/bin/arthurd", line 225, in <module>
    main()
  File "/tmp/test2/bin/arthurd", line 88, in main
    writer = ElasticItemsWriter(args.es_index)
  File "/tmp/test2/lib/python3.5/site-packages/arthur/writers.py", line 71, in __init__
    was_created = self.create_index(self.idx_url, clean=clean)
  File "/tmp/test2/lib/python3.5/site-packages/arthur/writers.py", line 138, in create_index
    raise ElasticSearchError(cause=cause)
arthur.writers.ElasticSearchError: Error creating Elastic Search index http://localhost:9200/items

However, I can create that index manually, so it seems there is no issue with ElasticSearch.

Just in case it matters, it is ElasticSearch 6.0 (pre-releasse), and it is working nicely with other applications.

The log file reads:

[2017-07-19 23:52:31,500 - root - INFO] - King Arthur is on command.
[2017-07-19 23:52:31,501 - root - DEBUG] - Redis connection stablished with redis://localhost/8.
[2017-07-19 23:52:31,505 - urllib3.connectionpool - DEBUG] - Starting new HTTP connection (1): localhost
[2017-07-19 23:52:31,520 - urllib3.connectionpool - DEBUG] - http://localhost:9200 "GET /items HTTP/1.1" 404 None
[2017-07-19 23:52:31,522 - urllib3.connectionpool - DEBUG] - Starting new HTTP connection (1): localhost
[2017-07-19 23:52:31,528 - urllib3.connectionpool - DEBUG] - http://localhost:9200 "POST /items HTTP/1.1" 400 None
[2017-07-19 23:52:31,528 - arthur.writers - INFO] - Can't create index http://localhost:9200/items (400)

Curiously enough, in the ElasticSearch logs I see no access which could be attributed to arthur.

I tried with ElasticSearch 5.1.x as well, with the same results.

Any idea?

Just in case it matters, the result of pip freeze (I know there are many packages that are not needed, but I'm using the venv for some other stuff):

arthur==0.1.0.dev1
beautifulsoup4==4.6.0
certifi==2017.4.17
chardet==3.0.4
cheroot==5.7.0
CherryPy==11.0.0
click==6.7
feedparser==5.2.1
grimoire-elk==0.30.4
grimoire-kidash==0.30.4
grimoirelab-toolkit==0.1.0
idna==2.5
Jinja2==2.9.6
MarkupSafe==1.0
numpy==1.13.1
pandas==0.20.3
perceval==0.9.0
perceval-mozilla==0.1.1
perceval-opnfv==0.1.0
pkg-resources==0.0.0
portend==2.1.2
PyMySQL==0.7.11
python-dateutil==2.6.1
pytz==2017.2
redis==2.10.5
requests==2.18.1
rq==0.8.0
six==1.10.0
sortinghat==0.4.0
SQLAlchemy==1.1.11
tempora==1.8
urllib3==1.21.1

Refactor job resuming

Job resuming is integrated inside execute_perceval_job. This is an antipattern. The logic to resume task/jobs should be in the scheduler. We decided to include this feature inside that function because it stores all the information needed to resume a task but since Perceval returns a summary with the last execution, this should not be a problem anymore.

pyflake errors: variables assigned to but never used

Running pyflakes over the version 7c0ec65 I got the following pyflake warnings:

./arthur/arthur.py:98: local variable 'e' is assigned to but never used
./tests/test_arthur.py:291: local variable 'ex' is assigned to but never used
./tests/test_arthur.py:305: local variable 'ex' is assigned to but never used

Closing condition:

pyflakes does not return these errors in master branch

Redefine default queues

The default queues created by Arthur are three and describe the different types of services:

create: stores jobs that retrieve data from the first time; the rationale behind this is usually these kind of processes are longer than others and consume more resources.
update: keeps recurring jobs, jobs that run after the first time
archive: stores jobs that retrieve items from a perceval archive

In my opinion, only archive may describe what these queues really store.

The goal of this task is to propose a better schema or keep the current one but renaming the queues to make clear what they do.

[cherrypy] Endpoint not reachable using containers

Cherrypy by default binds to localhost, which is not publicly available, and not reachable from outside the container.

By modifying the source file arthurd I was able to modify this behavior adding the desired configuration just before the quickstart():

cherrypy.config.update( {'server.socket_host': 'DOCKER_CONTAINER_IP' } )

So I guess we should analyze how can we fix this.

Remove `arthur` script

arthur script was developed for testing purposes. It is not used frequently, though, which means it's outdated. I think it makes more sense to remove it from the product.

Upgrade to RQ 1.0

RQ 1.0 was released recently. We should upgrade Arthur to use this release.

Error, likely only with Python 3.7

When running KingArthur in a Python 3.7 environment, I get:

$ arthurd --help
Traceback (most recent call last):
  File "/tmp/install/bin/arthurd", line 35, in <module>
    from arthur.server import ArthurServer
  File "/tmp/install/lib/python3.7/site-packages/arthur/server.py", line 32, in <module>
    from .arthur import Arthur
  File "/tmp/install/lib/python3.7/site-packages/arthur/arthur.py", line 33, in <module>
    from .scheduler import Scheduler
  File "/tmp/install/lib/python3.7/site-packages/arthur/scheduler.py", line 90
    async=self.async_mode)  # noqa: W606
        ^
SyntaxError: invalid syntax

Maybe this is due to some problem specific to Python 3.7 ?

Find a better schema for job ids

Job ids are generated with the prefix arthur-<task_id>-hash. This schema is not clear at all and it doesn't improve its readability.

The goal of this task is to define a better schema for ids, in the case that's necessary.

Tasks are not re-triggered

Is it the role of arthur to update tasks on a regular basics like hourly?
When used with mordred I noticed that I have to flush the redis database and restart arthur/mordred if I want to have new job in redis

Refactor JobScheduler into a TaskScheduler

The JobScheduler should be a TaskScheduler due to Jobs are related to RQ queues. In this way the concept of Job will be only limited to the instant when a job is created and enqueued. Jobs can also be considered as the different executions a Task had.

This will make easier to understand how Arthur works, and of course, to test it.