chaoss / grimoirelab-kingarthur Goto Github PK
View Code? Open in Web Editor NEWKing Arthur commands his loyal knight Perceval on the quest to retrieve data from software repositories.
License: GNU General Public License v3.0
King Arthur commands his loyal knight Perceval on the quest to retrieve data from software repositories.
License: GNU General Public License v3.0
The job events listener is coupled inside the scheduler. It's function is to listed for new events and to handle those events. The current handlers only reschedule the events. These handlers are provided when the listener is created
The problem with structure is that to reschedule the jobs, it needs to pass the scheduler to the handlers, so there's a cycle scheduler-listener-scheduler. This is a bad design.
This task aims to refactorize the code decoupling objects. It will improve the logic, how to monitorize what jobs are doing what and how the code is tested.
We should find a way to store the log messages produced by each job. These messages contain valuable information to keep track of the status of a job and to debug any possible error. So far these messages are only printed by the workers.
The easiest way to add this feature is to store the logs into log files but that will make difficult to access them using the rest API that arthurd
provides. Maybe the best way is to store them within the jobs in RQ and Redis.
When the server is running in foreground, log messages are written to a log file instead to the standard or error output.
The default behaviour should be to write first to stderr
.
Hi,
After one year working with different Big Data platforms I have reached the conclusion that the task scheduling should be done outside arthur. arthur should be a task repository and also the engine for executing the tasks, tracking the execution, stopping it and having stats about the track executions. The real execution is done by the data processing engine. In our case, RQ and the python tasks for collecting the data.
The task scheduling is something totally different from the task execution. For example in Unix systems, the scheduling is done using tools like cron. In our case, for the GrimoireLab platform, there is a python based tool called Airflow (https://airflow.apache.org/) which is the base for the Google Composer service, demoing its maturity.
Airflow could asks arthur to execute the jobs according to different schedules. The scheduling could be defined in Python code, and it has a great web interface to visualize complex scheduling (I am not sure but probably you can also visualize the execution process of the tasks).
My proposal is to use Airflow in GrimoireLab platform for task scheduling, and to simplify arthur cleaning the logic related to scheduling.
I'm trying King Arthur out, and have run into a couple of instances where the job result is None
or result.summary
is None
, but the code tries to get fields from it and crashes. I put in some quick if result.summary:
clauses to the plug the hole in my local version, but I wonder if that's just masking some deeper issue about an assumed guarantee of when result.summary
is set that is not being respected.
I've seen this happen with at least this snippet from scheduler.py:
logger.error("Job #%s (task: %s) failed but will be resumed",
job_id, task_id)
if result.summary.fetched > 0:
task.backend_args['next_from_date'] = result.summary.max_updated_on
if result.summary.max_offset:
task.backend_args['next_offset'] = result.summary.max_offset
which raises AttributeError: 'NoneType' object has no attribute 'summary'
, and this one in jobs.py
logger.debug("Job #%s (task: %s) completed (%s) - %s/%s items (%s) fetched",
result.job_id, task_id, result.backend,
str(result.summary.fetched), str(result.summary.skipped),
result.category)
which raises AttributeError: 'NoneType' object has no attribute 'fetched'
.
Tasks are not updated to RUNNING state when their jobs are running in a worker. To do this, it's necessary that the job sends an event when it enters in the worker.
Once #98 is added, the next step is to add to the CherryPy server the functionality of returning the job_number in the petitions of tasks and task by its id:
This task is directly related to Bitergia/raistlin#16
Python 3.7 makes async
a reserved word: https://docs.python.org/3/reference/compound_stmts.html#async
This causes errors where used with rq as a parameter name:
https://github.com/chaoss/grimoirelab-kingarthur/blob/master/arthur/scheduler.py#L90
https://github.com/chaoss/grimoirelab-kingarthur/blob/master/tests/test_jobs.py#L544
Rq version 0.12 changes the parameter name to is_async
: CHANGES.md
On travis-ci using
dist: xenial
python:
- "3.7"
will show the error: build
This issue is to activate protobot/dco (or similar bot) to check that all commits have a sign-off in this repository.
The CHAOSS Project Charter section 8.2.1 requires that all contributions are signed-off. The CHAOSS project has been piloting the use of DCO sign-offs. Once contributors know how to do it, sign-offs are easy to do with little overhead.
For users of the git command line interface, a sign-off is accomplished with the
-s
as part of the commit command:git commit -s -m 'This is a commit message'
For users of the GitHub interface, a sign-off is accomplished by writing
Signed-off-by: Your Name <[email protected]>
into the commit comment field. This can be automated by using a browser plugin like scottrigby/dco-gh-ui
To-Do for repo maintainers: Please inform your contributors about DCO sign-offs and comment on this issue when your are ready for the DCO bot to be activated on this repository.
In order to monitoring Arthur server and workers, it would be good to have a status report using the Rest API which will inform about the list of workers and their status, the list of queues, the number of tasks and jobs queued, and the status of other services as Redis.
The pipermail perceval backend does not work with arthur.
I start arthurd and a worker:
(acs@dellx) ~ $ arthurd -g -d redis://localhost/8 --log-path /tmp/arthurd --no-cache && tail -f /tmp/arthurd/arthur.log
(acs@dellx) ~ $ arthurw -g -d redis://localhost/8
and send the task:
(acs@dellx) ~ $ curl -XPOST -H "Content-Type: application/json" http://localhost:8080/add -d'
{
"tasks": [
{
"task_id": "http://lists.wikimedia.org/pipermail/analytics",
"backend_args": {
"tag": "http://lists.wikimedia.org/pipermail/analytics",
"uri": "http://lists.wikimedia.org/pipermail/analytics"
},
"cache": {
"cache": true,
"cache_path": null,
"fetch_from_cache": false
},
"backend": "pipermail",
"scheduler": {
"delay": 60
}
}
]
}'
Tasks added
and in the worker logs:
[2017-12-18 06:21:58,375 - rq.worker - ERROR] - arthur.errors.NotFoundError: pipermail not found
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 112, in __init__
self._bklass = perceval.find_backends(perceval.backends)[0][backend]
KeyError: 'pipermail'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 700, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 500, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 299, in execute_perceval_job
rq_job.connection, qitems)
File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 114, in __init__
raise NotFoundError(element=backend)
arthur.errors.NotFoundError: pipermail not found
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 112, in __init__
self._bklass = perceval.find_backends(perceval.backends)[0][backend]
KeyError: 'pipermail'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 700, in perform_job
rv = job.perform()
File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 500, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 299, in execute_perceval_job
rq_job.connection, qitems)
File "/usr/local/lib/python3.5/dist-packages/kingarthur-0.1.1-py3.5.egg/arthur/jobs.py", line 114, in __init__
raise NotFoundError(element=backend)
arthur.errors.NotFoundError: pipermail not found
[2017-12-18 06:21:58,375 - rq.worker - DEBUG] - Invoking exception handler <bound method Worker.move_to_failed_queue of <arthur.worker.ArthurWorker object at 0x7fbf3c447e10>>
[2017-12-18 06:21:58,375 - rq.worker - WARNING] - Moving job to 'failed' queue
The handlers that process JobEvent.COMPLETED
and JobEvent.FAILURE
(methods _handle_successful_job
and _handle_failed_job
) are tightly coupled to the Scheduler
, therefore, it's hard to test if events are really handled because these are private methods and they run in a listener.
The goal is to extract these methods and convert them into callables.
Hi Team,
Thanks for constantly showing support.
Here I am with yet another issue that is I was able to start and follow all steps to generate till task.
I need some help to understand how to check all records generated from task in elasticsearch?
i am trying to fetch node js repository data for further analysis and i could create tasks.json and see everything but not sure how to see the entire data set loaded ?
Regards
When a task fails is impossible to reload it. So far, the only way to do that is to delete and add the task again. This is a problem since related information, like jobs history will be lost.
Arthur should provide a method to re-run these tasks, probably updating their configuration parameters.
Tasks should keep track of the series of jobs that were run. Tasks should keep the jobs identifiers are if they finished successfully or with an error.
REST API should return this list of jobs.
Hello, I found a problem in Arthur.
I configured Mordred to fetch only JIRA.
In executing arthurw,
Sometimes, Arthurw has a wrong 'from_date'.
It is not m,ade by Arthurd. It means that Arthurd didn't make wrong 'from_date' job.
But, Arthurw has a previous 'from_date' value, often.
So, Perceval fetch previous 'from_date'.
It cause duplicated fetch.
e.g. 1st update job: 'from_date': datetime.datetime(2018, 3, 20, 4, 43, 56, tzinfo=tzlocal()),
...
3rd update job: 'from_date': datetime.datetime(2018, 3, 7, 4, 45, 39, tzinfo=tzlocal()),
Back to the previous date!
Plz, check out.
Thanks.
With the current design, there's no way to know the status of a task. The system should update the status of each task during their life cycle.
Tasks are composed of recurring jobs. Only one job is running at a time. When a job finishes successfully, the scheduler adds a new job to the queue for that task; when a job finishes with an error, the scheduler cancels recurring jobs for that task.
Therefore, tasks go through several stages. These stages might be defined as:
The goal of this issue is to define a set of statuses and tasks should be updated accordingly over their life cycle.
In order to avoid errors when returning a job that is "queued", it is needed to add the job_number
field in the metadata of the Job, before adding the job to the queue, here:
https://github.com/chaoss/grimoirelab-kingarthur/blob/master/arthur/scheduler.py#L185
In order to add the feature of the job logs (#72) in Raistlin, we have to expose the jobs with a REST petition to the CherryPy server.
The petition to the server must return all the needed information of the job asked.
For producing a pip package that can be later used as a dependency for mordred, I would need a stable version.
When a task is scheduled, it will run over and over again until either it fails or it is cancelled. Sometimes it is useful to run a task only once or few times. This might be used for testing purposes too.
Could we have a README.md stating what is being developed in this repo, how to install it, and how to run it? (at least the basics). I find the help message in bin/arthur could be a good starting point...
When a task fails inside the create
queue it is re-scheduled into update
queue. These tasks should be re-scheduled into create
.
Recently, Perceval
has added a new feature that replaces the old cache
mode. To support this in King Arthur
I suggest the following:
Cache
(it is not supported anymore by Perceval
).fetch
and fetch_from_archive
functions defined on backend
module. Some refactoring will be needed because non-primitive objects cannot be sent to workers (i.e ArchiveManager
).archive
where these kind of jobs will be pushed. Once these jobs end, they will be rescheduled to update
queue.Most ot the messages defined inside the code are not really useful to understand what Arthur is doing and what tasks is running. We need better messages and to define the right threshold levels for each message.
Does Arthur only fetch the raw data? Is there a way to enrich data in multiprocess?
And also can I start different workers on one machine?
[2018-05-09 11:28:48,289] - JobListener instence crashed. Error: None is not a valid date
[2018-05-09 11:28:48,290] - Traceback (most recent call last):
File "/arthur/src/grimoirelab-toolkit/grimoirelab/toolkit/datetime.py", line 165, in unixtime_to_datetime
dt = datetime.datetime.utcfromtimestamp(ut)
TypeError: a float is required
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/arthur/.local/lib/python3.5/site-packages/kingarthur-0.1.6-py3.5.egg/arthur/scheduler.py", line 211, in run
self.listen()
File "/arthur/.local/lib/python3.5/site-packages/kingarthur-0.1.6-py3.5.egg/arthur/scheduler.py", line 245, in listen
handler(job)
File "/arthur/.local/lib/python3.5/site-packages/kingarthur-0.1.6-py3.5.egg/arthur/scheduler.py", line 339, in _handle_successful_job
from_date = unixtime_to_datetime(result.max_date)
File "/arthur/src/grimoirelab-toolkit/grimoirelab/toolkit/datetime.py", line 169, in unixtime_to_datetime
raise InvalidDateError(date=str(ut))
grimoirelab.toolkit.datetime.InvalidDateError: None is not a valid date
Job ids are generated using a hash function. This is fine to generate unique identifiers but they are not human readable.
The goal of this task is to generate ids that can be printed and read by humans.
So far, Arthur only allows to use two kind of queues: create
and update
. In some cases, such as to reduce the congestion or to have priority lines, it might be useful that admins or user can defined their own queues when the tasks are scheduled.
Take into account this can also create problems because workers need to be spawned manually.
Hi
I see that we can run an arthur daemon with a configuration file but I can't find an example (or documentation) of this one.
Does it exist?
In setup.py only perceval dependency is defined but arthur also needs:
Since Perceval incorporates a summary object with the result of a fetching process (see chaoss/grimoirelab-perceval/issues/529), Arthur doesn't need to generate this summary by its own.
JobResult
and PercevalJob
classes should be rewritten to add the information that comes from Perceval.
Arthur relies on old dependencies (see below). Since we are putting effort on it, it would be good to update them to reduce the gap with the latest ones available.
redis==3.0.0 ---> latest 3.3.11
rq==1.0.0 ---> latest 1.3.1
cheroot==5.8.3 ---> latest 8.2.1
cherrypy>=8.1, <=11.0.0 ---> latest 18.3.0
As the number of repositories to analyze increases in Arthur it starts to make sense to store the tasks in a persistent place. In my opinion, tasks should be stored in Redis, taking as an example the way rq
stores its Job
objects.
After implementing this, and as a future work, Arthur will be able to read the tasks from the storage system once the service is restarted.
I followed instructions to run arthur (latest commit checkout in this repo), but I get an error:
$ arthurd -g -d redis://localhost/8 --es-index http://localhost:9200/items --log-path /tmp/logs --no-cacheTraceback (most recent call last):
File "/tmp/test2/bin/arthurd", line 225, in <module>
main()
File "/tmp/test2/bin/arthurd", line 88, in main
writer = ElasticItemsWriter(args.es_index)
File "/tmp/test2/lib/python3.5/site-packages/arthur/writers.py", line 71, in __init__
was_created = self.create_index(self.idx_url, clean=clean)
File "/tmp/test2/lib/python3.5/site-packages/arthur/writers.py", line 138, in create_index
raise ElasticSearchError(cause=cause)
arthur.writers.ElasticSearchError: Error creating Elastic Search index http://localhost:9200/items
However, I can create that index manually, so it seems there is no issue with ElasticSearch.
Just in case it matters, it is ElasticSearch 6.0 (pre-releasse), and it is working nicely with other applications.
The log file reads:
[2017-07-19 23:52:31,500 - root - INFO] - King Arthur is on command.
[2017-07-19 23:52:31,501 - root - DEBUG] - Redis connection stablished with redis://localhost/8.
[2017-07-19 23:52:31,505 - urllib3.connectionpool - DEBUG] - Starting new HTTP connection (1): localhost
[2017-07-19 23:52:31,520 - urllib3.connectionpool - DEBUG] - http://localhost:9200 "GET /items HTTP/1.1" 404 None
[2017-07-19 23:52:31,522 - urllib3.connectionpool - DEBUG] - Starting new HTTP connection (1): localhost
[2017-07-19 23:52:31,528 - urllib3.connectionpool - DEBUG] - http://localhost:9200 "POST /items HTTP/1.1" 400 None
[2017-07-19 23:52:31,528 - arthur.writers - INFO] - Can't create index http://localhost:9200/items (400)
Curiously enough, in the ElasticSearch logs I see no access which could be attributed to arthur.
I tried with ElasticSearch 5.1.x as well, with the same results.
Any idea?
Just in case it matters, the result of pip freeze (I know there are many packages that are not needed, but I'm using the venv for some other stuff):
arthur==0.1.0.dev1
beautifulsoup4==4.6.0
certifi==2017.4.17
chardet==3.0.4
cheroot==5.7.0
CherryPy==11.0.0
click==6.7
feedparser==5.2.1
grimoire-elk==0.30.4
grimoire-kidash==0.30.4
grimoirelab-toolkit==0.1.0
idna==2.5
Jinja2==2.9.6
MarkupSafe==1.0
numpy==1.13.1
pandas==0.20.3
perceval==0.9.0
perceval-mozilla==0.1.1
perceval-opnfv==0.1.0
pkg-resources==0.0.0
portend==2.1.2
PyMySQL==0.7.11
python-dateutil==2.6.1
pytz==2017.2
redis==2.10.5
requests==2.18.1
rq==0.8.0
six==1.10.0
sortinghat==0.4.0
SQLAlchemy==1.1.11
tempora==1.8
urllib3==1.21.1
Job resuming is integrated inside execute_perceval_job
. This is an antipattern. The logic to resume task/jobs should be in the scheduler. We decided to include this feature inside that function because it stores all the information needed to resume a task but since Perceval returns a summary with the last execution, this should not be a problem anymore.
Running pyflakes over the version 7c0ec65 I got the following pyflake warnings:
./arthur/arthur.py:98: local variable 'e' is assigned to but never used
./tests/test_arthur.py:291: local variable 'ex' is assigned to but never used
./tests/test_arthur.py:305: local variable 'ex' is assigned to but never used
Closing condition:
master
branchThe default queues created by Arthur are three and describe the different types of services:
create
: stores jobs that retrieve data from the first time; the rationale behind this is usually these kind of processes are longer than others and consume more resources.update
: keeps recurring jobs, jobs that run after the first timearchive
: stores jobs that retrieve items from a perceval archiveIn my opinion, only archive
may describe what these queues really store.
The goal of this task is to propose a better schema or keep the current one but renaming the queues to make clear what they do.
Cherrypy by default binds to localhost, which is not publicly available, and not reachable from outside the container.
By modifying the source file arthurd
I was able to modify this behavior adding the desired configuration just before the quickstart()
:
cherrypy.config.update( {'server.socket_host': 'DOCKER_CONTAINER_IP' } )
So I guess we should analyze how can we fix this.
arthur
script was developed for testing purposes. It is not used frequently, though, which means it's outdated. I think it makes more sense to remove it from the product.
RQ 1.0 was released recently. We should upgrade Arthur to use this release.
When running KingArthur in a Python 3.7 environment, I get:
$ arthurd --help
Traceback (most recent call last):
File "/tmp/install/bin/arthurd", line 35, in <module>
from arthur.server import ArthurServer
File "/tmp/install/lib/python3.7/site-packages/arthur/server.py", line 32, in <module>
from .arthur import Arthur
File "/tmp/install/lib/python3.7/site-packages/arthur/arthur.py", line 33, in <module>
from .scheduler import Scheduler
File "/tmp/install/lib/python3.7/site-packages/arthur/scheduler.py", line 90
async=self.async_mode) # noqa: W606
^
SyntaxError: invalid syntax
Maybe this is due to some problem specific to Python 3.7 ?
Job ids are generated with the prefix arthur-<task_id>-hash
. This schema is not clear at all and it doesn't improve its readability.
The goal of this task is to define a better schema for ids, in the case that's necessary.
Is it the role of arthur to update tasks on a regular basics like hourly?
When used with mordred I noticed that I have to flush the redis database and restart arthur/mordred if I want to have new job in redis
The JobScheduler
should be a TaskScheduler
due to Jobs are related to RQ queues. In this way the concept of Job
will be only limited to the instant when a job is created and enqueued. Jobs can also be considered as the different executions a Task
had.
This will make easier to understand how Arthur works, and of course, to test it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.