wikimedia / ores Goto Github PK

View Code? Open in Web Editor NEW

109.0 24.0 38.0 3.63 MB

🤖 A hosting service for 'revscoring' models.

Home Page: https://mediawiki.org/wiki/ORES

License: MIT License

Python 76.34% Jupyter Notebook 16.21% JavaScript 1.84% HTML 4.93% CSS 0.33% Makefile 0.10% Dockerfile 0.20% Shell 0.05%

artificial-intelligence

ores's Introduction

ORES

⚠️ Warning: As of late 2023, the ORES infrastructure is being deprecated by the WMF Machine Learning team, please check https://wikitech.wikimedia.org/wiki/ORES for more info.
While the code in this repository may still work, it is unmaintained, and as such may break at any time. Special consideration should also be given to machine learning models seeing drift in quality of predictions.

The replacement for ORES and associated infrastructure is Lift Wing: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing

Some Revscoring models from ORES run on the Lift Wing infrastructure, but they are otherwise unsupported (no new training or code updates).

They can be downloaded from the links documented at: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Revscoring_models_(migrated_from_ORES)

In the long term, some or all these models may be replaced by newer models specifically tailored to be run on modern ML infrastructure like Lift Wing.

If you have any questions, contact the WMF Machine Learning team: https://wikitech.wikimedia.org/wiki/Machine_Learning

A webserver for hosting scoring services. For more information, see the ORES documentation on MediaWiki.

Installation

ORES is based on Python 3. Use pip to install ORES:

pip install ores (or pip3 install ores if your distribution defaults to Python 2)

If you're running with the default Redis configuration, you'll need to install a few more optional libraries,

pip install ores[redis]

Then you can easily run a test server by:

ores applications.wsgi

Use the -h argument to view its usage.

ores applications.wsgi -h

Visit these pages to see if your installation works,

http://localhost:8080/ http://localhost:8080/v2/scores/testwiki/revid/641962088?features=true

Running ores using docker composer

As an easy way to run ores for development, download and install docker-compose and then do:

docker-compose build && docker-compose up

ores will be accessible through localhost:8080

Running tests

For a native installation, make sure you installed dependencies for testing:

pip install -r test-requirements.txt

then run:

py.test .

For docker installation, run:

docker-compose exec ores-worker py.test /ores

Utilities

ORES provides several utilities:

precached: Starts a daemon that requests scores for revisions as they happen
score_revisions: Scores a set of revisions using an ORES API
stress_test: Scores a large set of revisions at a configurable rate
test_api: Runs a series of tests against a live ORES API

In order to run any of them, run it through ./utility wrapper:

./utility test_api -h

For docker installations run it through one of containers:

docker-compose exec ores-worker /ores/utility test_api -h

Authors

ores's People

Contributors

Stargazers

Watchers

ores's Issues

IndexError: list index out of range on ores-test

Today I got the following from
http://ores-test.wmflabs.org/scores/ptwiki?models=reverted&revids=41947433|41947297

{
  "41947297": {
    "reverted": {
      "error": {
        "message": "expected string or buffer",
        "type": "<class 'TypeError'>"
      }
    }
  },
  "41947433": {
    "reverted": {
      "error": {
        "message": "list index out of range",
        "type": "<class 'IndexError'>"
      }
    }
  }
}

A few moments later, the same URL returned this:

{
  "41947297": {
    "reverted": {
      "error": {
        "message": "list index out of range",
        "type": "<class 'IndexError'>"
      }
    }
  },
  "41947433": {
    "reverted": {
      "prediction": true,
      "probability": {
        "false": 0.20332099489369587,
        "true": 0.7966790051063041
      }
    }
  }
}

I tried a third time, and the result changed again, to

{
  "41947297": {
    "reverted": {
      "prediction": false,
      "probability": {
        "false": 0.8112268331164385,
        "true": 0.18877316688356172
      }
    }
  },
  "41947433": {
    "reverted": {
      "prediction": true,
      "probability": {
        "false": 0.19660444985265774,
        "true": 0.8033955501473425
      }
    }
  }
}

Implement a cache with redis

Include a model version number so that we can do intelligent cache invalidation

Basic logging to stdout

Make sure that the server is capable of writing useful basic logging to stdout

Provide scores in XML format

Apparently this will be easier to parse in Qt, and this would help in the integration with Huggle software. See: https://phabricator.wikimedia.org/T108305#1570203.

Enable CORS

We should enable CORS so tools that want to use CORS can use it instead of having to rely on JSONP.

add libffi-dev to requirements.txt

I think we should add libffi-dev to the list. That seems to be required for requirements.txt to work.

Ores should return predictions for available models even if requested together with missing models

Currently, there is a "revert" model for a ptwiki:
http://ores.wmflabs.org/scores/ptwiki/?models=reverted&revids=41338880|41339445

{
  "41338880": {
    "reverted": {
      "prediction": true,
      "probability": {
        "false": 0.2614995893827041,
        "true": 0.7385004106172957
      }
    }
  },
  "41339445": {
    "reverted": {
      "prediction": false,
      "probability": {
        "false": 0.8541050428468395,
        "true": 0.14589495715316042
      }
    }
  }
}

but the modules "goodfaith" and "damaging" are not available yet. This should not make the following request to fail completely:
http://ores.wmflabs.org/scores/ptwiki/?models=reverted|damaging|goodfaith&revids=41338880|41339445

{
  "error": {
    "code": "bad request",
    "message": "Models '['goodfaith', 'damaging']' not available for ptwiki."
  }
}

I expected a response which would contain the predictions for the available modules, and error messages for the others. E.g.:

{
  "error": {
    "code": "bad request",
    "message": "Models '['goodfaith', 'damaging']' not available for ptwiki."
  }
  "41338880": {
    "reverted": {
      "prediction": ...
    }
  },
  "41339445": {
    "reverted": {
      "prediction": ...
    }
  }
}

or even a more verbose output like

{
  "41338880": {
    "reverted": {
      "prediction": ...
    },
    "goodfaith": {
      "error": ...
    },
    "damaging": {
      "error": ...
    }
  },
  "41339445": {
    "reverted": {
      "prediction": ...
    },
    "goodfaith": {
      "error": ...
    },
    "damaging": {
      "error": ...
    }
  }
}

Provide a RSS/Atom feed with recent-changes having high scores

Just an idea for the future...
Some users still follow the recent changes from RSS feeds such as
https://pt.wikipedia.org/w/api.php?hidebots=1&days=7&limit=50&action=feedrecentchanges&feedformat=rss
They could benefit from having this list filtered by the 'revert' (or 'badfaith') scores

ORES gives 500 Internal Server Error on revert score query for imported revisions

ORES chokes on this URL: http://ores.wmflabs.org/scores/enwiki/?revids=408030634&models=reverted and returns a 500 Internal Server Error.

My hypothesis is that this is because that very early revision was lost, and later restored from the nostalgia Wikipedia. ORES gives the same error when requesting the revert score of the other imported revision: http://ores.wmflabs.org/scores/enwiki/?revids=408030635&models=reverted

Instead of a server error, it should probably return the same kind of "nice" error as when a revision was deleted.

I discovered this while building archaeo. As part of the associated Wikimania talk, I'm fetching scores for all revisions of the Bee article on the English Wikipedia.

If it helps, Quarry 3990 provides the list of all revisions for "Bee". The imported revisions are chronologically the first ones (they have the earliest rev_timestamps) but they have very high rev_ids.

When scoring a revision, return the version of the model used

Something like this:

{
  "version": 12,
  "scores": {
    "123": {
      "reverted": { ... }
    },
    "456": {
      "reverted": { ... }
    }
}

or maybe

{
  "meta": { "version": 12, "timestamp": ..., "etc": ... },
  "scores": {
    "123": {
      "reverted": { ... }
    },
    "456": {
      "reverted": { ... }
    }
}

This might be useful when debugging user reports.

max_revids as a config param.

There should be a way to limit the maximum number of revisions that can be requested from ORES. Right now, ORES will try to deal with as many revids as you can fit in the URL.

Provide datasource/feature overrides for scoring

As a follow-on to #100, it would be great to be able to send simply a set of feature scores (for example, the current features of a real revision, but with 10 extra references) and get the model prediction for a revision with those features.

This would let us get a better feel for the behavior of the model, and also let us experiment with using the model to suggest specific interventions that will most improve a page.

ImportError: bad magic number in 'ores.models': b'\x03\xf3\r\n'

When I execute the lines

import sys;sys.path.insert(0, "../") # Makes ores package accessible
from ores.models.enwiki import features
features

of the notebook, I get

ImportError                               Traceback (most recent call last)
<ipython-input-1-0dd819d43f8a> in <module>()
      1 import sys;sys.path.insert(0, "../") # Makes ores package accessible
----> 2 from ores.models.enwiki import features
      3 features

ImportError: bad magic number in 'ores.models': b'\x03\xf3\r\n'

Make it possible to specify a base path for the model files

Makes it easy to locate them in a flexible way

Compute and store scores for all revisions

Love this idea, especially updating it backwards as the LIFO queue of recent changes.

How often would we want to invalidate the cached data? Is there any value in archiving previous model version results, maybe to study the progress of this tool?

wp10 API returns an array instead of a hash for some revisions

What is going on here?
https://ores.wmflabs.org/v1/scores/enwiki/wp10/?revids=712439107

This has been the way it returns most results: https://ores.wmflabs.org/v1/scores/enwiki/wp10/?revids=11111

I think this is new. I never hit any errors from it until yesterday.

Error responses should not return HTTP Status code 200

HTTP Status code 200 means success, ORES seems to be returning that for failures too. Should be an error in the 4xx range for client errors and 5xx range for server ones.

ORES workers randomly go offline

It's common to check out flower and find that ores workers have gone to an offline state. There's some errors in the log, but these are properly handled. See wikimedia/revscoring#140 for an example of the most common error.

Provide a model for thanked edits

I was wondering if we could provide users a way to easily identify recent changes which are more likely to be 'nice edits', and maybe we could use the thanks log as a label for training a model for this.

pip install -r requirements.txt fails

Ignoring link https://pypi.python.org/packages/source/s/scipy/scipy-0.9.0.zip#md5=a37933c9e3c4fdf8d087624cd7dcb47d (from https://pypi.python.
org/simple/scipy/), version 0.9.0 doesn't match ==0.15.1
  Using version 0.15.1 (newest of versions: 0.15.1, 0.15.1)
  Downloading from URL https://pypi.python.org/packages/source/s/scipy/scipy-0.15.1.tar.gz#md5=be56cd8e60591d6332aac792a5880110 (from https://p
ypi.python.org/simple/scipy/)
  Running setup.py (path:/home/yuvipanda/build/scipy/setup.py) egg_info for package scipy
    Download error on https://pypi.python.org/simple/numpy/: [X509] PEM lib (_ssl.c:2734) -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.python.org/simple/: [X509] PEM lib (_ssl.c:2734) -- Some packages may not be found!
    No local packages or download links found for numpy>=1.5.1
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/home/yuvipanda/build/scipy/setup.py", line 249, in <module>
        setup_package()
      File "/home/yuvipanda/build/scipy/setup.py", line 246, in setup_package
        setup(**metadata)
      File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/dist.py", line 239, in __init__
        self.fetch_build_eggs(attrs.pop('setup_requires'))
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/dist.py", line 264, in fetch_build_eggs
        replace_conflicting=True
      File "/home/yuvipanda/lib/python3.4/site-packages/pkg_resources.py", line 580, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/home/yuvipanda/lib/python3.4/site-packages/pkg_resources.py", line 818, in best_match
        return self.obtain(req, installer) # try and download/install
      File "/home/yuvipanda/lib/python3.4/site-packages/pkg_resources.py", line 830, in obtain
        return installer(requirement)
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/dist.py", line 314, in fetch_build_egg
        return cmd.easy_install(req)
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/command/easy_install.py", line 587, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('numpy>=1.5.1')
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.python.org/simple/numpy/: [X509] PEM lib (_ssl.c:2734) -- Some packages may not be found!

Provide values for the input features along with the predictions

When I get the wp10 score for a revision, what I'd really like to be able to do is to get the values for the features that model uses to calculate the wp10 prediction.

Ideally, these feature scores would be as close to the source text as possible, ie, actual number of references rather than log(number of references) or similar.

Having these feature scores would allow my project to start figuring out how to provide specific advice for article improvement based on which features scores are particularly weak relative to the others.

Provide a score checker on ores

Helder: halfak: BTW, I was wondering if we could have a path in the server for showing not only the score data, but also the diff, so that people could follow a single link when analysing certain false positives
Helder: for example http://ores-test.wmflabs.org/scores/ptwiki/?models=reverted&revids=42193847 returns the JSON used by the gadget
Helder: but we could have a http://ores-test.wmflabs.org/scores/ptwiki/?models=reverted&revids=42193847&showdiff=true (or something like that) to show an HTML page with the diff obtained from the API, and the scores from ores
halfak: Helder, interesting. One thing we could do is dump the cache from the decency solver.
halfak: It would be nice to support that. We already have all that data, we're just not returning it.
halfak: We could even allow the user to specify keys that they want.
halfak: e.g. cache=datasource.added_words|datasource.removed_words
halfak: would return two lists of words.
halfak: Then we could build a UI like labels/form_builder
halfak: It could be ores.wmflabs.org/score_checker/
halfak: It wouldn't take much to set up a little DB for filing reports.
halfak: That UI would be able to query the API and generate a browse-able structure for review.
Helder: interesting
Helder: that would be similar to the special pages from AbuseFilter extension
halfak: Indeed.
halfak: :)

TypeError on ORES ui

Open http://ores.wmflabs.org/ui/
Select ptwiki
Select all models
Type the revid 123456
Click on "Give me results":

Expected result
No error in the console.

Actual result
The console shows

TypeError: $(...).tablesort is not a function

from line $('.sortable.table').tablesort();

Asking for the scores of many revisions takes too long

I wanted the scores of 50 revisions, but ores took 74 seconds to respond.

scores_caches.Redis should log warnings if redis is not available instead of erroring

Indicate progress when running ORES

I think it would be helpful for the code to display a number for say every 100 revisions processed to give an indication of how many have been processed. The number could come after a new line for formatting purposes.

Historical model variants

I turns out that different modeling strategies produce different ranges of scoring probabilities and other differences in scorer model behavior. In order to not surprise users with such changes, we should allow users to choose to continue to use an old model even after a new one is deployed.

For example, the following URL gets a score for the "primary" model:

/scores/enwiki/damaging/123456789

This could be done explicitly with a variant param.

/scores/enwiki/damaging/123456789?variant=gradient_boosting

We'd need to change the output for when model info is requested so that there can be multiple variants reported.

/scores/enwiki/damaging/ returns:

{
  "linear_svc_balanced": { ...model_info..., "primary": false},
  "gradient_boosting": { ...model_info..., "primary": true}
}

This would also change the way we think about caching scores.  Right now, we a score is stored and retrieved based on a key "<context>:<model>:<version>:<rev_id>".  We'd need to add "variant" to that.  "<context>:<model>:<variant>:<version>:<rev_id>".  This then begs the question -- when we say "model", do we really mean *model*?  We're now generalizing the concept of a "model" to a "modeling problem" -- e.g. "predict when an edit is damaging". 

Under this scheme, we could still make updates to the models by adding new sources of signal and making backwards incompatible changes to `revscoring`, but the overall behavior of each variant should stay relatively consistent.

Error when setting up development environment

Here are the steps I followed to setup the development environment:

On OS X:

git clone https://github.com/wiki-ai/ores
cd ores
vagrant up
vagrant ssh

Then, once in vagrant virtual machine:

Configured and activated virtualenv as per https://gist.github.com/halfak/9f4830895496af9e9731
Then tried to start dev_server

git clone https://github.com/wiki-ai/ores
cd ores
utility dev_server

Since the above was complaining about the following missing modules, I installed them:
docopt, yamlconf, flask, flask-jsonpify, stopit, celery, mwapi, revscoring
Finally I got stuck at ImportError: No module named 'revscoring', since pip was not installing this, I did pip install ores
Installing ores fails with the error: https://gist.github.com/sabyasachi/4d134c2c2404e0071fe7

RevisionNotFoundErr

Since we should expect for revisions (particularly damaging ones) to end up getting deleted or outright oversighted off for a variety of reasons we should have our code display a more meaningful warning than a traceback. Perhaps something like:

"|Newline|The revision # was not found.|Newline|"

or perhaps it could be a "-" instead of the "."

Celery logs full of exceptions.

For example:

Sep 10 11:21:15 ores-worker-01 celery[17746]: Traceback (most recent call last):
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
Sep 10 11:21:15 ores-worker-01 celery[17746]: R = retval = fun(*args, **kwargs)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__
Sep 10 11:21:15 ores-worker-01 celery[17746]: return self.run(*args, **kwargs)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/celery.py", line 38, in _score_task
Sep 10 11:21:15 ores-worker-01 celery[17746]: return Timeout._score(self, context, model, rev_id, cache=cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 44, in _score
Sep 10 11:21:15 ores-worker-01 celery[17746]: return self._process(context, model, process_cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/timeout.py", line 21, in _process
Sep 10 11:21:15 ores-worker-01 celery[17746]: seconds=self.timeout)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/timeout.py", line 62, in timeout
Sep 10 11:21:15 ores-worker-01 celery[17746]: result = func(*args, **kwargs)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 31, in _process
Sep 10 11:21:15 ores-worker-01 celery[17746]: score = scoring_context.score(model, cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/scoring_contexts/scoring_context.py", line 45, in score
Sep 10 11:21:15 ores-worker-01 celery[17746]: feature_values = list(self.solve(model, cache))
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 251, in _solve_many
Sep 10 11:21:15 ores-worker-01 celery[17746]: value, cache, history = _solve(dependent, context, cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 241, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: raise CaughtDependencyError(message, e, tb)
Sep 10 11:21:15 ores-worker-01 celery[17746]: revscoring.errors.CaughtDependencyError: TimeoutException: Failed to process <revscoring.languages.english.parent_revision.badwords>:

and

Sep 10 10:41:05 ores-worker-01 celery[17746]: [2015-09-10 10:41:05,212: ERROR/MainProcess] Task ores.score_processors.celery._score_task[frwiki:reverted:118516900:0.3.0] raised unex
pected: AttributeError("'NoneType' object has no attribute 'json'",)
Sep 10 10:41:05 ores-worker-01 celery[17746]: Traceback (most recent call last):
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
Sep 10 10:41:05 ores-worker-01 celery[17746]: R = retval = fun(*args, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__
Sep 10 10:41:05 ores-worker-01 celery[17746]: return self.run(*args, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/celery.py", line 38, in _score_task
Sep 10 10:41:05 ores-worker-01 celery[17746]: return Timeout._score(self, context, model, rev_id, cache=cache)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 39, in _score
Sep 10 10:41:05 ores-worker-01 celery[17746]: caches={rev_id: cache})[rev_id]
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 23, in _get_root_ds
Sep 10 10:41:05 ores-worker-01 celery[17746]: return scoring_context.extract_roots(model, rev_ids, caches=caches)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/scoring_contexts/scoring_context.py", line 76, in extract_roots
Sep 10 10:41:05 ores-worker-01 celery[17746]: for rev_id, (error, root_vals) in zip(rev_ids, error_root_vals):
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api.py", line 140, in _extract_many
Sep 10 10:41:05 ores-worker-01 celery[17746]: rev_docs = self.get_rev_doc_map(rev_ids_missing_data)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api.py", line 197, in get_rev_doc_map
Sep 10 10:41:05 ores-worker-01 celery[17746]: properties=props)}
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api.py", line 194, in <dictcomp>
Sep 10 10:41:05 ores-worker-01 celery[17746]: return {rd['revid']: rd
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/api/collections/revisions.py", line 131, in query
Sep 10 10:41:05 ores-worker-01 celery[17746]: rev_docs, rvcontinue = self._query(*args, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/api/collections/revisions.py", line 188, in _query
Sep 10 10:41:05 ores-worker-01 celery[17746]: doc = self.session.get(params)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/util/api.py", line 30, in get
Sep 10 10:41:05 ores-worker-01 celery[17746]: return self.request('GET', params, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/api/session.py", line 129, in request
Sep 10 10:41:05 ores-worker-01 celery[17746]: doc = super().request(type, params, **kwargs).json()
Sep 10 10:41:05 ores-worker-01 celery[17746]: AttributeError: 'NoneType' object has no attribute 'json'

The latter should be handled better.

"ImportError: No module named 'jsonschema'" when running ipython

(3.4) helder@std:~/projects/ores/ipython
$ipython notebook --pylab inline
Traceback (most recent call last):
  File "/home/helder/env/3.4/lib/python3.4/site-packages/IPython/nbformat/validator.py", line 10, in <module>
    from jsonschema import ValidationError
ImportError: No module named 'jsonschema'

revscoring.dependent.DependencyError: Failed to process <parent_revision.markup_chars>: expected string or buffer

Per IRC discussion here is the traceback.

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 102, in _solve
    value = dependent(*args)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/features/feature.py", line 31, in __call__
    value = super().__call__(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 25, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/features/parent_revision.py", line 60, in process_markup_chars
    return sum(len(m.group(0)) for m  in MARKUP_RE.finditer(parent_revision_text))
TypeError: expected string or buffer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eva/Github/Objective-Revision-Evaluation-Service/ores/features_reverted.py", line 96, in run
    print('\t'.join(str(v) for v in (list(values) + [reverted])))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 39, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 105, in _solve
    .format(dependent, e), e)
revscoring.dependent.DependencyError: Failed to process <parent_revision.markup_chars>: expected string or buffer

Basic request metrics collection and a magic word for precached

We should have basic logging that counts (1) the number of scores requested per amount of time, (2) the proportion of requests that are returned from the cache, generated or errored, and (3) the response time of

Statsd seems like a good option.

We'll need a magic word to flag requests coming from precached to be logged differently so that we don't count those as real scores used.

This will also need to be abstracted in some way so that such logging can be optional.

mwapi raises SSL errors right after ORES starts up

The same request will succeed when retried.

[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1769)

Provide machine-readable information on models

Basically https://meta.wikimedia.org/wiki/ORES/reverted and https://meta.wikimedia.org/wiki/ORES/wp10 in json from. (The list of wikis available for a model and the corresponding accuracy).

Backpressure

The API needs a way to put backpressure on a client to encourage them to either reduce their request rate or try again later.

pylru is not in requirements.txt, but dev_server doesn't work without it.

We should make the dev server do no caching so that it doesn't require pylru.

Vagrant doesn't install redis-py

Redis is now an optional dependency. It should be installed manually.

Allow browser caching of ORES API responses

Following up on #45 (comment), it would be nice if some Cache-Control headers were added to the response by the ORES API web service.

Right now when making the same request multiple times it reaches to ORES each time. Something basic like Cache-control: public, maxage=300, would make a good start.

In addition (or alternatively) output an E-Tag header (e.g. set to he digits of a SHA1 digest of the JSON output - or something more sophisticated). This way browsers will pass it back in the form of If-None-Match, which the application can respond to by short-circuiting the response as 304 Not Modified - which saves a bit of bandwidth.

The latter has the benefit of working regardless of any fixed max-age (e.g. during first 5min the browser will use its cache without server roundtrip. After 5 minutes, the next request goes to the server with If-None-Match header and either gets back a short 304 response indicating the cache can be re-use, or it will use contents of the fresh 200 OK response).

Setup travis / some other version of CI

Good practice! Could just be PEP8 for now, can be tests later.

Scores appears twice in URL

YuviPanda noticed that the url structure has "scores" repeated when load balanced ores is accessed:

http://ores-lb.wmflabs.org/scores/scores/enwiki/?models=reverted&revids=4567454

I think it's because ores sets both an application root and a url prefix

Allow scoring the diff between two arbitrary revisions

See
wikimedia/revscoring#93

ORES wp10 endpoint sometimes returns a float instead of score probabilities

I've gotten occasional error logs while handling data gotten from ores. In particular, after parsing the json into a ruby hash and then getting the data for a particular revision from that hash, the routine checks whether the key 'probability' is present in that revision's data. But instead of being a hash, that data is a float.

Here's the error:

lib/importers/revision_score_importer.rb in block in save_scores at line 64
NoMethodError: undefined method `key?' for 1443972475.2339349:Float

Here's the routine that ingests ores data: https://github.com/WikiEducationFoundation/WikiEduDashboard/blob/master/lib/importers/revision_score_importer.rb

This only seems to happen every once in a while, and our system gathers revision scores for every revision in the system... so the same revisions that cause an error one time seem to be working fine on subsequent attempts.

Add support for JSONP

It seems we need something like
https://github.com/halfak/Revision-Coding/blob/d22bfa1eb7fa0bcb8c71adf377a60534ca05037a/revcoding/util.py#L22-L34
so that on-wiki gadgets can get the scores from Labs without getting something like
SyntaxError: missing ; before statement
when the callback of a request to Labs is executed.

Configurable pre-caching -- decide which models are pre-cached

Do not redirect from https to http

Currently if I access
https://ores-test.wmflabs.org/scores/ptwiki?models=reverted&revids=42150243
I get redirected to
http://ores-test.wmflabs.org/scores/ptwiki/?models=reverted&revids=42150243

In particular, when loading such an URL in JavaScript from an https:// page, the target will be blocked due to
https://developer.mozilla.org/docs/Security/MixedContent

In my script, I had to add the "/" to the end of the URL as a workaround:
he7d3r/mw-gadget-ScoredRevisions@50ad171

Where can I download the train and test file for reverted prediction

Hello

Thank you very much for publishing your code.

Is there a way I can download the train and test file (as mentioned in the code, each with 5000 observations) to rerun the classification?

Thank you very much,

I get a "NameError: name 'RevisionDocumentNotFound' is not defined" from features_reverted.py, line 90.

Command:

cat quarry-2159-20000-revisions-from-trwiki-for-revscores-run10107.tsv | tail -n+2 | ./features_reverted ores.features.trwiki.damaging --language=revscoring.languages.turkish --api=https://tr.wikipedia.org/w/api.php > /Datasets/trwiki.features_reverted.20k_2.tsv

First traceback

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 9, in process
    'flags', 'size'})
  File "/usr/local/lib/python3.4/dist-packages/mw/api/collections/revisions.py", line 45, in get
    raise KeyError(rev_id)
KeyError: 15117380

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eva/Github/Objective-Revision-Evaluation-Service/ores/features_reverted.py", line 90, in run
    print('\t'.join(str(v) for v in (list(values) + [reverted])))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 34, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 95, in _solve
    value = dependent(*values)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 20, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 12, in proces                                                                                                             s
    raise RevisionDocumentNotFound({'rev_id': rev_id})
NameError: name 'RevisionDocumentNotFound' is not defined


-------------------------------------------
Second traceback

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 9, in process
    'flags', 'size'})
  File "/usr/local/lib/python3.4/dist-packages/mw/api/collections/revisions.py", line 45, in get
    raise KeyError(rev_id)
KeyError: 15096494

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eva/Github/Objective-Revision-Evaluation-Service/ores/features_reverted.py", line 90, in run
    print('\t'.join(str(v) for v in (list(values) + [reverted])))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 34, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 95, in _solve
    value = dependent(*values)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 20, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 12, in proces                                                                                                             s
    raise RevisionDocumentNotFound({'rev_id': rev_id})
NameError: name 'RevisionDocumentNotFound' is not defined

Make it pass PEP8 - line length requirements

Is good practice.