Coder Social home page Coder Social logo

wikimedia / ores Goto Github PK

View Code? Open in Web Editor NEW
109.0 24.0 38.0 3.63 MB

๐Ÿค– A hosting service for 'revscoring' models.

Home Page: https://mediawiki.org/wiki/ORES

License: MIT License

Python 76.34% Jupyter Notebook 16.21% JavaScript 1.84% HTML 4.93% CSS 0.33% Makefile 0.10% Dockerfile 0.20% Shell 0.05%
artificial-intelligence

ores's Introduction

Build Status Test coverage GitHub license PyPI version

ORES

โš ๏ธ Warning: As of late 2023, the ORES infrastructure is being deprecated by the WMF Machine Learning team, please check https://wikitech.wikimedia.org/wiki/ORES for more info.

While the code in this repository may still work, it is unmaintained, and as such may break at any time. Special consideration should also be given to machine learning models seeing drift in quality of predictions.

The replacement for ORES and associated infrastructure is Lift Wing: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing

Some Revscoring models from ORES run on the Lift Wing infrastructure, but they are otherwise unsupported (no new training or code updates).

They can be downloaded from the links documented at: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Revscoring_models_(migrated_from_ORES)

In the long term, some or all these models may be replaced by newer models specifically tailored to be run on modern ML infrastructure like Lift Wing.

If you have any questions, contact the WMF Machine Learning team: https://wikitech.wikimedia.org/wiki/Machine_Learning

A webserver for hosting scoring services. For more information, see the ORES documentation on MediaWiki.

Installation

ORES is based on Python 3. Use pip to install ORES:

pip install ores (or pip3 install ores if your distribution defaults to Python 2)

If you're running with the default Redis configuration, you'll need to install a few more optional libraries,

pip install ores[redis]

Then you can easily run a test server by:

ores applications.wsgi

Use the -h argument to view its usage.

ores applications.wsgi -h

Visit these pages to see if your installation works,

http://localhost:8080/ http://localhost:8080/v2/scores/testwiki/revid/641962088?features=true

Running ores using docker composer

As an easy way to run ores for development, download and install docker-compose and then do:

docker-compose build && docker-compose up

ores will be accessible through localhost:8080

Running tests

For a native installation, make sure you installed dependencies for testing:

pip install -r test-requirements.txt

then run:

py.test .

For docker installation, run:

docker-compose exec ores-worker py.test /ores

Utilities

ORES provides several utilities:

  • precached: Starts a daemon that requests scores for revisions as they happen
  • score_revisions: Scores a set of revisions using an ORES API
  • stress_test: Scores a large set of revisions at a configurable rate
  • test_api: Runs a series of tests against a live ORES API

In order to run any of them, run it through ./utility wrapper:

./utility test_api -h

For docker installations run it through one of containers:

docker-compose exec ores-worker /ores/utility test_api -h

Authors

ores's People

Contributors

accraze avatar adamwight avatar arlolra avatar chaitanyamogal avatar codez266 avatar elukey avatar ethgra avatar halfak avatar he7d3r avatar kevinbazira avatar ladsgroup avatar legoktm avatar mdew192837 avatar perryprog avatar pix1234 avatar revi avatar soumyaa1804 avatar tgr avatar tklausmann avatar toarushiroineko avatar ureesoriano avatar yuvipanda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ores's Issues

IndexError: list index out of range on ores-test

Today I got the following from
http://ores-test.wmflabs.org/scores/ptwiki?models=reverted&revids=41947433|41947297

{
  "41947297": {
    "reverted": {
      "error": {
        "message": "expected string or buffer",
        "type": "<class 'TypeError'>"
      }
    }
  },
  "41947433": {
    "reverted": {
      "error": {
        "message": "list index out of range",
        "type": "<class 'IndexError'>"
      }
    }
  }
}

A few moments later, the same URL returned this:

{
  "41947297": {
    "reverted": {
      "error": {
        "message": "list index out of range",
        "type": "<class 'IndexError'>"
      }
    }
  },
  "41947433": {
    "reverted": {
      "prediction": true,
      "probability": {
        "false": 0.20332099489369587,
        "true": 0.7966790051063041
      }
    }
  }
}

I tried a third time, and the result changed again, to

{
  "41947297": {
    "reverted": {
      "prediction": false,
      "probability": {
        "false": 0.8112268331164385,
        "true": 0.18877316688356172
      }
    }
  },
  "41947433": {
    "reverted": {
      "prediction": true,
      "probability": {
        "false": 0.19660444985265774,
        "true": 0.8033955501473425
      }
    }
  }
}

Enable CORS

We should enable CORS so tools that want to use CORS can use it instead of having to rely on JSONP.

Ores should return predictions for available models even if requested together with missing models

Currently, there is a "revert" model for a ptwiki:
http://ores.wmflabs.org/scores/ptwiki/?models=reverted&revids=41338880|41339445

{
  "41338880": {
    "reverted": {
      "prediction": true,
      "probability": {
        "false": 0.2614995893827041,
        "true": 0.7385004106172957
      }
    }
  },
  "41339445": {
    "reverted": {
      "prediction": false,
      "probability": {
        "false": 0.8541050428468395,
        "true": 0.14589495715316042
      }
    }
  }
}

but the modules "goodfaith" and "damaging" are not available yet. This should not make the following request to fail completely:
http://ores.wmflabs.org/scores/ptwiki/?models=reverted|damaging|goodfaith&revids=41338880|41339445

{
  "error": {
    "code": "bad request",
    "message": "Models '['goodfaith', 'damaging']' not available for ptwiki."
  }
}

I expected a response which would contain the predictions for the available modules, and error messages for the others. E.g.:

{
  "error": {
    "code": "bad request",
    "message": "Models '['goodfaith', 'damaging']' not available for ptwiki."
  }
  "41338880": {
    "reverted": {
      "prediction": ...
    }
  },
  "41339445": {
    "reverted": {
      "prediction": ...
    }
  }
}

or even a more verbose output like

{
  "41338880": {
    "reverted": {
      "prediction": ...
    },
    "goodfaith": {
      "error": ...
    },
    "damaging": {
      "error": ...
    }
  },
  "41339445": {
    "reverted": {
      "prediction": ...
    },
    "goodfaith": {
      "error": ...
    },
    "damaging": {
      "error": ...
    }
  }
}

ORES gives 500 Internal Server Error on revert score query for imported revisions

ORES chokes on this URL: http://ores.wmflabs.org/scores/enwiki/?revids=408030634&models=reverted and returns a 500 Internal Server Error.

My hypothesis is that this is because that very early revision was lost, and later restored from the nostalgia Wikipedia. ORES gives the same error when requesting the revert score of the other imported revision: http://ores.wmflabs.org/scores/enwiki/?revids=408030635&models=reverted

Instead of a server error, it should probably return the same kind of "nice" error as when a revision was deleted.

I discovered this while building archaeo. As part of the associated Wikimania talk, I'm fetching scores for all revisions of the Bee article on the English Wikipedia.

If it helps, Quarry 3990 provides the list of all revisions for "Bee". The imported revisions are chronologically the first ones (they have the earliest rev_timestamps) but they have very high rev_ids.

When scoring a revision, return the version of the model used

Something like this:

{
  "version": 12,
  "scores": {
    "123": {
      "reverted": { ... }
    },
    "456": {
      "reverted": { ... }
    }
}

or maybe

{
  "meta": { "version": 12, "timestamp": ..., "etc": ... },
  "scores": {
    "123": {
      "reverted": { ... }
    },
    "456": {
      "reverted": { ... }
    }
}

This might be useful when debugging user reports.

max_revids as a config param.

There should be a way to limit the maximum number of revisions that can be requested from ORES. Right now, ORES will try to deal with as many revids as you can fit in the URL.

Provide datasource/feature overrides for scoring

As a follow-on to #100, it would be great to be able to send simply a set of feature scores (for example, the current features of a real revision, but with 10 extra references) and get the model prediction for a revision with those features.

This would let us get a better feel for the behavior of the model, and also let us experiment with using the model to suggest specific interventions that will most improve a page.

ImportError: bad magic number in 'ores.models': b'\x03\xf3\r\n'

When I execute the lines

import sys;sys.path.insert(0, "../") # Makes ores package accessible
from ores.models.enwiki import features
features

of the notebook, I get

ImportError                               Traceback (most recent call last)
<ipython-input-1-0dd819d43f8a> in <module>()
      1 import sys;sys.path.insert(0, "../") # Makes ores package accessible
----> 2 from ores.models.enwiki import features
      3 features

ImportError: bad magic number in 'ores.models': b'\x03\xf3\r\n'

Compute and store scores for all revisions

Love this idea, especially updating it backwards as the LIFO queue of recent changes.

How often would we want to invalidate the cached data? Is there any value in archiving previous model version results, maybe to study the progress of this tool?

Provide a model for thanked edits

I was wondering if we could provide users a way to easily identify recent changes which are more likely to be 'nice edits', and maybe we could use the thanks log as a label for training a model for this.

pip install -r requirements.txt fails

Ignoring link https://pypi.python.org/packages/source/s/scipy/scipy-0.9.0.zip#md5=a37933c9e3c4fdf8d087624cd7dcb47d (from https://pypi.python.
org/simple/scipy/), version 0.9.0 doesn't match ==0.15.1
  Using version 0.15.1 (newest of versions: 0.15.1, 0.15.1)
  Downloading from URL https://pypi.python.org/packages/source/s/scipy/scipy-0.15.1.tar.gz#md5=be56cd8e60591d6332aac792a5880110 (from https://p
ypi.python.org/simple/scipy/)
  Running setup.py (path:/home/yuvipanda/build/scipy/setup.py) egg_info for package scipy
    Download error on https://pypi.python.org/simple/numpy/: [X509] PEM lib (_ssl.c:2734) -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.python.org/simple/: [X509] PEM lib (_ssl.c:2734) -- Some packages may not be found!
    No local packages or download links found for numpy>=1.5.1
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/home/yuvipanda/build/scipy/setup.py", line 249, in <module>
        setup_package()
      File "/home/yuvipanda/build/scipy/setup.py", line 246, in setup_package
        setup(**metadata)
      File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/dist.py", line 239, in __init__
        self.fetch_build_eggs(attrs.pop('setup_requires'))
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/dist.py", line 264, in fetch_build_eggs
        replace_conflicting=True
      File "/home/yuvipanda/lib/python3.4/site-packages/pkg_resources.py", line 580, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/home/yuvipanda/lib/python3.4/site-packages/pkg_resources.py", line 818, in best_match
        return self.obtain(req, installer) # try and download/install
      File "/home/yuvipanda/lib/python3.4/site-packages/pkg_resources.py", line 830, in obtain
        return installer(requirement)
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/dist.py", line 314, in fetch_build_egg
        return cmd.easy_install(req)
      File "/home/yuvipanda/lib/python3.4/site-packages/setuptools/command/easy_install.py", line 587, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('numpy>=1.5.1')
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.python.org/simple/numpy/: [X509] PEM lib (_ssl.c:2734) -- Some packages may not be found!

Provide values for the input features along with the predictions

When I get the wp10 score for a revision, what I'd really like to be able to do is to get the values for the features that model uses to calculate the wp10 prediction.

Ideally, these feature scores would be as close to the source text as possible, ie, actual number of references rather than log(number of references) or similar.

Having these feature scores would allow my project to start figuring out how to provide specific advice for article improvement based on which features scores are particularly weak relative to the others.

Provide a score checker on ores

Helder: halfak: BTW, I was wondering if we could have a path in the server for showing not only the score data, but also the diff, so that people could follow a single link when analysing certain false positives
Helder: for example http://ores-test.wmflabs.org/scores/ptwiki/?models=reverted&revids=42193847 returns the JSON used by the gadget
Helder: but we could have a http://ores-test.wmflabs.org/scores/ptwiki/?models=reverted&revids=42193847&showdiff=true (or something like that) to show an HTML page with the diff obtained from the API, and the scores from ores
halfak: Helder, interesting. One thing we could do is dump the cache from the decency solver.
halfak: It would be nice to support that. We already have all that data, we're just not returning it.
halfak: We could even allow the user to specify keys that they want.
halfak: e.g. cache=datasource.added_words|datasource.removed_words
halfak: would return two lists of words.
halfak: Then we could build a UI like labels/form_builder
halfak: It could be ores.wmflabs.org/score_checker/
halfak: It wouldn't take much to set up a little DB for filing reports.
halfak: That UI would be able to query the API and generate a browse-able structure for review.
Helder: interesting
Helder: that would be similar to the special pages from AbuseFilter extension
halfak: Indeed.
halfak: :)

Indicate progress when running ORES

I think it would be helpful for the code to display a number for say every 100 revisions processed to give an indication of how many have been processed. The number could come after a new line for formatting purposes.

Historical model variants

I turns out that different modeling strategies produce different ranges of scoring probabilities and other differences in scorer model behavior. In order to not surprise users with such changes, we should allow users to choose to continue to use an old model even after a new one is deployed.

For example, the following URL gets a score for the "primary" model:

/scores/enwiki/damaging/123456789

This could be done explicitly with a variant param.

/scores/enwiki/damaging/123456789?variant=gradient_boosting

We'd need to change the output for when model info is requested so that there can be multiple variants reported.

/scores/enwiki/damaging/ returns:

{
  "linear_svc_balanced": { ...model_info..., "primary": false},
  "gradient_boosting": { ...model_info..., "primary": true}
}

This would also change the way we think about caching scores.  Right now, we a score is stored and retrieved based on a key "<context>:<model>:<version>:<rev_id>".  We'd need to add "variant" to that.  "<context>:<model>:<variant>:<version>:<rev_id>".  This then begs the question -- when we say "model", do we really mean *model*?  We're now generalizing the concept of a "model" to a "modeling problem" -- e.g. "predict when an edit is damaging". 

Under this scheme, we could still make updates to the models by adding new sources of signal and making backwards incompatible changes to `revscoring`, but the overall behavior of each variant should stay relatively consistent. 

Error when setting up development environment

Here are the steps I followed to setup the development environment:

  • On OS X:
git clone https://github.com/wiki-ai/ores
cd ores
vagrant up
vagrant ssh

Then, once in vagrant virtual machine:

git clone https://github.com/wiki-ai/ores
cd ores
utility dev_server
  • Since the above was complaining about the following missing modules, I installed them:
    docopt, yamlconf, flask, flask-jsonpify, stopit, celery, mwapi, revscoring
  • Finally I got stuck at ImportError: No module named 'revscoring', since pip was not installing this, I did pip install ores
  • Installing ores fails with the error: https://gist.github.com/sabyasachi/4d134c2c2404e0071fe7

RevisionNotFoundErr

Since we should expect for revisions (particularly damaging ones) to end up getting deleted or outright oversighted off for a variety of reasons we should have our code display a more meaningful warning than a traceback. Perhaps something like:

"|Newline|The revision # was not found.|Newline|"

or perhaps it could be a "-" instead of the "."

Celery logs full of exceptions.

For example:

Sep 10 11:21:15 ores-worker-01 celery[17746]: Traceback (most recent call last):
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
Sep 10 11:21:15 ores-worker-01 celery[17746]: R = retval = fun(*args, **kwargs)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__
Sep 10 11:21:15 ores-worker-01 celery[17746]: return self.run(*args, **kwargs)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/celery.py", line 38, in _score_task
Sep 10 11:21:15 ores-worker-01 celery[17746]: return Timeout._score(self, context, model, rev_id, cache=cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 44, in _score
Sep 10 11:21:15 ores-worker-01 celery[17746]: return self._process(context, model, process_cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/timeout.py", line 21, in _process
Sep 10 11:21:15 ores-worker-01 celery[17746]: seconds=self.timeout)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/timeout.py", line 62, in timeout
Sep 10 11:21:15 ores-worker-01 celery[17746]: result = func(*args, **kwargs)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 31, in _process
Sep 10 11:21:15 ores-worker-01 celery[17746]: score = scoring_context.score(model, cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/scoring_contexts/scoring_context.py", line 45, in score
Sep 10 11:21:15 ores-worker-01 celery[17746]: feature_values = list(self.solve(model, cache))
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 251, in _solve_many
Sep 10 11:21:15 ores-worker-01 celery[17746]: value, cache, history = _solve(dependent, context, cache)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 229, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: cache=cache, history=history)
Sep 10 11:21:15 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/dependencies/functions.py", line 241, in _solve
Sep 10 11:21:15 ores-worker-01 celery[17746]: raise CaughtDependencyError(message, e, tb)
Sep 10 11:21:15 ores-worker-01 celery[17746]: revscoring.errors.CaughtDependencyError: TimeoutException: Failed to process <revscoring.languages.english.parent_revision.badwords>:

and

Sep 10 10:41:05 ores-worker-01 celery[17746]: [2015-09-10 10:41:05,212: ERROR/MainProcess] Task ores.score_processors.celery._score_task[frwiki:reverted:118516900:0.3.0] raised unex
pected: AttributeError("'NoneType' object has no attribute 'json'",)
Sep 10 10:41:05 ores-worker-01 celery[17746]: Traceback (most recent call last):
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
Sep 10 10:41:05 ores-worker-01 celery[17746]: R = retval = fun(*args, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__
Sep 10 10:41:05 ores-worker-01 celery[17746]: return self.run(*args, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/celery.py", line 38, in _score_task
Sep 10 10:41:05 ores-worker-01 celery[17746]: return Timeout._score(self, context, model, rev_id, cache=cache)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 39, in _score
Sep 10 10:41:05 ores-worker-01 celery[17746]: caches={rev_id: cache})[rev_id]
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/score_processors/score_processor.py", line 23, in _get_root_ds
Sep 10 10:41:05 ores-worker-01 celery[17746]: return scoring_context.extract_roots(model, rev_ids, caches=caches)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/ores/scoring_contexts/scoring_context.py", line 76, in extract_roots
Sep 10 10:41:05 ores-worker-01 celery[17746]: for rev_id, (error, root_vals) in zip(rev_ids, error_root_vals):
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api.py", line 140, in _extract_many
Sep 10 10:41:05 ores-worker-01 celery[17746]: rev_docs = self.get_rev_doc_map(rev_ids_missing_data)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api.py", line 197, in get_rev_doc_map
Sep 10 10:41:05 ores-worker-01 celery[17746]: properties=props)}
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api.py", line 194, in <dictcomp>
Sep 10 10:41:05 ores-worker-01 celery[17746]: return {rd['revid']: rd
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/api/collections/revisions.py", line 131, in query
Sep 10 10:41:05 ores-worker-01 celery[17746]: rev_docs, rvcontinue = self._query(*args, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/api/collections/revisions.py", line 188, in _query
Sep 10 10:41:05 ores-worker-01 celery[17746]: doc = self.session.get(params)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/util/api.py", line 30, in get
Sep 10 10:41:05 ores-worker-01 celery[17746]: return self.request('GET', params, **kwargs)
Sep 10 10:41:05 ores-worker-01 celery[17746]: File "/srv/ores/venv/lib/python3.4/site-packages/mw/api/session.py", line 129, in request
Sep 10 10:41:05 ores-worker-01 celery[17746]: doc = super().request(type, params, **kwargs).json()
Sep 10 10:41:05 ores-worker-01 celery[17746]: AttributeError: 'NoneType' object has no attribute 'json'

The latter should be handled better.

"ImportError: No module named 'jsonschema'" when running ipython

(3.4) helder@std:~/projects/ores/ipython
$ipython notebook --pylab inline
Traceback (most recent call last):
  File "/home/helder/env/3.4/lib/python3.4/site-packages/IPython/nbformat/validator.py", line 10, in <module>
    from jsonschema import ValidationError
ImportError: No module named 'jsonschema'

revscoring.dependent.DependencyError: Failed to process <parent_revision.markup_chars>: expected string or buffer

Per IRC discussion here is the traceback.

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 102, in _solve
    value = dependent(*args)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/features/feature.py", line 31, in __call__
    value = super().__call__(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 25, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/features/parent_revision.py", line 60, in process_markup_chars
    return sum(len(m.group(0)) for m  in MARKUP_RE.finditer(parent_revision_text))
TypeError: expected string or buffer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eva/Github/Objective-Revision-Evaluation-Service/ores/features_reverted.py", line 96, in run
    print('\t'.join(str(v) for v in (list(values) + [reverted])))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 39, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 96, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 105, in _solve
    .format(dependent, e), e)
revscoring.dependent.DependencyError: Failed to process <parent_revision.markup_chars>: expected string or buffer

Basic request metrics collection and a magic word for precached

We should have basic logging that counts (1) the number of scores requested per amount of time, (2) the proportion of requests that are returned from the cache, generated or errored, and (3) the response time of

Statsd seems like a good option.

We'll need a magic word to flag requests coming from precached to be logged differently so that we don't count those as real scores used.

This will also need to be abstracted in some way so that such logging can be optional.

Backpressure

The API needs a way to put backpressure on a client to encourage them to either reduce their request rate or try again later.

Allow browser caching of ORES API responses

Following up on #45 (comment), it would be nice if some Cache-Control headers were added to the response by the ORES API web service.

Right now when making the same request multiple times it reaches to ORES each time. Something basic like Cache-control: public, maxage=300, would make a good start.

In addition (or alternatively) output an E-Tag header (e.g. set to he digits of a SHA1 digest of the JSON output - or something more sophisticated). This way browsers will pass it back in the form of If-None-Match, which the application can respond to by short-circuiting the response as 304 Not Modified - which saves a bit of bandwidth.

The latter has the benefit of working regardless of any fixed max-age (e.g. during first 5min the browser will use its cache without server roundtrip. After 5 minutes, the next request goes to the server with If-None-Match header and either gets back a short 304 response indicating the cache can be re-use, or it will use contents of the fresh 200 OK response).

ORES wp10 endpoint sometimes returns a float instead of score probabilities

I've gotten occasional error logs while handling data gotten from ores. In particular, after parsing the json into a ruby hash and then getting the data for a particular revision from that hash, the routine checks whether the key 'probability' is present in that revision's data. But instead of being a hash, that data is a float.

Here's the error:

lib/importers/revision_score_importer.rb in block in save_scores at line 64
NoMethodError: undefined method `key?' for 1443972475.2339349:Float

Here's the routine that ingests ores data: https://github.com/WikiEducationFoundation/WikiEduDashboard/blob/master/lib/importers/revision_score_importer.rb

This only seems to happen every once in a while, and our system gathers revision scores for every revision in the system... so the same revisions that cause an error one time seem to be working fine on subsequent attempts.

Do not redirect from https to http

Currently if I access
https://ores-test.wmflabs.org/scores/ptwiki?models=reverted&revids=42150243
I get redirected to
http://ores-test.wmflabs.org/scores/ptwiki/?models=reverted&revids=42150243

In particular, when loading such an URL in JavaScript from an https:// page, the target will be blocked due to
https://developer.mozilla.org/docs/Security/MixedContent

In my script, I had to add the "/" to the end of the URL as a workaround:
he7d3r/mw-gadget-ScoredRevisions@50ad171

I get a "NameError: name 'RevisionDocumentNotFound' is not defined" from features_reverted.py, line 90.

Command:

cat quarry-2159-20000-revisions-from-trwiki-for-revscores-run10107.tsv | tail -n+2 | ./features_reverted ores.features.trwiki.damaging --language=revscoring.languages.turkish --api=https://tr.wikipedia.org/w/api.php > /Datasets/trwiki.features_reverted.20k_2.tsv

First traceback

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 9, in process
    'flags', 'size'})
  File "/usr/local/lib/python3.4/dist-packages/mw/api/collections/revisions.py", line 45, in get
    raise KeyError(rev_id)
KeyError: 15117380

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eva/Github/Objective-Revision-Evaluation-Service/ores/features_reverted.py", line 90, in run
    print('\t'.join(str(v) for v in (list(values) + [reverted])))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 34, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 95, in _solve
    value = dependent(*values)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 20, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 12, in proces                                                                                                             s
    raise RevisionDocumentNotFound({'rev_id': rev_id})
NameError: name 'RevisionDocumentNotFound' is not defined


-------------------------------------------
Second traceback

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 9, in process
    'flags', 'size'})
  File "/usr/local/lib/python3.4/dist-packages/mw/api/collections/revisions.py", line 45, in get
    raise KeyError(rev_id)
KeyError: 15096494

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eva/Github/Objective-Revision-Evaluation-Service/ores/features_reverted.py", line 90, in run
    print('\t'.join(str(v) for v in (list(values) + [reverted])))
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 34, in solve_many
    value, cache, history = _solve(dependent, cache)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 91, in _solve
    value, cache, history = _solve(dependency, cache, history)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 95, in _solve
    value = dependent(*values)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/dependent.py", line 20, in __call__
    return self.process(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/revscoring-0.1.0-py3.4.egg/revscoring/datasources/rev_doc.py", line 12, in proces                                                                                                             s
    raise RevisionDocumentNotFound({'rev_id': rev_id})
NameError: name 'RevisionDocumentNotFound' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.