hltcoe / turkle Goto Github PK

View Code? Open in Web Editor NEW

143.0 17.0 46.0 9.44 MB

Django-based clone of Amazon's Mechanical Turk service running in your local environment.

Home Page: https://turkle.readthedocs.io

License: Other

Python 65.14% CSS 10.78% HTML 11.26% Shell 0.13% JavaScript 2.36% Less 10.33%

amazon-mechanical-turk annotation crowdsourcing hlt labeling mechanical-turk nlp

turkle's Introduction

Turkle

Run a clone of Amazon's Mechanical Turk service in your local environment.

Turkle is implemented as a Django-based web application that can be deployed on your local network or hosted on a public server. It is compatible with Human Intelligence Tasks (HITs) from Amazon Mechanical Turk. Turkle can use the same HTML Task template files and CSV files as the MTurk requester web GUI. The results of the Tasks completed by the workers can be exported to CSV files.

Turkle's features include:

Authentication support for Users
Projects can be restricted to Users who are members of a particular Group
Projects can be configured so that each Task needs to be completed by multiple Workers (redundant annotations)
An admin GUI for managing Users, Groups, Projects, and Batches of Tasks
Scripts to automate the creation of Users, Projects, and Batches of Tasks
Docker images using the SQLite and MySQL database backends

Full documentation is available at Read the Docs.

turkle's People

Contributors

Stargazers

Watchers

Forkers

cash fmof ccmaymay wddabc ericxsun chagge wuxiaobo xsongx songjunjian leezqcst naoshininja exploy lanwuwei happyflyzl fsysean yishuihanhan cfortune skysuka cuylerstuwe rahulyhg torhhu saikrishnarallabandi ondrocks jhu-library-operations noetits ren98feng yosemitemason ritvik06 matrixdecoded kleeeeea vymana dejmail dongji1111 light-cao rmusngi digdug101 vaibhavad sathiyatnj fredham hadibokaei boyu-mi yeganehkordi bellyfat yilunzhao

turkle's Issues

show previously entered values when redisplaying a HIT form

Once a HIT has been annotated, it would be really, really great if the annotator could go back and look at their annotations and modify them if need be. Currently, the only way to change an annotation on a particular HIT is to completely redo the entire HIT. Also, it would be great to be able to pull back up an annotated HIT later to show someone else or to ask questions.

install fails without the wheel package installed

Fresh Python 3.5 with an older pip 8. unicodecsv does not include a pre-built wheel on PyPI so you get a failure like so:

Building wheels for collected packages: unicodecsv
Running setup.py bdist_wheel for unicodecsv ... error
Complete output from command /home/username/envs/turkle/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-build-fsr2y88g/unicodecsv/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" bdist_wheel -d /tmp/tmpv7oyp396pip-wheel- --python-tag cp35:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help

error: invalid command 'bdist_wheel'

Failed building wheel for unicodecsv

pull out common code in unit tests into small library

There is a lot of boilerplate code in the unit tests that create projects, batches, and users before performing the tests. This can be pulled out into a utility module used by the tests.

Poster: Cash Costello id: 192

How to access worker's data from turkle folder?

I wanted to access the annotations done by workers through the turkle folder. This is useful in case if the server has any issues and prevented me from downloading the results. So is there any way to do so ? Are results saved inside db.sqlite3 ?

Use bulk_create in Batch.create_tasks_from_csv()

Batch.create_tasks_from_csv() is currently executing an SQL query for every row in a CSV file. It should be inserting multiple rows at a time using QuerySet.bulk_create():

https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create

Poster: Craig Harman id: 191

Can view and submit a task that has been completed

Create project and batch. 1 assignment per task.
Complete the task
Press the back button

You will see the task again and I think be able to submit it again. I assume it overwrites the previous submission. I haven't done much testing with this, but I could see annotators using the back button accidentally. You can also view tasks that have been completed with the direct URL like http://localhost:8000/turkle/task/5/assignment/7/

Poster: Cash Costello id: 130

Prevent Django version upgrade issues

It's been twice in a row that this is happened, it seems warranted to try to stop it. PR #25 freezes my current versions in the requirements file. I think this is a reasonable solution, especially given the volume of commits on the repo. N.B.: depends on #22.

parse csv input data files equivalently to Amazon Mechanical Turk.

E.g., If double quotes in the csv surround a field that contains commas, those commas aren't used as field delimiters.

Here is the only Amazon documentation on the CSV input file, which does not define the CSV file format:
http://docs.aws.amazon.com/AWSMechTurk/latest/RequesterUI/PublishingYourBatchofHITs.html

Here's a proposed CSV standard:
https://tools.ietf.org/html/rfc4180

Here's another suggestion:
https://en.wikipedia.org/wiki/Comma-separated_values#Toward_standardization

each HIT should have a "date published" field.

This way it will be easy to group HITs for a given template into separate upload dates.

csv.field_size_limit(sys.maxsize) (Python int too large to convert to C long)

Resolved in https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072

Adding a new user via script does not work

Adding a new user via the script

https://github.com/hltcoe/turkle/blob/master/scripts/add_user.py

does not work for us, even though the script says "Success" at the end.

$ python add_user.py -u bricksdont user-test-from-script password-test-from-script --server [server name]
Admin password:
Success

After that, the users overview in the GUI does not show this new user.

The exact same happens with import_users.py. We also tried logging into the machine the server is running on, then running the script with localhost.

What could be potential reasons for this?

`--insecure` as a hard-coded option

Turkle's manage.py hardcodes using --insecure for the runserver command:

https://github.com/hltcoe/turkle/blob/master/manage.py#L26

Which means that static files are served even if not in debug mode, right? Why is that?

Thanks so much for your help.

Implement external questions

Related to #98

MTurk does not provide a UI for creating a HIT that is an External Question. You have to use their API for that.

The data structure for creation is described here: https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_ExternalQuestionArticle.html

ExternalURL looks to be the key parameter.

MTurk adds these query parameters to the URL: assignmentId, hitId, turkSubmitTo, and workerId
When previewing a hit, assignmentId=ASSIGNMENT_ID_NOT_AVAILABLE

The HIT is submitted to https://www.mturk.com/mturk/externalSubmit

It must be a POST and include assignmentId.

Poster: Cash Costello id: 135

turkle using mysql does not handle 4 byte emojis

Original comment: https://gitlab.hltcoe.jhu.edu/research/turkle/issues/173#note_28038

Pasting 😃 into a text field for a project results in a MySQL error. Doing this same in a CSV or in a field on a task does not create an error. The error for the project template field is 1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x83' for column 'html_template' at row 1"

The emoji is saved as \ud83d\ude03 in the database if submitted in a form.

Poster: Cash Costello id: 175

Button "Save and add another" user does not work

While creating a new user in the GUI, after filling in all relevant fields, clicking

Save and add another

does not work as intended. It takes you back to the overview of all users, instead of opening an empty form for a new user, as I was expecting.

Downloading results via script fails

Downloading results for all batches results gives me the following error:

$ python download_results.py -u [admin user name] --server [server address]
Admin password:
Traceback (most recent call last):
  File "download_results.py", line 21, in <module>
    result = client.download(args.dir)
  File "/Users/mathiasmuller/Desktop/turkle/scripts/client.py", line 13, in wrapper
    return func(*args, **kwargs)
  File "/Users/mathiasmuller/Desktop/turkle/scripts/client.py", line 70, in download
    for row in soup.find('table', id='result_list').tbody.findAll('tr'):
AttributeError: 'NoneType' object has no attribute 'tbody'

Looking at the HTML content in soup at

https://github.com/hltcoe/turkle/blob/master/scripts/client.py#L69

it seems the script never makes it past the login page. An excerpt:

<div class="form-row">
<label class="required" for="id_password">Password:</label> <input id="id_password" name="password" required="" type="password"/>
<input name="next" type="hidden" value="/admin/turkle/batch/"/>
</div>
<div class="submit-row">
<label> </label><input type="submit" value="Log in"/>
</div>
</form>
</div>
<br class="clear"/>
</div>

I am using current master, and my password is definitely correct.

Thanks so much for your help.

tests fail because of missing csv files

The csv files from hits/tests/resources/ are all missing from the repo. Anyone from the COE have access to them?

sign up option

Hello,

Is there an option to make people sign up themselves and then access the actives batches ?

In fact I would like to ask some people to a listening test by sending them a link from which they could do it. Something like a google form with audio samples.

But in turkle, an account has to be created for each of them before by an admin. Except if there is a way to put a sign up option and then they would do the test.

There is also the option of not using the login. But then I cannot make each people to do a list of questions.

Thanks

manage HITs via web interface instead of command line interface

Enable User-level access control

Currently, Turkle restricts access at the Group level.

Based on discussions with @ateichert and @vandurme, there are use cases where access to Tasks should be restricted at the individual User level, instead of the Group level.

Poster: Craig Harman id: 128

Versioning of templates

Keep each version of the template for a project with a unique identifier
Record the version of the template used for each task
Include this template ID in the output CSV file
Change the batch download to include the CSV file and all templates used for that batch

charman vandurme I believe this summarizes what we discussed through email.

Poster: Cash Costello id: 156

update to django 1.10

PR #21

Add resumable projects

Support users starting a task, saving intermediate results, and then coming back to it later.

The admin UI needs a new checkbox for resumable
This option conflicts with anonymous annotators so need client side/server side checks on that
Need to think about how task expiration fits with this. Maybe no expiration for resumable tasks or expiration does not affect tasks that have intermediate results
Exporting a batch should not include intermediate results
Requestors changing the template could break loading intermediate results. No way around this. Just need to document this and maybe provide a warning when updating a project that has any intermediate results.
Loading intermediate results is not trivial. Template builders may need to build special code to populate the form with the intermediate results. Need to consider that we have html only templates that only use input controls and richer templates that have plenty of javascript or store results as data attributes on the DOM or in memory.

Also, this would be the first feature that makes our templates incompatible with MTurk. I believe the goal would be to continue to support MTurk templates but offer a superset of features. So we won't require anything in a template that would makes use incompatible but we will add optional features to better support our use cases.

Poster: Cash Costello id: 171

If the input csv file includes ids, use those instead of 1, 2, 3 to label the HITs.

Is there a standardized fieldname for the id of a HIT?

Example

It would be really nice if there were an example subdirectory with an example template and set of HITs.

HIT template reload

I'm using Turkle for development, and it would be extremely helpful to me if there was an easy way to "reload" the HIT template for a batch. That is, I'm recreating a project + batch every time I want to check the template, a multi-step process; if I could instead sync the template for a batch to a file on disk, it would make my development cycle much faster. (It would help if there were just a "reload" button or script to press, but I could even imagine logic going in the view to re-read a specified file from disk every time.)

This would definitely add complexity, and I don't know if anyone does, or ever would, use Turkle in a similar way to how I'm using it, so this might not be worth the costs. And maybe there's a better way to develop HIT templates that I just haven't realized?

Thanks :)

Allow users to view contributions

It would be helpful if users had access to a page where they could see how information such as how much time they'd spent on the system, how many tasks they'd done, and so on, perhaps broken down by week or month.

New Django versions break turkle

e.g. 1.9.X no longer has the syncdb command: 1.6 works fine.

Double quotes characters can cause problems when expanded as javascript literal strings

e.g.

csv file:

var1,var2
"""data for var1""","""data for var2"""

when replacing the variable in a template file like this:

<html>
<p>${var1}</p>
<script>
var x="${var2}"
</script>
</html>

There will be a syntax error when the line with var2 will be expanded to:

var x=""data for var2""

Add a form builder

WYSIWYG editor for forms similar to Google Forms. This allows people who don't know HTML to create simple form-based templates.

Poster: Cash Costello id: 169

Error due to missing trailing comma

The TEMPLATE_DIRS variable in turkle/settings.py needs a comma after its one item, to ensure that it's interpreted as a tuple: otherwise manage.py throws an error.

Allow admin users to reject a completed Task Assignment

As requested by dlawrie

CC: mayfield

Poster: Craig Harman id: 132

OverflowError when running migrate on Windows

When I run python manage.py migrate in an anaconda environment with the requirements recently installed on Windows 10, I get this stack trace:

Traceback (most recent call last):
  File "manage.py", line 29, in <module>
    execute_from_command_line(sys.argv)
  File "C:\ProgramData\Anaconda3\envs\coref-annotation\lib\site-packages\django\core\management\__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "C:\ProgramData\Anaconda3\envs\coref-annotation\lib\site-packages\django\core\management\__init__.py", line 338, in execute
    django.setup()
  File "C:\ProgramData\Anaconda3\envs\coref-annotation\lib\site-packages\django\__init__.py", line 27, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "C:\ProgramData\Anaconda3\envs\coref-annotation\lib\site-packages\django\apps\registry.py", line 108, in populate
    app_config.import_models()
  File "C:\ProgramData\Anaconda3\envs\coref-annotation\lib\site-packages\django\apps\config.py", line 202, in import_models
    self.models_module = import_module(models_module_name)
  File "C:\ProgramData\Anaconda3\envs\coref-annotation\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "C:\Users\cjmay\Documents\turkle\turkle\models.py", line 23, in <module>
    csv.field_size_limit(sys.maxsize)
OverflowError: Python int too large to convert to C long

Display a user-assigned task name at the top of the list of HITs.

It's entirely possible that someone may have two different turkle annotation windows open, and the only way to currently tell which task is which is to actually open and look at the format of a HIT.

name HITs (or display them) with the template name they come from

If the input csv file includes ids, use those instead of 1, 2, 3 to label the HITs.

display HITs grouped by HIT template

Creating a new user: fields are pre-populated

While creating a new user with the GUI, annoyingly, the user name and password field are pre-populated with my own username and password.

Also, after clicking Save, my browser asks me to update my stored credentials, which is also annoying.

I don't know enough to know if 1) this is a problem with my browser settings or 2) a bug on Turkle's side.

Could you please clarify? Thanks so much.

Extend dump_results to write a different result file per template/field-name-list

PR #19

Add template library

MTurk has a set of starting templates for common tasks

Poster: Cash Costello id: 170

mechanism for returning to and editing completed HITs

@charman @cash @vandurme

Is there such a mechanism, perhaps related to the CADET-style correction of existing NER tags etc? What I'm envisioning is: tasks have a boolean switch (e.g. "persistent") that, if set, keeps it in a user's landing page even if they've completed it, so they can go back and edit.

If not, does anyone see a particular reason not to have this functionality, or that it would be difficult to implement? If not, I'd take a shot at it.

FYI this is so that humanists can do the sort of annotation they're used to, but with very easy pivots to crowdsourcing (and the huge advantage of having people willing and able to implement interfaces for new data etc).

Hangs on HIT submission

When I use the template at the end of this issue, with the following batch file:

image_url
https://epsilon.aeon.co/images/2aadc0ca-7531-4d00-a9ef-0ff27280499c/idea_sized-regimentofprinces-tl.jpg
https://bloximages.newyork1.vip.townnews.com/eastoregonian.com/content/tncms/assets/v3/editorial/c/e5/ce526a60-d0cb-11e9-8f09-c7c27e38cb0b/5d729759017cc.image.jpg

it seems to work fine under the prototurk tool (I can add boxes, submit, and the generated JSON has the bounding coordinates etc). But in Turkle, when I submit, it hangs on "Loading next HIT..." (and the admin interface shows that the HIT has not been submitted).

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form answer-format="flatten-objects">
    <crowd-bounding-box 
        src="${image_url}"
        labels="['Text', 'Dog', 'Bird']"
        header="Draw bounding boxes around the requested items"
        name="annotatedResult">
        <short-instructions>Draw boxes around the requested target of interest.</short-instructions>
        <full-instructions header="Bounding Box Instructions">
            <p>Use the bounding box tool to draw boxes around the requested target of interest:</p>
            <ol>
              	<li>Draw a rectangle using your mouse over each instance of the target.</li>
                <li>Make sure the box does not cut into the target, leave a 2 - 3 pixel margin</li>
               	<li>When targets are overlapping, draw a box around each object, include all 
                      contiguous parts of the target in the box. Do not include parts that are completely 
                      overlapped by another object.</li>
               	<li>Do not include parts of the target that cannot be seen, even though you think you 
                      can interpolate the whole shape of the target.</li>
               	<li>Avoid shadows, they're not considered as a part of the target.</li>
               	<li>If the target goes off the screen, label up to the edge of the image.</li>
            </ol>
        </full-instructions>
    </crowd-bounding-box>
</crowd-form>
<input type="hidden" name="answer">

Add REST API, update command line scripts to use REST API

The current command line scripts scrape the HTML of the admin pages, and break when the formatting of these pages changes.

This seems to be one of the more popular REST frameworks for Django:

https://www.django-rest-framework.org

Poster: Craig Harman id: 154

the results file is not in the same format as provided by Amazon Mechanical Turk

Optionally redirect to next hit on submit

PR #18

delete VERSION and LOG.markdown files

Initial step migrate hangs

I'm installing Turkle on a new server, steps:

virtualenv with Python3, activated it, installed all requirements
cloned current master from Github

Then, for some reason,

python manage.py migrate

hangs indefinitely. It also cannot be stopped with Ctrl+C, needs to be killed with SIGKILL. Then I also do not get a traceback.

On my local computer, I did not have this problem; migration went smoothly.

Are there any requirements I am not aware of? Does migrate need write/read permissions in certain places?

Move issues from Gitlab (warning: will create lots of notifications)

Everyone, we are moving our outstanding issues from our internal Gitlab instance to Github. This will probably create a few hundred email notifications.

To avoid those, you can go to https://github.com/settings/notifications and turn off email notifications under Watching or unwatch this repository.

I'll be moving the issues on Friday morning so you can readjust your notification settings Friday afternoon.

Scripts do not check for wrong password

If I use any of the scripts in

https://github.com/hltcoe/turkle/tree/master/scripts

they do not check for wrong passwords. For instance, even if I enter a wrong password, the script add_user.py says "Success". This makes it harder to debug (= to know if the cause for malfunctioning might be the password).

Could this script perhaps output a helpful error message if the password is incorrect?

Allow per-project static file uploads

This is one possible solution to a common issue, but I'm not certain it's the right solution. Feedback appreciated!

Turkle, like Mechanical Turk, only allows a single HTML template file per project. While third-party libraries (jQuery, Bootstrap) can be accessed via links to third-party CDNs, any project-specific CSS and JavaScript needs to be in the single HTML template file. There is no support for project-specific image files.

When Turkle is used offline machines, the third-party CDNs are inaccessible, and users need to resort to things like copy-pasting entire jQuery libraries into an HTML template.

If a web application is built using a framework like React, there are potentially a lot of project-specific CSS and JavaScript files, and copying the contents of these files into a single HTML template adds complexity to the development process.

One way to address these copy-pasting issues is to allow Turkle users to upload static files that are tied to a specific project. In order to prevent path collisions, we would need to introduce a new Turkle-specific template variable - e.g. $TURKLE_STATIC_PATH. When the template is rendered, the Turkle template variable would be replaced with with a Django-generated, project-specific URL prefix.

This behavior would obviously diverge from MTurk's behavior, but could make it easier to support offline environments.

CC: ccostello dpennell vandurme

Poster: Craig Harman id: 143

Add API to Turkle

Currently, we have a few scripts that parse html which is less than ideal. We should replace those with an API similar to mTurk.

Documentation on their API: https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_OperationsArticle.html

I suggest we prototype a subset of the methods around HIT management (create, list, delete).

I'd like to try to maintain parameter compatibility with mTurk so that our scripts/clients are compatible. This likely means leaving several parameters per method as dummy parameters for things like managing paying turkers.

We would also need to implement the same style of user authentication for scripts to be interoptable.