Comments (8)
Hi @DUT-lujunyu, thanks for your interest in our work, sorry for the delayed response.
Here's some answers:
- The
annotated
files include annotations from human experts, while the main toxigen file does not. Thetrain
file are the annotations we collected first, which made it into the original paper submission. Thetest
file contains the annotations collected afterwards (same annotators). Together, they create ~10k human-annotated samples. - Where are you getting the
label
column from inannotated_train.csv
? I do not see that in the original dataset on huggingface.
from toxigen.
Thanks for your detailed answers!
I downloaded the annotated_train.csv from the link of huggingface "https://huggingface.co/datasets/skg/toxigen-data/blob/main/annotated_train.csv", and got the data as follows. The "label" does not seem to agree with the calculation method in the paper. So what does the label refer to?
from toxigen.
Sorry for the slow response, this is a strange problem. The annotated_train.csv file indeed has that label
field, but when you download the dataset using huggingface, I don't see it. I believe this label might be whether or not the original intention was to generate hate or non-hate for this instance.
from toxigen.
Hi @Thartvigsen,
I have dowloaded the dataset from HuggingFace.
However, this version of the dataset is different from the paper's one.
The paper reports a total of 274186 generated prompts.
However, the dataset available on HuggingFace contains 8960, 940, and 250951 prompts in annotated_train.csv
, annotated_test.csv
, and toxigen.csv
, respectively.
Why is that? Am I missing something here?
Also, from your previous responses, I do not understand a few things:
- Which is the test set used in the paper?
- Are
annotated_train.csv
andannotated_test.csv
also present intoxigen.csv
? - Which field of
annotated_train.csv
andannotated_test.csv
should we consider the ground truth?
Could you clarify?
Thank you.
from toxigen.
Hi @AmenRa thanks for your interest in our work!
I believe the 274k vs 260k issue is from duplication removal but the original resources were made unavailable, so I can't go back and check to be certain, unfortunately
- The original test set is is the 940 annotations in
annotated_test.csv
annotated_train.csv
andannotated_test.csv
are not present intoxigen.csv
I don't believe, though this can be double checked by looking for the overlap- We compute ground-truth as a balance from annotator scores for toxicity, introduced in the Convert human scores to binary labels section of this notebook
from toxigen.
Thanks for the fast reply!
However, I am still a bit confused.
The paper reports "We selected 792 statements from TOXIGEN to include in our test set".
The shared test set, which you are telling me is the original one, comprises 940 samples.
Could you clarify?
Thanks.
from toxigen.
This is a good question and I'm not sure. I don't have access to some of the original internal docs, so this confusion is likely irreducible for us both. I will try to hunt this down. I suspect that the root issue is that at the time of the original submission, we'd gotten annotations for <1k samples. Then at the time of paper acceptance, we'd gotten annotations for ~10k samples, resulting in two versions of the dataset for which we conducted splits. That 792 may be an artifact of the original numbers, not the larger annotated set. The 8960 annotated_train.csv
set should include the annotations collected in the second wave post-submission, but this may have also impacted the count for 792 somehow.
from toxigen.
Ok, thanks!
from toxigen.
Related Issues (18)
- Unable to Load Dataset HOT 12
- Issue With prompt_label in Data HOT 2
- Random sentences popping-up HOT 1
- Adding licensing info HOT 3
- Fine-tuned HateBERT giving an IndexError HOT 10
- This repo is missing important files
- Request to share the full questionaire for human evaluation. HOT 1
- Questions on HateBERT_ToxiGen HOT 3
- Prompts for Annotated Test Set HOT 1
- KeyError: 'choices' HOT 12
- Prediction error while using pipeline HOT 3
- IndexError from alice.py HOT 6
- IndexError: list index out of range when running alice HOT 5
- Index Issue HOT 1
- Warning - Legal and Ethical Issue for using this dataset - Issues could stem from political activism being built into algorithms HOT 5
- the issue of reproduced performance HOT 10
- Warning - Use of this labelling algorithm could lead to legal or ethical problems similar to Gemini
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from toxigen.