neulab / explainaboard Goto Github PK
View Code? Open in Web Editor NEWInterpretable Evaluation for AI Systems
License: MIT License
Interpretable Evaluation for AI Systems
License: MIT License
ExplainaBoard/explainaboard/tasks.py
Line 42 in 2a118f0
for example
TaskCategory("span-text-prediction", "prediction based on span and text",
[Task(TaskType.aspect_based_sentiment_classification, True, ["f1_score_seqeval"])]),
Is there a way to enable analysis for open-domain question answering datasets? Or at least on the Reading Comprehension (RC) side, given different retrieved contexts from multiple retrieval models, to use/submit different versions of the context dataset but for the same RC task.
Researchers using ExplainaBoard may want to prototype new feature functions, but not be very familiar with the internals of the ExplainaBoard SDK. In this case, it might be nice if they could define the buckets for each example externally, but do analysis over the buckets within ExplainaBoard.
There is an example of this in compare_mt
here: https://github.com/neulab/compare-mt#incorporating-wordsentence-labels
This could also help create a pipeline going from "prototyping outside of explainaboard" -> "figuring out the best feature functions" -> "implementation of the most useful ones directly in explainaboard".
Currently we have no example for relation extraction functionality. I heard from @pfliu-nlp that @jinlanfu might have an example, so it might be nice to add if you have one!
Hi @pfliu-nlp , I have a few questions about the datasets directory:
I've been looking at the outputs for text generation tasks such as summarization
https://github.com/neulab/ExplainaBoard/blob/main/data/system_outputs/cnndm/cnndm_mini.bart
and also inputs for tasks such as classification
https://github.com/neulab/ExplainaBoard/blob/main/data/system_outputs/sst2/sst2-cnn.tsv
and they all seem to be tokenized. However, in many cases our inputs/outputs are not tokenized. In fact, best practices like those advocated by sacrebleu suggest that we should be feeding untokenized inputs into our evaluation tools.
Right now I think explainaboard doesn't do tokenization internally, right? If so, maybe we should consider having this as an option.
I think these could be merged into a single file, would that be OK with you @pfliu-nlp ? If so I'll do it:
Dataset: https://github.com/nightingal3/metaphor-qa
Format of system output:
{ "question": ..., "answer1": ..., "answer2": , ..., "true_label": ...}, ...
]
Currently in requirements.txt
there is a single, old version of seqeval specified:
seqeval==0.0.12
In general it's probably better to avoid these "==" requirements, and only have ">" requirements, otherwise it can lead to environment conflicts.
Currently system_outputs
is comma separated: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/explainaboard_main.py#L48
However, this doesn't play super-well with some standard ways of doing things in Linux, which assume that multiple arguments are space separated. For example, the following don't work now but would work if we made it space separated:
explainaboard --system_outputs my_outputs/*
explainaboard --system_outputs my_outputs/{sys1,sys2,sys3}.tsv
What do you think about making system_outputs
space separated @pfliu-nlp? I'm happy to make the change if it's OK. The main downside of doing this is that this is a very central part of the CLI and will break other code that uses explainaboard. However, it might be better to do it now than later (as it'll be hard to change later).
P.S.: I just noticed that currently the CLI only supports single system analysis ๐ If that's the case then obviously there's no upstream code using this at the moment, so it should be fine to change unless there's a strong argument for preferring comma separated.
The CLI example for SQuAD extractive QA seems to output zero accuracy every time. This seems like a bug.
For example:
"results": {
"overall": [
{
"metric_name": "f1_score_qa",
"value": "0",
"confidence_score_low": "0",
"confidence_score_up": "0"
},
{
"metric_name": "exact_match_qa",
"value": "0",
"confidence_score_low": "0",
"confidence_score_up": "0"
}
],
Perhaps related to the file format: #70
Looking at the documentation for Extractive QA the file format is not clear with respect to how to analyze a new dataset. The system output JSON file specifies question IDs, but what if the dataset is not SQuAD but something entirely different, such as a dataset newly created by the user? How are the datasets specified, where do I get the question IDs?
It'd be nice if the documentation could be updated and/or support could be added for analyzing new datasets!
According to the documentation, multiple metrics are supported for the summarization task:
https://github.com/neulab/ExplainaBoard/blob/main/docs/supported_tasks.md#summarization
However, by default only BLEU is printed in the output JSON, and none of the documentation I can find tells me how I can print out (some variety of) ROUGE instead.
description
into feature schemadescription
field of existing supported tasksbut
(however
)Currently the publicly available explainaboard code doesn't support NLG tasks, so this should be integrated.
Speech recognition is a standard generation task where the input is speech, output is text. For now, analysis could be done on the output side only.
Currently explainaboard prints an extraneous message "Your request has been sent." to stdout during (some) analysis:
$ explainaboard --task summarization --system_outputs ./data/system_outputs/cnndm/cnndm_mini.bart > report.json
$ head -n 1 report.json
Your request has been sent.
As evidenced by the NotImplementedError
here:
https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/explainaboard_main.py#L58
It would be nice to support analysis of multiple systems at the same time!
Hi,
I was looking at explainaboard online and immediately some questions popped to mind:
I found that for NER, there was no calibration analysis. I was wondering if this is something you would like to support in the future?
Also, will you keep model binaries, as a model zoo? Or only analysis on benchmark datasets?
I believe a large-scale model zoo is missing in NLP, but the collection thereof might even be more complex. What are your thoughts?
Cheers,
Jordy
"Currently covered systems" in the README is empty. Maybe it should be replaced by a link on explainaboard.nlpedia.ai that lists all of the current systems?
https://github.com/neulab/ExplainaBoard#currently-covered-systems
A popular task in NLP is syntactic analysis using universal dependencies. Example subtasks are:
Currently due to some hard-coded relative paths to config files ExplainaBoard can't be run from anywhere but the top directory. This should be fixed.
The following would be good to improve on the main README
https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/table_schema.py
table_schema
is used to design the table format that will be presented in the frontend.
Here we can add one for aspect-based sentiment classification task.
It would be really useful if the system outputs had IDs which could be used to match outputs across models. For instance, the CNN/DailyMail outputs are not in the order which I expected, so it's not easy for me to map from my model's outputs to the ones shared by ExplainaBoard.
Currently, datasets are included in different places in the code: dataset
and explainaboard/example
. It should be fixed so that all datasets are in a single location and have a uniform format. I think dataset
is probably a better location.
There is a CLI example and dataset for extractive QA, but I wasn't able to easily guess what the dataset format is. I think we need some documentation for this, like we have for text classification.
(1) re-format existing output file (tsv? remove first line? table id?)
(2) create a readme.md file in: https://github.com/neulab/ExplainaBoard/tree/main/output_format/semp, add necessary information
I am planning on doing the following:
interpret_eval
and tensoreval
We already support summarization, so support for MT should also be similar. We might also consider having a hierarchical level of support, like:
Where there is a base set of features shared across all seq2seq tasks, and then specialized features for each sub-task.
In particular, it would be great if we could add some support for analyzing the datasets used in the class that I'm currently teaching, which would probably be helpful for the students in the class (although the deadline for the assignment is 2/25): http://phontron.com/class/multiling2022/assignment2.html
It'd be cool if we could easily grab datasets from huggingface datasets: https://huggingface.co/docs/datasets/
It seems that the generation of error cases for tagging/chunking tasks is very slow, I bet it can be optimized somehow.
I think it would be nice to have a tutorial that is basically the first thing that people look at when they want to figure out how to use ExplainaBoard on their own systems. The text classification tutorial is one good candidate for this, as text classification is a very easy-to-understand task with wide applicability.
However, while it tells you how to run the program and get an output json file, it is not quite useful yet because we need:
My ideal solution would be to link it together with something like what will be implemented in this PR (#60) which will be much more interpretable than a static JSON file, but at least an explanation of the JSON file format and what we could glean from it would be useful.
In addition, some basic functionality of ExplainaBoard is not documented, e.g.:
It'd be nice to work on this documentation!
ExplainaBoard/explainaboard/tasks.py
Line 86 in cd19492
Regarding the aspect-based-sentiment-classification
task, should the metrics of it be:
(1) f1_score_seqeval
or
(2) F1score
and Accuracy
?
When doing analysis via the CLI it should allow automatic submission of results to the web interface (ideally by default), and return a link to where the results can be browsed.
@pfliu-nlp , could you take a look at this maybe together with @OscarWang114 and @lyuyangh ?
On the web interface it would include a visual browser of the results, and then also a place where you can download the json report (so the web interface could stand in as a complete replacement for the CLI).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.