The explainaboard from neulab

update new task (aspect-based sentiment classsification) information into task_category

Line 42 in 2a118f0

TaskCategory("conditional-text-generation",

for example

TaskCategory("span-text-prediction", "prediction based on span and text",
                 [Task(TaskType.aspect_based_sentiment_classification, True, ["f1_score_seqeval"])]),

Better support for analyzing retrieve-and-read based QA systems

Is there a way to enable analysis for open-domain question answering datasets? Or at least on the Reading Comprehension (RC) side, given different retrieved contexts from multiple retrieval models, to use/submit different versions of the context dataset but for the same RC task.

Support for externally calculated bucketing functions

Researchers using ExplainaBoard may want to prototype new feature functions, but not be very familiar with the internals of the ExplainaBoard SDK. In this case, it might be nice if they could define the buckets for each example externally, but do analysis over the buckets within ExplainaBoard.

There is an example of this in compare_mt here: https://github.com/neulab/compare-mt#incorporating-wordsentence-labels

This could also help create a pipeline going from "prototyping outside of explainaboard" -> "figuring out the best feature functions" -> "implementation of the most useful ones directly in explainaboard".

Set up unit testing and continuous integration

Could use an example for RE functionality

Currently we have no example for relation extraction functionality. I heard from @pfliu-nlp that @jinlanfu might have an example, so it might be nice to add if you have one!

Questions about datasets directory

Hi @pfliu-nlp , I have a few questions about the datasets directory:

In the README it explains how to send a PR, but it doesn't explain what format the dataset should be in, where it should be placed, etc. (there are a few examples here but they seem to be system outputs, not the underlying data https://github.com/neulab/ExplainaBoard/tree/main/dataset/chunk_conll00/system_output)
What should I do in the case that a dataset consists of a few sub-datasets? For example, the same dataset has different dev/devtest/test splits, or contains data from different languages or domains?

Add arguments (dataset_name, model_name) into Eval functions of different tasks.

Does ExplainaBoard handle tokenization? Should it?

I've been looking at the outputs for text generation tasks such as summarization
https://github.com/neulab/ExplainaBoard/blob/main/data/system_outputs/cnndm/cnndm_mini.bart

and also inputs for tasks such as classification
https://github.com/neulab/ExplainaBoard/blob/main/data/system_outputs/sst2/sst2-cnn.tsv

and they all seem to be tokenized. However, in many cases our inputs/outputs are not tokenized. In fact, best practices like those advocated by sacrebleu suggest that we should be feeding untokenized inputs into our evaluation tools.

Right now I think explainaboard doesn't do tokenization internally, right? If so, maybe we should consider having this as an option.

existing_supports and cli_examples should be merged

I think these could be merged into a single file, would that be OK with you @pfliu-nlp ? If so I'll do it:

New Task: Metaphor QA

Dataset: https://github.com/nightingal3/metaphor-qa
Format of system output:

{ "question": ..., "answer1": ..., "answer2": , ..., "true_label": ...}, ...
]

Requirements specify a single old version of seqeval

Currently in requirements.txt there is a single, old version of seqeval specified:

seqeval==0.0.12

In general it's probably better to avoid these "==" requirements, and only have ">" requirements, otherwise it can lead to environment conflicts.

Suggestion: make system_outputs space-separated?

Currently system_outputs is comma separated: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/explainaboard_main.py#L48

However, this doesn't play super-well with some standard ways of doing things in Linux, which assume that multiple arguments are space separated. For example, the following don't work now but would work if we made it space separated:

explainaboard --system_outputs my_outputs/*
explainaboard --system_outputs my_outputs/{sys1,sys2,sys3}.tsv

What do you think about making system_outputs space separated @pfliu-nlp? I'm happy to make the change if it's OK. The main downside of doing this is that this is a very central part of the CLI and will break other code that uses explainaboard. However, it might be better to do it now than later (as it'll be hard to change later).

P.S.: I just noticed that currently the CLI only supports single system analysis 😄 If that's the case then obviously there's no upstream code using this at the moment, so it should be fine to change unless there's a strong argument for preferring comma separated.

Extractive QA CLI example is broken?

The CLI example for SQuAD extractive QA seems to output zero accuracy every time. This seems like a bug.

For example:

"results": {
    "overall": [
        {
            "metric_name": "f1_score_qa",
            "value": "0",
            "confidence_score_low": "0",
            "confidence_score_up": "0"
        },
        {
            "metric_name": "exact_match_qa",
            "value": "0",
            "confidence_score_low": "0",
            "confidence_score_up": "0"
        }
    ],

Perhaps related to the file format: #70

Method for analysis of new extractive QA datasets is not clear

Looking at the documentation for Extractive QA the file format is not clear with respect to how to analyze a new dataset. The system output JSON file specifies question IDs, but what if the dataset is not SQuAD but something entirely different, such as a dataset newly created by the user? How are the datasets specified, where do I get the question IDs?

It'd be nice if the documentation could be updated and/or support could be added for analyzing new datasets!

How to print cases for pair-wise system analysis?

Spacy model reload issue

When fixed piaget issue, a new issue come out:

System combination feature?

Warming up

read docs [1] to play with the ExplainaBoard SDK
read docs [2] and try to add a new feature (e.g., the number of sentiment words) for the text classification task
read docs [3] and try to add a new format (e.g., json) for text classification
read docs [4] and try to add a new task

It is not clear how to calculate different/additional metrics

According to the documentation, multiple metrics are supported for the summarization task:
https://github.com/neulab/ExplainaBoard/blob/main/docs/supported_tasks.md#summarization

However, by default only BLEU is printed in the output JSON, and none of the documentation I can find tells me how I can print out (some variety of) ROUGE instead.

Calibration on Structure Prediction Tasks

Github Down!

Interesting

add new field `description` to feature schema?

add description into feature schema
instantiation description field of existing supported tasks

explore more advanced features for aspect-based sentiment classification task and add them

read the paper, design more advanced features, for example
- the number of but (however)
- sentiment consistency of aspects
- sentiment consistency of samples
- training set dependent features?

Add NLG tasks

Currently the publicly available explainaboard code doesn't support NLG tasks, so this should be integrated.

New task: speech recognition

Speech recognition is a standard generation task where the input is speech, output is text. For now, analysis could be done on the output side only.

Evaluation metric: word error rate, character error rate
Types of analysis: could be done on the level of individual deletions, substitutions, insertions. I have an example here: https://github.com/neubig/util-scripts/blob/master/counterrors.pl https://github.com/neubig/util-scripts/blob/master/error-diff.pl

Support more QA tasks?

NQ (https://ai.google.com/research/NaturalQuestions/dataset)
TriviaQA (https://nlp.cs.washington.edu/triviaqa/) datasets
HotpotQA
DROP

Extraneous message printed to stdout during analysis

Currently explainaboard prints an extraneous message "Your request has been sent." to stdout during (some) analysis:

$ explainaboard --task summarization --system_outputs ./data/system_outputs/cnndm/cnndm_mini.bart > report.json
$ head -n 1 report.json
Your request has been sent.

Add detailed description of metrics (w.r.t summarization) and their applicable scenarios

https://github.com/neulab/ExplainaBoard/blob/main/docs/supported_tasks.md#summarization

CLI only supports single-system analysis

As evidenced by the NotImplementedError here:
https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/explainaboard_main.py#L58

It would be nice to support analysis of multiple systems at the same time!

Great initiative!

Hi,

I was looking at explainaboard online and immediately some questions popped to mind:

I found that for NER, there was no calibration analysis. I was wondering if this is something you would like to support in the future?

Also, will you keep model binaries, as a model zoo? Or only analysis on benchmark datasets?
I believe a large-scale model zoo is missing in NLP, but the collection thereof might even be more complex. What are your thoughts?

Cheers,

Jordy

"Currently covered systems" is empty

"Currently covered systems" in the README is empty. Maybe it should be replaced by a link on explainaboard.nlpedia.ai that lists all of the current systems?
https://github.com/neulab/ExplainaBoard#currently-covered-systems

New task: Syntactic analysis tasks w/ universal dependencies

A popular task in NLP is syntactic analysis using universal dependencies. Example subtasks are:

POS tagging
Morphological analysis
Dependency parsing

Data: Universal Dependencies Treebanks
Evaluation Metrics: Specified by the CoNLL Shared Tasks

Bugs: NER case printing

text field should be the sample_id

Code not runnable from other directories

Currently due to some hard-coded relative paths to config files ExplainaBoard can't be run from anywhere but the top directory. This should be fixed.

Some miscellaneous improvements on the main README

The following would be good to improve on the main README

It'd be good to explain what the CLI allows you to do a bit more, or refer to the additional documentation
I think it's probably better not to include the list of supported systems/tasks within the README, as it will get out of date quickly. Instead it'd be good to have a link to the web interface perhaps?

NER returns the same fine-grained results for all buckets?

I submitted the NER system output in the artifacts directory to the SDK. Below is the generated analysis. All buckets in "sentence_length" returned identical cases in bucket samples (a total of 1010). Possibly a bug?

multi-lingual evaluation?

update the `table_schema` once we added a new task

https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/table_schema.py

table_schema is used to design the table format that will be presented in the frontend.
Here we can add one for aspect-based sentiment classification task.

Add instance IDs to model outputs

It would be really useful if the system outputs had IDs which could be used to match outputs across models. For instance, the CNN/DailyMail outputs are not in the order which I expected, so it's not easy for me to map from my model's outputs to the ones shared by ExplainaBoard.

Dataset cleanup

Currently, datasets are included in different places in the code: dataset and explainaboard/example. It should be fixed so that all datasets are in a single location and have a uniform format. I think dataset is probably a better location.

Extractive QA data format needs to be documented

There is a CLI example and dataset for extractive QA, but I wasn't able to easily guess what the dataset format is. I think we need some documentation for this, like we have for text classification.

QA uses different ids for analysis and system output samples

add description of output file's format for semantic parsing

(1) re-format existing output file (tsv? remove first line? table id?)
(2) create a readme.md file in: https://github.com/neulab/ExplainaBoard/tree/main/output_format/semp, add necessary information

@shuaichenchang

Do some refactoring of names

I am planning on doing the following:

Removing some older terminology like interpret_eval and tensoreval
Changing camel case to snake case to be consistent
Deduplicate code

New Task: Support for MT

We already support summarization, so support for MT should also be similar. We might also consider having a hierarchical level of support, like:

sequence-to-sequence
- summarization
- MT
- ...

Where there is a base set of features shared across all seq2seq tasks, and then specialized features for each sub-task.

In particular, it would be great if we could add some support for analyzing the datasets used in the class that I'm currently teaching, which would probably be helpful for the students in the class (although the deadline for the assignment is 2/25): http://phontron.com/class/multiling2022/assignment2.html

Interface with huggingface datasets

It'd be cool if we could easily grab datasets from huggingface datasets: https://huggingface.co/docs/datasets/

Error case generation for tagging/chunking tasks is very slow

It seems that the generation of error cases for tagging/chunking tasks is very slow, I bet it can be optimized somehow.

Several things are not documented in text classification tutorial

I think it would be nice to have a tutorial that is basically the first thing that people look at when they want to figure out how to use ExplainaBoard on their own systems. The text classification tutorial is one good candidate for this, as text classification is a very easy-to-understand task with wide applicability.

However, while it tells you how to run the program and get an output json file, it is not quite useful yet because we need:

An explanation on how to interpret or visualize the file

My ideal solution would be to link it together with something like what will be implemented in this PR (#60) which will be much more interpretable than a static JSON file, but at least an explanation of the JSON file format and what we could glean from it would be useful.

In addition, some basic functionality of ExplainaBoard is not documented, e.g.:

Pairwise analysis
Fine-grained error analysis

It'd be nice to work on this documentation!

double check?

ExplainaBoard/explainaboard/tasks.py

Line 86 in cd19492

TaskCategory("span-text-prediction", "prediction based on span and text",

Regarding the aspect-based-sentiment-classification task, should the metrics of it be:
(1) f1_score_seqeval or
(2) F1score and Accuracy?

CLI should allow automatic submission of results to web interface

When doing analysis via the CLI it should allow automatic submission of results to the web interface (ideally by default), and return a link to where the results can be browsed.

@pfliu-nlp , could you take a look at this maybe together with @OscarWang114 and @lyuyangh ?

On the web interface it would include a visual browser of the results, and then also a place where you can download the json report (so the web interface could stand in as a complete replacement for the CLI).

neulab / explainaboard Goto Github PK

explainaboard's People

Contributors

Stargazers

Watchers

Forkers

explainaboard's Issues

Recommend Projects

Recommend Topics

Recommend Org