Coder Social home page Coder Social logo

explainaboard's People

Contributors

ccatherinee avatar hwidjaja avatar jinlanfu avatar lyuyangh avatar neubig avatar nightingal3 avatar noelchen90 avatar odashi avatar oscarwang114 avatar paulcccccch avatar pfliu-nlp avatar qinyiwei avatar qjiang002 avatar rooa avatar shuaichenchang avatar tahmid04 avatar tetsuok avatar yixinl7 avatar yuh-zha avatar yyy-apple avatar zdou0830 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

explainaboard's Issues

Better support for analyzing retrieve-and-read based QA systems

Is there a way to enable analysis for open-domain question answering datasets? Or at least on the Reading Comprehension (RC) side, given different retrieved contexts from multiple retrieval models, to use/submit different versions of the context dataset but for the same RC task.

Support for externally calculated bucketing functions

Researchers using ExplainaBoard may want to prototype new feature functions, but not be very familiar with the internals of the ExplainaBoard SDK. In this case, it might be nice if they could define the buckets for each example externally, but do analysis over the buckets within ExplainaBoard.

There is an example of this in compare_mt here: https://github.com/neulab/compare-mt#incorporating-wordsentence-labels

This could also help create a pipeline going from "prototyping outside of explainaboard" -> "figuring out the best feature functions" -> "implementation of the most useful ones directly in explainaboard".

Questions about datasets directory

Hi @pfliu-nlp , I have a few questions about the datasets directory:

  1. In the README it explains how to send a PR, but it doesn't explain what format the dataset should be in, where it should be placed, etc. (there are a few examples here but they seem to be system outputs, not the underlying data https://github.com/neulab/ExplainaBoard/tree/main/dataset/chunk_conll00/system_output)
  2. What should I do in the case that a dataset consists of a few sub-datasets? For example, the same dataset has different dev/devtest/test splits, or contains data from different languages or domains?

Does ExplainaBoard handle tokenization? Should it?

I've been looking at the outputs for text generation tasks such as summarization
https://github.com/neulab/ExplainaBoard/blob/main/data/system_outputs/cnndm/cnndm_mini.bart

and also inputs for tasks such as classification
https://github.com/neulab/ExplainaBoard/blob/main/data/system_outputs/sst2/sst2-cnn.tsv

and they all seem to be tokenized. However, in many cases our inputs/outputs are not tokenized. In fact, best practices like those advocated by sacrebleu suggest that we should be feeding untokenized inputs into our evaluation tools.

Right now I think explainaboard doesn't do tokenization internally, right? If so, maybe we should consider having this as an option.

Requirements specify a single old version of seqeval

Currently in requirements.txt there is a single, old version of seqeval specified:

seqeval==0.0.12

In general it's probably better to avoid these "==" requirements, and only have ">" requirements, otherwise it can lead to environment conflicts.

Suggestion: make system_outputs space-separated?

Currently system_outputs is comma separated: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/explainaboard_main.py#L48

However, this doesn't play super-well with some standard ways of doing things in Linux, which assume that multiple arguments are space separated. For example, the following don't work now but would work if we made it space separated:

explainaboard --system_outputs my_outputs/*
explainaboard --system_outputs my_outputs/{sys1,sys2,sys3}.tsv

What do you think about making system_outputs space separated @pfliu-nlp? I'm happy to make the change if it's OK. The main downside of doing this is that this is a very central part of the CLI and will break other code that uses explainaboard. However, it might be better to do it now than later (as it'll be hard to change later).

P.S.: I just noticed that currently the CLI only supports single system analysis ๐Ÿ˜„ If that's the case then obviously there's no upstream code using this at the moment, so it should be fine to change unless there's a strong argument for preferring comma separated.

Extractive QA CLI example is broken?

The CLI example for SQuAD extractive QA seems to output zero accuracy every time. This seems like a bug.

For example:

"results": {
    "overall": [
        {
            "metric_name": "f1_score_qa",
            "value": "0",
            "confidence_score_low": "0",
            "confidence_score_up": "0"
        },
        {
            "metric_name": "exact_match_qa",
            "value": "0",
            "confidence_score_low": "0",
            "confidence_score_up": "0"
        }
    ],

Perhaps related to the file format: #70

Method for analysis of new extractive QA datasets is not clear

Looking at the documentation for Extractive QA the file format is not clear with respect to how to analyze a new dataset. The system output JSON file specifies question IDs, but what if the dataset is not SQuAD but something entirely different, such as a dataset newly created by the user? How are the datasets specified, where do I get the question IDs?

It'd be nice if the documentation could be updated and/or support could be added for analyzing new datasets!

Warming up

  • read docs [1] to play with the ExplainaBoard SDK
  • read docs [2] and try to add a new feature (e.g., the number of sentiment words) for the text classification task
  • read docs [3] and try to add a new format (e.g., json) for text classification
  • read docs [4] and try to add a new task

Add NLG tasks

Currently the publicly available explainaboard code doesn't support NLG tasks, so this should be integrated.

Extraneous message printed to stdout during analysis

Currently explainaboard prints an extraneous message "Your request has been sent." to stdout during (some) analysis:

$ explainaboard --task summarization --system_outputs ./data/system_outputs/cnndm/cnndm_mini.bart > report.json
$ head -n 1 report.json
Your request has been sent.

Great initiative!

Hi,

I was looking at explainaboard online and immediately some questions popped to mind:

I found that for NER, there was no calibration analysis. I was wondering if this is something you would like to support in the future?

Also, will you keep model binaries, as a model zoo? Or only analysis on benchmark datasets?
I believe a large-scale model zoo is missing in NLP, but the collection thereof might even be more complex. What are your thoughts?

Cheers,

Jordy

Some miscellaneous improvements on the main README

The following would be good to improve on the main README

  • It'd be good to explain what the CLI allows you to do a bit more, or refer to the additional documentation
  • I think it's probably better not to include the list of supported systems/tasks within the README, as it will get out of date quickly. Instead it'd be good to have a link to the web interface perhaps?

Add instance IDs to model outputs

It would be really useful if the system outputs had IDs which could be used to match outputs across models. For instance, the CNN/DailyMail outputs are not in the order which I expected, so it's not easy for me to map from my model's outputs to the ones shared by ExplainaBoard.

Dataset cleanup

Currently, datasets are included in different places in the code: dataset and explainaboard/example. It should be fixed so that all datasets are in a single location and have a uniform format. I think dataset is probably a better location.

New Task: Support for MT

We already support summarization, so support for MT should also be similar. We might also consider having a hierarchical level of support, like:

  • sequence-to-sequence
    • summarization
    • MT
    • ...

Where there is a base set of features shared across all seq2seq tasks, and then specialized features for each sub-task.

In particular, it would be great if we could add some support for analyzing the datasets used in the class that I'm currently teaching, which would probably be helpful for the students in the class (although the deadline for the assignment is 2/25): http://phontron.com/class/multiling2022/assignment2.html

Several things are not documented in text classification tutorial

I think it would be nice to have a tutorial that is basically the first thing that people look at when they want to figure out how to use ExplainaBoard on their own systems. The text classification tutorial is one good candidate for this, as text classification is a very easy-to-understand task with wide applicability.

However, while it tells you how to run the program and get an output json file, it is not quite useful yet because we need:

  • An explanation on how to interpret or visualize the file

My ideal solution would be to link it together with something like what will be implemented in this PR (#60) which will be much more interpretable than a static JSON file, but at least an explanation of the JSON file format and what we could glean from it would be useful.

In addition, some basic functionality of ExplainaBoard is not documented, e.g.:

  • Pairwise analysis
  • Fine-grained error analysis

It'd be nice to work on this documentation!

double check?

TaskCategory("span-text-prediction", "prediction based on span and text",

Regarding the aspect-based-sentiment-classification task, should the metrics of it be:
(1) f1_score_seqeval or
(2) F1score and Accuracy?

CLI should allow automatic submission of results to web interface

When doing analysis via the CLI it should allow automatic submission of results to the web interface (ideally by default), and return a link to where the results can be browsed.

@pfliu-nlp , could you take a look at this maybe together with @OscarWang114 and @lyuyangh ?

On the web interface it would include a visual browser of the results, and then also a place where you can download the json report (so the web interface could stand in as a complete replacement for the CLI).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.