The tabgenie from kasnerz

Some dataset infos from HF not loaded

For some datasets, the information is not loaded in the related fields (e.g., citation, homepage) even though it is present in the dataset_infos.json file.

See for example NumericNLG and https://huggingface.co/datasets/kasnerz/numericnlg/blob/main/dataset_infos.json

This has behaved strangely before and it may be caused by a particular version of Huggingface datasets package.

Expand README

Add instructions for running and developing the app

Lightweight models for deployment

Currently, the app is running on a public URL http://quest.ms.mff.cuni.cz/rel2text/tabgenie/. However, this version of the app does not have access to GPUs and cannot thus use the model interactively.

Try if some small models (such as T5-small, GPT-2-small or some models adapted for fp16/fp8) could perform inference on CPU in reasonable time to showcase the capabilities of the app.

Better error handling

If a pipeline is not working properly, display a single warning and hide the rest of the errors (for example the model_api pipeline is complaining way too much if the model is not connected).

Also make a proper error logging to a file so that future problems can be discovered and handled easily.

Basic tests

Add some basic assertions, e.g.:

all the examples in the dataset may be transformed into a table
all the tables are rectangular (cf. #17)
all the examples may be exported
pipelines can handle the inputs
etc.

View pre-generated model outputs

Besides online processing, the app should show pre-generated outputs.

The outputs will be loaded from a directory, the name of each file will correspond to the model & dataset.

In the left panel, there will be a list of checkboxes on which outputs to display. In contrast to pipelines (which are generated on-the-fly), these outputs will be instantly available together with the table.

Gold references should be moved under this category.

Limit pipelines to certain datasets

Currently, all pipelines apply to all datasets. Use pipeline config to specify to which datasets the pipeline applies.

Handling data errors

Due to some data errors, the tables need not to be a perfect MxN rectangle.

It should be considered if these cases should be fixed with some heuristics and if not, how to handle them during export.

See e.g. example # 20 in ToTTo in which a cell containing a dash with a column span of 2 is missing in the original raw data:

table (cf. the row "Neftekhimik Nizhnekamsk")

ToTTo (excerpt from the example)

[{'column_span': 1, 'is_header': False, 'row_span': 3, 'value': 'Neftekhimik Nizhnekamsk (loan)'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '2012–13'},
  {'column_span': 1, 'is_header': False, 'row_span': 2, 'value': 'Russian FNL'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '6'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'},
  {'column_span': 2, 'is_header': False, 'row_span': 1, 'value': ''},
// here another cell of column_span 2 is missing
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '6'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'}],

output HTML

Export improvements

It is currently not possible to export table properties in JSON – it should be enough to add the appropriate processor.

Export should be also documented better and properly tested for all the datasets (cf. #29).

Better navigation

Small improvements for navigating between the examples in the dataset:

display the total number of examples
add a button for jumping to a random example
(optionally) display "sections", distinguished e.g. number of triples in WebNLG, the source of data in DART etc.

Fix Excel export with multi-index columns

For some tables (supposedly with hierarchical headers), export to XLSX does not work.

To replicate:

in web interface go to HiTab - dev - table 1277 → export to XLSX

  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/pandas/io/formats/excel.py", line 578, in _format_header_mi
    raise NotImplementedError(
NotImplementedError: Writing to Excel with MultiIndex columns and no index ('index'=False) is not yet implemented.

Add new datasets

If there is any dataset which is not added yet and has suitable format, add it.

Potential candidates (not checked yet):

Always check the licenses!

If you want to add a dataset to the Hugginface datasets, see the comment below: #10 (comment)

Improve interactive mode

Currently, only a very basic version of the interactive mode is implemented (#4). The user can enter edit mode, change cell value(s), and re-run the pipelines.

Only the edited cell ids are sent to the backend (together with the new values) & an altered version of the table is created for the current request. This way it should be relatively safe since the dataset will still contain the original, unmodified table.

TODO:

invalidate / hide pre-generated outputs
investigate if the current approach is safe enough
think about persistence - should we keep the edited values after a new table is fetched?

MultiWOZ turn based dataset

whole conversation view with reference instructions to users. See #63
Turn based view:
- Not straightforward using huggingface Dataset & tabgenie: I need 1 - to n mapping between
  - Need to pregenerate all prefixes - bad for storage
  - Need to create completely new class DialogueDataset / equivalent to TabularDataset
    - computing of get_example_count will differ - turns instead of conversations (as tables)
    - the turn prefixes will be generated on the fly from the stored conversation
- system view
- user view
consider adding multiple annotation. See "all_version" multiwoz version https://huggingface.co/datasets/pietrolesci/multiwoz_all_versions/viewer/pietrolesci--multiwoz_all_versions/test

Favourites not working properly

Trying to remove a favourite example triggers

Failed to store action to server: Removed favourite totto-dev-6270-undefined-undefined

on the frontend and

Traceback (most recent call last):
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/kasner/tabgenie/tabgenie/main.py", line 403, in favourite
    assert dataset and split and isinstance(table_idx, int), (dataset, split, table_idx)
AssertionError: ('totto-dev-6270', None, None)

on the backend.

After clicking on the favourite example, the app sometimes shows the first example in the split instead. It seems non-deterministic - some timing between the javascript requests?

Extra fields

Find a principled way to represent all the extra information for each table. This may include:

title, subtitle, section name, category, ...
URL
logical forms
other related tables
etc.

This information should be:

loaded for the dataset,
displayed separately from the main table (in the top area),
optionally included in the input of the model (will depend on the formatting tags for the model input, see #6 .

Task and domain knowledge

Although the datasets have a unified appearance, the domains and tasks are different for each dataset.

This information should be one of the inputs to the table2text generation system.

The system needs to know if the task is to describe overall appearance of the data, trends, summaries, commentary, etc.

The goal is to find a principled way to represent this information and present it to the system.

Run baselines

Now that the web app can display pre-generated outputs, it would be useful to have some outputs to display.

Goals:

prepare suitable baselines
prepare the inputs (tabgenie export should come in useful here)
run the baselines
import the outputs in the app

Baselines = any existing models which can be applied on table-to-text generation.

It is not necessary that the baselines are applicable on each dataset (some work only for triples, some work only with highlighted cells, some work with logical forms, etc.), but it would be cool to run them on as many datasets as possible.

It is also not necessary to actually run the baselines - if they provide the outputs exactly for our datasets, we can just use the outputs (or, if the datasets are similar, we can try to add them / adapt them).

Proposals:

BART / T5 finetuned for seq2seq generation
prompted LLMs
[@kasnerz] https://github.com/luka-group/Lattice
[@kategerasimenko] https://github.com/yxuansu/PlanGen/
https://github.com/microsoft/PLOG
- [@kategerasimenko] models not available, only output for test sets for LogicNLG and Logic2Text
https://github.com/XiangLi1999/PrefixTuning
https://github.com/syw1996/TableGPT
https://github.com/ErnestGong/data2text-duv
[@oplatek-DONE] https://github.com/ReneeYe/VariationalTemplateMachine
- "Runable" & paper readable - progress tracked in [oplatek's fork]
- See VAE outputs - with "high" temperature and five different hypotheses /lnet/work/people/oplatek/VariationalTemplateMachine/exp/Wiki.20230112_095301/generate/temp-5hyp.html
- See VAE with beamsearch with minimal temperature and 1 best hypotheses TODO
- (oplatek/VariationalTemplateMachine#1)
https://github.com/h-shahidi/2birds-gen/tree/master/table-to-text
https://github.com/ernestgong/data2text-three-dimensions
https://github.com/lancopku/Pivot
https://github.com/anindyasarkarIITH/Structure_data_to_summary
https://github.com/wentinghome/amg
[@oplatek] https://github.com/GXimingLu/a_star_neurologic
https://github.com/yxuansu/few-shot-table-to-text-generation
https://github.com/liang8qi/data2textwithauxiliarysupervision
[@oplatek] https://github.com/issacqzh/irl_table2text
- Hard to install - missing
https://github.com/KaijuML/dtt-multi-branchL

Further ideas:

Data export

Add an option to export a dataset. Each dataset would be exported in a unified format.

For starters, it should be enough to add a simple JSON export with linearized table (see #6):

[
 { 
   "in" : <linearized table>,
   "out": <reference>
 },
 (...)
]

Later, it would be also nice to add option to export each table as an object and to add extra fields (#5) in the exported file.

It would also make sense to enable to do this from the command line.

New generation methods - implementation and visualization

Add other methods to generate text from the table except for a seq2seq model.

The methods may include:

represeting cells as text using simple templates,
deriving information from the table using rules or more advanced methods (sums, averages, trends, ...),
generating relevant information from the table using a QG + QA model,
retrieving additional information from external sources.

The goal is to examine all possible ways to generate the content present in table caption and find which methods are applicable (or if the task of generating table caption is possible at all).

The methods should be launchable on-demand from the app and visualizable alongside the tables.

Formatting the model input

The user should be able to format the inputs to the model. The format would be used either for export or for live interaction with the model.

The input could be based on Jinja. For example this formatting string:

[title] {{ title }} [/title] {% for cell in cells %} [cell] {{ cell }} [/cell]

could create the following input for the model:

[title] career statistics [/title] [cell] club [/cell] [cell] season [/cell] [cell] league [/cell] (...)
``

Make "reference" a list

Make "reference" field a list to account for multiple references per table (e.g. WebNLG and ToTTo).

Convert setting parameters to flags

Some settings would be easier to use as command line flags instead of parameters in config.yml.

Currently these are cache_dev_splits and hostname (i.e. the ones which differ during deployment).

The implementation needs to take into account that run in fact calls the default flask run command - I was not able to find a way to integrate extra parameters into this command.

But it should be possible to implement it as tabgenie [EXTRA_ARGS] run [FLASK_ARGS] (although it is a bit clumsy).

Import and visualize generated outputs

For a particular dataset split, the user will upload a text file.

The file will have as many lines as the number of examples in the dataset. Each line will contain an output generated from the model.

The app will the display the generated outputs along with each example.

Load own HF model

The user will be able to interactively generate output with their own model.

If the model is loaded alongside the app (which is the default idea), it should be fairly easy – the user will just specify the path to the checkpoint + the model name and AutoModel and AutoTokenizer will do the rest.

Other options to consider: communicate via REST API, use some existing APIs.

Manual analysis mode

Add features which will facilitate manual error analysis:

generate a random sample from the dataset
bookmark interesting examples
write notes

There are two ways to do this:

easier: implement it in JS without any saving and force the user to export the data (the data would be lost after reloading the page)
harder: implement it using some sort of cookies / sessions / user accounts (to keep the data even after the app is reloaded or closed)

Keyboard shortcuts

Would it be possible to browse using keyboard (PgUp/Left, PgDn/Right or something like that)?

App not starting

I run tabgenie run and get the following:

2023-02-12 18:14:27 INFO Application ready
 * Debug mode: off
2023-02-12 18:14:36 INFO Page loaded
2023-02-12 18:14:36 INFO Initializing totto
2023-02-12 18:14:36 INFO Loading totto / dev
2023-02-12 18:14:38 WARNING Found cached dataset totto (/Users/kategerasimenko/.cache/huggingface/datasets/GEM___totto/totto/1.0.0/e27f5cf45f2aaed97e1626a55d68f1fa3b7a358cbc4fa7c4abb784c3d4e4b20d)
2023-02-12 18:14:38 INFO session=<SecureCookieSession {'favourites': {}}> favourites table_data['session']=<SecureCookieSession {'favourites': {}}>
2023-02-12 18:14:38 ERROR Exception on /table [GET]
Traceback (most recent call last):
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 1823, in full_dispatch_request
    return self.finalize_request(rv)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 1842, in finalize_request
    response = self.make_response(rv)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 2153, in make_response
    rv = self.json.response(rv)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/json/provider.py", line 309, in response
    f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/json/provider.py", line 230, in dumps
    return json.dumps(obj, **kwargs)
  File "/Users/kategerasimenko/.pyenv/versions/3.9.4/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/kategerasimenko/.pyenv/versions/3.9.4/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/kategerasimenko/.pyenv/versions/3.9.4/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/json/provider.py", line 122, in _default
    raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
TypeError: Object of type LocalProxy is not JSON serializable

Publish on PyPI

So that we can pip install tabgenie

More efficient dataset loading

Find a more efficient way to load the datasets.

Currently, some have to be downloaded directly from their original sources. This makes deployment quite cumbersome, since the datasets have to be packed together with the app.

Also, each data split is loaded in the memory, which is not very memory-efficient and takes a lot of time.

Possible solutions:

add all the datasets to Huggingface - this would facilitate the deployment significantly
use the dataset stream for loading the HF datasets - this would help the efficiency, but prevent jumping to random examples
load several examples for each dataset in the background - this would improve the quality of the interaction while still being memory efficient

Python module interface

Enable using Tabgenie as a module in other projects.

Example use-case:

from tabgenie import load_dataset, get_pipeline

# load data
dataset = load_dataset("totto")
pipeline = get_pipeline("rdf_triples")

# process the data
triples = pipeline.process(dataset)

# ... custom processing ...

Persistent URLs

Add a possibility to link a specific table by URL, e.g. https://[TABGENIE_URL]?dataset=webnlg&split=dev&table_idx=55.

Currently, similar request would return just the HTML code of the table, not the whole app.

Curate the existing datasets

Go through all the existing datasets and make sure they are added properly.

For each dataset:

check license
add data card
check if the data is loaded properly
check if no relevant fields are missing

UX bugs

double request (for the target table and table 0) when opening a table from favourites / notes

open the list of favourites / notes
click on any item
the target table is loaded first for half a second and then reloaded with table 0
request tracking shows that there are two requests when clicking on the table link

"remove from favourites" button throws 500

add item to favourites
open favourites
delete the item
action fails

"remove note" acts weirdly

Create two notes
Try to remove one of them
It is not removed
And in general, it acts weirdly, try to experiment and you will see

xlsx table for annotation is not available for export (I thought Ondrej integrated it, was it removed on purpose?)
favourites are apparently saved somewhere when the user closes the tab with the application (the "favourite" button is pressed when I reopen the tab) but the "favourites" list is empty
the icons for interactive mode and export are displayed as missing characters in some OS / browsers / fonts, it will be better to replace them with PNG images (the icons may stay the same)

Linking cells to headings

Find a principled way to fetch headings related to the highlighted cells.

At the moment, only the cells themselves and the table title is sent as an input to the model.

It would make sense to fetch the row and column heading(s) as well. Moreover, hiearchical cases where a heading covers multiple rows / columns should be handled, too.

editable table properties

represent receive properties in json object instead of html
display them
use a modal for editing them (they can be large). See https://stackoverflow.com/a/58764264

Fix custom model API input (UI, templates)

The custom input is currently not working as expected, both on the frontend and the backend side.

The templates have to be loaded properly and modifiable by the user through the web interface.

Potential bugs in the datasets

This is the list of the things that I found weird and will need double-checking.

scigen - text and reference confused?
scigen - for values in bold, _h object in cell value, error when linearizing
numericnlg - refs don't look like refs
logic2text - look into the highlights provided by contlog
logicnlg - restore the highlights, add disclaimer
hitab train 66 - error
sportsett - why first three cols empty? upd: fixed by #78

Allow to specify huggingface cache in other directory instead of ~/.cache/huggingface

Multiple pipeline instances

Allow creating multiple pipeline instances, e.g. several model_api with different configs.

Add about infobox

The "About" button is currently just a placeholder.

The button should displaying a small infobox or infobubble with information about the authors and the project.

Fix interactive mode

acts weirdly on ToTTo
doesn't take highlighted cells into account

do not store datasets/splits in to app.config

If app.config is changed the app reloads in debug model
IMO the app.config is intended to be persistent.

Currently, it breaks using flask sessions in development server with auto-reloading.
Auto reloading means that new session and new secret key is used.

Add one-line description of the app

There should be a very short description what the app is about and what you can do with it. The best place is the header, I think.

Extend CLI interface

TabGenie currently supports commands tabgenie run for launching the web interface and tabgenie export for exporting the datasets to json.

TODO:

extend the export command so that it supports all the formats that are already supported through the web interface (CSV, XLSX, HTML)
extend the export command so that it supports exporting references in a text file
add a list command which lists supported datasets and pipelines (and possibly generated outputs)

Export of merged cells to csv/excel

There's general problem with pandas when handling merged cells. The table with merged cells (HiTab dev 1277) looks like this in csv/excel:

club	season	division	league	league	fa cup	fa cup	other	other	total	total
club	season	division	apps	goals	apps	goals	apps	goals	apps	goals
chester city	1987-88	third division	1	0	0	0	0	0	1	0
chester city	1988-89	third division	25	2	0	0	5	0	30	2
chester city	1989-90	third division	18	4	2	0	6	1	26	5
chester city	total	total	44	6	2	0	11	1	57	7
tranmere rovers	1993-94	first division	0	0	0	0	0	0	0	0
port vale	1993-94	second division	2	0	1	0	0	0	3	0

The values are duplicated.
For csv, it's not clear how to display these cells.
for Excel, the structure should be preserved but this requires switching to a different processing rather than pandas.

A "playground" would then allow the user to create a new empty table MxN and edit fields in it.

UX improvements

hide the less important table properties and display them on demand (using a collapsible block)
remove the switches for pipelines and outputs in the left panel, unify them with the output boxes in the right panel (the switch will be just a small button available at the related output box)
integrate the pipeline switches with pipeline setttings: e.g. add a small cogwheel at the top of the output box

kasnerz / tabgenie Goto Github PK

tabgenie's People

Contributors

Stargazers

Watchers

Forkers

tabgenie's Issues

ToTTo (excerpt from the example)

Recommend Projects

Recommend Topics

Recommend Org