Coder Social home page Coder Social logo

kasnerz / tabgenie Goto Github PK

View Code? Open in Web Editor NEW
52.0 5.0 3.0 9.69 MB

A multi-purpose toolkit for table-to-text generation: web interface, Python bindings, CLI commands.

Home Page: https://quest.ms.mff.cuni.cz/nlg/tabgenie

License: MIT License

Python 67.93% CSS 4.48% JavaScript 16.08% HTML 11.50%
flask python table-to-text

tabgenie's People

Contributors

kasnerz avatar kategerasimenko avatar oplatek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tabgenie's Issues

Expand README

Add instructions for running and developing the app

Lightweight models for deployment

Currently, the app is running on a public URL http://quest.ms.mff.cuni.cz/rel2text/tabgenie/. However, this version of the app does not have access to GPUs and cannot thus use the model interactively.

Try if some small models (such as T5-small, GPT-2-small or some models adapted for fp16/fp8) could perform inference on CPU in reasonable time to showcase the capabilities of the app.

Better error handling

If a pipeline is not working properly, display a single warning and hide the rest of the errors (for example the model_api pipeline is complaining way too much if the model is not connected).

Also make a proper error logging to a file so that future problems can be discovered and handled easily.

Basic tests

Add some basic assertions, e.g.:

  • all the examples in the dataset may be transformed into a table
  • all the tables are rectangular (cf. #17)
  • all the examples may be exported
  • pipelines can handle the inputs
  • etc.

View pre-generated model outputs

Besides online processing, the app should show pre-generated outputs.

The outputs will be loaded from a directory, the name of each file will correspond to the model & dataset.

In the left panel, there will be a list of checkboxes on which outputs to display. In contrast to pipelines (which are generated on-the-fly), these outputs will be instantly available together with the table.

Gold references should be moved under this category.

Handling data errors

Due to some data errors, the tables need not to be a perfect MxN rectangle.

It should be considered if these cases should be fixed with some heuristics and if not, how to handle them during export.

See e.g. example # 20 in ToTTo in which a cell containing a dash with a column span of 2 is missing in the original raw data:

table (cf. the row "Neftekhimik Nizhnekamsk")
screen-2022-12-02-16-16-37

ToTTo (excerpt from the example)

[{'column_span': 1, 'is_header': False, 'row_span': 3, 'value': 'Neftekhimik Nizhnekamsk (loan)'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '2012–13'},
  {'column_span': 1, 'is_header': False, 'row_span': 2, 'value': 'Russian FNL'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '6'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'},
  {'column_span': 2, 'is_header': False, 'row_span': 1, 'value': ''},
// here another cell of column_span 2 is missing
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '6'},
  {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '0'}],

output HTML
screen-2022-12-02-16-19-34

Export improvements

It is currently not possible to export table properties in JSON – it should be enough to add the appropriate processor.

Export should be also documented better and properly tested for all the datasets (cf. #29).

Better navigation

Small improvements for navigating between the examples in the dataset:

  • display the total number of examples
  • add a button for jumping to a random example
  • (optionally) display "sections", distinguished e.g. number of triples in WebNLG, the source of data in DART etc.

Fix Excel export with multi-index columns

For some tables (supposedly with hierarchical headers), export to XLSX does not work.

To replicate:

in web interface go to HiTab - dev - table 1277 → export to XLSX

  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/pandas/io/formats/excel.py", line 578, in _format_header_mi
    raise NotImplementedError(
NotImplementedError: Writing to Excel with MultiIndex columns and no index ('index'=False) is not yet implemented.

Improve interactive mode

Currently, only a very basic version of the interactive mode is implemented (#4). The user can enter edit mode, change cell value(s), and re-run the pipelines.

Only the edited cell ids are sent to the backend (together with the new values) & an altered version of the table is created for the current request. This way it should be relatively safe since the dataset will still contain the original, unmodified table.

TODO:

  • invalidate / hide pre-generated outputs
  • investigate if the current approach is safe enough
  • think about persistence - should we keep the edited values after a new table is fetched?

MultiWOZ turn based dataset

  • whole conversation view with reference instructions to users. See #63
  • Turn based view:
    • Not straightforward using huggingface Dataset & tabgenie: I need 1 - to n mapping between
      • Need to pregenerate all prefixes - bad for storage
      • Need to create completely new class DialogueDataset / equivalent to TabularDataset
        • computing of get_example_count will differ - turns instead of conversations (as tables)
        • the turn prefixes will be generated on the fly from the stored conversation
    • system view
    • user view
  • consider adding multiple annotation. See "all_version" multiwoz version https://huggingface.co/datasets/pietrolesci/multiwoz_all_versions/viewer/pietrolesci--multiwoz_all_versions/test

See also budzianowski/multiwoz#119 (comment)

Favourites not working properly

  • Trying to remove a favourite example triggers
Failed to store action to server: Removed favourite totto-dev-6270-undefined-undefined

on the frontend and

Traceback (most recent call last):
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/kasner/virtualenv/tabgenie/lib/python3.8/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/kasner/tabgenie/tabgenie/main.py", line 403, in favourite
    assert dataset and split and isinstance(table_idx, int), (dataset, split, table_idx)
AssertionError: ('totto-dev-6270', None, None)

on the backend.

  • After clicking on the favourite example, the app sometimes shows the first example in the split instead. It seems non-deterministic - some timing between the javascript requests?

Extra fields

Find a principled way to represent all the extra information for each table. This may include:

  • title, subtitle, section name, category, ...
  • URL
  • logical forms
  • other related tables
  • etc.

This information should be:

  • loaded for the dataset,
  • displayed separately from the main table (in the top area),
  • optionally included in the input of the model (will depend on the formatting tags for the model input, see #6 .

Task and domain knowledge

Although the datasets have a unified appearance, the domains and tasks are different for each dataset.

This information should be one of the inputs to the table2text generation system.

The system needs to know if the task is to describe overall appearance of the data, trends, summaries, commentary, etc.

The goal is to find a principled way to represent this information and present it to the system.

Run baselines

Now that the web app can display pre-generated outputs, it would be useful to have some outputs to display.

Goals:

  • prepare suitable baselines
  • prepare the inputs (tabgenie export should come in useful here)
  • run the baselines
  • import the outputs in the app

Baselines = any existing models which can be applied on table-to-text generation.

It is not necessary that the baselines are applicable on each dataset (some work only for triples, some work only with highlighted cells, some work with logical forms, etc.), but it would be cool to run them on as many datasets as possible.

It is also not necessary to actually run the baselines - if they provide the outputs exactly for our datasets, we can just use the outputs (or, if the datasets are similar, we can try to add them / adapt them).

Proposals:

Further ideas:

Data export

Add an option to export a dataset. Each dataset would be exported in a unified format.

For starters, it should be enough to add a simple JSON export with linearized table (see #6):

[
 { 
   "in" : <linearized table>,
   "out": <reference>
 },
 (...)
]

Later, it would be also nice to add option to export each table as an object and to add extra fields (#5) in the exported file.

It would also make sense to enable to do this from the command line.

New generation methods - implementation and visualization

Add other methods to generate text from the table except for a seq2seq model.

The methods may include:

  • represeting cells as text using simple templates,
  • deriving information from the table using rules or more advanced methods (sums, averages, trends, ...),
  • generating relevant information from the table using a QG + QA model,
  • retrieving additional information from external sources.

The goal is to examine all possible ways to generate the content present in table caption and find which methods are applicable (or if the task of generating table caption is possible at all).

The methods should be launchable on-demand from the app and visualizable alongside the tables.

Formatting the model input

The user should be able to format the inputs to the model. The format would be used either for export or for live interaction with the model.

The input could be based on Jinja. For example this formatting string:

[title] {{ title }} [/title] {% for cell in cells %} [cell] {{ cell }} [/cell]

could create the following input for the model:

[title] career statistics [/title] [cell] club [/cell] [cell] season [/cell] [cell] league [/cell] (...)
``

Make "reference" a list

Make "reference" field a list to account for multiple references per table (e.g. WebNLG and ToTTo).

Convert setting parameters to flags

Some settings would be easier to use as command line flags instead of parameters in config.yml.

Currently these are cache_dev_splits and hostname (i.e. the ones which differ during deployment).

The implementation needs to take into account that run in fact calls the default flask run command - I was not able to find a way to integrate extra parameters into this command.

But it should be possible to implement it as tabgenie [EXTRA_ARGS] run [FLASK_ARGS] (although it is a bit clumsy).

Import and visualize generated outputs

For a particular dataset split, the user will upload a text file.

The file will have as many lines as the number of examples in the dataset. Each line will contain an output generated from the model.

The app will the display the generated outputs along with each example.

Load own HF model

The user will be able to interactively generate output with their own model.

If the model is loaded alongside the app (which is the default idea), it should be fairly easy – the user will just specify the path to the checkpoint + the model name and AutoModel and AutoTokenizer will do the rest.

Other options to consider: communicate via REST API, use some existing APIs.

Manual analysis mode

Add features which will facilitate manual error analysis:

  • generate a random sample from the dataset
  • bookmark interesting examples
  • write notes

There are two ways to do this:

  • easier: implement it in JS without any saving and force the user to export the data (the data would be lost after reloading the page)
  • harder: implement it using some sort of cookies / sessions / user accounts (to keep the data even after the app is reloaded or closed)

Keyboard shortcuts

Would it be possible to browse using keyboard (PgUp/Left, PgDn/Right or something like that)?

App not starting

I run tabgenie run and get the following:

2023-02-12 18:14:27 INFO Application ready
 * Debug mode: off
2023-02-12 18:14:36 INFO Page loaded
2023-02-12 18:14:36 INFO Initializing totto
2023-02-12 18:14:36 INFO Loading totto / dev
2023-02-12 18:14:38 WARNING Found cached dataset totto (/Users/kategerasimenko/.cache/huggingface/datasets/GEM___totto/totto/1.0.0/e27f5cf45f2aaed97e1626a55d68f1fa3b7a358cbc4fa7c4abb784c3d4e4b20d)
2023-02-12 18:14:38 INFO session=<SecureCookieSession {'favourites': {}}> favourites table_data['session']=<SecureCookieSession {'favourites': {}}>
2023-02-12 18:14:38 ERROR Exception on /table [GET]
Traceback (most recent call last):
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 1823, in full_dispatch_request
    return self.finalize_request(rv)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 1842, in finalize_request
    response = self.make_response(rv)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/app.py", line 2153, in make_response
    rv = self.json.response(rv)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/json/provider.py", line 309, in response
    f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/json/provider.py", line 230, in dumps
    return json.dumps(obj, **kwargs)
  File "/Users/kategerasimenko/.pyenv/versions/3.9.4/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/kategerasimenko/.pyenv/versions/3.9.4/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/kategerasimenko/.pyenv/versions/3.9.4/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/kategerasimenko/Desktop/robota/tabgenie/env/lib/python3.9/site-packages/flask/json/provider.py", line 122, in _default
    raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
TypeError: Object of type LocalProxy is not JSON serializable

More efficient dataset loading

Find a more efficient way to load the datasets.

Currently, some have to be downloaded directly from their original sources. This makes deployment quite cumbersome, since the datasets have to be packed together with the app.

Also, each data split is loaded in the memory, which is not very memory-efficient and takes a lot of time.

Possible solutions:

  • add all the datasets to Huggingface - this would facilitate the deployment significantly
  • use the dataset stream for loading the HF datasets - this would help the efficiency, but prevent jumping to random examples
  • load several examples for each dataset in the background - this would improve the quality of the interaction while still being memory efficient

Python module interface

Enable using Tabgenie as a module in other projects.

Example use-case:

from tabgenie import load_dataset, get_pipeline

# load data
dataset = load_dataset("totto")
pipeline = get_pipeline("rdf_triples")

# process the data
triples = pipeline.process(dataset)

# ... custom processing ...

Persistent URLs

Add a possibility to link a specific table by URL, e.g. https://[TABGENIE_URL]?dataset=webnlg&split=dev&table_idx=55.

Currently, similar request would return just the HTML code of the table, not the whole app.

Curate the existing datasets

Go through all the existing datasets and make sure they are added properly.

For each dataset:

  • check license
  • add data card
  • check if the data is loaded properly
  • check if no relevant fields are missing

UX bugs

  • double request (for the target table and table 0) when opening a table from favourites / notes
  • open the list of favourites / notes
  • click on any item
  • the target table is loaded first for half a second and then reloaded with table 0
  • request tracking shows that there are two requests when clicking on the table link
  • "remove from favourites" button throws 500
  • add item to favourites
  • open favourites
  • delete the item
  • action fails
  • "remove note" acts weirdly
  • Create two notes
  • Try to remove one of them
  • It is not removed
    And in general, it acts weirdly, try to experiment and you will see
  • xlsx table for annotation is not available for export (I thought Ondrej integrated it, was it removed on purpose?)
  • favourites are apparently saved somewhere when the user closes the tab with the application (the "favourite" button is pressed when I reopen the tab) but the "favourites" list is empty
  • the icons for interactive mode and export are displayed as missing characters in some OS / browsers / fonts, it will be better to replace them with PNG images (the icons may stay the same)

Linking cells to headings

Find a principled way to fetch headings related to the highlighted cells.

At the moment, only the cells themselves and the table title is sent as an input to the model.

It would make sense to fetch the row and column heading(s) as well. Moreover, hiearchical cases where a heading covers multiple rows / columns should be handled, too.

Fix custom model API input (UI, templates)

The custom input is currently not working as expected, both on the frontend and the backend side.

The templates have to be loaded properly and modifiable by the user through the web interface.

Potential bugs in the datasets

This is the list of the things that I found weird and will need double-checking.

  • scigen - text and reference confused?
  • scigen - for values in bold, _h object in cell value, error when linearizing
  • numericnlg - refs don't look like refs
  • logic2text - look into the highlights provided by contlog
  • logicnlg - restore the highlights, add disclaimer
  • hitab train 66 - error
  • sportsett - why first three cols empty? upd: fixed by #78

Add about infobox

The "About" button is currently just a placeholder.

The button should displaying a small infobox or infobubble with information about the authors and the project.

do not store datasets/splits in to app.config

If app.config is changed the app reloads in debug model
IMO the app.config is intended to be persistent.

Currently, it breaks using flask sessions in development server with auto-reloading.
Auto reloading means that new session and new secret key is used.

Extend CLI interface

TabGenie currently supports commands tabgenie run for launching the web interface and tabgenie export for exporting the datasets to json.

TODO:

  • extend the export command so that it supports all the formats that are already supported through the web interface (CSV, XLSX, HTML)
  • extend the export command so that it supports exporting references in a text file
  • add a list command which lists supported datasets and pipelines (and possibly generated outputs)

Export of merged cells to csv/excel

There's general problem with pandas when handling merged cells. The table with merged cells (HiTab dev 1277) looks like this in csv/excel:

club season division league league fa cup fa cup other other total total
club season division apps goals apps goals apps goals apps goals
chester city 1987-88 third division 1 0 0 0 0 0 1 0
chester city 1988-89 third division 25 2 0 0 5 0 30 2
chester city 1989-90 third division 18 4 2 0 6 1 26 5
chester city total total 44 6 2 0 11 1 57 7
tranmere rovers 1993-94 first division 0 0 0 0 0 0 0 0
port vale 1993-94 second division 2 0 1 0 0 0 3 0

The values are duplicated.
For csv, it's not clear how to display these cells.
for Excel, the structure should be preserved but this requires switching to a different processing rather than pandas.

UX improvements

  • in the table navigation block, the elements with table numbers have larger height than buttons
  • show notes themselves (at least the beginning) in the notes list
  • fix the key column width for properties (now it changes dynamically when toggling the individual properties)
  • the button for regenerating the model output in the interactive mode is too small - change to "Regenerate output" button in the pipeline output box
  • in "toggle view", do we need the view only with pre-generated outputs? is it needed without looking at input data?
  • think of better json structure in the export
  • clickable email addresses in the about section?
  • "export examples with notes" doesn't include notes into the files
  • for datasets without header (eg e2e), dummy header is inserted when exporting to csv
  • mobile version? :)

Update README

The README should be updated so that it feels like a normal README, focusing less on development and more on usage.

It should be also checked for outdated info etc.

Interactive mode

Add an interactive mode in which people would be able to edit existing inputs or create their own.

Idea: a selection box for switching between the highlighting & interactive mode. In the interactive mode, each table cell in the existing datasets would be editable.

A "playground" would then allow the user to create a new empty table MxN and edit fields in it.

UX improvements

  • hide the less important table properties and display them on demand (using a collapsible block)
  • remove the switches for pipelines and outputs in the left panel, unify them with the output boxes in the right panel (the switch will be just a small button available at the related output box)
  • integrate the pipeline switches with pipeline setttings: e.g. add a small cogwheel at the top of the output box

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.