dhfbk / variationist Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 0.0 10.45 MB

Variationist: Exploring Multifaceted Variation and Bias in Written Language Data (ACL 2024 demo track)

Home Page: https://aclanthology.org/2024.acl-demos.33/

License: MIT License

Python 100.00%

bias-in-data computational-linguistics corpus-analysis language-variation nlp

variationist's People

Contributors

Stargazers

Watchers

variationist's Issues

Handle label sparsity for temporal/spatial/quantitative using "granularity"

Some variables (e.g., dates in the standard Twitter format YYYY-MM-DDTHH:MM:SS.000Z) are likely to take a different value for each text, making the final results for the metrics sparse, uninformative, or even useless (e.g., a PMI rank for each exact datetime). It would make sense to run the computation on a given granularity instead (e.g., "year", "year-month", "year-month-day").

I strongly support to leave to the user any preprocessing stuff (we cannot handle any data variant the user is thinking about!), but the case of datetimes is quite standard (we can just support 2-3 formats and document it), and temporal aggregation seems a good feature that could be appreciated!

Extra: the same principle can be applied on spatial data with coordinates, for which a given granularity would instead be an integer denoting kilometers. Using the Haversine formula and creating a set of bounding boxes based on the granularity, the results will make much more sense.

Note: this issue is complementary to the var_subsets feature, here we are still working on the variable values before determining which ones are of interest to the user

Whitespace tokenizer implicitly removes the emojis

The regex at line https://github.com/dhfbk/variationist/blob/main/src/data/tokenization_utils.py#L13 leads to the removal of emojis which are however useful indicators for many use cases (e.g., hate speech, sentiment analysis, etc). In general, the regex at line 13 is somehow limited to hold a fixed set of characters, perhaps we can think about a more general way to do that logic!

Add bi- & tri-gram support

Add support for n-grams (n>1) in tokenization, so metrics can be calculated on them in addition to single tokens.

empty tokens in whitespace tokenizer

whitespace tokenizer adds an extra empty token at the end of each sentence:

E.g.
['After', 'finding', 'peculiar', 'key', 'three', 'smart', 'adventurous', 'kids', 'launch', 'quest', 'uncover', 'whereabouts', 'coveted', 'archaeological', 'treasure', '']

dhfbk / variationist Goto Github PK

variationist's People

Contributors

Stargazers

Watchers

variationist's Issues

Handle label sparsity for temporal/spatial/quantitative using "granularity"

Whitespace tokenizer implicitly removes the emojis

Add bi- & tri-gram support

empty tokens in whitespace tokenizer

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent