Coder Social home page Coder Social logo

dhfbk / variationist Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 0.0 10.45 MB

Variationist: Exploring Multifaceted Variation and Bias in Written Language Data (ACL 2024 demo track)

Home Page: https://aclanthology.org/2024.acl-demos.33/

License: MIT License

Python 100.00%
bias-in-data computational-linguistics corpus-analysis language-variation nlp

variationist's People

Contributors

alanramponi avatar ca-milla avatar stefanomenini avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

variationist's Issues

Handle label sparsity for temporal/spatial/quantitative using "granularity"

Some variables (e.g., dates in the standard Twitter format YYYY-MM-DDTHH:MM:SS.000Z) are likely to take a different value for each text, making the final results for the metrics sparse, uninformative, or even useless (e.g., a PMI rank for each exact datetime). It would make sense to run the computation on a given granularity instead (e.g., "year", "year-month", "year-month-day").

I strongly support to leave to the user any preprocessing stuff (we cannot handle any data variant the user is thinking about!), but the case of datetimes is quite standard (we can just support 2-3 formats and document it), and temporal aggregation seems a good feature that could be appreciated!

Extra: the same principle can be applied on spatial data with coordinates, for which a given granularity would instead be an integer denoting kilometers. Using the Haversine formula and creating a set of bounding boxes based on the granularity, the results will make much more sense.

Note: this issue is complementary to the var_subsets feature, here we are still working on the variable values before determining which ones are of interest to the user

Add bi- & tri-gram support

Add support for n-grams (n>1) in tokenization, so metrics can be calculated on them in addition to single tokens.

empty tokens in whitespace tokenizer

whitespace tokenizer adds an extra empty token at the end of each sentence:

E.g.
['After', 'finding', 'peculiar', 'key', 'three', 'smart', 'adventurous', 'kids', 'launch', 'quest', 'uncover', 'whereabouts', 'coveted', 'archaeological', 'treasure', '']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.