Coder Social home page Coder Social logo

xiaomeng-zhu / lieder Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 20.49 MB

Repository for the ACL 2024 paper "LIEDER: Linguistically-Informed Evaluation Suite for Discourse Entity Recognition"

Python 39.17% R 60.83%
acl2024 linguistics-dataset discourse-entity-recognition natural-language-processing natural-language-understanding semantic-analysis

lieder's Introduction

LIEDER: Linguistically-Informed Evaluation for Discourse Entity Recognition

Table of Contents

  1. LIEDER: Linguistically-Informed Evaluation Suite for Discourse Entity Recognition
  2. Replication

Abstract

Discourse Entity (DE) recognition is the task of identifying novel and known entities introduced within a text. While previous work has found that large language models have basic, if imperfect, DE recognition abilities (Schuster and Linzen, 2022), it remains largely unassessed which of the fundamental semantic properties that govern the introduction and subsequent reference to DEs they have knowledge of. We propose the Linguistically-Informed Evaluation for Discourse Entity Recognition (LIEDER) dataset that allows for a detailed examination of language models’ knowledge of four crucial semantic properties: EXISTENCE, UNIQUENESS, PLURALITY, and NOVELTY. We find evidence that state-of-the-art large language models exhibit sensitivity to all of these properties except NOVELTY, which demonstrates that they have yet to reach human-level language understanding abilities.

Design of LIEDER

We outline four crucial semantic properties that language models ought to know for successful DE recognition:

  • EXISTENCE: A language model with human-level understanding abilities should only use definite descriptions to refer to entities that have been introduced into the discourse.
  • UNIQUENESS: A language model should use a singular definite description to refer to a previously introduced entity only when the referent is unique relative to the discourse.
  • PLURALITY: A language model should use a plural definite description only if the set of DEs contains more than one individual of the relevant sort.
  • NOVELTY: A language model should recognize that an occurrence of an indefinite noun phrase is associated with the introduction of a new entity into the discourse.

Each example in LIEDER consists of two sentences: a context sentence followed by a continuation. There are four context types: pos_neg, neg_pos, pos_pos, neg_neg and two continuation types: singular and plural, which result in 8 different combinations (summarized below).

Context type Context Singular Continuation Plural Continuation
pos_neg John owns a dog but Mark doesn't own a dog. The dog is very cute. #The dogs are very cute.
neg_pos John doesn't own a dog but Mark owns a dog. The dog is very cute. #The dogs are very cute.
pos_pos John owns a dog and Mark owns a dog too. #The dog is very cute. The dogs are very cute.
neg_neg John doesn't own a dog and Mark doesn't own a dog either. #The dog is very cute. #The dogs are very cute.

Because metalinguistic judgments elicited from language models may not reflect the full extent of the model's knowledge (Hu & Levy, 2023), we instead compare felicity using the probabilities the model assigns to the continuation given the context. We assume that the probability a model assigns to a felicitous case should be greater than the probability it assigns to an infelicitous one. With 3 felicitous pairs and 5 infelicitous ones, this means we have 15 informative probability comparisons in total.

Comparison Type Requirement
p(sg|pos_neg)>p(sg|pos_pos) uniqueness, novelty
p(sg|neg_pos)>p(sg|pos_pos) uniqueness, novelty
p(sg|neg_pos)>p(sg|neg_neg) existence
p(sg|pos_neg)>p(sg|neg_neg) existence
p(pl|pos_pos)>p(pl|pos_neg) plurality
p(pl|pos_pos)>p(pl|neg_pos) plurality
p(pl|pos_pos)>p(pl|neg_neg) existence, plurality
p(sg|pos_neg)>p(pl|pos_neg) plurality
p(sg|pos_neg)>p(pl|neg_pos) plurality
p(sg|pos_neg)>p(pl|neg_neg) existence, plurality
p(sg|neg_pos)>p(pl|neg_pos) plurality
p(sg|neg_pos)>p(pl|pos_neg) plurality
p(sg|neg_pos)>p(pl|neg_neg) existence, plurality
p(pl|pos_pos)>p(sg|pos_pos) uniqueness, novelty
p(pl|pos_pos)>p(sg|neg_neg) existence

Experiment 1

EXISTENCE & UNIQUENESS

High accuracy on the first two panels suggests that models know EXISTENCE (i.e. singular definites cannot be used to refer to non-existing DEs). exp1_singular

Lower accuracy on the last two panels suggests two possibilities:

  • Hypothesis 1: During training, the models have successfully learned the EXISTENCE requirement, but they failed to learn UNIQUENESS.
  • Hypothesis 2: During training, the models have successfully learned both the EXISTENCE and UNIQUENESS requirements but fail to recognize that two distinct DEs have been introduced in pos_pos contexts, resulting in difficulties in distinguishing the infelicitous pos_pos from felicitous pos_neg and neg_pos. To put it in another way, they fail at the NOVELTY requirement.

Experiments 2 and 3 (Appendix) will focus on teasing these two hypotheses apart.

PLURALITY

High accuracy in all three panels below suggests that models know PLURALITY. exp1_plural

Experiment 2

One way to test if LLMs fail to recognize two different DEs in pos_pos contexts is to use lexical cues that make explicit the distinctness of the first and second entities.
If performance relative to pos_pos contexts increases when the distinction is explicit, then there is evidence that the LLMs fail to recognize the distinction in the implicit case, where the presence of multiple DEs results from the NOVELTY condition on indefinites alone.

Accordingly, we make the following modification to our dataset which we call Explicit Novelty (corresponding to diff in the repo): for each context of the type pos_pos, we add the adjective different to the second indefinite description. exp2_singular_vs_exp1 There is a significant increase (p<0.001) in accuracy from Implicit to Explicit Novelty, suggesting models' difficulty with the NOVELTY requirement.

Replication

To replicate our results, start by cd into the repository and follow the instructions below:

Naming Conventions

All experimental configurations in the config directory and results in the results directory obey the following naming conventions:

  • <exp_version> = {base, diff, two} that corresponds to Experiment 1, 2, and 3 (Appendix) respectively.
  • <model> = {llama2, llama3, babbage-002, davinci-002}1
  • <size> can be 7B, 13B, and 70B for Llama 2 or 8B and 70B for Llama 3.

Running Inference

python scripts/experiments_<spec>.py --config <config_f>

Specifically,

  • scripts/experiments_openai.py for babbage-002 and davinci-002
  • scripts/experiments_llama_family.py for loading Llama models on a CPU
  • scripts/experiments_llama_family_gpu.py for loading Llama models on a GPU 2

For example, if you would like to run Llama2-7B on the base version of LIEDER (corresponding to Experiment 1), run the following command

python scripts/experiments_llama_family.py --config config/base_llama2_7B.json

Note: Please double check that the script that you are using and the config file match in terms of the model series. For example, running scripts/experiments_openai.py using a Llama config file will result in errors.

Analyzing Results

To analyze the results in Experiment 1 and 2, running the following command with the appropriate config file

python scripts/analyze_experiments.py --config <config_f> --metric <ref_or_nonref> --model_series <llama_or_not>

To analyze the results for Experiment 3, simply replace scripts/analyze_experiments.py with scripts/analyze_experiments_two.py.

If you would like to compute accuracy under the direct metric that is used throughout the main body of the paper, the second argument --metric should be ref. Otherwise, if you would like to use the relative metric discussed in the appendix, specify the metric as nonref.

For example, running the following command

python scripts/analyze_experiments_two.py --config config/two_llama2_7B.json --metric ref --model_series llama

produces two files: results/two/two_llama2_7B_summed.csv and results/two/two_llama2_7B_accuracy_ref.csv. The former contains the summed conditional probability of the continuation given the context, and the latter contains the accuracy per comparisons.

View Accuracy with Associated Semantic Property

To see the semantic property that is associated with each entry in the accuracy file, run

python scripts/view_accuracy_with_type.py --accuracy_file <accuracy_file_name>

Generating Plots

Follow analysis.R for plots and significance analysis.

Footnotes

  1. The repository also contains results from GPT-2 and CodeLlama that are not included in the camera-ready version of the paper.

  2. Note that some of the models might be too large to fit on a single GPU. In such cases, follow comments in the script for details.

lieder's People

Contributors

xiaomeng-zhu avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.