dfki-nlp / llmcheckup Goto Github PK

9.0 5.0 1.0 11 MB

Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-explanation" (Wang et al. 2024)

Home Page: https://arxiv.org/abs/2401.12576

Python 82.51% Dockerfile 0.28% JavaScript 0.45% HTML 16.75%

chatbot conversational-ai dialogue explainability explainable-ai interface interpretability language-models llms nlp xai

llmcheckup's Introduction

LLMCheckup

Dialogical Interpretability Tool for LLMs

💥Running with conda / virtualenv

Note: Please use Python 3.8+ and torch 2.0+

Create the environment and install dependencies

Conda

conda create -n llmcheckup python=3.9
conda activate llmcheckup

venv

python -m venv venv
source venv/venv/activate

⚙️Install the requirements

python -m pip install --upgrade pip
pip install -r requirements.txt
python -m nltk.downloader "averaged_perceptron_tagger" "wordnet" "omw-1.4"

🚀Launch system

python flask_app.py

💟Supported explainability methods

Feature Attribution
- Attention, Integrated gradient, etc.
- Implemented by 🐛inseq package
Semantic Similarity
Free-text rationalization
- Zero-shot CoT
- Plan-and-Solve
- Optimization by PROmpting (OPRO)
- Any customized additional prompt according to users' wish
- Notice: Above mentioned options can be freely chosen in the interface - "Prompt modification"
Data Augmentation
- Implemented by NLPAug package or few-shot prompting
Counterfactual Generation

🤗Models:

In our study, we identified three LLMs for our purposes.

🦙LLama2 (Request access to its weights at the ♾️ Meta AI website)
- Quantized Llama2 (https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ)
Mistral (https://huggingface.co/mistralai/Mistral-7B-v0.1)
- Quantized Mistral (https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ)
Stable Beluga 2 (finetuned llama2-70B) (https://huggingface.co/petals-team/StableBeluga2)

🐳Deployment:

We support different methods for deployment:

Original models
Quantized by GPTQ
Loading model in 8-bits by bitsandbytes
peer2peer by petals

✏️Support:

Method	Unix-based	Windows
Original	✅	✅
GPTQ	✅	✅
Bitsanbytes*	✅	✅
petals**	✅	❌

*: 🪟 For Windows: if you encounter errors while installing bitsandbytes, then try:

python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl

**: petals is currently not supported in windows, since a lot of Unix-specific things are used in petals. See issue here. petals is still usable if running LLMCheckup in Docker or WSL 2.

🔍Use case:

Fact checking

Dataset: COVID-Fact

Link: https://github.com/asaakyan/covidfact

Structure

{
    Claim: ...,
    Evidence: ...,
    Label: ...,
}

Commonsense Question Answering

Dataset: ECQA

Link: https://github.com/dair-iitd/ECQA-Dataset

{
    Question: ...,
    Multiple choices: ...,
    Correct answer: ...,
    Positive explanation: ...,
    Negative explanation: ...,
    Free-flow explanation: ...,
}

📝Input with multi modalities

Text
Image
- Image upload
- Optical Character Recognition
  - Implemented by EasyOCR package
Audio
- A lightweight fairseq s2t model from meta
- If you encounter errors when reading recorded files: soundfile.LibsndfileError: Error opening path_to_wav: Format not recognised.. Then please try to install ffmpeg.
  - 🐧In Linux: sudo apt install ffmpeg or pip3 install ffmpeg
  - 🪟In Windows: Download ffmpeg from here and add the path to system environment

llmcheckup's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes

llmcheckup's Issues

Class factory design pattern

https://github.com/nfelnlp/LLMCheckup/blob/c10069d4bf40b3319a9e8153e040d96366d7e412/logic/decoder.py#L12

https://www.baeldung.com/java-factory-pattern

This is an example in java but can be easily used for your init_model method too
https://github.com/nfelnlp/LLMCheckup/blob/c10069d4bf40b3319a9e8153e040d96366d7e412/logic/decoder.py#L47

The factory desing pattern keeps code readable and extensible.

Do not use bare except statements E722

https://github.com/nfelnlp/LLMCheckup/blob/c10069d4bf40b3319a9e8153e040d96366d7e412/data/dataset_preprocessing.py#L33

Refactor project structure

https://docs.python-guide.org/writing/structure/

This is a good starting point, you shouldn't have one single folder holding a dozen of codefiles. The reason being, that many files in a single folder make code mostly look unstructured and hard to oversee and its worse in its maintainability, as your code isn't broken down in (mostly) independent modules.

I personally also like using the java structure (even though abbreviated at times).
https://stackoverflow.com/questions/28160379/how-to-create-a-test-directory-in-intellij-13/28161314#28161314

Nested class

https://github.com/nfelnlp/LLMCheckup/blob/c10069d4bf40b3319a9e8153e040d96366d7e412/logic/core.py#L312

Why is this a private class?

Suggestion, a bit off topic because the model class seemingly doesn't do anything with models except translate:
Would be a relatively big refactor, but you could write a real model class in a ModelInteractions package (or something in that manner) and put all model interactions inside there, this would abstract your code and make it more accessible

Rename setters and getters

https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/conversation.py#L31
update_name -> set_name, also you could use @Property if you like that more

https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/conversation.py#L35
update_contents -> set_contents

https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/conversation.py#L39
update_type -> set_type

but setters and getters aren't really needed anyways, you're accessing a (I assume) public class attribute and you aren't doing any logic in the setters or getters. The class itself looks like a @DataClass. If you don't know dataclasses, https://docs.python.org/3/library/dataclasses.html dataclasses are really neat, think of them as the python equivalent to structs

also https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/conversation.py#L19C56-L19C56
"kind" method argument is ambigous, use type instead, but type is also ambigous, use conversation_type or something similiar instead and rename class property accordingly

inseq assertion error

Prompt:

show me the 5 most important features for data point 730 by attention

Error Track:

  File "C:\Users\87290\DFKI\LLMCheckup\flask_app.py", line 267, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "C:\Users\87290\DFKI\LLMCheckup\logic\core.py", line 536, in update_state
    returned_item = run_action(
  File "C:\Users\87290\DFKI\LLMCheckup\logic\action.py", line 49, in run_action
    action_return, action_status = actions[p_text](
  File "C:\Users\87290\DFKI\LLMCheckup\actions\explanation\feature_importance.py", line 130, in feature_importance_operation
    out_agg = out.aggregate(inseq.data.aggregator.SubwordAggregator)
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\data\attribution.py", line 617, in aggregate
    aggregated.sequence_attributions[idx] = seq.aggregate(aggregator, **kwargs)
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\data\aggregator.py", line 249, in aggregate
    return aggregator.aggregate(
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\data\aggregator.py", line 720, in aggregate
    return super().aggregate(attr, source_spans=source_spans, target_spans=target_spans, **kwargs)
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\data\aggregator.py", line 546, in aggregate
    return super().aggregate(attr, source_spans=source_spans, target_spans=target_spans, **kwargs)
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\data\aggregator.py", line 102, in aggregate
    cls.post_aggregate_hook(aggregated, **kwargs)
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\data\aggregator.py", line 337, in post_aggregate_hook
    cls.is_compatible(attr)
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\data\aggregator.py", line 432, in is_compatible
    assert attr.target_attributions.shape[1] == attr.attr_pos_end - attr.attr_pos_start
AssertionError

prompt:

primary features of data point 3444 by input gradient

  File "C:\Users\87290\DFKI\LLMCheckup\flask_app.py", line 267, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "C:\Users\87290\DFKI\LLMCheckup\logic\core.py", line 536, in update_state
    returned_item = run_action(
  File "C:\Users\87290\DFKI\LLMCheckup\logic\action.py", line 49, in run_action
    action_return, action_status = actions[p_text](
  File "C:\Users\87290\DFKI\LLMCheckup\actions\explanation\feature_importance.py", line 120, in feature_importance_operation
    out = inseq_model.attribute(
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\models\attribution_model.py", line 445, in attribute
    attribution_outputs = attribution_method.prepare_and_attribute(
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\attr\attribution_decorators.py", line 71, in batched_wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\attr\feat\feature_attribution.py", line 237, in prepare_and_attribute
    attribution_output = self.attribute(
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\attr\feat\feature_attribution.py", line 372, in attribute
    attr_pos_start, attr_pos_end = check_attribute_positions(
  File "C:\Users\87290\anaconda3\envs\llm\lib\site-packages\inseq\attr\feat\attribution_utils.py", line 85, in check_attribute_positions
    raise ValueError("Start and end attribution positions cannot be the same.")
ValueError: Start and end attribution positions cannot be the same.

Offload final constants to file

code from here
https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/core.py#L162
to
https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/core.py#L195C1-L195C1
here can be offloaded into a singleton/ configfile or something similiar to improve maintainability and readability. Current implementation clutters up the code and consists of final values/ you could also use a tuple / frozen set

noqa

https://github.com/nfelnlp/LLMCheckup/blob/0187cfccd3c66afbbc7d559b4c2d0168f10a4763/experiment/compute_parsing_accuracy.py#L26C54-L26C54

E402, F401 can be avoided by making the according files available in module, normally there shouldn't be a need to sys.path.append

Otherwise, if still needed:
https://stackoverflow.com/questions/56184551/add-path-to-sys-path-vs-pep-e402 This could make noqa irrelevant too

type comparison faulty

https://github.com/nfelnlp/LLMCheckup/blob/0187cfccd3c66afbbc7d559b4c2d0168f10a4763/experiment/compute_parsing_accuracy.py#L50

use "is" for type comparison, also parenthesis wrongly placed

should be if type(parse_text) is tuple

Inline import

https://github.com/nfelnlp/LLMCheckup/blob/c10069d4bf40b3319a9e8153e040d96366d7e412/logic/dataset_description.py#L102

inline import hurts E402

UUID / GUID

https://github.com/nfelnlp/LLMCheckup/blob/c10069d4bf40b3319a9e8153e040d96366d7e412/logic/core.py#L416

Did you want a Universally Unique Identifier/ Globally Unique Identifier?

uuid.uuid4() generates anonymous uuids

https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python

Nested for loop

https://github.com/nfelnlp/LLMCheckup/blob/c10069d4bf40b3319a9e8153e040d96366d7e412/logic/core.py#L276

export second layer to function/ classmethod to keep it readable and to abstract

Remove unnecessary directories

The directories /cache, /data shouldn't be uploaded as source files.

Cache can be generated on the fly and the data should be available somewhere?

Guard clauses

https://stackoverflow.com/questions/4887212/shall-i-use-guard-clause-and-try-to-avoid-else-clause

https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/conversation.py#L121

https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/conversation.py#L128

to be consistent with
https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/conversation.py#L196

or the other way around you could remove the guard clause, a clean consistent style is always good. I personally really like guard clauses they improve the flow of reading greatly.

unnecessary super call in root class of ExplainBot family

https://github.com/nfelnlp/LLMCheckup/blob/8cab97fde6ec7d54100daccf3628813fd3715a3d/logic/core.py#L109C11-L109C11

Am I missing something? ExplainBot doesnt need a super call? Object class constructor is always called

Show CUDA OOM error on interface

I tried running the default config which has a Llama-2-7B-chat-hf model and ran into the common CUDA out of memory error. I wonder if we should show some sort of hint as well as a recommendation on the interface about this. Specifically, we could replace the standard response

I'm sorry but could you rephrase the message, please?

with

While decoding the response, I recognized a CUDA out of memory. I suggest to choose a smaller model for your hardware configuration. You can do that by opening the global_config.gin file and editing the value of GlobalArgs.config to an equivalent with a model of smaller parameter size, e.g. "ecqa_llama_gptq.gin" or "ecqa_pythia.gin".

[2024-01-09 07:29:27,018] INFO in core: getting grammar
[2024-01-09 07:29:27,018] INFO in core: About to decode
[2024-01-09 07:29:28,031] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
  File "/home/nfel/PycharmProjects/LLMCheckup/flask_app.py", line 267, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/home/nfel/PycharmProjects/LLMCheckup/logic/core.py", line 572, in update_state
    parse_tree, parsed_text = self.compute_parse_text(text)
  File "/home/nfel/PycharmProjects/LLMCheckup/logic/core.py", line 376, in compute_parse_text
    parsed_text = self.mprompt_parser.parse_user_input(text)
  File "/home/nfel/PycharmProjects/LLMCheckup/parsing/multi_prompt/prompting_parser.py", line 173, in parse_user_input
    parsed_operation = self.generate_with_prompt(operation_type_prompt, user_input).replace("[E]", "").strip()
  File "/home/nfel/PycharmProjects/LLMCheckup/parsing/multi_prompt/prompting_parser.py", line 96, in generate_with_prompt
    outputs = self.decoder_model.generate(**inputs, generation_config=self.generation_config)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
    outputs = self(
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward
    outputs = self.model(
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
    layer_outputs = decoder_layer(
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/nfel/PycharmProjects/LLMCheckup/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 383, in forward
    value_states = torch.cat([past_key_value[1], value_states], dim=2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 9.78 GiB total capacity; 5.66 GiB already allocated; 75.00 MiB free; 6.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[2024-01-09 07:29:28,031] INFO in flask_app: Exception getting bot response: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 9.78 GiB total capacity; 5.66 GiB already allocated; 75.00 MiB free; 6.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

requirements.txt has version conflicts

ERROR: Cannot install -r .\requirements.txt (line 10), -r .\requirements.txt (line 73), -r .\requirements.txt (line 90) and huggingface-hub==0.4.0 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested huggingface-hub==0.4.0
datasets 2.10.1 depends on huggingface-hub<1.0.0 and >=0.2.0
sentence-transformers 2.2.0 depends on huggingface-hub
transformers 4.34.1 depends on huggingface-hub<1.0 and >=0.16.4

To replicate:
Download repository on fresh venv and execute
"pip -r requirements.txt"

Why use gin?

What is wrong with using json / ini files? They are much more popular and better to overlook.

I have discussed this with @nfelnlp and he seems to understand my point but I would genuinely like to know why you've decided against writing a small loader.