exasol / transformers-extension Goto Github PK

An Exasol extension for using state-of-the-art pretrained machine learning models via the Hugging Face Transformers API.

License: MIT License

Python 96.79% Shell 3.10% Dockerfile 0.11%

exasol exasol-integration huggingface-transformers transformers machine-learning data-science

transformers-extension's Introduction

Exasol Transformers Extension

An Exasol extension to use state-of-the-art pretrained machine learning models via the transformers api.

This Extension is build and tested for Linux OS, and does not have Windows or MacOS support. It might work regardless but proceed at your own risk.

Information for Users

Information for Contributors

transformers-extension's People

Contributors

Stargazers

Watchers

Forkers

bailbot tungsontran

transformers-extension's Issues

Add model counters to unit tests of prediction UDFs

In the prediction UDFs, input dataframe is filtered based on model_name, bucketfs_conn, sub_dir features
This filtering is already implemented to prevent same model multiple times in each batch iteration
To improve tests in this directions, tests cases should be enhanced with model counter assertion

for FillingMask

transformers-extension/tests/unit_tests/udf_wrapper_params/filling_mask/mock_filling_mask.py

Line 27 in 0df0d6e

class MockPipeline:

Create Masked Language Modelling UDF

Background

This is the task of masking tokens in a given string text with a masking token and prompting the model to fill that mask with an appropriate token.
We should add this model and its capabilities
https://huggingface.co/docs/transformers/task_summary#language-modeling

Acceptance Criteria

Create UDF for filling-mask task
Add unit/integration tests

ScriptDeployer Client Connection needs to be encrypted

Background

When running the

python -m exasol_transformers_extension.deploy scripts --dsn 127.0.0.1:9564 --db-user sys \
--db-pass exasol --schema TRASNFORMERS \
--language-alias transformers

the following error message appears with Docker-DB 8.18.1

reported by @exa-eswar

probable reason is too old a version of pyexasol or a misconfigured pyexasol connection

Convert DownloaderUDF to SET UDF

the models to be downloaded can be given in the table below.
this table might include bucketfs_connection_name and model_name as columns
to scale downloader udf like that, we need to convert it to SET UDF

Background

The directory structure where pre-trained models are downloaded has been changed with transformers 4.22
While the models files directly are downloaded into the cache directory in older versions, with version 4.22 transformers uses complex directory structure.
We need to upload this directory to bucket as we downloaded.
In order to complete this issue, the ticket exasol/bucketfs-python#1 must be done.

Acceptance Criteria

switched to upload_directory method in bucketfs-utils
update toml file with `transformers="^4.22"'
- pay attention that newer versions install nvidia modules which increases SLC size ~2GB

Concatenation command in the user guide can lead to broken SLC archives

Background

The shell command cat language_container_part_* > language_container.tar.gz to concatinate the container can lead to broken archives, because the order in which the parts are concatinated isn't defined.
- https://github.com/exasol/transformers-extension/blob/main/doc/user_guide/user_guide.md#download-language-container

Acceptance Criteria

Fix the command, that it concatinates the parts in the correct order

Download models from private libraries

Background

We can download only open models (without any authentication) from the Huggingface repository. We should be able to download those models from private repos by passing authentication strings.

Handle errors in all operations

Errors might occur in both a model loading or during labeling.
Some ideas for handling errors in all operations
1.the obvious one is an exception, but this means we lose all the work we might have already done.
2.we introduce an error column for the output and if we get an error we put it in there and keep the answer columns empty.
3.we log them somewhere, either stdout with output redirect or we return the input tuple with empty answers.

Add manual setup description to User Guide

User guide explains how to install the extension from the released artifacts
We need to add manual setup description
- clone repo
- create virtual env
- install via pip install
- ..

Add rank column to model results returning top k predictions

FillingMask and QuestionAswering models might return top_k results
We better to add rank column to results to make selection easier.
- Rank the prediction with k results.

Add Developer Guide

describe tasks. pipeline and base model
how to add tasks
describe tests

Update method for generating bucket udf path

We currently use bucketfs_utils to generate udf path
We should replace with the generate_udf_path method

this method will be added in this ticket exasol/bucketfs-utils-python#72

Apply same API call across all prediction UDFs.

Prediction UDFs uses the API provided by transformers library.
Transformers library provides flexibility by making it possible both to call the object of the related task class (e.g AutoModelForQuestionAnswering) and to call the related class via the pipeline method (e.g. pipeline("question-answering")).
It would be better for completeness to use the same API call across all UDFs.

Deploy Language Container in One Step

Background

In the current language container installation, we firstly download artifacts from Github Repo. We then have to run deployment script.
The deployment script can contain the downloading step, so that users are able to install the SLC in one step

Acceptance Criteria

Update language container script to get artifact version and download
Update user guide, enhance manual and script deployments

Fix release droid configuration

integration test environment is not setup in the configuration
because of that, integrations tests are failed and the release couldn't be succeed

Split SLC into parts to upload release artifacts

Background

The size of the script language container is 2250 MB, larger than GitHub size limit 2147 MB
This limitation prevents uploading the release artifacts to GitHub
As a workaround solution:
- split the container into smaller parts
- then expect it to be merged by user

Acceptance Criteria

split the container before uploading
upload only the splitted parts
state how to merge them in user guide.

Address "insufficient privileges for creating schema"

Add parameter specifying GPU device

We need a parameter to specify the GPU device
The prediction models should be sent to this device
Default value should be set as CPU.
Ex:

device_name = ctx.gpu_device if torch.cuda.is_available() else "cpu"
device = torch.device(device_name)
..
model.to(device)
..
del model # free device memory

Fix contents of file error_code_config.yml

File error_code_config.yml contains invalid yaml syntax.

tab characters are not allowed
indention should be correct

Please compare contents of the corresponding file for project-keeper for reference.

Update user guide

fix errors in deployment script
improve explanation of sub_dir parameters

Create Token Classification UDF

Background

Token classification labels each tokens in a given text.
The common token classification tasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging.
The main purpose of the NER is to classify tokens in a text, such as dates, individuals and places
The main purpose of the PoS is to label tokens in a text such as verbs, nouns, and punctuation marks.
Transformers API provides AutoModelForTokenClassification class to perform token classification

Acceptance Criteria

Create UDF for token classification
Add unit/integration tests

Create Question Answering UDF

Question Answering is the task of extracting an answer from a text given a question
Transformers lib provides AutoModelForQuestionAnswering class to do this
The result of the UDF should return by adding the predicted answer and the prediction score to the UDF inputs
Example:

https://huggingface.co/docs/transformers/task_summary#extractive-question-answering

Upload transformers models from local file system

We currently only allow to upload model from huggingface
We should provide an UDF that upload transformer model from local file system

Update user guide for added error message column

error message column added with #25
this update should be mentioned in the user guide

Background

In order to use the pre-trained models, we need to download them using transformers api
The downloaded models should be stored into bucketfs to use in prediction operations.

Acceptance Criteria

An UDF
- downloads the specified model into tmp folder
- uploads the downloaded model into bucketfs
The UDF should take bucketfs conection, and bucketfs path to store model as input parameters
Add unit and integration tests

Background

with poetry 1.4.0 it doesn't create the setup.py anymore https://github.com/python-poetry/poetry/releases/tag/1.4.0
we currently use poetry build to generate the setup.py
however, the setup.py isn't needed with newer pip versions and if there are releases to pypi or as wheels
for that reason, we can remove setup.py and githook that generates it, from this repo

Acceptance Criteria

Update workflows to poetry 1.4.0
Remove setup.py Github Workflow
Remove setup.py githook
Remove setup.py file

Add integration tests for private models

Background

Current tests are working with public models
To have tests for private models we need a valid user authentication token
We can create a private test repo in huggingface hub so that we can perform integration tests for private models.

Acceptance Criteria

create a private model repo
generate a token, put it into env so that we can get it from github secrets
implement integration tests
- upload model from pytest test using a private model
- run prediction UDF in DB without nameserver (without nameserver, UDFs can't communicate to huggingface)
  https://huggingface.co/docs/huggingface_hub/how-to-manage

Create Sequence Classification UDF

Create an UDF for sequence text classification (please see sequence-classification)

takes variadic inputs.
returns softmax logits
ex: transformers_sequence_classification(string, ..., string) -> softmax_logits

Handle error in prediction of non-cached models

When we try to perform prediction in a non-cached models, we get the following error.
We should handle this error, and try not to download model in prediction UDFs.

SQL Error [22002]: VM error: F-UDF-CL-LIB-1127: F-UDF-CL-SL-PYTHON-1002: F-UDF-CL-SL-PYTHON-1026: ExaUDFError: F-UDF-CL-SL-PYTHON-1114: Exception during run
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 601, in _get_config_dict
resolved_config_file = cached_path(
File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 282, in cached_path
output_path = get_from_cache(
File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 470, in get_from_cache
os.makedirs(cache_dir, exist_ok=True)
File "/usr/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/usr/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/buckets/bfsdefault/default/container/SUB_DIR'

During handling of the above exception, another exception occurred:

TE_QUESTION_ANSWERING_UDF:8 run
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 41, in run
predictions_df = self.get_predictions_from_batch(batch_df)
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 55, in get_predictions_from_batch
for model_df in
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 90, in extract_unique_model_dataframes_from_batch
self.load_models(model_name)
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 127, in load_models
self.last_loaded_model = self.base_model.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 423, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 680, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 553, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 641, in _get_config_dict
raise EnvironmentError(
OSError: Can't load config for 'WRONG_MODEL_NAME'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'WRONG_MODEL_NAME' is the correct path to a directory containing a config.json file
(Session: 1744397360052043776)

Create Sequence Classification Text Pair UDF

This task is similar to #9
The only difference is that two sequences are expected as inputs
It will be useful in situations such as comparing two sequence
- The following link inlcudes a example of determining whether two strings are are paraphrases of each other.
  https://huggingface.co/docs/transformers/task_summary#sequence-classification

Background

The main purpose of the text generation is to create a coherent portion of text that is a continuation from the given context.
Transformers API provides AutoModelForCausalLM class to perform text generation

Acceptance Criteria

Create UDF for text generation
Add unit/integration tests

Inherent prediction UDF classes from same base class

Background

Prediction UDFs classes share many methods
We should extract the common methods into an abstract class
Each UDF class should extend this abstract class
Note that this ticket is should be dealt with once the implementation of all prediction classes is complete

Acceptance Criteria

[ ] Create abstract class
[ ] Extend UDFs from the abstract class
[ ] Split large methods into small functions
[ ] Rename UDF classes with UDF suffix
[ ] Remove warning in the shared method
[ ] Split get_batched_predictions into functions

Prepare design document

Put general overview how the extension works

Prepare the skeleton of the project

setup script language container installation
add deployment scripts
add deployment tests
prepare test fixtures

Connectionpool - optional HTTPS connection

Move build to AWS CodeBuild

We have disk space problem.
The size of SLC is ~2.1 GB. We have around 4 Copies of it:

The docker images we build
The exported container
The upload container in the bucketfs
The extracted container in the bucketfs.

These copies lead to OSError: [Errno 28] No space left on device problem.
In order to overcome this, we should move the build to AWS CodeBuild

Acceptance Criteria

Create AWS CodeBuild buildspec.yaml
- install poetry
- run tests
- Resources:
  - https://docs.aws.amazon.com/codebuild/latest/userguide/build-spec-ref.html
  - https://github.com/exasol/script-languages-release/blob/master/aws-code-build/ci/build_buildspec.yaml
Setup codebuild project in AWS https://docs.aws.amazon.com/codebuild/latest/userguide/create-project-console.html
Release workflow will run still on Github but without tests

Fix CI Tests

This test test_language_container_deployer_cli_by_downloading_container leads to following error: It causes this error: error: BucketFS: root path 'container/language_container'' does not exist in bucket 'default' of bucketfs 'bfsdefault.
The test should be debugged, possible reasons:
- SLCs' size exceed the machine disk size, or
- test does not work as expected.

Add ability to return topk results to Question Answering UDF

Background

Current Question Answering UDF returns only one result of each input.
It is possible to get multiple results by changing the topk parameter which is set to 1 by default

Acceptance Criteria

Get topk parameter as UDF input and update UDF accordingly
Update unit/integration tests

Add User Guide for TE

Add user guide
Extend README file

Inital Setup of the Project

add gtihub workflows
add githooks
a short README.rst
doc folder including changes folder, guides... etc
poetry setup

Add custom matcher object for unit tests

In order to increase readability of unit tests, statements in assert should be improved with matcher object:

Matcher-Function:

https://docs.pytest.org/en/7.1.x/how-to/assert.html#defining-your-own-explanation-for-failed-assertions

Matcher-Object:

Correct model filtering in prediction UDFs

Currently we filter the given data based on only model_name in all prediction UDFs
But different versions of each model can be stored under different bucketfs_conn connections and sub_dirs.
Therefore model filtering should be extended by considering these details
Extend tests accordingly.

Ex: filtering like that

unique_values = batch_df[['model','bucketfs_conn', 'sub_dir]].drop_duplicates().values
for model, bucketfs_conn, subdir in unique_values:
  model_df = batch_df[(batch_df['model']==model) & (..) ..]

Support for Zero-Shot Learning models

It would be great to have ZeroShot learning models support. Given a business case, where I would like to understand a rough direction of a given dataset, this model usage at scale would be highly helpful.

Update torch version

Background

torch >= 1.13.0 requires
- nvidia dependencies
- some other dependencies such as typing-extensions, opt-einsum ..
These new dependencies increases the SLC size ~2GB (from ~2.2GB to ~4.2GB)
CI tests are failed due to the disk space problem.
#8 ticket will resolve this problem.

Acceptance Criteria

fix torch version to 1.11.0 which does not need the additional dependencies stated above.

Create Translation UDF

Background

Translation is the task of translating a text from one language to another.
Transformers API provides AutoModelForSeq2SeqLM class to perform translation

Acceptance Criteria

Create UDF for Translation
Support both multilingual and single-pair models
Add unit/integration tests

Setup FillingMask Pipeline once for each model

Background

Current implementation setup filling-mask pipeline for each topk values of each model
top_k parameter can be set in the call method of the pipeline
Thus, setup call can be done once in each model

Acceptance Criteria

setup filling-mask pipeline once
update tests accordingly

Reduce disk space used by the machine during releasing

we need to free disk space of the machine while releasing.
otherwise, SLC and other installed large-sized libraries cause of disk problem

Prepare the first release

add build configuration scripts
complete change log

exasol / transformers-extension Goto Github PK

transformers-extension's Introduction

Exasol Transformers Extension

Table of Contents

Information for Users

Information for Contributors

transformers-extension's People

Contributors

Stargazers

Watchers

Forkers

transformers-extension's Issues

Background

Acceptance Criteria

Background

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Recommend Projects

Recommend Topics

Recommend Org