Coder Social home page Coder Social logo

exasol / transformers-extension Goto Github PK

View Code? Open in Web Editor NEW
2.0 7.0 2.0 1.31 MB

An Exasol extension for using state-of-the-art pretrained machine learning models via the Hugging Face Transformers API.

License: MIT License

Python 96.79% Shell 3.10% Dockerfile 0.11%
exasol exasol-integration huggingface-transformers transformers machine-learning data-science

transformers-extension's Introduction

Exasol Transformers Extension

An Exasol extension to use state-of-the-art pretrained machine learning models via the transformers api.

This Extension is build and tested for Linux OS, and does not have Windows or MacOS support. It might work regardless but proceed at your own risk.

Table of Contents

Information for Users

Information for Contributors

transformers-extension's People

Contributors

ahsimb avatar dejanmihajlovic avatar marlenekress79789 avatar tkilias avatar tomuben avatar umitbuyuksahin avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

transformers-extension's Issues

Add model counters to unit tests of prediction UDFs

  • In the prediction UDFs, input dataframe is filtered based on model_name, bucketfs_conn, sub_dir features
  • This filtering is already implemented to prevent same model multiple times in each batch iteration
  • To improve tests in this directions, tests cases should be enhanced with model counter assertion

for FillingMask

ScriptDeployer Client Connection needs to be encrypted

Background

  • When running the

    python -m exasol_transformers_extension.deploy scripts --dsn 127.0.0.1:9564 --db-user sys \
    --db-pass exasol --schema TRASNFORMERS \
    --language-alias transformers
    

    the following error message appears with Docker-DB 8.18.1

    image (19)

    reported by @exa-eswar

  • probable reason is too old a version of pyexasol or a misconfigured pyexasol connection

Convert DownloaderUDF to SET UDF

  • the models to be downloaded can be given in the table below.
  • this table might include bucketfs_connection_name and model_name as columns
  • to scale downloader udf like that, we need to convert it to SET UDF

Support Transformers >= 4.22

Background

  • The directory structure where pre-trained models are downloaded has been changed with transformers 4.22
  • While the models files directly are downloaded into the cache directory in older versions, with version 4.22 transformers uses complex directory structure.
  • We need to upload this directory to bucket as we downloaded.
  • In order to complete this issue, the ticket exasol/bucketfs-python#1 must be done.

Acceptance Criteria

  • switched to upload_directory method in bucketfs-utils
  • update toml file with `transformers="^4.22"'
    • pay attention that newer versions install nvidia modules which increases SLC size ~2GB

Download models from private libraries

Background

We can download only open models (without any authentication) from the Huggingface repository. We should be able to download those models from private repos by passing authentication strings.

Handle errors in all operations

  • Errors might occur in both a model loading or during labeling.
  • Some ideas for handling errors in all operations
    1.the obvious one is an exception, but this means we lose all the work we might have already done.
    2.we introduce an error column for the output and if we get an error we put it in there and keep the answer columns empty.
    3.we log them somewhere, either stdout with output redirect or we return the input tuple with empty answers.

Add manual setup description to User Guide

  • User guide explains how to install the extension from the released artifacts
  • We need to add manual setup description
    • clone repo
    • create virtual env
    • install via pip install
    • ..

Apply same API call across all prediction UDFs.

  • Prediction UDFs uses the API provided by transformers library.
  • Transformers library provides flexibility by making it possible both to call the object of the related task class (e.g AutoModelForQuestionAnswering) and to call the related class via the pipeline method (e.g. pipeline("question-answering")).
  • It would be better for completeness to use the same API call across all UDFs.

Deploy Language Container in One Step

Background

  • In the current language container installation, we firstly download artifacts from Github Repo. We then have to run deployment script.
  • The deployment script can contain the downloading step, so that users are able to install the SLC in one step

Acceptance Criteria

  • Update language container script to get artifact version and download
  • Update user guide, enhance manual and script deployments

Fix release droid configuration

  • integration test environment is not setup in the configuration
  • because of that, integrations tests are failed and the release couldn't be succeed

Split SLC into parts to upload release artifacts

Background

  • The size of the script language container is 2250 MB, larger than GitHub size limit 2147 MB
  • This limitation prevents uploading the release artifacts to GitHub
  • As a workaround solution:
    • split the container into smaller parts
    • then expect it to be merged by user

Acceptance Criteria

  • split the container before uploading
  • upload only the splitted parts
  • state how to merge them in user guide.

Add parameter specifying GPU device

  • We need a parameter to specify the GPU device
  • The prediction models should be sent to this device
  • Default value should be set as CPU.
  • Ex:
device_name = ctx.gpu_device if torch.cuda.is_available() else "cpu"
device = torch.device(device_name)
..
model.to(device)
..
del model # free device memory

Update user guide

  • fix errors in deployment script
  • improve explanation of sub_dir parameters

Create Token Classification UDF

Background

  • Token classification labels each tokens in a given text.
  • The common token classification tasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging.
  • The main purpose of the NER is to classify tokens in a text, such as dates, individuals and places
  • The main purpose of the PoS is to label tokens in a text such as verbs, nouns, and punctuation marks.
  • Transformers API provides AutoModelForTokenClassification class to perform token classification

Acceptance Criteria

  • Create UDF for token classification
  • Add unit/integration tests

Add model downloaderUDF

Background

  • In order to use the pre-trained models, we need to download them using transformers api
  • The downloaded models should be stored into bucketfs to use in prediction operations.

Acceptance Criteria

  • An UDF
    • downloads the specified model into tmp folder
    • uploads the downloaded model into bucketfs
  • The UDF should take bucketfs conection, and bucketfs path to store model as input parameters
  • Add unit and integration tests

Remove setup.py

Background

  • with poetry 1.4.0 it doesn't create the setup.py anymore https://github.com/python-poetry/poetry/releases/tag/1.4.0
  • we currently use poetry build to generate the setup.py
  • however, the setup.py isn't needed with newer pip versions and if there are releases to pypi or as wheels
  • for that reason, we can remove setup.py and githook that generates it, from this repo

Acceptance Criteria

  • Update workflows to poetry 1.4.0
  • Remove setup.py Github Workflow
  • Remove setup.py githook
  • Remove setup.py file

Add integration tests for private models

Background

  • Current tests are working with public models
  • To have tests for private models we need a valid user authentication token
  • We can create a private test repo in huggingface hub so that we can perform integration tests for private models.

Acceptance Criteria

  • create a private model repo
  • generate a token, put it into env so that we can get it from github secrets
  • implement integration tests

Handle error in prediction of non-cached models

  • When we try to perform prediction in a non-cached models, we get the following error.
  • We should handle this error, and try not to download model in prediction UDFs.

SQL Error [22002]: VM error: F-UDF-CL-LIB-1127: F-UDF-CL-SL-PYTHON-1002: F-UDF-CL-SL-PYTHON-1026: ExaUDFError: F-UDF-CL-SL-PYTHON-1114: Exception during run
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 601, in _get_config_dict
resolved_config_file = cached_path(
File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 282, in cached_path
output_path = get_from_cache(
File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 470, in get_from_cache
os.makedirs(cache_dir, exist_ok=True)
File "/usr/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/usr/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/buckets/bfsdefault/default/container/SUB_DIR'

During handling of the above exception, another exception occurred:

TE_QUESTION_ANSWERING_UDF:8 run
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 41, in run
predictions_df = self.get_predictions_from_batch(batch_df)
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 55, in get_predictions_from_batch
for model_df in
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 90, in extract_unique_model_dataframes_from_batch
self.load_models(model_name)
File "/usr/local/lib/python3.8/dist-packages/exasol_transformers_extension/udfs/models/base_model_udf.py", line 127, in load_models
self.last_loaded_model = self.base_model.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 423, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 680, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 553, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 641, in _get_config_dict
raise EnvironmentError(
OSError: Can't load config for 'WRONG_MODEL_NAME'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'WRONG_MODEL_NAME' is the correct path to a directory containing a config.json file
(Session: 1744397360052043776)

Create Text Generation UDF

Background

  • The main purpose of the text generation is to create a coherent portion of text that is a continuation from the given context.
  • Transformers API provides AutoModelForCausalLM class to perform text generation

Acceptance Criteria

  • Create UDF for text generation
  • Add unit/integration tests

Inherent prediction UDF classes from same base class

Background

  • Prediction UDFs classes share many methods
  • We should extract the common methods into an abstract class
  • Each UDF class should extend this abstract class
  • Note that this ticket is should be dealt with once the implementation of all prediction classes is complete

Acceptance Criteria

[ ] Create abstract class
[ ] Extend UDFs from the abstract class
[ ] Split large methods into small functions
[ ] Rename UDF classes with UDF suffix
[ ] Remove warning in the shared method
[ ] Split get_batched_predictions into functions

Move build to AWS CodeBuild

We have disk space problem.
The size of SLC is ~2.1 GB. We have around 4 Copies of it:

  • The docker images we build
  • The exported container
  • The upload container in the bucketfs
  • The extracted container in the bucketfs.

These copies lead to OSError: [Errno 28] No space left on device problem.
In order to overcome this, we should move the build to AWS CodeBuild

Acceptance Criteria

Fix CI Tests

  • This test test_language_container_deployer_cli_by_downloading_container leads to following error: It causes this error: error: BucketFS: root path 'container/language_container'' does not exist in bucket 'default' of bucketfs 'bfsdefault.
  • The test should be debugged, possible reasons:
    • SLCs' size exceed the machine disk size, or
    • test does not work as expected.

Add ability to return topk results to Question Answering UDF

Background

  • Current Question Answering UDF returns only one result of each input.
  • It is possible to get multiple results by changing the topk parameter which is set to 1 by default

Acceptance Criteria

  • Get topk parameter as UDF input and update UDF accordingly
  • Update unit/integration tests

Inital Setup of the Project

  • add gtihub workflows
  • add githooks
  • a short README.rst
  • doc folder including changes folder, guides... etc
  • poetry setup

Correct model filtering in prediction UDFs

  • Currently we filter the given data based on only model_name in all prediction UDFs
  • But different versions of each model can be stored under different bucketfs_conn connections and sub_dirs.
  • Therefore model filtering should be extended by considering these details
  • Extend tests accordingly.

Ex: filtering like that

unique_values = batch_df[['model','bucketfs_conn', 'sub_dir]].drop_duplicates().values
for model, bucketfs_conn, subdir in unique_values:
  model_df = batch_df[(batch_df['model']==model) & (..) ..]

Support for Zero-Shot Learning models

It would be great to have ZeroShot learning models support. Given a business case, where I would like to understand a rough direction of a given dataset, this model usage at scale would be highly helpful.

Update torch version

Background

  • torch >= 1.13.0 requires
    • nvidia dependencies
    • some other dependencies such as typing-extensions, opt-einsum ..
  • These new dependencies increases the SLC size ~2GB (from ~2.2GB to ~4.2GB)
  • CI tests are failed due to the disk space problem.
  • #8 ticket will resolve this problem.

Acceptance Criteria

  • fix torch version to 1.11.0 which does not need the additional dependencies stated above.

Create Translation UDF

Background

  • Translation is the task of translating a text from one language to another.
  • Transformers API provides AutoModelForSeq2SeqLM class to perform translation

Acceptance Criteria

  • Create UDF for Translation
  • Support both multilingual and single-pair models
  • Add unit/integration tests

Setup FillingMask Pipeline once for each model

Background

  • Current implementation setup filling-mask pipeline for each topk values of each model
  • top_k parameter can be set in the call method of the pipeline
  • Thus, setup call can be done once in each model

Acceptance Criteria

  • setup filling-mask pipeline once
  • update tests accordingly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.