Coder Social home page Coder Social logo

heron's Introduction

Heron

A Library for Vision / Video and Language models

English | 日本語 | 中文

Welcome to "heron" repository. Heron is a library that seamlessly integrates multiple Vision and Language models, as well as Video and Language models. One of its standout features is its support for Japanese V&L models. Additionally, we provide pretrained weights trained on various datasets.

Please click here to see the multimodal demo pages built with different LLMs. (Both are available in Japanese)

Heron allows you to configure your own V&L models combining various modules. Vision Encoder, Adopter, and LLM can be configured in the configuration file. The distributed learning method and datasets used for training can also be easily configured.

Installation

1. Clone this repository

git clone https://github.com/turingmotors/heron.git
cd heron

2. Install Packages

We recommend using virtual environment to install the required packages. If you want to install the packages globally, use pip install -r requirements.txt instead.

2-a. Poetry (Recommended)

Using pyenv and Poetry, you can install the required packages as follows:

# install pyenv environment
pyenv install 3.10
pyenv local 3.10

# install packages from pyproject.toml
poetry install

# install local package
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# for development, install pre-commit
pre-commit install

2-b. Anaconda

Using Anaconda, you can install the required packages as follows:

conda create -n heron python=3.10 -y
conda activate heron
pip install --upgrade pip  # enable PEP 660 support

pip install -r requirements.txt
pip install -e .

# for development, install pre-commit
pre-commit install

3. Resister for Llama-2 models

To use Llama-2 models, you need to register for the models. First, you request access to the llama-2 models, in HuggingFace page and Meta website.

Please sign-in the HuggingFace account.

huggingface-cli login

Training

For learning, use the yaml configuration file under the projects directory.
For example, the contents of [projects/opt/exp001.yml](. /projects/opt/exp001.yml) has the following contents:

training_config:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 4
  num_train_epochs: 1
  dataloader_num_workers: 16
  fp16: true
  optim: "adamw_torch"
  learning_rate: 5.0e-5
  logging_steps: 100
  evaluation_strategy: "steps"
  save_strategy: "steps"
  eval_steps: 4000
  save_steps: 4000
  save_total_limit: 1
  deepspeed: ./configs/deepspeed/ds_config_zero1.json
  output_dir: ./output/
  report_to: "wandb"

model_config:
  fp16: true
  pretrained_path: # None or path to model weight
  model_type: git_llm
  language_model_name: facebook/opt-350m
  vision_model_name: openai/clip-vit-base-patch16
  num_image_with_embedding: 1 # if 1, no img_temporal_embedding
  max_length: 512
  keys_to_finetune:
    - visual_projection
    - num_image_with_embedding
  keys_to_freeze: []

  use_lora: true
  lora:
    r: 8
    lora_alpha: 32
    target_modules:
      - q_proj
      - k_proj
      - v_proj
    lora_dropout: 0.01
    bias: none
    task_type: CAUSAL_LM

dataset_config_path:
  - ./configs/datasets/m3it.yaml

training_config sets the training configuration, model_config sets the model configuration, and dataset_config_path sets the dataset configuration.
The following LLM modules are currently supported for model_type. We plan to add more supported modules in the future.

To start learning, execute the following command.

./scripts/run.sh

GPU is required for learning; we have tested on Ubuntu 20.04, CUDA 11.7.

Evaluation

You can get the pretrained weight form HuggingFace Hub: turing-motors/heron-chat-git-ja-stablelm-base-7b-v0
See also notebooks.

import requests
from PIL import Image

import torch
from transformers import AutoProcessor
from heron.models.git_llm.git_llama import GitLlamaForCausalLM

device_id = 0

# prepare a pretrained model
model = GitLlamaForCausalLM.from_pretrained('turing-motors/heron-chat-git-ja-stablelm-base-7b-v0')
model.eval()
model.to(f"cuda:{device_id}")

# prepare a processor
processor = AutoProcessor.from_pretrained('turing-motors/heron-chat-git-ja-stablelm-base-7b-v0')

# prepare inputs
url = "https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw)

text = f"##Instruction: Please answer the following question concretely. ##Question: What is unusual about this image? Explain precisely and concretely what he is doing? ##Answer: "

# do preprocessing
inputs = processor(
    text,
    image,
    return_tensors="pt",
    truncation=True,
)
inputs = {k: v.to(f"cuda:{device_id}") for k, v in inputs.items()}

# set eos token
eos_token_id_list = [
    processor.tokenizer.pad_token_id,
    processor.tokenizer.eos_token_id,
]

# do inference
with torch.no_grad():
    out = model.generate(**inputs, max_length=256, do_sample=False, temperature=0., eos_token_id=eos_token_id_list)

# print result
print(processor.tokenizer.batch_decode(out))

Pretrained Models

model LLM module adapter size
heron-chat-blip-ja-stablelm-base-7b-v0 Japanese StableLM Base Alpha BLIP 7B
heron-chat-git-ja-stablelm-base-7b-v0 Japanese StableLM Base Alpha GIT 7B
heron-chat-git-ELYZA-fast-7b-v0 ELYZA GIT 7B
heron-preliminary-git-Llama-2-70b-v0 *1 Llama-2 GIT 70B
*1 This model only applies to pre-training of adapters.

Datasets

LLava-Instruct dataset translated into Japanese.
LLaVA-Instruct-150K-JA

Organization

Turing Inc.

License

Released under the Apache License 2.0.

Acknowledgements

heron's People

Contributors

ensan-hcl avatar hidetatz avatar ino-ichan avatar kotarotanahashi avatar qqpann avatar ymgaq avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.