The robin from cerc-aai

Pretrained projector checkpoint

Hi! Thanks for your great work!
I want to ask if it is possible if you can kindly share mm_projector.bin file from your pretrained checkpoints for cerc-aai/mistral-7b-oh-siglip-so400m-finetune-lora model?

Synthetic Multi Image Data

How best to create synthetic data using GPT-4V to improve multi image reasoning capabilities?

Change config style in file/code

Add the option to run with a config file instead of passing all the arguments in the launch script (keep both possible)
Try removing if statements that check for things like: model_path, model_base, model_name, move to config. (linked to issues #18 #26 )

Add VE-Type argument

Add an argument that can be passed to deepspeed in the pretrain and finetune launch scripts. This argument will be used to pick which VE class to use.
Types needed: CLIP, OCLIP, timm
For OCLIP models it is also necessary to pass the config of the VE to enable local models (see load_model in open_clip.py), and if the model uses Timm in OpenClip
Working directory: robin/model/multimodal_encoder/
Additional objectives:

remove the encoder_info.py file
remove all conditions with model_name == ...

Multiple Images

Add Multi Image support

Improve inference speed for evaluations

Strongly linked to #25
Look into multi-GPU batch inference

Train mem overhaul

Setup code for individual clusters more cleanly

Overhaul inheritance - LLM

Interleaved

requires

https://github.com/AGI-Collective/Robin/issues/48

Improve printing quality

Make rank0_print from train.py available in all the training files and replace all the prints with rank0_print

Multiple Encoders

add base model to the HF config file

Fix tokenization count mis-match issue

Data analysis: Paired vs Interleaved data

How much impact does training on paired vs interleaved data have on performance

5196

Study bigger resolution images and resolution agnostic techniques

Documentation

A major documentation over-haul will be needed before next release

Citation

Hello there!
Thanks for the awesome work which I am currently using in my own work. I would like to cite your repo. How would you prefer I do this?

Evaluation Results of Different Model Configurations

I've conducted additional evaluations for the various model configurations available in the Robin repository. My intention is to provide these results to the community for further insights and potential improvements.

Methodology:
The evaluations were conducted using three benchmarks: MM-Vet, SEED-Benchv1, and MMBench. Below is a summary of the models evaluated along with their corresponding results using the original LlaVA evaluation scripts:

Model Name	Image Model	Text Model	MM-Vet	SEED-Benchv1	MMBench
liuhaotian/llava-v1.5-7b	CLIP-ViT-L/14 336	lmsys/vicuna-7b-v1.5	31.1	58.60	64.3
liuhaotian/llava-v1-7b	CLIP-ViT-L/14 336	lmsys/vicuna-7b-v1.3	28.1	33.52	59.2
liuhaotian/llava-v1.5-7b	CLIP-ViT-L/14 336	meta-llama/Llama-2-7b-chat-hf	30.1	54.68	56.78
agi-collective/mistral-7b-siglip-so400m-finetune-lora	SigLIP--ViT-L/14 384	mistralai/Mistral-7B-v0.1	25.7	53.33	57.47
agi-collective/mistral-7b-oh-siglip-so400m-frozen-ve-finetune-lora	SigLIP--ViT-L/14 384	teknium/OpenHermes-2.5-Mistral-7B	35.8	57.39	63.8

Observations and Considerations:

The results vary across different benchmarks and model configurations.
It's important to note that while some numbers are lower, this does not necessarily imply inferior model performance. At this level, evaluations can be quite subjective.
-It is encouraged for users and developers to interact with the models directly for a more comprehensive understanding of their capabilities and characteristics.
Further exploration and fine-tuning might be beneficial for certain model configurations.
Community feedback on these configurations can be valuable for future improvements.

I hope these results are helpful for the ongoing development and refinement of the models in the Robin repository. Your work in creating and maintaining these models is highly appreciated by the community.

Thank you for your dedication to advancing the field of AI.

🤗 transformers compatiblity

Hi there!
Great work on training the model and the release!

Llava models should be compatible with HF transfomrers library, we recently added Llava 1.5, BakLlava and Vip-llava as well. If the architecture is exactly the same as Llava, with minor changes (different LM and different CLIP encoder), the integration should work out of the box using this conversion script: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava/convert_llava_weights_to_hf.py
You just need to push or save the raw state_dict of your model.

Here are some examples of converted models: https://huggingface.co/llava-hf

I can also help if needed!
Thanks again and looking forward to hearing from you

multiple images per prompt

Hello there!

Is it possible to prompt with multiple images? If so, it's not obvious to me how this can be done without altering your code.
For example

from robin.serve.pipeline import LlavaMistralPipeline

pipe = LlavaMistralPipeline(
model_path="agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora",
model_base="teknium/OpenHermes-2.5-Mistral-7B",
)

messages = [
{"role": "USER", "content": "compare and contrast these images", "image": ['path/to/img/1.jpg', 'path/to/img/2.jpg']},
]
messages = pipe(messages)

I can make a PR if this has not been implemented yet :)

Thank you!

Overhaul inheritance - VE

Convert checkpoint to usable model

Write a script (python and/or bash) to convert an intermediate finetuning checkpoint to the final model format for upload to HF and evaluation

Separate vision encoder and LLM

Create one class which takes in a desired LLM+VE and creates separate instances of those models within the class.
Includes issues:

#21
#22

Improve OCR capabilities

Experiment with different architectures/data to improve OCR capabilities

examine OCR failure cases in benchmarks
come up with theory on why things are failing
devise solution
implement solution

Create a end-user model class

Create a python class that is the only thing the end user should interact with

the creation of an instance of this class should be: model = Robin('agi-collective/[model_name]')

Base LLM, VE type... can be found in the model config.json

There should be optional arguments like temperature, conversation type...

There should be a .predict('[text]', '[image_path/URL]') function

Any other standard or useful arguments are welcome!

The forward pass of training and LlavaMistralPipeline can be good inspiration.

The aim is to simplify the overly complex serve folder, remove all the redundant code in cli, pipeline..., do the same in eval, and have a simple Robin.predict([text], [image]) call

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aiworker/monorepo/scripts/agi_llava.py", line 1, in <module>
    from robin.serve.pipeline import LlavaMistralPipeline
  File "/home/aiworker/miniconda/envs/new1/lib/python3.10/site-packages/robin/__init__.py", line 1, in <module>
    from .model import LlavaLlamaForCausalLM
  File "/home/aiworker/miniconda/envs/new1/lib/python3.10/site-packages/robin/model/__init__.py", line 1, in <module>
    from .language_model.llava_llama import LlavaLlamaForCausalLM, LlavaConfig
  File "/home/aiworker/miniconda/envs/new1/lib/python3.10/site-packages/robin/model/language_model/llava_llama.py", line 22, in <module>
    from transformers import AutoConfig, AutoModelForCausalLM, \
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/home/aiworker/miniconda/envs/new1/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1336, in __getattr__
    value = getattr(module, name)
  File "/home/aiworker/miniconda/envs/new1/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1335, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/home/aiworker/miniconda/envs/new1/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1347, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/home/aiworker/miniconda/envs/new1/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

It seems to be some incompatibility with the flash-attn package but I have yet to find the root cause. Any ideas? Perhaps dependencies need to be updated?

See if original LLaVA models still run

If it doesn't work see fix in previous codebase: https://github.com/AGI-Collective/robin_llava-private/pull/3

cerc-aai / robin Goto Github PK

robin's People

Contributors

Stargazers

Watchers

Forkers

robin's Issues

Recommend Projects

Recommend Topics

Recommend Org