Coder Social home page Coder Social logo

anyantudre / florence-2-vision-language-model Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 0.0 71.86 MB

Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.

Home Page: https://huggingface.co/microsoft/Florence-2-large

Jupyter Notebook 100.00%
computer-vision deep-learning florence-2 huggingface vision-language vision-language-model vision-transformer vision-transformer-models

florence-2-vision-language-model's Introduction

Florence-2: Microsoft's Cutting-edge Vision Language Models

๐Ÿ•ธ LinkedIn โ€ข ๐Ÿ“™ Kaggle โ€ข ๐Ÿ’ป Medium Blog โ€ข ๐Ÿค— Hugging Face โ€ข


Open In

๐Ÿ”— Short Links

๐Ÿ“ƒ Model Description

Florence-2, released by Microsoft in June 2024, is an advanced, lightweight foundation vision-language model open-sourced under the MIT license. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks. Despite its small size, it achieves results comparable to those of much larger models, such as Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.

Florence-2 model series

Model Model size Model Description
Florence-2-base [HF] 0.23B Pretrained model with FLD-5B
Florence-2-large [HF] 0.77B Pretrained model with FLD-5B
Florence-2-base-ft [HF] 0.23B Finetuned model on a colletion of downstream tasks
Florence-2-large-ft [HF] 0.77B Finetuned model on a colletion of downstream tasks

Tasks

Florence 2 supports many tasks out of the box:

  • Caption,
  • Detailed Caption,
  • More Detailed Caption,
  • Dense Region Caption,
  • Object Detection,
  • OCR,
  • Caption to Phrase Grounding,
  • segmentation,
  • Region proposal,
  • OCR,
  • OCR with Region.
    You can try out the model via HF Space.

๐Ÿ•ธ Unified Representation

Vision tasks are diverse and vary in terms of spatial hierarchy and semantic granularity. Instance segmentation provides detailed information about object locations within an image but lacks semantic information. On the other hand, image captioning allows for a deeper understanding of the relationships between objects, but without reference to their actual locations.

Open In
Figure 1. Illustration showing the level of spatial hierarchy and semantic granularity expressed by each task. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

The authors of Florence-2 decided that instead of training a series of separate models capable of executing individual tasks, they would unify their representation and train a single model capable of executing over 10 tasks. However, this requires a new dataset.

๐Ÿ’Ž Dataset

Florence-2's strength doesn't stem from its architecture, but from the massive dataset it was pre-trained on. The authors noted that leading computer vision datasets typically contain limited information - WIT only includes image/caption pairs, SA-1B only contains images and associated segmentation masks. Therefore, they decided to build a new FLD-5B dataset containing a wide range of information about each image - boxes, masks, captions, and grounding. The dataset creation process was largely automated. The authors used off-the-shelf task-specific models and a set of heuristics and quality checks to clean the obtained results. The result was a new dataset containing over 5 billion annotations for 126 million images, which was used to pre-train the Florence-2 model.

Open In An illustrative example of an image and its corresponding annotations in the FLD-5B dataset. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

FLD-5B is not yet publicly available, but the authors announced its upcoming release during CVPR 2024.

Open In Summary of size, spatial hierarchy, and semantic granularity of top datasets. Source: Florence-2 CVPR 2024 poster.

๐Ÿงฉ Architecture and Pre-training details

Regardless of the computer vision task being performed, Florence-2 formulates the problem as a sequence-to-sequence task. Florence-2 takes an image and text as inputs, and generates text as output. The model has a simple structure. It uses a DaViT vision encoder to convert images into visual embeddings, and BERT to convert text prompts into text and location embeddings. The resulting embeddings are then processed by a standard encoder-decoder transformer architecture, generating text and location tokens.

Open In Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

For region-specific tasks, location tokens representing quantized coordinates are added to the tokenizer's vocabulary.

  • Box Representation (x0, y0, x1, y1): Location tokens correspond to the box coordinates, specifically the top-left and bottom-right corners.
  • Polygon Representation (x0, y0, ..., xn, yn): Location tokens represent the polygon's vertices in clockwise order.

๐Ÿฆพ Capabilities

Florence-2 is smaller and more accurate than its predecessors. The Florence-2 series consists of two models: Florence-2-base and Florence-2-large, with 0.23 billion and 0.77 billion parameters, respectively. This size allows for deployment on even mobile devices. Despite its small size, Florence-2 achieves better zero-shot results than Kosmos-2 across all benchmarks, even though Kosmos-2 has 1.6 billion parameters.

Examples

๐Ÿ‹๐Ÿพโ€โ™‚๏ธ Finetuning

Even if Florence-2 supports many tasks, maybe your task or domain might not be supported, or you may want to better control the model's output for your task. That's when you will need to fine-tune.

๐Ÿ—‚ Resources

Title Type Brief Description Links
Florence-2 Demo Demo HF Space Link
Florence-2 DocVQA Demo Demo HF Space Link
Florence-2 Finetuned Demo Demo HF Space Link
Florence-2 Inference Notebook Notebook Notebook Link
Florence-2 Finetuning Notebook Notebook Notebook Link
Vision Language Models Explained Blog article article Link
Florence-2 Finetuning on DocVQA Video Video Link
Florence-2 Finetuning on Video Vido Link

๐Ÿ”— Citations and References

  • @article{xiao2023florence, title={Florence-2: Advancing a unified representation for a variety of vision tasks}, author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu}, journal={arXiv preprint arXiv:2311.06242}, year={2023} }

  • Piotr Skalski. (Jun 20, 2024). Florence-2: Open Source Vision Foundation Model by Microsoft. Roboflow Blog

  • Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

florence-2-vision-language-model's People

Contributors

anyantudre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.