Microsoft Tests A Small AI Vision Language Model

Florence-2, a type of artificial intelligence (AI) called a vision language model, can create content for marketing, advertising, and storytelling by understanding text prompts that can generate images.

Behind Microsoft's push into all things AI, the company released research this month on the small Azure AI model.

The size — 0.2B and 0.7B — is said to have strong performance for a variety of computer vision and vision-language tasks. The small size and the ease of use should make it attractive to developers and creative artists working with website design, advertising campaigns, and marketing content.

Florence-2 supports captioning, object detection, Azure AI Services’ optical character recognition and more, and can repeatedly create visions with little effort.



For example, the image shows a modern kitchen with a large window on the left side.

The window (#1) has a view of trees and greenery outside. On the left side of the image, there is an image of a blue sofa (#2) with a wooden coffee table in front of it.
Above the table, there are three copper pendant lights (#3) hanging from the ceiling. There is a large island (#4) with a white countertop, and two barstools (#5) are next to the table.

In the center of the kitchen, there is a bottle with green plants (#6) on the table. The floor (#7) is made of light-colored wood, and the walls (#8) are painted in a dark blue color.

The Florence project is being tested as a foundational vision model using a range of perceptual capabilities.

The developers created the FLD-5B dataset containing 126 million images and 5 billion annotations, which are collected by the Florence data engine.

Florence-2 was pre-trained on this dataset through comprehensive multitask learning and has been successful in captioning, object detection, visual grounding, and referring segmentation, among others.

“Regardless of the computer vision task being performed, Florence-2 formulates the problem as a sequence-to-sequence task,” Hugging Face describes in a writeup published this week. “The model has a simple structure. It uses a DaViT vision encoder to convert images into visual embeddings, and BERT to convert text prompts into text and location embeddings.”

It works off a massive pre-trained dataset. As Hugging Face reports: “the authors noted that leading computer vision datasets typically contain limited information - WIT only includes image/caption pairs, SA-1B only contains images and associated segmentation masks.”

For this, Microsoft developers built a new FLD-5B dataset containing a range of information about each image in the frame, such as windows and tables.

Next story loading loading..