Alibaba Cloud Releases Open-source Qwen-VL, A Large Vision Language Model

On August 25th, Alibaba Cloud launched two open-source large vision language models (LVLM), Qwen-VL and its conversationally fine-tuned Qwen-VL-Chat. Qwen-VL is the multimodal version of Qwen-7B, Alibaba Cloud’s 7-billion-parameter model of its large language model Tongyi Qianwen. Capable of understanding both image inputs and text prompts in English and Chinese, Qwen-VL can perform various tasks such as responding to open-ended queries related to different images and generating image captions.

Qwen-VL is a vision language (VL) model that supports multiple languages including Chinese and English. Compared to previous VL models, Qwen-VL not only has basic abilities in image recognition, description, question answering, and dialogue, but also adds capabilities such as visual localization and understanding of text within images.

For example, if a foreign tourist who doesn’t understand Chinese goes to the hospital for treatment and doesn’t know how to get to the corresponding department, he can take a picture of the floor guide map and ask Qwen-VL, “Which floor is the orthopedics department on?” or “Where should I go for ENT?” Qwen-VL will provide text replies based on the information in the image. This is its image question-answering capability. Another example is that if you input a photo of Shanghai’s Bund, and ask Qwen-VL to find the Oriental Pearl Tower, it can accurately outline the corresponding building using detection boxes. This demonstrates its visual localization ability.

SEE ALSO:Alibaba Cloud Open Sources Tongyi Qianwen with 7 Billion Parameter Model

Qwen-VL, based on the Qwen-7B language model, introduces a visual encoder in its architecture to support visual input signals. Through the design of the training process, the model is able to perceive and understand visual signals at a fine-grained level. Qwen-VL supports image input resolution of 448, which is higher than the previously open-sourced LVLM models that typically supported only 224 resolution. Building upon Qwen-VL, the team at Tongyi Qianwen has developed Qwen-VL-Chat, a visual AI assistant based on LLM with alignment mechanisms. This allows developers to quickly build dialogue applications with multimodal capabilities.

Multimodality is one of the important technological advancements in general artificial intelligence. It is widely believed that transitioning from a single-sensory, text-only language model to a multimodal model that supports various forms of information input such as text, images, and audio represents a significant leap towards intelligent models on a larger scale. Multimodality enhances the understanding capabilities of large-scale models and greatly expands their range of applications.

Vision is the primary sensory ability of humans and it is also the first modality that researchers aim to incorporate into large-scale models. Following the release of M6 and OFA series multimodal models, Alibaba Cloud’s Tongyi Qianwen team has now open-sourced a large-scale vision language model (LVLM) called Qwen-VL, based on Qwen-7B.

Qwen-VL is the industry’s first universal model that supports Chinese open-domain visual localization. The ability of open-domain visual localization determines the accuracy of large models’ “vision”, that is, whether they can accurately identify desired objects in images. This is crucial for the practical application of VL models in scenarios such as robot control.

In mainstream multimodal task evaluation and multimodal conversational ability evaluation, Qwen-VL has achieved performance far beyond that of equivalent-sized general models.

SEE ALSO:Alibaba Cloud’s Energy Expert Helps Analyze Carbon Footprint for The First Olympic Esports Week

In the standard English evaluation of the four major multimodal tasks (Zero-shot Caption/VQA/DocVQA/Grounding), Qwen-VL achieved the best performance among open-source LVLMs of similar size. In order to test the model’s multimodal dialogue capability, the Tongyi Qianwen team constructed a test set called ‘Shijinshi’ based on GPT-4 scoring mechanism, and conducted comparative tests on Qwen-VL-Chat and other models. Qwen-VL-Chat achieved the best results among open-source LVLMs in both Chinese and English alignment evaluations.

Qwen-VL and its visual AI assistant Qwen-VL-Chat have been launched on the ModelScope, open-source, free, and available for commercial use. Users can directly download models from the ModelScope or access and invoke Qwen-VL and Qwen-VL-Chat through Alibaba Cloud DashScope. Alibaba Cloud provides users with comprehensive services including model training, inference, deployment, fine-tuning, etc.

In early August, Alibaba Cloud open-sourced the Qwen-7B Generalized Questioning Model and Qwen-7B-Chat Dialogue Model, with a total of 70 billion parameters. This made it the first large-scale technology company in China to join the ranks of open-source large models. The release of the Qwen-7B Generalized Questioning Model immediately gained widespread attention and quickly climbed up HuggingFace’s trending list that week. In less than a month, it received over 3,400 stars on GitHub, and its cumulative download count has exceeded 400,000.

Sign up today for 5 free articles monthly!


Pandaily Substack subscribe

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Related Posts
Designers sue Shein over AI ripoffs of their work thumbnail

Designers sue Shein over AI ripoffs of their work

A group of designers are suing Shein, the Chinese fast-fashion firm reportedly valued at $66 billion, for allegedly stealing independent artists’ works “over and over again, as part of a long and continuous pattern of racketeering.” The designers — Krista Perry, Larissa Martinez and Jay Baron — claim in their lawsuit that Shein’s “design ‘algorithm’
Read More
New World anti-cheat causing issues for gamers: Here’s the fix thumbnail

New World anti-cheat causing issues for gamers: Here’s the fix

New World, being an online game with competitive PvP, requires anti-cheat software. Anti-cheat software is nothing new, but in New World‘s case, it’s actually causing issues for some players. Apparently, Easy Anti-Cheat – New World‘s anti-cheat software of choice – is having some connection issues that are preventing people from even launching the game. If…
Read More
Japan Flips Nuclear Policy, Proposes Building New Plants thumbnail

Japan Flips Nuclear Policy, Proposes Building New Plants

The Fukushima disaster of 2011 soured Japan on nuclear energy, leading the island nation to rely less on atomic power and more on imported fossil fuels. But with an intent to hit net-zero carbon emissions by 2050, Japan's government on Thursday announced a reversal on its nuclear strategy, according to the country's national broadcaster. The new
Read More
Ditch Prime Day: These 50+ Walmart deals are even better thumbnail

Ditch Prime Day: These 50+ Walmart deals are even better

ZDNETAmazon's Prime Day has come and gone, but Walmart is leaning into its competitive side by still offering thousands of deals for you to shop. We've been closely monitoring the Walmart deals since the retail giant held its Walmart Deals event-- (which took place a week before Amazon Prime Day -- and we have good news: Many of the
Read More
Disney Plus, Sky, Amazon Prime Video: Das sind die Streaming-Highlights im Februar thumbnail

Disney Plus, Sky, Amazon Prime Video: Das sind die Streaming-Highlights im Februar

Kritiker verrissen dieses Prequel zu “Kingsman” zwar, doch für Fans der Spionage-Komödie ist es dennoch ein Muss: Regisseur Matthew Vaughn inszeniert die Ursprungsgeschichte des unabhängigen Gentlemen-Geheimdienst mit brillanter Besetzung, unter anderem Gemma Arterton, Matthew Goode, Tom Hollander, Daniel Brühl, sowie Djimon Hounsou und Charles Dance. Sie führt uns zurück in das frühe 20. Jahrhundert: Einige der…
Read More
Index Of News
Total
0
Share