Alibaba Cloud Releases Open-source Qwen-VL, A Large Vision Language Model

On August 25th, Alibaba Cloud launched two open-source large vision language models (LVLM), Qwen-VL and its conversationally fine-tuned Qwen-VL-Chat. Qwen-VL is the multimodal version of Qwen-7B, Alibaba Cloud’s 7-billion-parameter model of its large language model Tongyi Qianwen. Capable of understanding both image inputs and text prompts in English and Chinese, Qwen-VL can perform various tasks such as responding to open-ended queries related to different images and generating image captions.

Qwen-VL is a vision language (VL) model that supports multiple languages including Chinese and English. Compared to previous VL models, Qwen-VL not only has basic abilities in image recognition, description, question answering, and dialogue, but also adds capabilities such as visual localization and understanding of text within images.

For example, if a foreign tourist who doesn’t understand Chinese goes to the hospital for treatment and doesn’t know how to get to the corresponding department, he can take a picture of the floor guide map and ask Qwen-VL, “Which floor is the orthopedics department on?” or “Where should I go for ENT?” Qwen-VL will provide text replies based on the information in the image. This is its image question-answering capability. Another example is that if you input a photo of Shanghai’s Bund, and ask Qwen-VL to find the Oriental Pearl Tower, it can accurately outline the corresponding building using detection boxes. This demonstrates its visual localization ability.

Qwen-VL, based on the Qwen-7B language model, introduces a visual encoder in its architecture to support visual input signals. Through the design of the training process, the model is able to perceive and understand visual signals at a fine-grained level. Qwen-VL supports image input resolution of 448, which is higher than the previously open-sourced LVLM models that typically supported only 224 resolution. Building upon Qwen-VL, the team at Tongyi Qianwen has developed Qwen-VL-Chat, a visual AI assistant based on LLM with alignment mechanisms. This allows developers to quickly build dialogue applications with multimodal capabilities.

Multimodality is one of the important technological advancements in general artificial intelligence. It is widely believed that transitioning from a single-sensory, text-only language model to a multimodal model that supports various forms of information input such as text, images, and audio represents a significant leap towards intelligent models on a larger scale. Multimodality enhances the understanding capabilities of large-scale models and greatly expands their range of applications.

Vision is the primary sensory ability of humans and it is also the first modality that researchers aim to incorporate into large-scale models. Following the release of M6 and OFA series multimodal models, Alibaba Cloud’s Tongyi Qianwen team has now open-sourced a large-scale vision language model (LVLM) called Qwen-VL, based on Qwen-7B.

Qwen-VL is the industry’s first universal model that supports Chinese open-domain visual localization. The ability of open-domain visual localization determines the accuracy of large models’ “vision”, that is, whether they can accurately identify desired objects in images. This is crucial for the practical application of VL models in scenarios such as robot control.

In mainstream multimodal task evaluation and multimodal conversational ability evaluation, Qwen-VL has achieved performance far beyond that of equivalent-sized general models.

In the standard English evaluation of the four major multimodal tasks (Zero-shot Caption/VQA/DocVQA/Grounding), Qwen-VL achieved the best performance among open-source LVLMs of similar size. In order to test the model’s multimodal dialogue capability, the Tongyi Qianwen team constructed a test set called ‘Shijinshi’ based on GPT-4 scoring mechanism, and conducted comparative tests on Qwen-VL-Chat and other models. Qwen-VL-Chat achieved the best results among open-source LVLMs in both Chinese and English alignment evaluations.

Qwen-VL and its visual AI assistant Qwen-VL-Chat have been launched on the ModelScope, open-source, free, and available for commercial use. Users can directly download models from the ModelScope or access and invoke Qwen-VL and Qwen-VL-Chat through Alibaba Cloud DashScope. Alibaba Cloud provides users with comprehensive services including model training, inference, deployment, fine-tuning, etc.

In early August, Alibaba Cloud open-sourced the Qwen-7B Generalized Questioning Model and Qwen-7B-Chat Dialogue Model, with a total of 70 billion parameters. This made it the first large-scale technology company in China to join the ranks of open-source large models. The release of the Qwen-7B Generalized Questioning Model immediately gained widespread attention and quickly climbed up HuggingFace’s trending list that week. In less than a month, it received over 3,400 stars on GitHub, and its cumulative download count has exceeded 400,000.

Sign up today for 5 free articles monthly!

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Fender Mustang Micro Plus Headphone Amplifier with Rechargeable Battery, Bluetooth Audio Streaming and 50 Amp and Effects Models, with 2-Year Warranty

(3693)

$129.99 (as of November 6, 2024 18:48 GMT +00:00 - )

Midea MRC070S0AWW Chest Freezer, 7.0 Cubic Feet, White

(9958)

$269.99 (as of November 6, 2024 18:44 GMT +00:00 - )

Disney Pin - Sweet Dreams - Mystery Pouch - 5 Pin Pack

$51.95 (as of November 6, 2024 18:51 GMT +00:00 - )

Core Power Fairlife 26g Protein Milk Shakes, Liquid Ready To Drink for Workout Recovery, Chocolate, 14 Fl Oz Bottle (Pack of 12)

(19558)

$32.00 (as of November 6, 2024 18:44 GMT +00:00 - )

Brita Standard Water Filter for Pitchers and Dispensers, BPA-Free, Reduces Copper, Cadmium and Mercury Impurities, Lasts Two Months or 40 Gallons, Includes 3 Filters for Pitchers

(211878)

$17.98 (as of November 6, 2024 18:44 GMT +00:00 - )

Index Of News Author

Technology

Designers sue Shein over AI ripoffs of their work

A group of designers are suing Shein, the Chinese fast-fashion firm reportedly valued at $66 billion, for allegedly stealing independent artists’ works “over and over again, as part of a long and continuous pattern of racketeering.” The designers — Krista Perry, Larissa Martinez and Jay Baron — claim in their lawsuit that Shein’s “design ‘algorithm’

July 14, 2023

Technology

New World anti-cheat causing issues for gamers: Here’s the fix

New World, being an online game with competitive PvP, requires anti-cheat software. Anti-cheat software is nothing new, but in New World‘s case, it’s actually causing issues for some players. Apparently, Easy Anti-Cheat – New World‘s anti-cheat software of choice – is having some connection issues that are preventing people from even launching the game. If…

October 8, 2021

Technology

Japan Flips Nuclear Policy, Proposes Building New Plants

The Fukushima disaster of 2011 soured Japan on nuclear energy, leading the island nation to rely less on atomic power and more on imported fossil fuels. But with an intent to hit net-zero carbon emissions by 2050, Japan's government on Thursday announced a reversal on its nuclear strategy, according to the country's national broadcaster. The new

December 23, 2022

Technology

Ditch Prime Day: These 50+ Walmart deals are even better

ZDNETAmazon's Prime Day has come and gone, but Walmart is leaning into its competitive side by still offering thousands of deals for you to shop. We've been closely monitoring the Walmart deals since the retail giant held its Walmart Deals event-- (which took place a week before Amazon Prime Day -- and we have good news: Many of the

July 17, 2024

Technology

Disney Plus, Sky, Amazon Prime Video: Das sind die Streaming-Highlights im Februar

Kritiker verrissen dieses Prequel zu “Kingsman” zwar, doch für Fans der Spionage-Komödie ist es dennoch ein Muss: Regisseur Matthew Vaughn inszeniert die Ursprungsgeschichte des unabhängigen Gentlemen-Geheimdienst mit brillanter Besetzung, unter anderem Gemma Arterton, Matthew Goode, Tom Hollander, Daniel Brühl, sowie Djimon Hounsou und Charles Dance. Sie führt uns zurück in das frühe 20. Jahrhundert: Einige der…

January 26, 2022

Technology

DisplayMate: The Apple iPhone 13 Pro (Max) has the best display among smartphones

Apple iPhone 13 Pro je v prodeji teprve několik dní a už prošel různými odbornými testy. Začátkem týdne vyšlo hodnocení fotoaparátů od Dxomark, nyní se dostal se svým displejem i do rukou odborníků z DisplayMate. Od nich si vysloužil nejlepší hodnocení A+ a ocenění Best Smartphone Display Award za aktuálně nejlepší displej ve smartphonu. Testeři…

September 30, 2021

Hand-Picked Top-Read Stories

The United Charms of Baseball

Watching an American Election from Across the Pond

The Influence of Sedona Prince

Trending Tags

Alibaba Cloud Releases Open-source Qwen-VL, A Large Vision Language Model

Fender Mustang Micro Plus Headphone Amplifier with Rechargeable Battery, Bluetooth Audio Streaming and 50 Amp and Effects Models, with 2-Year Warranty

Midea MRC070S0AWW Chest Freezer, 7.0 Cubic Feet, White

Disney Pin - Sweet Dreams - Mystery Pouch - 5 Pin Pack

Core Power Fairlife 26g Protein Milk Shakes, Liquid Ready To Drink for Workout Recovery, Chocolate, 14 Fl Oz Bottle (Pack of 12)

Brita Standard Water Filter for Pitchers and Dispensers, BPA-Free, Reduces Copper, Cadmium and Mercury Impurities, Lasts Two Months or 40 Gallons, Includes 3 Filters for Pitchers

AI roundup: HEALWELL AI buys LLMs, Forward to roll out staffless health kiosks

The crime of assaulting a police officer was a public servant and appealed. Appellant: After giving way, he acted naturally. Attorney General: Obviously wanted to trip the police

Facebook Witnesses Daily Active Users Dip For First Time; Here’s The Reason

‘Salah is better than Messi and Ronaldo’, says Premier League legend

Should You Take Magnesium and Zinc Together?

The United Charms of Baseball

Watching an American Election from Across the Pond

The Influence of Sedona Prince

Trump’s Final Days on the Campaign Trail

New Yorkers urged to conserve water after driest October in 150 years

Alibaba Cloud Releases Open-source Qwen-VL, A Large Vision Language Model

Related Posts