Google AI has announced WIT: a data series linking text images and context from Wikipedia

The system that includes 11 million images and their contexts in 108 languages ​​is intended for artificial intelligence training. The data series is available to the public and Google will also hold a WIT-based application competition together with the Wikimedia Foundation and the KEGGLE website

דוגמה לניתוח תמונה והקשר מויקיפדיה עבור פרויקט WIT של גוגל AI. צילום יחצ
An example of image analysis and context from Wikipedia for the Google AI WIT project. Press photo

Google is today celebrating its 23rd anniversary. Google AI, one of the company’s junior divisions, announced WIT: a data series linking text images and the Wikipedia context open to the general public for artificial intelligence training.

Research, Google Research has published the details of Google AI’s announcement of WIT – a huge series of images from Wikipedia and their adaptation to text in many languages ​​- for artificial intelligence training.

In their blog on the Google AI site

” Traditionally, these data sets were created by manually adding captions to images, or scanning the web and extracting the alternative text as captions for images. While the previous approach allows Higher quality data, the intensive manual interpretation process limits the amount of data that can be generated. Kim. Another shortcoming of existing data sets is the lack of coverage in non-English languages. The speaker naturally led us to ask: Is it possible to overcome these limitations and create a high-quality, large and multilingual data set with a variety of contents? “

” Today we present the data set Of Wikipedia-based texts and images (WIT), created by extracting multiple texts in image descriptions from Wikipedia articles and image links in Wikipedia. We conducted a rigorous screening that ensured that only high-quality text-image kits would be scanned. As outlined in “WIT: Wikipedia-Based Image Text Data Kit for Multilingual Multilingual Multilingual Machine Learning” presented at SIGIR ’21, the result was a repository of 37.5 million rich text and image examples including 11.5 million Unique images and their descriptions in 108 languages. The WIT dataset is available for download and use under a Creative Commons license. “

The unique advantages of the WIT dataset are:

  • Size: WIT is the largest multi-modal data set of text-examples Image available to the public. Multilingual: 108 languages WIT has 10 languages ​​more than any other data set. Contextual information: Unlike typical multimodal data systems, which have only one caption per image, WIT includes information that includes page-level and section-level relationships. Entities in the real world: Wikipedia, being a broad knowledge base, is rich in real-world entities represented in WIT.

  • Challenging test set: In our recent work received at EMNLP, all the latest models demonstrated Significantly lower performance in WIT compared to traditional evaluation sets (e.g., a decrease of about 30 points in memory). A quality training kit and Challenging Evaluation Index The extensive coverage of diverse concepts in Wikipedia means that WIT’s evaluation systems serve as a challenging criterion, Even for modern models. We found that The average recovery scores

  • Note: This article has been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

    Related Posts
    Little black carbon-capture dress thumbnail

    Little black carbon-capture dress

    LanzaTech has been engineering Clostridium autoethanogenum using directed evolution to select for strains with high CO and CO2 fixing efficiencies. These strains have been used by retailer Migros, based in Zurich, as feedstock to make 30% of the polyester polyethylene terephthalate (PET) contained in plastic bottles in its food and drink packaging, and by Zurich-based…
    Read More
    Sexmissbruk minskade med beteendeterapi thumbnail

    Sexmissbruk minskade med beteendeterapi

    Tema 3 februari, 2022 Artikel från Umeå universitet Ämne: Hälsa & medicin Hypersexualitet, så kallat sexmissbruk, har samband med höga halter av hormonet oxytocin. Den goda nyheten är att det går att få ned halterna med hjälp av kognitiv beteendeterapi. Detta enligt en studie av forskare vid Umeå universitet, Karolinska Institutet och universitetet i Nicosia…
    Read More
    LeoLabs data shows on-orbit maneuvers by Russian satellites thumbnail

    LeoLabs data shows on-orbit maneuvers by Russian satellites

    LeoLabs’ Rendezvous and Proximity Dashboard shows one approach made by Cosmos-2562 to Resurs-P3 (left), eventually coming into proximity (right). For almost a year, the objects appeared to maneuver one to two times a week based on in-track distance history. Credit: LeoLabs WASHINGTON — The space tracking firm LeoLabs over the past year tracked two Russian
    Read More
    A new catalyst to optimize the oxygen production from water thumbnail

    A new catalyst to optimize the oxygen production from water

    The oxygen evolution reaction is essential to make chemicals and energy carriers using electrons. These processes include hydrogen generation as a byproduct of oxygen evolution.  But this reaction requires a catalyst material. Existing versions of catalysts use rare and expensive elements such as iridium, limiting the potential of such fuel production. There has been a…
    Read More
    Astronaut Luca Parmitano on NASA's return to the moon: 'It's no longer a dream' (exclusive) thumbnail

    Astronaut Luca Parmitano on NASA’s return to the moon: ‘It’s no longer a dream’ (exclusive)

    News Spaceflight Luca Parmitano preparing for flight. (Image credit: ESA/NASA) Helping to inject enthusiastic support for the Italian space industry, distinguished European Space Agency (ESA) astronaut Luca Parmitano recently appeared as the keynote speaker at the "Space It Up" conference in Houston, Texas to promote six innovative startups from his homeland entering the global space
    Read More
    Index Of News
    Total
    0
    Share