Recently, lots of commonsense reasoning datasets have been created. However, existing benchmarks are often composed of data from certain regions and overlook cultural differences. Therefore, the performance of models differs across different regions.
A recent paper published on arXiv investigates regional commonsense with visual scenes.

Modern data processing algorithms exhibit shortcomings when dealing with diverse regional and cultural data. Image credit:Duncan Harris, Wikimedia(CC BY 2.0)
The researchers propose a new evaluation benchmark that includes multiple-choice questions paired with images from movies and TV series in Asian, African, and Western countries. For instance, the model should recognize that people are participating in a wedding based on their apparel, which depends on the region.
In this study, it is shown that current models do not generalize well across different regions, especially when the questions involve higher-level reasoning and rich visual contexts. The researchers hope to motivate researchers to build better commonsense reasoning systems with more inclusive consideration.
Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models’ ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at this https URL.
Research paper: Yin, D., Li, L. H., Hu, Z., Peng, N., and Chang, K.-W., “Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning”, 2021. Link: https://arxiv.org/abs/2109.06860