Offered Tutorials/Workshops
The following tutorials/workshops are offered for ACMSE 2026:
- Tutorial/Workshop 1: Visual Question Answering and RAG Using Generative AI
- Tutorial/Workshop 2: TBD
Visual Question Answering and RAG Using Generative AI
Presenter: Gourav Bathla (GLA University – India)
Type of Event: Workshop
Date: TBD
Time: TBD (Central Time)
Duration: 2 hours
Room: TBD
Abstract:
Visual Question Answering (VQA) is a multimodal learning task that aims to generate accurate answers to natural language questions based on visual content such as images or videos. The questions may require binary (yes/no) responses, object counting, attribute recognition, spatial reasoning, or detailed descriptive understanding of specific objects or scenes within the visual input. Traditional VQA approaches typically rely on convolutional neural networks (CNNs) or pre-trained visual backbones such as VGGNet and ResNet for feature extraction from images, combined with sequence-based language models like Long Short-Term Memory (LSTM) networks to encode textual questions. These visual and textual representations are then fused using techniques such as concatenation, attention mechanisms, or multimodal embeddings to predict the final answer. While such architectures provide reasonable performance on standard VQA benchmarks, their capability to capture complex contextual relationships and high-level semantic reasoning remains limited. Recent advancements in Large Vision Models (VLMs), including Vision Transformers (ViTs) and foundation models such as Gemini, have significantly improved VQA performance. These models leverage transformer-based architectures, large-scale pretraining on multimodal datasets, and self-attention mechanisms to model long-range dependencies and fine-grained interactions between visual and textual modalities. As a result, they demonstrate superior performance on complex and real-world VQA datasets involving compositional reasoning, scene understanding, and open-ended queries. In this workshop, VQA will be demonstrated using both traditional deep learning architectures and state-of-the-art Large Vision Models. Participants will gain practical insights into model architectures, multimodal feature fusion strategies, training pipelines, and performance comparisons, highlighting the evolution of VQA systems from classical neural networks to modern large-scale vision-language models.
Keywords:
Visual Question Answering (VQA), Large Vision Model (VLM).
Covered Topics:
The covered topics include:
- Questions and image embeddings
- Iimplementation using ResNet and LSTM
- Demonstration of VQA using Gemini
- ViT (Vision Transformer)
- RAG (Retrieval-Augmented Generation)
Prerequisites for Participants:
Machine Learning, Neural Networks, Python.


