Hammad Ayyubi

I work at Google DeepMind on Vision-Language Understanding and Generation research.

My research interests focuses on Computer Vision, Natural Language Processing and Commonsense Reasoning. In particular, I am interested in building systems that can reason about our world in an interpretable, robust, and trustworthy manner. This involves extensive work with LLMs, agents, tools, and instruction-tuning.

Previously, I finished my PhD at the Dept. of Computer Sicence, Columbia University, advised by Prof. Shih-Fu Chang.

I have been fortunate to have worked with some amazing people via internships at Microsoft, Google and Adobe – Jianwei Yang, Oriana Riva, Tianqi Liu, Arsha Nagrani, Mingda Zhang, Anurag Arnab, and Vlad Morariu.

Prior to joining Columbia, I finished my Master’s at UC San Diego, advised by Prof. Gary Cottrell. I also worked with Prof. Manmohan Chandraker and Prof. David Kriegman during that time. I finished my Bachelor’s at Indian Institute of Technology, BHU (IIT, BHU).

Google DeepMind

Mountain View, CA

news

Nov 4, 2025	Our paper, DELOC: Document Element Localizer, on grounding in PDFs using Multimodal LLMs accepted to EMNLP 2025.
Apr 1, 2025	Our paper, PuzzleGPT, on predicting location and time from images accepted to NAACL 2025 Findings.
Feb 10, 2025	I have joined full-time at Google!
Oct 15, 2024	One paper on Event Graph based Interpretable VideoQA accepted to NeurIPS MAR Workshop, 2024.
Sep 26, 2024	Our paper on Multimodal Reasoning on Generated Images was accepted at NeurIPS’24. Dataset and code here.
Sep 20, 2024	Our paper on Entity-Aware Video Captioning was accepted at EMNLP’24.
Jul 22, 2024	One paper on Procedure Planning accepted at ECCV’24.
Jul 20, 2024	One paper on insufficient context in Multimodal Reasoning accepted at ACM MM’24.
Jun 1, 2024	Excited to be starting my summer internship at Adobe!
Dec 9, 2023	I am looking for a MS/UG student to work with me on Multimodal Commonsense Reasoning. If you are interested, please reach out.