publications | Bin Zhu

Please see Google Scholar for more recent works and arXiv papers.

2026

ArXiv

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, and Bin Zhu

arXiv, 2026

PDF Website
MM

Semantic-Structural Decoupling: Disentangling Semantic Attention from Structural Bias in the Attention Manifold

Pengkun Jiao, Bin Zhu, Jingjing Chen, and Yu-Gang Jiang

In Proceedings of the 34th ACM International Conference on Multimedia, 2026

PDF Website
MM

SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

Yian Li, Yang Jiao, Bin Zhu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, and Yu-Gang Jiang

In Proceedings of the 34th ACM International Conference on Multimedia, 2026

PDF Website
ECCV

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Ninghao Zhang, Bin Zhu, Shijie Zhou, and Jingjing Chen

In European Conference on Computer Vision, 2026

PDF Website
ECCV

VICAL: Vicinal Consistency Alignment for Long-Tailed Visual Recognition

Jianggang Zhu, Zheng Wang, Bin Zhu, Yi-Ping Phoebe Chen, and Jingjing Chen

In European Conference on Computer Vision, 2026

PDF Website
CVPR Main

RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

Shijie Zhou, Bin Zhu, Jiarui Yang, Xiangyu Zhao, Jingjing Chen, and Yu-Gang Jiang

In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, 2026

PDF Website
ACL Main

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, and Jingjing Chen

In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026

PDF Website
ACL Findings

Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models

Ziyao Tang, Pengkun Jiao, Bin Zhu, Huiyan Qi, Jingjing Chen, and Yu-Gang Jiang

In Findings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026

PDF Website
AAAI Oral

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

Jiarui Yang, Bin Zhu, Jingjing Chen, and Yu-Gang Jiang

In Proceedings of the AAAI Conference on Artificial Intelligence, 2026

PDF Code
TIP

ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation

Guangzhao Dai, Shuo Wang, Hao Zhao, Bin Zhu, Qianru Sun, and Xiangbo Shu

IEEE Transactions on Image Processing, 2026

PDF
ICMR Oral

Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models

Bin Zhu, Yinxuan Gui, Huiyan Qi, Jingjing Chen, Chong-Wah Ngo, and Ee-Peng Lim

In ACM International Conference on Multimedia Retrieval, 2026

PDF Website
ICMR Oral

Rode: Linear rectified mixture of diverse experts for food large multi-modal models

Pengkun Jiao, Xinlan Wu, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yugang Jiang

In ACM International Conference on Multimedia Retrieval, 2026

PDF Website
ICMR Oral

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, and Fan Zhang

In ACM International Conference on Multimedia Retrieval, 2026

PDF Code
TOMM

Efficient Test-Time Retrieval Augmented Generation

Hailong Yin, Bin Zhu, Jingjing Chen, and Chong-Wah Ngo

ACM Transactions on Multimedia Computing, Communications and Applications, 2026

PDF Website
ICME

Region-Aware Optimization for Multi-Person Hand Generation in Text-to-Image Synthesis

Xiaoyu Chen, Bin Zhu, Xue Song, Pengkun Jiao, Yue Yu, and Jingjing Chen

In IEEE International Conference on Multimedia and Expo, 2026

PDF
ICASSP

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, and Pan Zhou

In IEEE International Conference on Acoustics, Speech and Signal Processing, 2026

PDF Website
ICASSP

Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation

Ching Lam Cheng, Bin Zhu, and Shengfeng He

In IEEE International Conference on Acoustics, Speech and Signal Processing, 2026

PDF
ICASSP

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang

In IEEE International Conference on Acoustics, Speech and Signal Processing, 2026

PDF
🏆 MMM Oral

Benchmarking Gaslighting Negation Attacks Against Reasoning Models

Bin Zhu, Hailong Yin, Jingjing Chen, and Yu-Gang Jiang

In International Conference on Multimedia Modeling (Best Paper Candidate), 2026

PDF Website
MMM Oral

Dual-LoRA and Quality-Enhanced Pseudo Replay for Multimodal Continual Food Learning

Xinlan Wu, Bin Zhu, Feng Han, Pengkun Jiao, and Jingjing Chen

In International Conference on Multimedia Modeling, 2026

PDF

2025

NC

LLiM: Large Lithium-ion Battery Model for Secure Shared E-bike Battery in Smart Cities

Donghui Ding, Zhao Li, Linhao Luo, Ming Jin, Bin Zhu, Yichen Zhong, Junhao Hu, Peng Cai, and Huiqi Hu

Nature Communications, 2025

PDF Code
ICCV

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yugang Jiang

In International Conference on Computer Vision, 2025

PDF Website
MM

Look before you decide: Prompting active deduction of mllms for assumptive reasoning

Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Tianwen Qian, Bin Zhu, Na Zhao, and Yu-Gang Jiang

In Proceedings of the 33rd ACM International Conference on Multimedia, 2025

PDF Website
CVPR

HD-EPIC: A highly-detailed egocentric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen

In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, 2025

PDF Website
AAAI

Hand1000: Generating realistic hands from text with only 1,000 images

Haozhuo Zhang, Bin Zhu, Yu Cao, and Yanbin Hao

In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

PDF Code Website
AAAI

RAGG: Retrieval-Augmented Grasp Generation Model

Zhenhua Tang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, and Richang Hong

In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

PDF
TMM

FoodLMM: A versatile food assistant using large multi-modal model

Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, and Chong-Wah Ngo

IEEE Transactions on Multimedia, 2025

PDF Code
ArXiv

Don’t Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs

Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang

arXiv preprint arXiv:2504.09456, 2025

PDF
WACV

Retrieval augmented recipe generation

Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

PDF
TOMM

Cookingdiffusion: Cooking procedural image generation with stable diffusion

Yuan Wang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, and Xiang Wang

ACM Transactions on Multimedia Computing, Communications and Applications, 2025

PDF Website
ICMR

Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion

Huiyan Qi, Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Ee-Peng Lim

In ACM International Conference on Multimedia Retrieval (ICMR), 2025

PDF Website
ICME

Efficient Prompt Tuning for Hierarchical Ingredient Recognition

Yinxuan Gui, Bin Zhu, Jingjing Chen, and Chong-Wah Ngo

In IEEE International Conference on Multimedia and Expo (ICME), 2025

PDF
ASSETS

Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, and Patrick Carrington

In International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), 2025

PDF
CHI-LBW

OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, and Patrick Carrington

In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2025

PDF

2024

MM Oral

Navigating weight prediction with diet diary

Yinxuan Gui, Bin Zhu, Jingjing Chen, Chong Wah Ngo, and Yu-Gang Jiang

In Proceedings of the 32nd ACM International Conference on Multimedia, 2024

PDF Website
ECCV

Enhancing recipe retrieval with foundation models: A data augmentation perspective

Fangzhou Song, Bin Zhu, Yanbin Hao, and Shuo Wang

In European Conference on Computer Vision, 2024

PDF Code
TMM

From canteen food to daily meals: Generalizing food recognition to more practical scenarios

Guoshan Liu, Yang Jiao, Jingjing Chen, Bin Zhu, and Yu-Gang Jiang

IEEE Transactions on Multimedia, 2024

PDF
TMM

Efficient Unsupervised Video Hashing with Contextual Modeling and Structural Controlling

Jingru Duan, Yanbin Hao, Bin Zhu, Lechao Cheng, Pengyuan Zhou, and Xiang Wang

IEEE Transactions on Multimedia, 2024

PDF
TOMM

Text-driven video prediction

Xue Song, Jingjing Chen, Bin Zhu, and Yu-Gang Jiang

ACM Transactions on Multimedia Computing, Communications and Applications, 2024

PDF
TOMM

CVLP-NaVD: Contrastive Visual-Language Pre-training Models for Non-annotated Visual Description

Haoran Li, Yanbin Hao, Jiarui Yu, Bin Zhu, Shuo Wang, and Tong Xu

ACM Transactions on Multimedia Computing, Communications and Applications, 2024

PDF
MM Asia

Active Object Segmentation: A New Modality for Egocentric Action Recognition

Jian Ma, Bin Zhu, Kun Li, and Dima Damen

In Proceedings of the 6th ACM International Conference on Multimedia in Asia, 2024

PDF
ECCVW

Video editing for video retrieval

Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, and Dima Damen

In European Conference on Computer Vision Workshop, 2024

PDF

2023

MM

CgT-GAN: clip-guided text GAN for image captioning

Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, and Xiangnan He

In Proceedings of the 31st ACM International Conference on Multimedia, 2023

PDF Code

2022

NeurIPS

Epic-kitchens visor benchmark: Video segmentations and object relations

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen

In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2022

PDF Website
MM

Unsupervised video hashing with multi-granularity contextualization and multi-structure preservation

Yanbin Hao, Jingru Duan, Hao Zhang, Bin Zhu, Pengyuan Zhou, and Xiangnan He

In Proceedings of the 30th ACM International Conference on Multimedia, 2022

PDF Code
MM Oral

Mix-dann and dynamic-modal-distillation for video domain adaptation

Yuehao Yin, Bin Zhu, Jingjing Chen, Lechao Cheng, and Yu-Gang Jiang

In Proceedings of the 30th ACM International Conference on Multimedia, 2022

PDF
ICMR

Cross-lingual adaptation for recipe retrieval with mixup

Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Wing-Kwong Chan

In Proceedings of the 2022 International Conference on Multimedia Retrieval, 2022

PDF

2021

TMM

Learning from web recipe-image pairs for food recognition: Problem, baselines and performance

Bin Zhu, Chong-Wah Ngo, and Wing-Kwong Chan

IEEE Transactions on Multimedia, 2021

PDF
TIP

Learning to match anchor-target video pairs with dual attentional holographic networks

Yanbin Hao, Chong-Wah Ngo, and Bin Zhu

IEEE Transactions on Image Processing, 2021

PDF

2020

TIP

A study of multi-task and region-wise deep learning for food ingredient recognition

Jingjing Chen, Bin Zhu, Chong-Wah Ngo, Tat-Seng Chua, and Yu-Gang Jiang

IEEE Transactions on Image Processing, 2020

PDF
MM Grand Challenge

Person-level action recognition in complex events via tsd-tsm networks

Yanbin Hao, Zi-Niu Liu, Hao Zhang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, and Chong-Wah Ngo

In Proceedings of the 28th ACM International Conference on Multimedia Grand Challenge: Human Centric Analysis, 2020

PDF
MM

Cross-domain cross-modal food transfer

Bin Zhu, Chong-Wah Ngo, and Jing-jing Chen

In Proceedings of the 28th ACM International Conference on Multimedia, 2020

PDF
CVPR

CookGAN: Causality based text-to-image synthesis

Bin Zhu and Chong-Wah Ngo

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

PDF

2019

CVPR

R2GAN: Cross-modal recipe retrieval with generative adversarial network

Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

PDF