publications

Please see Google Scholar for more recent works and arXiv papers.

2025

  1. ICCV
    ICCV25-Dual-LoRA.jpg
    From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning
    Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yugang Jiang
    In International Conference on Computer Vision, 2025
  2. MM
    MM2025.png
    Look before you decide: Prompting active deduction of mllms for assumptive reasoning
    Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Tianwen Qian, Bin Zhu, Na Zhao, and Yu-Gang Jiang
    In Proceedings of the 33rd ACM International Conference on Multimedia, 2025
  3. CVPR
    HDEPIC.jpg
    HD-EPIC: A highly-detailed egocentric video dataset
    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen
    In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025
  4. AAAI
    Hand1000.jpg
    Hand1000: Generating realistic hands from text with only 1,000 images
    Haozhuo Zhang, Bin Zhu, Yu Cao, and Yanbin Hao
    In Proceedings of the AAAI Conference on Artificial Intelligence, 2025
  5. AAAI
    HandGrasp.jpg
    RAGG: Retrieval-Augmented Grasp Generation Model
    Zhenhua Tang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, and Richang Hong
    In Proceedings of the AAAI Conference on Artificial Intelligence, 2025
  6. TMM
    FoodLMM.jpg
    FoodLMM: A versatile food assistant using large multi-modal model
    Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, and Chong-Wah Ngo
    IEEE Transactions on Multimedia, 2025
  7. ArXiv
    GaslightingBench.jpg
    Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation
    Bin Zhu, Huiyan Qi, Yinxuan Gui, Jingjing Chen, Chong-Wah Ngo, and Ee-Peng Lim
    arXiv preprint arXiv:2501.19017, 2025
  8. ArXiv
    GaslightingBench-R.jpg
    Reasoning Models Are More Easily Gaslighted Than You Think
    Bin Zhu, Hailong Yin, Jingjing Chen, and Yu-Gang Jiang
    arXiv preprint arXiv:2506.09677, 2025
  9. ArXiv
    Gaslighting-Attention.jpg
    Don’t Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs
    Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang
    arXiv preprint arXiv:2504.09456, 2025
  10. WACV
    RAG-WACV25.jpg
    Retrieval augmented recipe generation
    Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang
    In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025
  11. TOMM
    TOMM25-cookingdiff.png
    Cookingdiffusion: Cooking procedural image generation with stable diffusion
    Yuan Wang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, and Xiang Wang
    ACM Transactions on Multimedia Computing, Communications and Applications, 2025
  12. ICMR
    FastFood.jpg
    Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion
    Huiyan Qi, Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Ee-Peng Lim
    In ACM International Conference on Multimedia Retrieval (ICMR), 2025
  13. ICME
    ICME25.jpg
    Efficient Prompt Tuning for Hierarchical Ingredient Recognition
    Yinxuan Gui, Bin Zhu, Jingjing Chen, and Chong-Wah Ngo
    In IEEE International Conference on Multimedia and Expo (ICME), 2025
  14. ASSETS
    Assets25.png
    Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking
    Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, and Patrick Carrington
    In International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), 2025
  15. CHI-LBW
    OSCAR-CHI2025.jpg
    OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking
    Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, and Patrick Carrington
    In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2025

2024

  1. MM Oral
    cover_weightprediction.jpg
    Navigating weight prediction with diet diary
    Yinxuan Gui, Bin Zhu, Jingjing Chen, Chong Wah Ngo, and Yu-Gang Jiang
    In Proceedings of the 32nd ACM International Conference on Multimedia, 2024
  2. ECCV
    cover_DAR.jpg
    Enhancing recipe retrieval with foundation models: A data augmentation perspective
    Fangzhou Song, Bin Zhu, Yanbin Hao, and Shuo Wang
    In European Conference on Computer Vision, 2024
  3. TMM
    cover_canteen.jpg
    From canteen food to daily meals: Generalizing food recognition to more practical scenarios
    Guoshan Liu, Yang Jiao, Jingjing Chen, Bin Zhu, and Yu-Gang Jiang
    IEEE Transactions on Multimedia, 2024
  4. TMM
    TMM24-hashing.jpg
    Efficient Unsupervised Video Hashing with Contextual Modeling and Structural Controlling
    Jingru Duan, Yanbin Hao, Bin Zhu, Lechao Cheng, Pengyuan Zhou, and Xiang Wang
    IEEE Transactions on Multimedia, 2024
  5. TOMM
    cover_TVP.jpg
    Text-driven video prediction
    Xue Song, Jingjing Chen, Bin Zhu, and Yu-Gang Jiang
    ACM Transactions on Multimedia Computing, Communications and Applications, 2024
  6. TOMM
    TOMM24-caption.jpg
    CVLP-NaVD: Contrastive Visual-Language Pre-training Models for Non-annotated Visual Description
    Haoran Li, Yanbin Hao, Jiarui Yu, Bin Zhu, Shuo Wang, and Tong Xu
    ACM Transactions on Multimedia Computing, Communications and Applications, 2024
  7. MM Asia
    MMAsia24.jpg
    Active Object Segmentation: A New Modality for Egocentric Action Recognition
    Jian Ma, Bin Zhu, Kun Li, and Dima Damen
    In Proceedings of the 6th ACM International Conference on Multimedia in Asia, 2024
  8. ECCVW
    ECCV24-videoEditing.png
    Video editing for video retrieval
    Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, and Dima Damen
    In European Conference on Computer Vision Workshop, 2024

2023

  1. MM
    cover_CgT-GAN.jpg
    CgT-GAN: clip-guided text GAN for image captioning
    Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, and Xiangnan He
    In Proceedings of the 31st ACM International Conference on Multimedia, 2023

2022

  1. NeurIPS
    cover_VISOR.jpg
    Epic-kitchens visor benchmark: Video segmentations and object relations
    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen
    In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2022
  2. MM
    cover_hashing.jpg
    Unsupervised video hashing with multi-granularity contextualization and multi-structure preservation
    Yanbin Hao, Jingru Duan, Hao Zhang, Bin Zhu, Pengyuan Zhou, and Xiangnan He
    In Proceedings of the 30th ACM International Conference on Multimedia, 2022
  3. MM Oral
    cover_DANN.jpg
    Mix-dann and dynamic-modal-distillation for video domain adaptation
    Yuehao Yin, Bin Zhu, Jingjing Chen, Lechao Cheng, and Yu-Gang Jiang
    In Proceedings of the 30th ACM International Conference on Multimedia, 2022
  4. ICMR
    cover_mixup.jpg
    Cross-lingual adaptation for recipe retrieval with mixup
    Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Wing-Kwong Chan
    In Proceedings of the 2022 International Conference on Multimedia Retrieval, 2022

2021

  1. TMM
    cover_TMMsurvey.jpg
    Learning from web recipe-image pairs for food recognition: Problem, baselines and performance
    Bin Zhu, Chong-Wah Ngo, and Wing-Kwong Chan
    IEEE Transactions on Multimedia, 2021
  2. TIP
    cover_hyberlink.jpg
    Learning to match anchor-target video pairs with dual attentional holographic networks
    Yanbin Hao, Chong-Wah Ngo, and Bin Zhu
    IEEE Transactions on Image Processing, 2021

2020

  1. TIP
    cover_VIREOFood251.jpg
    A study of multi-task and region-wise deep learning for food ingredient recognition
    Jingjing Chen, Bin Zhu, Chong-Wah Ngo, Tat-Seng Chua, and Yu-Gang Jiang
    IEEE Transactions on Image Processing, 2020
  2. MM Grand Challenge
    TSD-TSM.jpg
    Person-level action recognition in complex events via tsd-tsm networks
    Yanbin Hao, Zi-Niu Liu, Hao Zhang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, and Chong-Wah Ngo
    In Proceedings of the 28th ACM International Conference on Multimedia Grand Challenge: Human Centric Analysis, 2020
  3. MM
    cover_crossdomain.jpg
    Cross-domain cross-modal food transfer
    Bin Zhu, Chong-Wah Ngo, and Jing-jing Chen
    In Proceedings of the 28th ACM International Conference on Multimedia, 2020
  4. CVPR
    cover_CookGAN.jpg
    CookGAN: Causality based text-to-image synthesis
    Bin Zhu and Chong-Wah Ngo
    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

2019

  1. CVPR
    cover_R2GAN.jpg
    R2GAN: Cross-modal recipe retrieval with generative adversarial network
    Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019