Luowei Zhou

I am a research scientist at Google Deepmind.

Prior to Google Deepmind, I spent two years at Microsoft working on vision foundation models. Featured projects include FLorence and ClipBERT. I received my Ph.D. degree from University of Michigan in 2020, under the supervision of Dr. Jason J. Corso. I worked on projects including vision-language pre-training (Unified VLP), YouCook2, and one of the first Visual Transformers (densecap). I received my bachelor's degree from Nanjing University in 2015, where I worked on Multi-Agent RL. I spent summer interns at FAIR, MSR, and Salesforce Research. I am one of the winners of CVPR 2021 Best Student Paper HM.

Email  /  CV  /  Google Scholar  /  Twitter  /  Linkedin  /  Github

profile photo

  • [07/2023] MaMMuT is accepted at TMLR.
  • [02/2023] MIST is accepted at CVPR'23.
  • [09/2022] 3 papers accepted at NeurIPS'22.
  • [07/2022] 2 papers accepted at ECCV'22.
  • [06/2022] Florence & Visual Clues are covered by The Economist! Check out XD's keynote at CVPR'22.
  • [03/2022] 3 papers accepted at CVPR'22.
  • [06/2021] Our ClipBERT won the CVPR'21 Best Student Paper HM!
  • [06/2021] My talk on recent advances in video-and-language pre-training from our CVPR'21 vision+language tutorial.
  • [06/2021] Congrats to Team RUC and INRIA on winning our CVPR'21 ActivityNet-Entities challenge. video and report.
  • [06/2021] 1 paper accepted at NeurIPS'21 and 1 paper at ACL'21.
  • [06/2021] We will host Video-And-Language Understanding Evaluation challenge (VALUE) at ICCV'21.
  • [03/2021] 2 papers accepted at CVPR'21.
  • [11/2020] Announcing YouCook Leaderboard, a one-stop shop for YouCook2 info & leaderboard.
  • [10/2020] Recognized as top 10% of high-scoring reviewers at NeurIPS'21.
  • [05/2020] Videos from our CVPR'20 tutorial on Recent Advances in V+L are available!
  • [04/2020] My defense recording "Language-Driven Video Understanding" is on YouTube. Editing credit: Dan Newman.
  • [09/2019] Work with Prof. Justin Johnson on the first Deep Learning for Computer Vision course at UMich. Recordings.
  • Research

    I'm interested in computer vision and its relations to natural language and deep learning, with a focus on learning visual representation from multimodal supervision. Problems of interest include multimodal learning (e.g., captioning, grounding, VQA), video understanding, unsupervised representation learning, generative models, and Transformers etc.

    b3do MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

    Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou

    CVPR, 2023

    b3do Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

    Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, Michael Zeng

    NeurIPS, 2022
    PDF / Examples / Covered by The Economist

    b3do Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

    Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

    NeruIPS, 2022
    PDF / Code

    b3do OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

    Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan

    NeurIPS, 2022

    b3do Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

    Haoxuan You*, Luowei Zhou*, Bin Xiao*, Noel Codella*, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan

    ECCV, 2022
    PDF / Code

    b3do DnA: Improving Few-shot Transfer Learning with Low-Rank Decomposition and Alignment

    Ziyu Jiang, Tianlong Chen, Xuxi Chen, Yu Cheng, Luowei Zhou, Lu Yuan, Ahmed Awadallah, Zhangyang Wang

    ECCV, 2022

    b3do BEVT: BERT Pretraining of Video Transformers

    Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, Lu Yuan

    CVPR, 2022
    PDF / Code

    b3do CLIP-Event: Connecting Text and Images With Event Structures

    Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, Shih-Fu Chang

    CVPR, 2022 (Oral)
    PDF / Code

    b3do RegionCLIP: Region-Based Language-Image Pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

    CVPR, 2022
    PDF / Code

    b3do Temporally Guided Articulated Hand Pose Tracking in Surgical Videos

    Nathan Louis, Luowei Zhou, Steven J Yule, Roger D Dias, Milisa Manojlovich, Francis D Pagani, Donald S Likosky, Jason J. Corso

    IJCARS, 2022
    PDF / Code

    b3do Florence: A New Foundation Model for Computer Vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang

    arXiv, 2021
    PDF / Azure Blog / XD's CVPR'22 Keynote / Synced

    b3do Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

    Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, Jingjing Liu

    Best Student Paper Honorable Mention award

    CVPR, 2021 (Oral)
    PDF / Code

    b3do UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training

    Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu

    CVPR, 2021
    PDF / Code

    b3do VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

    Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu

    NeurIPS, 2021
    PDF / Benchmark / ICCV'21 Challenge

    b3do Cluster-Former: Clustering-based Sparse Transformer for Question Answering

    Shuohang Wang, Luowei Zhou, Zhe Gan, Yen-Chun Chen, Yuwei Fang, Siqi Sun, Yu Cheng, Jingjing Liu

    ACL (Findings), 2021

    b3do Language-Driven Video Understanding

    Luowei Zhou

    Dissertation, 2020
    PDF / Defense Recording

    b3do Unified Vision-Language Pre-Training for Image Captioning and VQA

    Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao

    AAAI, 2020 (Spotlight)
    PDF / Code / MSR Blog / VentureBeat

    b3do Grounded Video Description

    Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach

    CVPR, 2019 (Oral)
    PDF / Code / ActivityNet-Entities dataset / CVPR'20 Challenge / CVPR'21 Challenge

    b3do Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition

    Hao Huang, Luowei Zhou, Wei Zhang, Jason J. Corso, Chenliang Xu

    BMVC, 2019

    b3do End-to-End Dense Video Captioning With Masked Transformer

    Luowei Zhou*, Yingbo Zhou*, Jason J. Corso, Richard Socher, Caiming Xiong

    CVPR, 2018 (Spotlight)
    PDF / Code

    b3do Towards Automatic Learning of Procedures from Web Instructional Videos

    Luowei Zhou, Chenliang Xu, Jason J. Corso

    AAAI, 2018 (Oral)
    PDF / Code / YouCook2 dataset / Leaderboard

    b3do Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction

    Luowei Zhou, Nathan Louis, Jason J. Corso

    BMVC, 2018
    PDF / Code / YouCook2-BB dataset

    b3do Multiagent Reinforcement Learning With Sparse Interactions by Negotiation and Knowledge Transfer

    Luowei Zhou, Pei Yang, Chunlin Chen, Yang Gao

    Journal Impact Factor: 19.12

    IEEE Transactions on Cybernetics, 2017
    PDF / Code

    b3do A Balanced Heuristic Mechanism for Multirobot Task Allocation of Intelligent Warehouses

    Luowei Zhou, Yuanyuan Shi, Jiangliu Wang, Pei Yang

    MPE, 2014

    Website template courtesy