Luowei Zhou

Azure CV Research


I am currently a Senior Researcher at Microsoft. I received my Ph.D. degree from University of Michigan in 2020, under the supervision of Dr. Jason J. Corso. I received my bachelor's degree from Nanjing University in 2015.

My research focuses on the intersection of computer vision and natural language processing (or vision+language), such as visual captioning, grounding, and question answering. My work intensively relies on deep learning and machine learning algorithms. My most recent efforts are on automatic video understanding, featured projects include vision-language pre-training (VLP), YouCook2, grounded video description, and densecap. Previously, I worked on Multi-Agent RL at Nanjing University. I have spent summer interns at Facebook AI Research, MSR, and Salesforce Research.


  • [06/2022] Introducing the Vision Foundation Model Florence (XD's keynote at CVPR'22). Check out the coverage by The Economist on Florence-based Visual Storytelling (paper; demos).
  • [07/2022] Two papers accepted at ECCV'22: Modality-Shared CLIP (MS-CLIP) and Low-Rank Decompose and Align for few-shot transfer (DNA). Details coming soon.
  • [03/2022] Three papers accepted at CVPR'22: knowledge for CLIP (Oral), zero-shot detection, and BERT pre-training for Video Transformers. CLIP-Event, RegionCLIP, BEVT.
  • [06/2021] Our ClipBERT won the CVPR'21 Best Student Paper HM!
  • [06/2021] Videos from our CVPR'21 vision+language tutorial are available now. My summary on recent advances in video-and-language pre-training and representation learning.
  • [06/2021] The winner was announced for our CVPR'21 ActivityNet-Entities challenge. Congrats to AI M3 team from Renmin University of China and INRIA. See video and report.
  • [06/2021] Check out our NeurIPS 2021 work Video-And-Language Understanding Evaluation (VALUE) benchmark & join our challenge at CLVL workshop, ICCV'21!
  • [03/2021] Two papers accepted at CVPR'21! ClipBERT (code) is accepted as an oral (w/ three SAs)! Both preprint and code on UC2 (multi-lingual VLP) are available.
  • [02/2021] The 2nd Activity-Entities Object Grounding Challenge at CVPR'21 has started! Click here for more details!
  • [11/2020] Announcing YouCook Leaderboard, a one-stop shop for YouCook2 info & leaderboard.
  • [10/2020] Recognized as top 10% of high-scoring reviewers at NeurIPS'21. Thanks to organizers for the free registration.
  • [05/2020] Videos from our CVPR'20 tutorial on Recent Advances in V+L are available!
  • [04/2020] My thesis defense titled "Language-Driven Video Understanding" is available on YouTube now. Thanks to Dan for the editing/captions.
  • [04/2020] CVPR'20 Activity-Entities Object Localization (Grounding) Challenge (a part of the annual ActivityNet Challenge) has officially started! Click here for more details!
  • [04/2020] YouCook2 text-to-video retrieval task is hosted at the CVPR'20 Video Pentathlon Workshop. Also, check out this awesome demo built by Antoine!
  • [11/2019] Our VLP work is accepted by AAAI'20 (spotlight)! VLP is featured in MSR blog, VentureBeat, and TDS.
  • [09/2019] We introduce our work on Unified Vision-Language Pre-training (VLP), which achieves SotA on image captioning and VQA (datasets: COCO/VQA2.0/Flickr30k) with a single model architecture. Code available on Github. Try it out!
  • [09/2019] I am working with Prof. Justin Johnson on a new class on Deep Learning for Computer Vision at UMich.
  • [04/2019] We released the source code for our CVPR'19 oral paper Grounded Video Description! The evaluation server for our dataset ANet-Entities is live on Codalab!
  • [04/2019] Our grounded video description paper is accepted by CVPR'19 (oral). We made the ActivityNet-Entities dataset (158k bboxes on 52k captions) available at Github including evaluation scripts. Source code is on the way!
  • [03/2018] Our dense video captioning paper is accepted by CVPR 2018 (spotlight).
  • [02/2018] I will join Facebook AI Research (FAIR) for Research Intern in summer 2018.
  • [02/2018] I will co-organize CVPR'18 Workshop on Fine-grained Instructional Video Understanding (FIVER).
  • [11/2017] Our paper on YouCook2 and procedure segmentation is accepted by AAAI 2018 (oral).