Junsong Chen

I am currently a research intern at NVIDIA Research supervised by Dr. Enze Xie and Prof. Song Han. I was also a Research Assistant at The University of Hong Kong and a PhD Candidate at DLUT. During my PhD time, I have been fortunate to work under the supervision of Prof. Ping Luo, and Prof. Huchuan Lu.

Email  /  CV  /  Google Scholar  /  Github

profile photo

Research Interests


In 2023&2024, I participate in the following research topics:

  • AIGC, e.g. Diffusion Models in 2D/3D
  • Large Language Models (LLMs), e.g.
  • Autonomous Driving, e.g. 3D perception & prediction

Previously, I did research on 2D/3D detection, segmentation and object tracking. Representative works include MetaBEV for 3D objection detection and segmentation and ARKitTrack for 2D object tracking.


News


  • [06/2024] One paper was accepted by ECCV2024.
  • [01/2024] One paper was accepted by ICLR2024 (Spotlight).
  • [12/2023] One paper was accepted by AAAI2023.
  • [07/2023] One paper was accepted by ICCV2023.
  • [03/2023] One paper was accepted by CVPR2023.

Selected Publications


For the latest full list please check here.

AIGC (Text-to-Image Diffusion)

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen*, Jincheng Yu*, Chongjian Ge*, Lewei Yao*, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li,
ICLR, 2024 (Spotlight)
project page / arXiv / Demo / code star

PixArt-α is a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), and the training speed markedly surpasses existing large-scale T2I models, e.g., PixArt-α only takes 12% of Stable Diffusion v1.5's training time (753 vs. 6,250 A100 GPU days).

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Junsong Chen*, Chongjian Ge*, Enze Xie*†, Yue Wu*, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li,
ECCV, 2024
project page / arXiv / code star

In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts.

Perception (object detection/segmentation/tracking)

MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation
Chongjian Ge*, Junsong Chen*, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, Ping Luo
ICCV, 2023
project page / arXiv / Youtube / code star

MetaBEV is a 3D Bird-Eye's View (BEV) perception model that is robust to sensor missing/failure, supporting both single modality mode (camera/lidar) and multi-modal fusion mode with strong performance.

ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
Haojie Zhao*, Junsong Chen*, Lijun Wang, Huchuan Lu,
CVPR, 2023
project page / arXiv / Youtube / code star

ARKitTrack, a new RGB-D track-ing dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple’s iPhone and iPad. ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total. Along with the bounding box annotations and frame-level attributes.