Research Interests
In 2023&2024, I participate in the following research topics:
- AIGC, e.g. Diffusion Models in 2D/3D
- Large Language Models (LLMs), e.g.
- Autonomous Driving, e.g. 3D perception & prediction
Previously, I did research on 2D/3D detection, segmentation and object tracking. Representative works include MetaBEV for 3D objection detection and segmentation and ARKitTrack for 2D object tracking.
|
News
-
[06/2024] One paper was accepted by ECCV2024.
-
[01/2024] One paper was accepted by ICLR2024 (Spotlight).
-
[12/2023] One paper was accepted by AAAI2023.
-
[07/2023] One paper was accepted by ICCV2023.
-
[03/2023] One paper was accepted by CVPR2023.
|
Selected Publications
For the latest full list please check here.
AIGC (Text-to-Image Diffusion)
|
|
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen*,
Jincheng Yu*,
Chongjian Ge*,
Lewei Yao*,
Enze Xie,
Yue Wu,
Zhongdao Wang,
James Kwok,
Ping Luo,
Huchuan Lu,
Zhenguo Li,
ICLR, 2024 (Spotlight)
project page /
arXiv /
Demo /
code star
PixArt-α is a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), and the training speed markedly surpasses existing large-scale T2I models, e.g., PixArt-α only takes 12% of Stable Diffusion v1.5's training time (753 vs. 6,250 A100 GPU days).
|
|
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Junsong Chen*,
Chongjian Ge*,
Enze Xie*†,
Yue Wu*,
Lewei Yao,
Xiaozhe Ren,
Zhongdao Wang,
Ping Luo,
Huchuan Lu,
Zhenguo Li,
ECCV, 2024
project page /
arXiv /
code star
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts.
|
Perception (object detection/segmentation/tracking)
|
|
MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation
Chongjian Ge*,
Junsong Chen*,
Enze Xie,
Zhongdao Wang,
Lanqing Hong,
Huchuan Lu,
Zhenguo Li,
Ping Luo
ICCV, 2023
project page /
arXiv /
Youtube /
code star
MetaBEV is a 3D Bird-Eye's View (BEV) perception model that is robust to sensor missing/failure, supporting both single modality mode (camera/lidar) and multi-modal fusion mode with strong performance.
|
|
ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
Haojie Zhao*,
Junsong Chen*,
Lijun Wang,
Huchuan Lu,
CVPR, 2023
project page /
arXiv /
Youtube /
code star
ARKitTrack, a new RGB-D track-ing dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple’s iPhone and iPad.
ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total. Along with the bounding box annotations and frame-level attributes.
|
|