This week at the Computer Vision and Pattern Recognition (CVPR) conference in Seattle, NVIDIA researchers are presenting a series of groundbreaking visual generative AI models and techniques that are set to redefine the landscape of image generation, 3D scene editing, visual language understanding, and autonomous vehicle perception.

Jan Kautz, NVIDIA’s VP of learning and perception research, emphasized the transformative nature of generative AI, stating that it represents a pivotal technological advancement. NVIDIA’s contributions to CVPR include over 50 research projects, with two papers being finalists for the Best Paper Awards. These papers delve into the training dynamics of diffusion models and the creation of high-definition maps for self-driving cars.
A notable achievement for NVIDIA at CVPR is winning the Autonomous Grand Challenge’s End-to-End Driving at Scale track, outshining over 450 global entries. This victory showcases NVIDIA’s leading-edge work in employing generative AI for comprehensive self-driving vehicle models, earning them an Innovation Award from CVPR.
Among the innovative projects highlighted is JeDi, a new technique that enables rapid customization of diffusion models for text-to-image generation using just a few reference images. This method significantly reduces the time required for fine-tuning on custom datasets.
Another significant advancement is FoundationPose, a new foundation model that sets a performance record by instantly understanding and tracking the 3D pose of objects in videos without per-object training. This technology has the potential to revolutionize augmented reality (AR) and robotics applications.
NVIDIA’s team of researchers has unveiled NeRFDeformer, an innovative method that allows for the editing of 3D scenes captured by Neural Radiance Fields (NeRF) with just a single 2D image. This breakthrough simplifies the process of 3D scene modification, eliminating the need for manual reanimation or complete NeRF reconstruction, which could greatly benefit graphics, robotics, and digital twin technologies.
In collaboration with MIT, NVIDIA has also introduced VILA, a cutting-edge family of vision language models that set new standards in image, video, and text comprehension. VILA’s advanced reasoning abilities enable it to interpret internet memes by integrating visual cues with textual context.
NVIDIA’s visual AI research is making waves across various sectors, presenting over a dozen papers at CVPR focused on pioneering methods for perception, mapping, and planning in autonomous vehicles. Sanja Fidler, VP of NVIDIA’s AI Research team, is discussing the transformative impact that vision language models could have on the future of self-driving technology.
The extensive range of NVIDIA’s research presented at CVPR showcases the vast potential of generative AI to enhance creative processes, streamline manufacturing and healthcare automation, and drive advancements in autonomy and robotics.