News

News

News

Breakthrough! The release of China's first Sora level video model

2024-05-07


This article is transferred from the Beijing Daily client;


A 60 second exquisite and delicate video can be generated with just a text command - since February this year, the Sora, a large model of cultural and educational videos, has caused a sensation in the global artificial intelligence industry and beyond. On the morning of April 27th, at the 2024 Zhongguancun Forum Annual Meeting Future Artificial Intelligence Pioneer Forum, Shengshu Technology and Tsinghua University jointly released China's first long duration, high consistency, and high dynamic video model - Vidu. Vidu not only simulates the real physical world, but also has rich imagination and features such as multi shot generation and high spatiotemporal consistency. This is also the world's first major breakthrough video model since Sora's release, with comprehensive performance benchmarking against international top levels and accelerating iteration and improvement.


It is understood that the model adopts the team's original architecture U-ViT, which combines Diffusion and Transformer, and supports one click generation of high-definition video content up to 16 seconds long with a resolution of up to 1080P.


Based on the on-site demonstration, Vidu is able to simulate the real physical world and generate scenes with complex details that conform to the laws of real physics, such as reasonable lighting and shadow effects, delicate character expressions, etc. It also has rich imagination, capable of generating fictional scenes that do not exist in the real world, creating surreal content with depth and complexity, such as scenes like "a ship in the studio sailing towards the camera in the waves.".


In addition, Vidu is able to generate complex dynamic shots, no longer limited to simple fixed shots such as push, pull, and move, but can switch between different shots such as far, close, medium, and close-up in a single frame around a unified subject, including direct effects such as long shots, focus tracking, and transitions, injecting shot language into the video.


As a self-developed video model in China, Vidu can also understand Chinese elements and generate unique Chinese elements such as pandas and dragons in videos.


It is worth mentioning that the clips in the short film are generated continuously from beginning to end without obvious frame insertion. It can be inferred from the performance of "one mirror to the end" that Vidu adopts a "one-step" generation method. Like Sora, the conversion from text to video is direct and continuous, and the underlying algorithm is completely end-to-end generated based on a single model, without involving intermediate frame insertion and other multi-step processing.


It is understood that Vidu's rapid breakthrough stems from the team's long-term accumulation in Bayesian machine learning and multimodal large models, as well as multiple original achievements. The core technology U-ViT architecture was proposed by the team in September 2022, earlier than the DiT architecture adopted by Sora. It is the world's first architecture that integrates Diffusion and Transformer, and is completely independently developed by the team.


"After the release of Sora, we found that it happened to be highly consistent with our technological roadmap, which also made us firmly further advance our research," said Zhu Jun, Vice Dean of the Institute of Artificial Intelligence at Tsinghua University and Chief Scientist of Shengshu Technology. Since the release of Sora in February this year, based on a deep understanding of the U-ViT architecture and long-term accumulated engineering and data experience, the team has further broken through key technologies for long video representation and processing in just two months. They have developed and launched the Vidu video big model, significantly improving the coherence and dynamism of videos.





ConsultClose
Loading...
customer service

After sales