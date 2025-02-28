Alibaba has launched Wan 2.1, its latest suite of AI-powered video generation models, making them open-source for academic and commercial use. The new models, hosted on Hugging Face, offer a range of capabilities including text-to-video (T2V) and image-to-video (I2V) generation, setting the stage for advancements in AI-driven content creation.

Wan 2.1 consists of four parameter-based models designed for different levels of video generation tasks:

• T2V-1.3B and T2V-14B (Text-to-Video models)

• I2V-14B-720P and I2V-14B-480P (Image-to-Video models)

Alibaba’s smallest model, T2V-1.3B, is particularly notable as it can run on consumer-grade GPUs with just 8.19GB of vRAM. The company claims that an Nvidia RTX 4090 can generate a five-second 480p video in under four minutes.

The AI models utilise a diffusion transformer architecture, enhanced with variational autoencoders (VAE) to optimise memory usage and improve video quality. The 3D causal VAE architecture, dubbed Wan-VAE, enables the system to generate consistent high-resolution (1080p) videos while retaining historical frame information, ensuring better scene consistency.

Alibaba says Wan 2.1 outperforms OpenAI’s Sora model in several key areas:

• Better scene generation quality

• Higher single-object accuracy

• More precise spatial positioning

Wan 2.1 is being released under the Apache 2.0 license, making it freely available for research and academic purposes. However, commercial usage comes with restrictions, limiting its application in certain industries.

While Wan 2.1 currently focuses on text-to-video and image-to-video generation, Alibaba hints that future versions could expand capabilities to include video-to-audio generation and AI-powered video editing.