Sora and V-JEPA Have Not Learned The Complete Real World Model -- A Philosophical Analysis of Video AIs Through the Theory of Productive Imagination

Jianqiu Zhang
2024-05-07
Abstract:Sora from Open AI has shown exceptional performance, yet it faces scrutiny over whether its technological prowess equates to an authentic comprehension of reality. Critics contend that it lacks a foundational grasp of the world, a deficiency V-JEPA from Meta aims to amend with its joint embedding approach. This debate is vital for steering the future direction of Artificial General Intelligence(AGI). We enrich this debate by developing a theory of productive imagination that generates a coherent world model based on Kantian philosophy. We identify three indispensable components of the coherent world model capable of genuine world understanding: representations of isolated objects, an a priori law of change across space and time, and Kantian categories. Our analysis reveals that Sora is limited because of its oversight of the a priori law of change and Kantian categories, flaws that are not rectifiable through scaling up the training. V-JEPA learns the context-dependent aspect of the a priori law of change. Yet it fails to fully comprehend Kantian categories and incorporate experience, leading us to conclude that neither system currently achieves a comprehensive world understanding. Nevertheless, each system has developed components essential to advancing an integrated AI productive imagination-understanding engine. Finally, we propose an innovative training framework for an AI productive imagination-understanding engine, centered around a joint embedding system designed to transform disordered perceptual input into a structured, coherent world model. Our philosophical analysis pinpoints critical challenges within contemporary video AI technologies and a pathway toward achieving an AI system capable of genuine world understanding, such that it can be applied for reasoning and planning in the future.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses the limitations of artificial intelligence (AI) in the field of video understanding and generation, especially the two systems Sora from OpenAI and V-JEPA from Meta. The author evaluates the shortcomings of these two systems in constructing real-world models through philosophical analysis, particularly by drawing on Kant's theory of productive imagination. Sora is capable of generating highly realistic videos based on textual prompts, but it has deficiencies in simulating complex physical interactions and causal relationships. On the other hand, V-JEPA understands the world by learning the spatio-temporal correlations between video clips, but it fails to fully comprehend Kant's categories and performs poorly in dealing with complex scenes. The paper proposes that a complete world model needs to include representations of isolated objects, prior laws of change, and Kant's categories. Sora ignores the prior laws of change and Kant's categories, while V-JEPA, although it learns context-dependent prior laws of change, lacks a thorough understanding of Kant's categories and their integration with experience. The paper points out that existing video AI technologies have not achieved a true understanding of the world because they are unable to construct coherent world models from fragmented perceptual data like humans do. To improve this, the paper proposes an innovative training framework based on a joint embedding system to transform unordered perceptual inputs into structured world models. The author also presents a theory based on productive imagination, emphasizing the importance of experience in forming coherent world models, which differs from Kant's theory. They believe that current AI systems need to better integrate these elements to achieve a comprehensive understanding of the real world, which can be used for reasoning and planning in the future. In summary, the paper aims to reveal the challenges faced by contemporary video AI technologies through philosophical analysis, pointing out the limitations of Sora and V-JEPA in understanding and simulating the real world, and proposing a theoretical framework to improve existing systems.