PEVAが示す自己中心的ビデオ予測の新潮流はどこへ向かうのか？

PEVAは、人間の動作から自己中心的ビデオを予測する新モデル

元記事タイトル: 人間の動作から自己中心的ビデオを予測するPEVA

BAIR Blog 2025年07月01日

RESEARCH 研究論文 / Preprint

Field Note 読む前に確認

3行まとめ

PEVAは過去のフレームと3Dポーズ変化に基づき次フレームを予測
反実仮想シナリオの生成や長時間ビデオ生成が可能
現実世界でのエージェント向けビデオ予測モデルとして有望

こんな人に関係ある話

機械学習研究者ロボット工学者 VR/AR開発者

信頼度メモ

BAIR Blog の公式情報

記事の読み解き Reading

元記事を材料に、要点、編集視点、良い点と懸念点を読みやすい順に整理しています。

PEVAは、過去のフレームと3Dポーズ変化を指定したアクションに基づいて次フレームを予測します。このモデルは原子的な行動シーケンス生成、反実仮想シナリオのシミュレーション、長時間ビデオ生成が可能で、現実世界での実装に向けた進歩を示しています。

編集部コメント

PEVAは、現実世界で動作するエージェント向けの新たなビデオ予測モデルであり、その反実仮想シナリオ生成能力が注目されます。しかし、複雑なアクション空間への対応や多様性への適応性が今後の課題となるでしょう。

評価ポイント Assessment

良い点

PEVAは3Dポーズ変化に基づいた視点移動に対応する
反実仮想シナリオのシミュレーション能力がある
長時間ビデオ生成が可能で、現実世界での応用範囲が広い

懸念点

複雑なアクション空間への対応はまだ完全ではない可能性がある
現実世界の多様性に対応するための学習データ量が必要となる

業界・社会への影響 Impact

PEVAは、ロボット工学やVR/AR分野でのシミュレーションや制御に大きな影響を及ぼす可能性があります。特に、複雑な行動空間を持つ実装体験を模倣するためのモデルとして有望です。

深堀り Deep Dive

前提知識

近年、世界モデルの研究が進展し、未来の状態を予測する技術が開発されてきた。これは、計画や制御に応用されるもので、直感的な物理法則から多段階の動画予測まで幅広く応用されている。しかし、これらは多くの場合、抽象的な制御信号を扱うものであり、現実世界で動作するエボディッドエージェント（具身エージェント）向けのモデルは極めて限られている。具身エージェントは、物理的に基づいた複雑な行動空間を持ち、実生活の多様な状況に適応しなければならない。

何が新しいのか

PEVAは、過去のフレームと3Dポーズの変化に基づいて次のフレームを予測するモデルであり、これは従来の世界モデルとは異なる。従来のモデルは、抽象的な制御信号を扱っていたが、PEVAは具身エージェントの視点（自己中心的ビュー）を考慮し、実際の物理的行動に基づいた予測が可能である。これにより、原子的な行動シーケンス生成や、反実仮想シナリオのシミュレーション、長時間の動画生成が実現されている。

今後見るべき論点

具身エージェント向けの世界モデルの実装が進むにしたがって、PEVAのようなモデルがどのように応用されるか
自己中心的ビューにおける視覚と行動の関係性の解明が進むか
長時間の動画生成技術が、現実世界の複雑な環境に適応できるか

用語解説

具身エージェント物理的な環境と相互作用しながら行動するエージェント。人間やロボットなど、現実世界で動作するシステムを指す。

自己中心的ビューエージェント自身の視点（例：人間の目線）から見た視覚情報。環境の全体像ではなく、エージェントが実際に見る視点を指す。

反実仮想シナリオ現実とは異なる条件の下で起こる可能性のある出来事を仮定し、その結果をシミュレーションする手法。

世界モデル物理的な世界の状態や行動の結果を予測するモデル。計画や制御に応用される。

参照元 Sources

元記事と、深堀りで参照した情報源です。コミュニティ投稿やプレプリントでは、ここから根拠を確認できます。

人間の動作から自己中心的ビデオを予測するPEVA

BAIR Blog

http://bair.berkeley.edu/blog/2025/07/01/peva/

この記事の見取り図

読む前に確認
記事の読み解き
深堀り
参照元
AI要約について

キーワード

PEVA 自己中心的ビデオ予測 3Dポーズ変化反実仮想シナリオ

AI要約について

本記事の要約・分類・読み解きにはAIを使用しています。内容確認に努めていますが、誤訳・解釈違い・元記事更新の反映漏れを含む可能性があります。重要な判断を行う場合は、必ず元記事もご確認ください。

記事データ

Source	公式情報
Category	研究論文
Status	完成記事
出典	BAIR Blog
公開日	2025-07-01

元記事の説明文

<div class="modal" id="imageModal"> <span class="close">×</span> <img class="modal-content" id="modalImg" /> </div>  <div style="width: 100%; margin: 0 auto; text-align: center;"> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/teaserv3_web.png" width="100%" /> <br /> <i style="font-size: 0.9em;"><a href="https://arxiv.org/abs/2506.21552" target="_blank"><strong>Predicting Ego-centric Video from human Actions (PEVA)</strong></a>. Given past video frames and an action specifying a desired change in 3D pose, PEVA predicts the next video frame. Our results show that, given the first frame and a sequence of actions, our model can generate videos of atomic actions (a), simulate counterfactuals (b), and support long video generation (c).</i> </p> </div> <p>Recent years have brought significant advances in world models that learn to simulate future outcomes for planning and control. From intuitive physics to multi-step video prediction, these models have grown increasingly powerful and expressive. But few are designed for truly embodied agents. In order to create a World Model for Embodied Agents, we need a <em>real</em> embodied agent that acts in the <em>real</em> world. A <em>real</em> embodied agent has a physically grounded complex action space as opposed to abstract control signals. They also must act in diverse real-life scenarios and feature an egocentric view as opposed to aesthetic scenes and stationary cameras.</p>  <div style="text-align: center; margin: 30px auto;"> <img src="https://bair.berkeley.edu/static/blog/peva/PEVA-summary.png" style="height: auto; display: block; margin: 0 auto;" title="Click to enlarge" /> </div> <p style="text-align: center; font-size: 0.85em; color: #666; margin-top: 10px; padding: 8px; background-color: #f5f5f5; border-radius: 4px;"><em>💡 Tip: Click on any image to view it in full resolution.</em></p> <h2 id="why-its-hard">Why It’s Hard</h2> <ul> <li><strong>Action and vision are heavily context-dependent.</strong> The same view can lead to different movements and vice versa. This is because humans act in complex, embodied, goal-directed environments.</li> <li><strong>Human control is high-dimensional and structured.</strong> Full-body motion spans 48+ degrees of freedom with hierarchical, time-dependent dynamics.</li> <li><strong>Egocentric view reveals intention but hides the body.</strong> First-person vision reflects goals, but not motion execution, models must infer consequences from invisible physical actions.</li> <li><strong>Perception lags behind action.</strong> Visual feedback often comes seconds later, requiring long-horizon prediction and temporal reasoning.</li> </ul> <p>To develop a World Model for Embodied Agents, we must ground our approach in agents that meet these criteria. Humans routinely look first and act second—our eyes lock onto a goal, the brain runs a brief visual “simulation” of the outcome, and only then does the body move. At every moment, our egocentric view both serves as input from the environment and reflects the intention/goal behind the next movement. When we consider our body movements, we should consider both actions of the feet (locomotion and navigation) and the actions of the hand (manipulation), or more generally, whole-body control.</p> <h2 id="what-did-we-do">What Did We Do?</h2> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/what_did_we_do_web.png" width="80%" /> </p> <p>We trained a model to <span style="font-weight: bold;">P</span>redict <span style="font-weight: bold;">E</span>go-centric <span style="font-weight: bold;">V</span>ideo from human <span style="font-weight: bold;">A</span>ctions (<a href="https://arxiv.org/abs/2506.21552" target="_blank">PEVA</a>) for Whole-Body-Conditioned Egocentric Video Prediction. PEVA conditions on kinematic pose trajectories structured by the body’s joint hierarchy, learning to simulate how physical human actions shape the environment from a first-person view. We train an autoregressive conditional diffusion transformer on Nymeria, a large-scale dataset pairing real-world egocentric video with body pose capture. Our hierarchical evaluation protocol tests increasingly challenging tasks, providing comprehensive analysis of the model’s embodied prediction and control abilities. This work represents an initial attempt to model complex real-world environments and embodied agent behaviors through human-perspective video prediction.</p> <h2 id="method">Method</h2> <h3 id="structured-action-representation-from-motion">Structured Action Representation from Motion</h3> <p>To bridge human motion and egocentric vision, we represent each action as a rich, high-dimensional vector capturing both full-body dynamics and detailed joint movements. Instead of using simplified controls, we encode global translation and relative joint rotations based on the body’s kinematic tree. Motion is represented in 3D space with 3 degrees of freedom for root translation and 15 upper-body joints. Using Euler angles for relative joint rotations yields a 48-dimensional action space (3 + 15 × 3 = 48). Motion capture data is aligned with video using timestamps, then converted from global coordinates to a pelvis-centered local frame for position and orientation invariance. All positions and rotations are normalized to ensure stable learning. Each action captures inter-frame motion changes, enabling the model to connect physical movement with visual consequences over time.</p> <h3 id="design-of-peva-autoregressive-conditional-diffusion-transformer">Design of PEVA: Autoregressive Conditional Diffusion Transformer</h3> <div style="width: 100%; margin: 0 auto; text-align: center;"> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/method_web.png" width="100%" /> <br /> </p> </div> <p>While the Conditional Diffusion Transformer (CDiT) from Navigation World Models uses simple control signals like velocity and rotation, modeling whole-body human motion presents greater challenges. Human actions are high-dimensional, temporally extended, and physically constrained. To address these challenges, we extend the CDiT method in three ways:</p> <ul> <li><strong>Random Timeskips</strong>: Allows the model to learn both short-term motion dynamics and longer-term activity patterns.</li> <li><strong>Sequence-Level Training</strong>: Models entire motion sequences by applying loss over each frame prefix.</li> <li><strong>Action Embeddings</strong>: Concatenates all actions at time t into a 1D tensor to condition each AdaLN layer for high-dimensional whole-body motion.</li> </ul> <h3 id="sampling-and-rollout-strategy">Sampling and Rollout Strategy</h3> <p>At test time, we generate future frames by conditioning on a set of past context frames. We encode these frames into latent states and add noise to the target frame, which is then progressively denoised using our diffusion model. To speed up inference, we restrict attention, where within image attention is applied only to the target frame and context cross attention is only applied for the last frame. For action-conditioned prediction, we use an autoregressive rollout strategy. Starting with context frames, we encode them using a VAE encoder and append the current action. The model then predicts the next frame, which is added to the context while dropping the oldest frame, and the process repeats for each action in the sequence. Finally, we decode the predicted latents into pixel-space using a VAE decoder.</p> <h3 id="atomic-actions">Atomic Actions</h3> <p>We decompose complex human movements into atomic actions—such as hand movements (up, down, left, right) and whole-body movements (forward, rotation)—to test the model’s understanding of how specific joint-level movements affect the egocentric view. We include some samples here:</p> <div style="width: 90%; margin: 0 auto;">  <h4 style="text-align: center; margin: 20px 0 10px 0;">Body Movement Actions</h4> <div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_forward.png" width="100%" /> <i style="font-size: 0.9em;">Move Forward</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/rotate_left.png" width="100%" /> <i style="font-size: 0.9em;">Rotate Left</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/rotate_right.png" width="100%" /> <i style="font-size: 0.9em;">Rotate Right</i> </div> </div>  <h4 style="text-align: center; margin: 20px 0 10px 0;">Left Hand Actions</h4> <div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_left_hand_up.png" width="100%" /> <i style="font-size: 0.9em;">Move Left Hand Up</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_left_hand_down.png" width="100%" /> <i style="font-size: 0.9em;">Move Left Hand Down</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_left_hand_left.png" width="100%" /> <i style="font-size: 0.9em;">Move Left Hand Left</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_left_hand_right.png" width="100%" /> <i style="font-size: 0.9em;">Move Left Hand Right</i> </div> </div>  <h4 style="text-align: center; margin: 20px 0 10px 0;">Right Hand Actions</h4> <div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_right_hand_up.png" width="100%" /> <i style="font-size: 0.9em;">Move Right Hand Up</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_right_hand_down.png" width="100%" /> <i style="font-size: 0.9em;">Move Right Hand Down</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_right_hand_left.png" width="100%" /> <i style="font-size: 0.9em;">Move Right Hand Left</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_actions_v3/move_right_hand_right.png" width="100%" /> <i style="font-size: 0.9em;">Move Right Hand Right</i> </div> </div> </div> <h3 id="long-rollout">Long Rollout</h3> <p>Here you can see the model’s ability to maintain visual and semantic consistency over extended prediction horizons. We demonstrate some samples of PEVA generating coherent 16-second rollouts conditioned on full-body motion. We include some video samples and image samples for closer viewing here:</p> <div style="width: 90%; margin: 0 auto;">  <div style="text-align: center; margin: 30px 0;"> <img src="https://bair.berkeley.edu/static/blog/peva/long_seq_v2_compressed.gif" style="border-radius: 5px;" width="100%" /> </div>  <div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/id_34_web.png" width="100%" /> <i style="font-size: 0.85em;">Sequence 1</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/id_47_web.png" width="100%" /> <i style="font-size: 0.85em;">Sequence 2</i> </div> <div style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/id_86_web.png" width="100%" /> <i style="font-size: 0.85em;">Sequence 3</i> </div> </div> </div> <h3 id="planning">Planning</h3> <p>PEVA can be used for planning by simulating multiple action candidates and scoring them based on their perceptual similarity to the goal, as measured by LPIPS.</p> <div style="width: 75%; margin: 0 auto; text-align: center;"> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/counterfactuals_v3_1_web.png" title="Click to enlarge" width="100%" /> <br /> <i style="font-size: 0.9em;">In this example, it rules out paths that lead to the sink or outdoors finding the correct path to open the fridge.</i> </p> </div> <div style="width: 75%; margin: 0 auto; text-align: center;"> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/counterfactuals_v3_2_web.png" title="Click to enlarge" width="100%" /> <br /> <i style="font-size: 0.9em;">In this example, it rules out paths that lead to grabbing nearby plants and going to the kitchen while finding reasonable sequence of actions that lead to the shelf.</i> </p> </div> <h3 id="enables-visual-planning-ability">Enables Visual Planning Ability</h3> <p>We formulate planning as an energy minimization problem and perform action optimization using the Cross-Entropy Method (CEM), following the approach introduced in Navigation World Models [<a href="https://arxiv.org/abs/2412.03572" target="_blank">arXiv:2412.03572</a>]. Specifically, we optimize action sequences for either the left or right arm while holding other body parts fixed. Representative examples of the resulting plans are shown below:</p> <div style="width: 75%; margin: 0 auto; text-align: center;"> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/right_id_18.png" width="100%" /> <br /> <i style="font-size: 0.9em;">In this case, we are able to predict a sequence of actions that raises our right arm to the mixing stick. We see a limitation with our method as we only predict the right arm so we do not predict to move the left arm down accordingly.</i> </p> </div> <div style="width: 75%; margin: 0 auto; text-align: center;"> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/right_kettle.png" width="100%" /> <br /> <i style="font-size: 0.9em;">In this case, we are able to predict a sequence of actions that reaches toward the kettle but does not quite grab it as in the goal.</i> </p> </div> <div style="width: 75%; margin: 0 auto; text-align: center;"> <p style="text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/left_id_4.png" width="100%" /> <br /> <i style="font-size: 0.9em;">In this case, we are able to predict a sequence of actions that pulls our left arm in, similar to the goal.</i> </p> </div> <h2 id="quantitative-results">Quantitative Results</h2> <p>We evaluate PEVA across multiple metrics to demonstrate its effectiveness in generating high-quality egocentric videos from whole-body actions. Our model consistently outperforms baselines in perceptual quality, maintains coherence over long time horizons, and shows strong scaling properties with model size.</p> <h3 style="text-align: center;">Baseline Perceptual Metrics</h3> <div style="width: 85%; margin: 20px auto; text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/baselines.png" title="Click to enlarge" width="50%" /> <p style="margin-top: 10px; text-align: center;"><i style="font-size: 0.9em;">Baseline perceptual metrics comparison across different models.</i></p> </div> <h3 style="text-align: center;">Atomic Action Performance</h3> <div style="width: 85%; margin: 20px auto; text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/atomic_action_quantitative.png" title="Click to enlarge" width="100%" /> <p style="margin-top: 10px; text-align: center;"><i style="font-size: 0.9em;">Comparison of models in generating videos of atomic actions.</i></p> </div>  <h3 style="text-align: center;">FID Comparison</h3> <div style="width: 85%; margin: 20px auto; text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/fid_comparison_web.png" title="Click to enlarge" width="100%" /> <p style="margin-top: 10px; text-align: center;"><i style="font-size: 0.9em;">FID comparison across different models and time horizons.</i></p> </div> <h3 style="text-align: center;">Scaling</h3> <div style="width: 85%; margin: 20px auto; text-align: center;"> <img src="https://bair.berkeley.edu/static/blog/peva/scaling.png" title="Click to enlarge" width="80%" /> <p style="margin-top: 10px; text-align: center;"><i style="font-size: 0.9em;">PEVA has good scaling ability. Larger models lead to better performance.</i></p> </div> <h2 id="future-directions">Future Directions</h2> <p>Our model demonstrates promising results in predicting egocentric video from whole-body motion, but it remains an early step toward embodied planning. Planning is limited to simulating candidate arm actions and lacks long-horizon planning and full trajectory optimization. Extending PEVA to closed-loop control or interactive environments is a key next step. The model currently lacks explicit conditioning on task intent or semantic goals. Our evaluation uses image similarity as a proxy objective. Future work could leverage combining PEVA with high-level goal conditioning and the integration of object-centric representations.</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>The authors thank Rithwik Nukala for his help in annotating atomic actions. We thank <a href="https://www.cs.cmu.edu/~katef/">Katerina Fragkiadaki</a>, <a href="https://www.cs.utexas.edu/~philkr/">Philipp Krähenbühl</a>, <a href="https://www.cs.cornell.edu/~bharathh/">Bharath Hariharan</a>, <a href="https://guanyashi.github.io/">Guanya Shi</a>, <a href="https://shubhtuls.github.io/">Shubham Tulsiani</a> and <a href="https://www.cs.cmu.edu/~deva/">Deva Ramanan</a> for the useful suggestions and feedbacks for improving the paper; <a href="https://www.cis.upenn.edu/~jshi/">Jianbo Shi</a> for the discussion regarding control theory; <a href="https://yilundu.github.io/">Yilun Du</a> for the support on Diffusion Forcing; <a href="https://brentyi.com/">Brent Yi</a> for his help in human motion related works and <a href="https://people.eecs.berkeley.edu/~efros/">Alexei Efros</a> for the discussion and debates regarding world models. This work is partially supported by the ONR MURI N00014-21-1-2801.</p> <hr /> <p style="text-align: center;"> <strong>For more details, read the <a href="https://arxiv.org/abs/2506.21552" target="_blank">full paper</a> or visit the <a href="https://dannytran123.github.io/PEVA/" target="_blank">project website</a>.</strong> </p>