実業務セッションに基づくエージェント評価基準はどこまで進んだか？

実業務セッションに基づく企業向けエージェント評価基準 EnterpriseClawBench

元記事タイトル: エンタープライズエージェント評価基準 EnterpriseClawBench

arXiv cs.CL 2026年06月23日

査読未完了の可能性があります。完成した査読済み論文としてではなく、研究コミュニティ向けの早期共有として読んでください。

RESEARCH 研究論文 / Preprint

Field Note 読む前に確認

3行まとめ

EnterpriseClawBenchは、実業務セッションから構築されたエージェント評価基準
852の再現可能なタスクが用意されている
内部企業コンテンツを含むためデータ公開不可

こんな人に関係ある話

AI開発者企業のIT担当者研究者

信頼度メモ

プレプリント論文（査読前の可能性あり）

記事の読み解き Reading

元記事を材料に、要点、編集視点、良い点と懸念点を読みやすい順に整理しています。

EnterpriseClawBenchは、実業務セッションから構築された企業向けエージェントの評価基準です。この基準では、852の再現可能なタスクが用意され、各タスクには回復したフィクスチャや書き直されたプロンプトなどが付属しています。しかし、内部企業コンテンツを含むため、データは公開されていません。

編集部コメント

EnterpriseClawBenchは、実業務セッションに基づく評価基準として注目を集めています。しかし、内部企業コンテンツを含むためデータ公開が制限されている点には注意が必要です。

評価ポイント Assessment

良い点

実業務セッションに基づく評価基準
852の再現可能なタスクが用意されている
評価結果は単一スコアではなく詳細な指標で報告

懸念点

内部企業コンテンツを含むためデータ公開不可

業界・社会への影響 Impact

この研究は、実業務環境でのエージェントのパフォーマンス評価に新たな基準を提供し、企業におけるAIエージェントの開発と導入に影響を与える可能性があります。

参照元 Sources

元記事と、深堀りで参照した情報源です。コミュニティ投稿やプレプリントでは、ここから根拠を確認できます。

エンタープライズエージェント評価基準 EnterpriseClawBench

arXiv cs.CL

https://arxiv.org/abs/2606.23654

この記事の見取り図

読む前に確認
記事の読み解き
参照元
AI要約について
関連記事

キーワード

EnterpriseClawBench エージェント評価ワークプレイスセッション

AI要約について

本記事の要約・分類・読み解きにはAIを使用しています。内容確認に努めていますが、誤訳・解釈違い・元記事更新の反映漏れを含む可能性があります。重要な判断を行う場合は、必ず元記事もご確認ください。

速報について — 速報は追加調査や本文抽出の結果で内容が更新される場合があります。初期要約には誤りや不足が含まれる可能性があります。

記事データ

Source	プレプリント
Category	研究論文
Status	速報
出典	arXiv cs.CL
公開日	2026-06-23

元記事の説明文

arXiv:2606.23654v1 Announce Type: new Abstract: Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench