VideoAdAgent Bench is an open evaluation for AI agents that produce video ads end-to-end. Today's strongest video models โ Veo 3.1, Kling 3.0, Seedance 2.0 โ generate stunning short clips, but a clip is not an ad. The right test is whether a creative director would ship the output: brand-consistent product framing, narrative arc, persona casting, audio cohesion, on-screen typography. We built this benchmark to measure that gap directly, graded by the same rubric our team uses to ship ads in production.
Across the head-to-head ad-quality pairings, Creatify Agent ยท Pro mode wins 95% of pairs vs raw foundation-model baselines (both single-scene and multi-scene) and 85% vs competing agents, scores a clean 1.90 on the production hallucination rubric (vs Lite 2.85, Luma 3.20, raw foundation-model multi-scene 3.65โ3.85, HeyGen 4.40), and posts the highest production-quality score (0.990) in the field. Both Creatify tiers rank #1 and #2 on the composite leaderboard.
Creatify Agent + every shipping competitor ยท 3 evaluation arms (Agent vs Agent, Ad Quality, Production Quality) ยท multi-judge pairwise (Claude Opus 4.7 + GPT-5, position-swap stable consensus) ยท 15-frame hallucination & visual-defect scoring (Claude Opus 4.7 vision) ยท deterministic structural compliance (duration, aspect, audio, on-screen text via OCR).
| System | Composite | Arm 1 score | Arm 2 score | Arm 3 score | Success | Cost / video | Cost / 30s โ | p50 latency |
|---|---|---|---|---|---|---|---|---|
| Creatify Agent ยท Pro mode | 0.901 | 0.732 | 0.775 | 0.855 | 100% | $6.00 | $5.99 | 1260s |
| Creatify Agent ยท Lite mode | 0.836 | 0.716 | 0.757 | 0.778 | 100% | $3.00 | $2.97 | 556s |
| Luma Agent | 0.585 | 0.580 | 0.603 | 0.741 | 100% | $8.24 | $10.95 | 4200s |
| HeyGen V3 Agent | 0.327 | 0.358 | 0.378 | 0.609 | 100% | $0.87 | $1.18 | 681s |
| Seedance 2.0 (multi-scene) | 0.312 | โ | 0.337 | 0.588 | 85% | $2.40 | $4.76 | 161s |
| Kling 3.0 (multi-scene) | 0.301 | โ | 0.365 | 0.585 | 100% | $4.99 | $5.41 | 80s |
| Kling 3.0 (single-scene) | 0.296 | โ | 0.302 | 0.593 | 100% | $2.00 | $5.98 | 81s |
| Veo 3.1 (multi-scene) | 0.287 | โ | 0.362 | 0.549 | 100% | $12.10 | $19.11 | 85s |
| Seedance 2.0 (single-scene) | 0.285 | โ | 0.349 | 0.545 | 100% | $1.20 | $2.99 | 124s |
| Veo 3.1 (single-scene) | 0.265 | โ | 0.314 | 0.530 | 100% | $4.00 | $15.00 | 81s |
Until now, there has not been a public benchmark that grades AI video systems on the end-to-end task of producing a finished ad. Existing video benchmarks โ VBench, Video Arena, EvalCrafter โ measure per-frame fidelity on single-shot text-to-video output. None of them measure whether the output would function as an ad: hook strength, brand alignment, on-screen typography, hallucination cost, production polish. VideoAdAgent Bench fills that gap.
A stunning eight-to-twelve-second clip is not an ad. An ad has to hold a brand at the right angle, render the product name without garbling letters, build a narrative arc with a hook in the first three seconds, hit a fixed duration and aspect ratio, carry voiceover or music, and survive a production QA review without product hallucinations, anatomy glitches, or text artifacts that would force a re-roll. Today's text-to-video models do not attempt most of these problems. The agents that wrap them either do, badly, or skip them entirely. The point of this benchmark is to measure exactly that gap.
Each brief is structured the way our internal creative team writes one for production: product description, brand guidelines, narrative intent (problem-solution / before-after / demo / testimonial / lifestyle), platform constraints (duration, aspect ratio, required on-screen text strings), and the input assets the agent should ground itself in โ logo, product reference image, persona reference. The agent receives the brief and returns a finished MP4. We grade the MP4 the way a creative director grades a campaign before approval: not against a checklist of features, but against whether the output is shippable.
The brief set spans every major ad-production pattern plus targeted stress tests against the hardest axes of multi-scene production: anchor storytelling, persona consistency, brand-asset reuse, structured CTA typography, and persuasion-arc compliance. Every brief uses fictional brands and AI-generated product mocks to avoid real-customer IP. The full set is pre-registered โ published to a tagged commit before a single output was generated โ so no system in the benchmark could have been tuned to the specific briefs.
Strictly agent-vs-agent comparison. Both systems get the same brief and produce a finished ad; LLM judges (Opus 4.7 + GPT-5) see the brief and score 8 ad-rubric dimensions. Win-rate = % of pairs Creatify or HeyGen wins outright (consensus across both judges + both seat positions). Per-dimension scores = average score the judges gave each system across all pairs (un-swapped via position flag).
| System | Win-rate | Ad Effectiveness | Brand Alignment | Visual Quality | Motion Quality | Hook Strength | Cta Clarity | Narrative Arc | Overall Preference |
|---|---|---|---|---|---|---|---|---|---|
| Creatify Agent ยท Pro mode | 87% | 0.765 | 0.806 | 0.785 | 0.738 | 0.720 | 0.683 | 0.786 | 0.732 |
| Creatify Agent ยท Lite mode | 80% | 0.744 | 0.775 | 0.794 | 0.748 | 0.735 | 0.630 | 0.773 | 0.716 |
| Luma Agent | 34% | 0.683 | 0.713 | 0.771 | 0.719 | 0.698 | 0.664 | 0.690 | 0.580 |
| HeyGen V3 Agent | 0% | 0.495 | 0.485 | 0.709 | 0.646 | 0.602 | 0.482 | 0.485 | 0.358 |
Multi-judge pairwise where judges see the brief and score 8 ad-rubric dimensions: ad_effectiveness, brand_alignment, visual_quality, motion_quality, hook_strength, cta_clarity, narrative_arc, overall_preference. Per-dimension scores below are per-system averages across every pair the system appeared in (un-swapped via position flag). Competing agents (HeyGen V3, Luma) are evaluated separately in Arm 1; this arm contains only Creatify and the raw text-to-video foundation-model baselines.
Two baseline variants per foundation model. Single-scene mirrors how Veo 3.1, Kling 3.0, and Seedance 2.0 are actually used today: the brief becomes a single prompt, the model returns one 8โ12s clip. Multi-scene is a fairer variant we built for the baselines: an LLM decomposes the brief into per-scene prompts (the same decomposition step Creatify Agent does internally), each scene is generated separately, and the clips are concatenated. The multi-scene variant gives the foundation models access to the same scene-planning capability the agentic systems have, so the remaining gap is everything beyond scene planning โ typography, audio, brand grounding, hallucination control. Both variants appear in the table below; the single-scene rows surface what raw foundation-model output looks like in its natural usage pattern.
| System | Win-rate | Ad Effectiveness | Brand Alignment | Visual Quality | Motion Quality | Hook Strength | Cta Clarity | Narrative Arc | Overall Preference |
|---|---|---|---|---|---|---|---|---|---|
| Creatify Agent ยท Pro mode | 95% | 0.781 | 0.823 | 0.796 | 0.739 | 0.740 | 0.694 | 0.791 | 0.775 |
| Creatify Agent ยท Lite mode | 89% | 0.767 | 0.801 | 0.794 | 0.742 | 0.750 | 0.665 | 0.783 | 0.757 |
| Kling 3.0 (multi-scene) | 2% | 0.465 | 0.488 | 0.734 | 0.678 | 0.590 | 0.302 | 0.577 | 0.365 |
| Veo 3.1 (multi-scene) | 2% | 0.462 | 0.487 | 0.747 | 0.693 | 0.571 | 0.306 | 0.552 | 0.362 |
| Seedance 2.0 (single-scene) | 2% | 0.467 | 0.471 | 0.746 | 0.734 | 0.499 | 0.311 | 0.531 | 0.349 |
| Seedance 2.0 (multi-scene) | 4% | 0.439 | 0.458 | 0.756 | 0.697 | 0.596 | 0.295 | 0.480 | 0.337 |
| Veo 3.1 (single-scene) | 0% | 0.415 | 0.417 | 0.720 | 0.682 | 0.513 | 0.262 | 0.482 | 0.314 |
| Kling 3.0 (single-scene) | 0% | 0.417 | 0.426 | 0.669 | 0.648 | 0.470 | 0.297 | 0.466 | 0.302 |
Arm 3 grades the finished video as a deliverable, separate from how well it answers the brief. Two complementary measurements: (1) a hallucination & visual-defect score from Claude Opus 4.7 vision on 15 frames per video โ the dominant signal for whether output is shippable to a paying advertiser; and (2) deterministic mechanical-compliance checks on the rendered MP4 โ duration, aspect, audio, on-screen typography โ the kind of checks an ad-ops team would run before a campaign launch. These blend into the Arm 3 score with the hallucination component weighted 3ร (it is the dominant defect signal). Per-frame visual-quality LLM-judge dims (aesthetic, imaging, dynamic motion) are reported on the radar chart above but excluded from the Arm 3 composite โ they penalize multi-scene structure by construction (a single continuous 8-second shot scores higher than a 3-cut 30-second ad regardless of which is the better deliverable). Competing agents (HeyGen V3, Luma) are evaluated head-to-head against Creatify under Arm 1, and their Arm 3 production scores are also reported here for completeness.
The deciding metric for whether a video is shippable. 15 evenly-spaced frames per video are sent to Claude Opus 4.7 with the brief's product description as reference. Six categories scored 0โ10 (higher = more defects). The composite is the headline number; sub-categories surface the defect type behind that score โ garbled product labels, hallucinated objects, broken text rendering, physics violations, anatomy errors. Pass-rate = share of briefs scoring < 3.0 (production-grade threshold).
| System | Composite โ | Pass-rate โ | Product fidelity โ | Text quality โ | Object hallucin. โ | Phys. violations โ | Anatomy โ |
|---|---|---|---|---|---|---|---|
| Creatify Agent ยท Pro mode | 1.90 | 90% | 1.75 | 2.00 | 1.20 | 0.90 | 0.70 |
| Creatify Agent ยท Lite mode | 2.85 | 50% | 2.70 | 2.80 | 2.15 | 1.35 | 0.95 |
| Luma Agent | 3.20 | 45% | 2.50 | 3.05 | 2.40 | 1.65 | 1.10 |
| Kling 3.0 (single-scene) | 3.60 | 35% | 3.15 | 2.95 | 3.00 | 2.30 | 1.65 |
| Seedance 2.0 (multi-scene) | 3.65 | 35% | 4.35 | 2.12 | 1.88 | 1.12 | 0.76 |
| Seedance 2.0 (single-scene) | 3.80 | 25% | 3.35 | 3.95 | 2.95 | 1.90 | 1.10 |
| Kling 3.0 (multi-scene) | 3.80 | 35% | 3.95 | 3.70 | 3.00 | 1.65 | 1.55 |
| Veo 3.1 (multi-scene) | 3.85 | 30% | 4.15 | 3.20 | 2.50 | 1.70 | 1.65 |
| Veo 3.1 (single-scene) | 4.00 | 20% | 3.70 | 3.40 | 3.30 | 2.35 | 1.60 |
| HeyGen V3 Agent | 4.40 | 20% | 4.65 | 4.00 | 2.55 | 1.45 | 1.05 |
Four no-LLM measurements on the rendered MP4. Each cell is a 0โ1 quality score averaged across every brief in the set (1.0 = perfect match on every brief). Duration accuracy: 1.0 when actual length is within ยฑ0.5s of the brief target, linear penalty beyond ยท Aspect ratio: actual w/h within 5% of brief aspect ratio ยท Audio cohesion: video contains an audio stream ยท Text-rendering accuracy: average per-brief fraction of required on-screen strings detected (Tesseract OCR + Claude vision check) โ the typography signal that separates a deliverable ad from a stylish silent clip.
| System | Duration accuracy | Aspect | Audio | Text rendering | Production avg |
|---|---|---|---|---|---|
| Creatify Agent ยท Pro mode | 1.000 | 1.000 | 1.000 | 0.958 | 0.990 |
| Creatify Agent ยท Lite mode | 0.969 | 1.000 | 1.000 | 0.892 | 0.965 |
| Luma Agent | 0.880 | 1.000 | 0.950 | 0.858 | 0.922 |
| HeyGen V3 Agent | 0.130 | 1.000 | 1.000 | 0.892 | 0.755 |
| Kling 3.0 (multi-scene) | 0.715 | 0.050 | 1.000 | 0.150 | 0.479 |
| Kling 3.0 (single-scene) | 0.000 | 0.500 | 1.000 | 0.300 | 0.450 |
| Seedance 2.0 (multi-scene) | 0.588 | 1.000 | 0.000 | 0.206 | 0.449 |
| Veo 3.1 (multi-scene) | 0.237 | 1.000 | 0.000 | 0.175 | 0.353 |
| Veo 3.1 (single-scene) | 0.000 | 1.000 | 0.000 | 0.275 | 0.319 |
| Seedance 2.0 (single-scene) | 0.000 | 1.000 | 0.000 | 0.275 | 0.319 |
overall_preference + 0.5 ร Arm 3 production-quality score.
overall_preference the judges gave each system across all pairs it appeared in.overall_preference, the headline of the eight.ffprobe.ffprobe.v2 will scale the benchmark along two axes. First, more briefs covering vertical-specific ad patterns: e-commerce, app install, B2B SaaS, healthcare, education. Second, a closed-loop track that measures the value of multi-turn revision โ the production experience single-shot bench numbers can't capture. We are also folding campaign-level signals (CTR, ROAS) into a domain-specific AdReward to drive DPO-style alignment of the underlying I2V models, replacing the static rubric with a learned, ad-effectiveness-calibrated judge.
Creatify Agent and this benchmark stand on the shoulders of a community of researchers working on multi-agent video, video-quality evaluation, ad-effectiveness rubrics, and closed-loop quality assurance. Notable threads we drew on:
For the full architecture of the Creatify Agent system, the design choices behind the closed-loop QA gate, and the production examples behind these numbers, read the companion technical report: Creatify Agent โ a closed-loop agent built for video advertising. The benchmark itself โ briefs, baseline runners, scorers, and aggregation โ is open-source under Apache 2.0 at github.com/creatify-ai/VABench.