VideoAdAgent Bench v1

📄 Technical report creatify-ai/VABench

VideoAdAgent Bench is an open evaluation for AI agents that produce video ads end-to-end. Today's strongest video models — Veo 3.1, Kling 3.0, Seedance 2.0 — generate stunning short clips, but a clip is not an ad. The right test is whether a creative director would ship the output: brand-consistent product framing, narrative arc, persona casting, audio cohesion, on-screen typography. We built this benchmark to measure that gap directly, graded by the same rubric our team uses to ship ads in production.

Across the head-to-head ad-quality pairings, Creatify Agent · Pro mode wins 95% of pairs vs raw foundation-model baselines (both single-scene and multi-scene) and 85% vs competing agents, scores a clean 1.90 on the production hallucination rubric (vs Lite 2.85, Luma 3.20, raw foundation-model multi-scene 3.65–3.85, HeyGen 4.40), and posts the highest production-quality score (0.990) in the field. Both Creatify tiers rank #1 and #2 on the composite leaderboard.

Creatify Agent + every shipping competitor · 3 evaluation arms (Agent vs Agent, Ad Quality, Production Quality) · multi-judge pairwise (Claude Opus 4.7 + GPT-5, position-swap stable consensus) · 15-frame hallucination & visual-defect scoring (Claude Opus 4.7 vision) · deterministic structural compliance (duration, aspect, audio, on-screen text via OCR).

Top composite

0.901

Creatify Agent · Pro mode · blend of all three evaluation arms

Arm 1 — Agent vs Agent

87%

Pro mode head-to-head win-rate vs HeyGen V3 and Luma

Arm 2 — Ad Quality

95%

Pro mode head-to-head win-rate vs multi-scene foundation-model baselines

Arm 3 — Production Quality

0.855

Pro mode production score (structural compliance + hallucination score)

Quality Radar — all dimensions normalized 0-1

Cost vs Quality

Creatify Agent · Lite mode

Creatify Agent · Pro mode

HeyGen V3 Agent

Luma Agent

Veo 3.1 (single-scene)

Kling 3.0 (single-scene)

Seedance 2.0 (single-scene)

Veo 3.1 (multi-scene)

Kling 3.0 (multi-scene)

Seedance 2.0 (multi-scene)

Headline — all arms as 0–1 scores

System	Composite	Arm 1 score	Arm 2 score	Arm 3 score	Success	Cost / video	Cost / 30s ⓘ	p50 latency
Creatify Agent · Pro mode	0.901	0.732	0.775	0.855	100%	$6.00	$5.99	1260s
Creatify Agent · Lite mode	0.836	0.716	0.757	0.778	100%	$3.00	$2.97	556s
Luma Agent	0.585	0.580	0.603	0.741	100%	$8.24	$10.95	4200s
HeyGen V3 Agent	0.327	0.358	0.378	0.609	100%	$0.87	$1.18	681s
Seedance 2.0 (multi-scene)	0.312	—	0.337	0.588	85%	$2.40	$4.76	161s
Kling 3.0 (multi-scene)	0.301	—	0.365	0.585	100%	$4.99	$5.41	80s
Kling 3.0 (single-scene)	0.296	—	0.302	0.593	100%	$2.00	$5.98	81s
Veo 3.1 (multi-scene)	0.287	—	0.362	0.549	100%	$12.10	$19.11	85s
Seedance 2.0 (single-scene)	0.285	—	0.349	0.545	100%	$1.20	$2.99	124s
Veo 3.1 (single-scene)	0.265	—	0.314	0.530	100%	$4.00	$15.00	81s

Why we built this benchmark

Until now, there has not been a public benchmark that grades AI video systems on the end-to-end task of producing a finished ad. Existing video benchmarks — VBench, Video Arena, EvalCrafter — measure per-frame fidelity on single-shot text-to-video output. None of them measure whether the output would function as an ad: hook strength, brand alignment, on-screen typography, hallucination cost, production polish. VideoAdAgent Bench fills that gap.

A stunning eight-to-twelve-second clip is not an ad. An ad has to hold a brand at the right angle, render the product name without garbling letters, build a narrative arc with a hook in the first three seconds, hit a fixed duration and aspect ratio, carry voiceover or music, and survive a production QA review without product hallucinations, anatomy glitches, or text artifacts that would force a re-roll. Today's text-to-video models do not attempt most of these problems. The agents that wrap them either do, badly, or skip them entirely. The point of this benchmark is to measure exactly that gap.

Mirroring real ad production

Each brief is structured the way our internal creative team writes one for production: product description, brand guidelines, narrative intent (problem-solution / before-after / demo / testimonial / lifestyle), platform constraints (duration, aspect ratio, required on-screen text strings), and the input assets the agent should ground itself in — logo, product reference image, persona reference. The agent receives the brief and returns a finished MP4. We grade the MP4 the way a creative director grades a campaign before approval: not against a checklist of features, but against whether the output is shippable.

The brief set spans every major ad-production pattern plus targeted stress tests against the hardest axes of multi-scene production: anchor storytelling, persona consistency, brand-asset reuse, structured CTA typography, and persuasion-arc compliance. Every brief uses fictional brands and AI-generated product mocks to avoid real-customer IP. The full set is pre-registered — published to a tagged commit before a single output was generated — so no system in the benchmark could have been tuned to the specific briefs.

Arm 1 — Agent vs Agent (Creatify vs HeyGen V3, Luma)

Strictly agent-vs-agent comparison. Both systems get the same brief and produce a finished ad; LLM judges (Opus 4.7 + GPT-5) see the brief and score 8 ad-rubric dimensions. Win-rate = % of pairs Creatify or HeyGen wins outright (consensus across both judges + both seat positions). Per-dimension scores = average score the judges gave each system across all pairs (un-swapped via position flag).

System	Win-rate	Ad Effectiveness	Brand Alignment	Visual Quality	Motion Quality	Hook Strength	Cta Clarity	Narrative Arc	Overall Preference
Creatify Agent · Pro mode	87%	0.765	0.806	0.785	0.738	0.720	0.683	0.786	0.732
Creatify Agent · Lite mode	80%	0.744	0.775	0.794	0.748	0.735	0.630	0.773	0.716
Luma Agent	34%	0.683	0.713	0.771	0.719	0.698	0.664	0.690	0.580
HeyGen V3 Agent	0%	0.495	0.485	0.709	0.646	0.602	0.482	0.485	0.358

Arm 2 — Ad Quality (Creatify vs raw T2V baselines)

Multi-judge pairwise where judges see the brief and score 8 ad-rubric dimensions: ad_effectiveness, brand_alignment, visual_quality, motion_quality, hook_strength, cta_clarity, narrative_arc, overall_preference. Per-dimension scores below are per-system averages across every pair the system appeared in (un-swapped via position flag). Competing agents (HeyGen V3, Luma) are evaluated separately in Arm 1; this arm contains only Creatify and the raw text-to-video foundation-model baselines.

Two baseline variants per foundation model. Single-scene mirrors how Veo 3.1, Kling 3.0, and Seedance 2.0 are actually used today: the brief becomes a single prompt, the model returns one 8–12s clip. Multi-scene is a fairer variant we built for the baselines: an LLM decomposes the brief into per-scene prompts (the same decomposition step Creatify Agent does internally), each scene is generated separately, and the clips are concatenated. The multi-scene variant gives the foundation models access to the same scene-planning capability the agentic systems have, so the remaining gap is everything beyond scene planning — typography, audio, brand grounding, hallucination control. Both variants appear in the table below; the single-scene rows surface what raw foundation-model output looks like in its natural usage pattern.

System	Win-rate	Ad Effectiveness	Brand Alignment	Visual Quality	Motion Quality	Hook Strength	Cta Clarity	Narrative Arc	Overall Preference
Creatify Agent · Pro mode	95%	0.781	0.823	0.796	0.739	0.740	0.694	0.791	0.775
Creatify Agent · Lite mode	89%	0.767	0.801	0.794	0.742	0.750	0.665	0.783	0.757
Kling 3.0 (multi-scene)	2%	0.465	0.488	0.734	0.678	0.590	0.302	0.577	0.365
Veo 3.1 (multi-scene)	2%	0.462	0.487	0.747	0.693	0.571	0.306	0.552	0.362
Seedance 2.0 (single-scene)	2%	0.467	0.471	0.746	0.734	0.499	0.311	0.531	0.349
Seedance 2.0 (multi-scene)	4%	0.439	0.458	0.756	0.697	0.596	0.295	0.480	0.337
Veo 3.1 (single-scene)	0%	0.415	0.417	0.720	0.682	0.513	0.262	0.482	0.314
Kling 3.0 (single-scene)	0%	0.417	0.426	0.669	0.648	0.470	0.297	0.466	0.302

Arm 3 — Video Production Quality (Creatify vs raw T2V baselines)

Arm 3 grades the finished video as a deliverable, separate from how well it answers the brief. Two complementary measurements: (1) a hallucination & visual-defect score from Claude Opus 4.7 vision on 15 frames per video — the dominant signal for whether output is shippable to a paying advertiser; and (2) deterministic mechanical-compliance checks on the rendered MP4 — duration, aspect, audio, on-screen typography — the kind of checks an ad-ops team would run before a campaign launch. These blend into the Arm 3 score with the hallucination component weighted 3× (it is the dominant defect signal). Per-frame visual-quality LLM-judge dims (aesthetic, imaging, dynamic motion) are reported on the radar chart above but excluded from the Arm 3 composite — they penalize multi-scene structure by construction (a single continuous 8-second shot scores higher than a 3-cut 30-second ad regardless of which is the better deliverable). Competing agents (HeyGen V3, Luma) are evaluated head-to-head against Creatify under Arm 1, and their Arm 3 production scores are also reported here for completeness.

Hallucination & Visual-Defect Score — Claude Opus 4.7 vision on 15 frames per video (0–10, lower is better)

The deciding metric for whether a video is shippable. 15 evenly-spaced frames per video are sent to Claude Opus 4.7 with the brief's product description as reference. Six categories scored 0–10 (higher = more defects). The composite is the headline number; sub-categories surface the defect type behind that score — garbled product labels, hallucinated objects, broken text rendering, physics violations, anatomy errors. Pass-rate = share of briefs scoring < 3.0 (production-grade threshold).

System	Composite ↓	Pass-rate ↑	Product fidelity ↓	Text quality ↓	Object hallucin. ↓	Phys. violations ↓	Anatomy ↓
Creatify Agent · Pro mode	1.90	90%	1.75	2.00	1.20	0.90	0.70
Creatify Agent · Lite mode	2.85	50%	2.70	2.80	2.15	1.35	0.95
Luma Agent	3.20	45%	2.50	3.05	2.40	1.65	1.10
Kling 3.0 (single-scene)	3.60	35%	3.15	2.95	3.00	2.30	1.65
Seedance 2.0 (multi-scene)	3.65	35%	4.35	2.12	1.88	1.12	0.76
Seedance 2.0 (single-scene)	3.80	25%	3.35	3.95	2.95	1.90	1.10
Kling 3.0 (multi-scene)	3.80	35%	3.95	3.70	3.00	1.65	1.55
Veo 3.1 (multi-scene)	3.85	30%	4.15	3.20	2.50	1.70	1.65
Veo 3.1 (single-scene)	4.00	20%	3.70	3.40	3.30	2.35	1.60
HeyGen V3 Agent	4.40	20%	4.65	4.00	2.55	1.45	1.05

Production Polish — deterministic structural quality across the brief set (0–1)

Four no-LLM measurements on the rendered MP4. Each cell is a 0–1 quality score averaged across every brief in the set (1.0 = perfect match on every brief). Duration accuracy: 1.0 when actual length is within ±0.5s of the brief target, linear penalty beyond · Aspect ratio: actual w/h within 5% of brief aspect ratio · Audio cohesion: video contains an audio stream · Text-rendering accuracy: average per-brief fraction of required on-screen strings detected (Tesseract OCR + Claude vision check) — the typography signal that separates a deliverable ad from a stylish silent clip.

System	Duration accuracy	Aspect	Audio	Text rendering	Production avg
Creatify Agent · Pro mode	1.000	1.000	1.000	0.958	0.990
Creatify Agent · Lite mode	0.969	1.000	1.000	0.892	0.965
Luma Agent	0.880	1.000	0.950	0.858	0.922
HeyGen V3 Agent	0.130	1.000	1.000	0.892	0.755
Kling 3.0 (multi-scene)	0.715	0.050	1.000	0.150	0.479
Kling 3.0 (single-scene)	0.000	0.500	1.000	0.300	0.450
Seedance 2.0 (multi-scene)	0.588	1.000	0.000	0.206	0.449
Veo 3.1 (multi-scene)	0.237	1.000	0.000	0.175	0.353
Veo 3.1 (single-scene)	0.000	1.000	0.000	0.275	0.319
Seedance 2.0 (single-scene)	0.000	1.000	0.000	0.275	0.319

Methodology — how to read every metric

Composite (0–1, higher is better) — A single number combining Arm 2 (ad quality) and Arm 3 (video production quality) to put every system on one ranked list.
Creatify + raw T2V systems: Composite = 0.5 × Arm 2 overall_preference + 0.5 × Arm 3 production-quality score.
HeyGen V3 Agent (no Arm 2/3 data, never paired against a T2V baseline): Composite = 0.5 × Arm 1 score + 0.5 × production polish.

Arm 1 — Capability (Win-rate %). Share of pairs Creatify wins outright against an agentic competitor (HeyGen V3, Luma). Two judges (Claude Opus 4.7 + GPT-5) see the brief and both videos side-by-side, position-swapped; only consensus verdicts count.
Arm 1 score (0–1) is the average overall_preference the judges gave each system across all pairs it appeared in.

Arm 2 — Ad Quality (Win-rate %). Same pairwise protocol, paired against Veo 3.1, Kling 3.0, and Seedance 2.0 multi-scene baselines. Judges score eight ad-rubric dimensions; the 0–1 Arm 2 score reported here is overall_preference, the headline of the eight.

The 8 ad-rubric dimensions (Arms 1 + 2). Each scored 0–1 by the two judges per pair.

ad_effectiveness — does the video accomplish the marketing intent (sell the product, drive the action)?
brand_alignment — does it match the brief's brand, persona, and visual tone?
visual_quality — clarity, composition, polish, and absence of visible artifacts.
motion_quality — smooth motion, plausible camera moves, no jitter or glitches.
hook_strength — does the first 3 seconds capture attention and signal the value prop?
cta_clarity — is the call-to-action visible, legible, and well-timed?
narrative_arc — setup → payoff over the runtime, not a single shot.
overall_preference — which video would you pick to ship, all things considered.

Arm 3 — Video Production Quality (0–1). Weighted blend of two components, with hallucination weighted 3× because it dominates "shippable vs not":
(a) Hallucination-free score = 1 − (hallucination composite / 10). The hallucination composite is a 0–10 score from Claude Opus 4.7 vision on 15 frames per video across six categories — product fidelity, object hallucination, text rendering, physics violations, anatomy, and overall quality.
(b) Production polish — four no-LLM measurements on the rendered MP4 (see Production Polish definitions below).
Per-frame visual-quality LLM-judge dims (aesthetic, imaging, dynamic motion) are reported on the radar chart but excluded from the Arm 3 composite: they penalize multi-scene structure by construction.

Production Polish components. Each averaged across every brief in the set.

Duration accuracy (0–1) — 1.0 when actual duration is within ±0.5s of the brief target; 50% linear penalty per second beyond tolerance, clipped at 0. Measured via ffprobe.
Aspect (0–1) — share of briefs where actual w/h matches the brief aspect ratio within 5%. Measured via ffprobe.
Audio (0–1) — share of briefs where the rendered MP4 contains at least one audio stream.
Text rendering (0–1) — per-brief fraction of required on-screen strings detected (Tesseract OCR + Claude vision check), averaged across the brief set.
Production avg — mean of the four scores above.

Hallucination categories (each scored 0–10, higher = more defects, lower is better).

product_fidelity — does the product on screen actually look like the briefed product (right shape, color, label, packaging)?
object_hallucination — extra objects appearing, morphing, or disappearing between frames.
text_quality — on-screen typography rendered legibly (no garbled glyphs, no half-letters).
physical_violations — gravity, occlusion, scale, and physics consistency.
human_anatomy — fingers, faces, limbs, eyes (the classic generative-video failure mode).
overall_quality — judge's holistic verdict.

Pass-rate = share of briefs scoring < 3.0 composite (production-grade threshold).

Cost / video (USD). Average across the brief set. All per-second rates are billed against output video duration, not agent runtime.
Raw T2V baselines: published $/output-second × output duration (Veo 3.1 $0.50/s, Kling 3.0 $0.20/s, Seedance 2.0 $0.10/s).
Multi-scene baselines: sum of per-scene single-shot costs (agent typically plans 3–5 scenes per brief).
Creatify Agent — Lite $3 / Pro $6: typical decomposition per run — image gens + Seedance I2V calls + persona gen + music + agent LLM tokens. Pro mode roughly doubles the LLM/I2V cost via the per-scene QA review gate that catches and regenerates broken scenes.
Cost / 30s = cost normalized to a 30-second deliverable (cost ÷ actual output duration × 30) — fair comparison when T2V baselines cap at 8–12s but real ads need 15–30s.

Success rate. Share of briefs where the run produced a downloadable MP4. Failure modes counted: API timeouts, content-policy rejections, downstream errors.

Latency p50. Median wall-clock seconds from API call to MP4 ready. For agentic systems (Creatify, HeyGen, Luma) this is the full agent run; for raw T2V, queue + render time.

Conclusion

1. Creatify wins decisively against every competing agent. Arm 1 head-to-head: Pro mode 85% win-rate and Lite mode 76% against the field (HeyGen V3, Luma). The strongest competitor (Luma) wins only 34% of its agent-vs-agent pairs; HeyGen 0%. Creatify's Pro and Lite take the top two slots in every arm.

2. Creatify dominates raw T2V baselines on ad quality. Arm 2 (vs Veo 3.1 / Kling 3.0 / Seedance 2.0, single-scene and multi-scene variants): Pro mode 95% win-rate, Lite mode 89%. The best raw foundation-model baseline lands at 4% arm-2 win-rate. Brand-aware persona casting, on-screen typography, and audio cohesion are what separate a 12-second silent clip from a finished branded ad.

3. Creatify produces the cleanest video deliverable. Pro mode production-quality score 0.990 · Lite mode 0.965 · best non-Creatify baseline 0.922 (Luma) · HeyGen 0.755. Pro mode hallucination score 1.90 · Lite mode 2.85 · Luma 3.20 · raw T2V multi-scene 3.65–3.85 · HeyGen 4.40.

4. Pro mode beats Lite mode on the head-to-head arms. Pro's per-scene QA-review gate adds +9 points of arm-1 win-rate (vs other agents), +7 points of arm-2 win-rate (vs raw T2V), −0.95 hallucination defects (1.90 vs 2.85, lower is fewer visual defects), and +2.5 points of production quality (0.990 vs 0.965) — at ~2× cost (Pro $6 vs Lite $3) and ~2× latency. Lite is the right choice for high-volume creative iteration; Pro is the right choice when brand-fidelity and on-screen detail matter most.

5. Creatify delivers a ready-to-ship ad, not a raw clip. Lite at $3/video and Pro at $6/video — each producing a fully assembled branded ad with persona, voiceover, music, captions, and required on-screen text. Full audio cohesion, tight duration match, 89–96% on-screen text rendered correctly vs 15–21% for raw T2V baselines. That structural-compliance gap is what separates a deliverable ad from a stylish clip you still have to assemble yourself.

The bench is open-source. Every brief, every raw judge verdict, every output video, and every aggregation script is published under Apache 2.0 at github.com/creatify-ai/VABench. Fork it, audit the verdicts, run your own briefs through the same harness, or score your own agent against ours. The goal is to give the AI-video community a shared reference for what "shippable ad" means — and a forcing function to close the gap between a stunning clip and a finished campaign.

What's next

v2 will scale the benchmark along two axes. First, more briefs covering vertical-specific ad patterns: e-commerce, app install, B2B SaaS, healthcare, education. Second, a closed-loop track that measures the value of multi-turn revision — the production experience single-shot bench numbers can't capture. We are also folding campaign-level signals (CTR, ROAS) into a domain-specific AdReward to drive DPO-style alignment of the underlying I2V models, replacing the static rubric with a learned, ad-effectiveness-calibrated judge.

Acknowledgements

Creatify Agent and this benchmark stand on the shoulders of a community of researchers working on multi-agent video, video-quality evaluation, ad-effectiveness rubrics, and closed-loop quality assurance. Notable threads we drew on:

Multi-agent video generation: Mora, VideoDirectorGPT, FilmAgent, StoryAgent, VISTA.
Video-quality benchmarks: VBench / VBench-2.0 for per-frame quality dim definitions; VideoGen-Eval for agent-based evaluation methodology; Artificial Analysis Video Arena for pairwise + position-bias control patterns.
Reward modeling for video: VideoReward / Flow-DPO, UnifiedReward — these inform the trajectory we expect Creatify Agent to take next: turning campaign-level CTR/ROAS into a domain-specific AdReward that drives DPO-style alignment of the underlying generation models.
Ad-specific rubrics: Springboards Creativity Benchmark, AdsQA, the Pitt Ads Dataset, "Decoding the Hook" for first-3s evaluation.
Closed-loop quality assurance: Meta-Harness (Lee, Khattab, Finn et al.) on end-to-end optimization of evaluation harnesses; Learning Beyond Gradients (Weng) on critique-and-revise loops over explicit policies as a durable alternative to gradient-based fine-tuning.

For the full architecture of the Creatify Agent system, the design choices behind the closed-loop QA gate, and the production examples behind these numbers, read the companion technical report: Creatify Agent — a closed-loop agent built for video advertising. The benchmark itself — briefs, baseline runners, scorers, and aggregation — is open-source under Apache 2.0 at github.com/creatify-ai/VABench.