Web Artifact Evaluation Benchmark

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE tests whether generated webpages actually work by compiling requirements into observable UI states, user-intent transitions, and DOM/visual assertions for browser execution.

442 tasks
5 input modalities
5,495 transitions
5,271 requirement checks

Benchmark Coverage

Diverse web tasks with explicit and implicit requirements.

WebRISE spans 8 domains and 35 scenarios. Its contracts cover explicit affordances and implicit state-consistency constraints, allowing errors to be traced back to requirements rather than only local events.

8
application domains
35
domain scenarios
2,210
task-modality instances
12,441
DOM/visual assertions
Domain and scenario distribution of WebRISE.

Overview

Evaluate behavior, not just appearance.

Existing protocols often inspect local evidence: a screenshot, a fixed script, a checkpoint, or an exploration trace. WebRISE turns requirements into Interaction Contract Graphs and verifies the page through adaptive browser interaction.

Requirement-induced evaluation maps explicit and implicit requirements to states, transitions, DOM/visual assertions, and transition-level evidence.
A shopping-cart transition where the failing artifact toggles an item but leaves dependent totals and checkout availability stale.

Pipeline

From multimodal specifications to executable interaction contracts.

WebRISE constructs one task-level contract shared across Text, Markdown, Sketch, Image, and Video inputs. A contract-guided agent executes each transition and dual DOM/VLM oracles score the evidence.

The ICG specifies what to verify; the adaptive browser agent decides how to realize each transition on the generated page.

Leaderboard

Interactive web generation is still far from solved.

Across 14 MLLMs, even the strongest setting leaves roughly one third of transitions or requirement checks unsatisfied. T denotes transition validity, R denotes requirement coverage, and V denotes auxiliary visual quality.

T Transition validity
R Requirement coverage
V Auxiliary visual quality
Safety HTML safety and robustness diagnostics
442 tasks across five modalities

The same contract is tested under Text, Markdown, Sketch, Image, and Video inputs.

5,495 state transitions

Evaluation targets requirement-induced behavior instead of isolated checkpoints.

5,271 requirement checks

Explicit user functions and implicit product-level constraints are scored separately.

65.6 / 66.3 best T / R scores

Even the strongest setting leaves about one third of required behavior unsatisfied.

Model Text MD Sketch Image Video Overall
TRV TRV TRV TRV TRV
Open-weight models
Qwen3.6-35B-A3B26.830.578.215.519.280.841.245.477.046.649.671.749.552.272.850.5
Qwen3.5-122B-A10B38.041.256.842.545.972.038.042.374.040.243.870.742.847.171.351.1
Qwen3.5-27B36.340.059.941.745.572.138.642.776.842.646.770.643.146.971.851.7
Qwen3.5-397B-A17B45.749.264.851.154.575.746.850.578.948.451.472.849.352.872.157.6
Kimi-K2.548.551.968.957.059.673.847.850.479.956.959.172.658.660.372.961.2
Qwen3.6-27B47.950.975.357.560.183.050.453.387.255.257.874.154.257.274.162.5
Kimi-K2.644.647.383.151.754.987.147.851.586.358.560.473.263.765.473.563.3
Proprietary models
Claude Opus 4.643.345.556.654.356.373.952.355.072.257.759.570.252.654.970.758.3
Gemini 3 Flash44.748.271.950.054.179.346.149.385.454.157.572.445.648.570.858.5
Claude Opus 4.748.850.968.354.556.576.249.752.477.457.058.570.565.066.172.761.6
Gemini 3.1 Pro50.753.669.758.961.579.252.254.984.854.557.172.252.054.971.661.9
Qwen3.6-Plus49.351.968.251.754.674.553.856.486.357.559.473.861.763.474.862.5
GPT-5.459.761.478.460.562.279.857.860.386.660.062.171.563.164.873.766.8
GPT-5.560.362.385.664.466.183.360.662.986.161.863.474.165.666.373.969.1

Bold and underline denote the best and second-best result within each model group.

Citation

Cite WebRISE

If you use WebRISE in your research, please cite our arXiv paper.

@misc{meng2026webriserequirementinducedstateevaluation,
      title={WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts}, 
      author={Yuxin Meng and Yuhan Suo and Junjie Wang and Yuhan Sun and Yiyao Yu and Ruixu Zhang and Ruining Hu and Yubin Wang and Shouwei Ruan and Bin Wang and Yuxiang Zhang and Yujiu Yang},
      year={2026},
      eprint={2606.03220},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.03220}, 
}