SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

Siqi Chen¹, Xinyu Dong¹, Haolei Xu¹, Xingyu Wu¹, Fei Tang¹, Hang Zhang¹, Yuchen Yan¹, Linjuan Wu¹, Wenqi Zhang¹, Guiyang Hou¹ Yongliang Shen^1†, Weiming Lu¹ Yueting Zhuang¹,

¹Zhejiang University
Preprint. Under review.
^†Corresponding Author

Abstract

Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 24 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications.

Dataset Construction and Validation

To address limitations in prior SVG benchmarks, SVGenius constructs a high-quality, complexity-aware dataset from over 100K real-world SVGs sourced across 24 domains. Following rigorous preprocessing and semantic validation by human annotators, 927 structurally and semantically sound samples are curated. A principled complexity stratification framework is introduced, leveraging normalized metrics—path count, control points, and command diversity—to partition samples into Easy, Moderate, and Complex tiers. Stratified sampling and manual inspection yield a balanced subset of 300 representative SVGs, laying a robust foundation for multi-dimensional evaluation across understanding, editing, and generation tasks.

Left: systematic pipeline from data collection, cleaning, human filtering to complexity stratification. Center: 24-domain coverage across diverse applications. Right: validation of complexity modeling showing clear hierarchical separation across Easy, Moderate, and Complex levels through feature distributions and complexity scores.

We compare construction methods, domain diversity, complexity metrics (paths and control points), and task coverage. SVGenius provides the first comprehensive evaluation across Understanding, Editing, and Generation with systematic complexity modeling. Task abbreviations: PQA (Perceptual QA), SQA (Semantic QA), BF (Bug Fixing), CO (Code Optimization), SE (Style Editing), TTG (Text-to-SVG), ITG (Image-to-SVG), ST (Style Transfer).

Results

SVGenius evaluates 22 diverse (M)LLMs across three SVG processing dimensions under a zero-shot setting, revealing significant capability disparities. Proprietary models lead overall but degrade sharply with rising complexity, while reasoning-enhanced training consistently outperforms pure scaling, especially in complex understanding and generation tasks. Open-source models exhibit scalability benefits but remain limited by architectural and training constraints. Specialized models show domain-specific strength but lack generalizability. Crucially, all models exhibit systematic degradation patterns, underscoring fundamental limitations in current approaches and highlighting the need for structure-aware, reasoning-rich training paradigms.

Performance on SVG generation dimension across different model types and difficulty levels. Results are reported using task-specific metrics (SSIM, LPIPS, MSE, DINO, PSS) for Image-to-SVG.

Performance on SVG understanding dimension across different model types and difficulty levels. Accuracy scores are shown for Perceptual QA and Semantic QA tasks, with models marked as reasoning (★), code (★), open-source (★), or proprietary (★) variants.

Performance on SVG editing dimension across different model types and difficulty levels. Results are reported using task-specific metrics (ACC, rMSE, RLC, CCR) for Bug Fixing, Style Editing and Code Optimization tasks, with models marked as reasoning (★), code (★), open-source (★), or proprietary (★) variants.

Performance on SVG generation dimension across different model types and difficulty levels. Results are reported using task-specific metrics (CLIP, AES, HPS, rCLIP, FSS, Cart., Pixel, Line, 3D) for Text-based Generation and Style Transfer tasks, with models marked as reasoning (★), code (★), open-source (★), or proprietary (★) variants.

Performance on SVG generation dimension across different model types and difficulty levels. Results are reported using task-specific metrics (CP, DF, SC, CH, CB) for Style Transfer tasks, with models marked as reasoning (★), code (★), open-source (★), or proprietary (★) variants.

BibTeX

@misc{chen2025svgeniusbenchmarkingllmssvg, title={SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation}, author={Siqi Chen and Xinyu Dong and Haolei Xu and Xingyu Wu and Fei Tang and Hang Zhang and Yuchen Yan and Linjuan Wu and Wenqi Zhang and Guiyang Hou and Yongliang Shen and Weiming Lu and Yueting Zhuang}, year={2025}, eprint={2506.03139}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.03139}, }