SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

Siqi Chen1, Xinyu Dong1, Haolei Xu1, Xingyu Wu1, Fei Tang1, Hang Zhang1, Yuchen Yan1, Linjuan Wu1, Wenqi Zhang1, Guiyang Hou1 Yongliang Shen1†, Weiming Lu1 Yueting Zhuang1,
1Zhejiang University
Preprint. Under review.
Corresponding Author
SVG overview png

Overview of SVGenius. SVGenius evaluates (M)LLMs capabilities across three progressive dimensions: Understanding (perceptual and semantic QA), Editing (bug fixing, code optimization, style editing), and Generation (text-to-SVG, multimodal-to-SVG, style transfer). Built on real-world data from 24 domains with systematic complexity stratification, our benchmark enables comprehensive assessment of SVG processing capabilities. The radar chart shows representative model performance patterns, revealing distinct capability boundaries and degradation with increasing complexity.

Abstract

Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 24 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications.

Dataset Construction and Validation

To address limitations in prior SVG benchmarks, SVGenius constructs a high-quality, complexity-aware dataset from over 100K real-world SVGs sourced across 24 domains. Following rigorous preprocessing and semantic validation by human annotators, 927 structurally and semantically sound samples are curated. A principled complexity stratification framework is introduced, leveraging normalized metrics—path count, control points, and command diversity—to partition samples into Easy, Moderate, and Complex tiers. Stratified sampling and manual inspection yield a balanced subset of 300 representative SVGs, laying a robust foundation for multi-dimensional evaluation across understanding, editing, and generation tasks.

SVGenius dataset construction and complexity validation

Left: systematic pipeline from data collection, cleaning, human filtering to complexity stratification. Center: 24-domain coverage across diverse applications. Right: validation of complexity modeling showing clear hierarchical separation across Easy, Moderate, and Complex levels through feature distributions and complexity scores.


We compare construction methods, domain diversity, complexity metrics (paths and control points), and task coverage. SVGenius provides the first comprehensive evaluation across Understanding, Editing, and Generation with systematic complexity modeling. Task abbreviations: PQA (Perceptual QA), SQA (Semantic QA), BF (Bug Fixing), CO (Code Optimization), SE (Style Editing), TTG (Text-to-SVG), ITG (Image-to-SVG), ST (Style Transfer).

SVGenius compare

Results

SVGenius evaluates 22 diverse (M)LLMs across three SVG processing dimensions under a zero-shot setting, revealing significant capability disparities. Proprietary models lead overall but degrade sharply with rising complexity, while reasoning-enhanced training consistently outperforms pure scaling, especially in complex understanding and generation tasks. Open-source models exhibit scalability benefits but remain limited by architectural and training constraints. Specialized models show domain-specific strength but lack generalizability. Crucially, all models exhibit systematic degradation patterns, underscoring fundamental limitations in current approaches and highlighting the need for structure-aware, reasoning-rich training paradigms.

BibTeX


        @misc{chen2025svgeniusbenchmarkingllmssvg,
      title={SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation}, 
      author={Siqi Chen and Xinyu Dong and Haolei Xu and Xingyu Wu and Fei Tang and Hang Zhang and Yuchen Yan and Linjuan Wu and Wenqi Zhang and Guiyang Hou and Yongliang Shen and Weiming Lu and Yueting Zhuang},
      year={2025},
      eprint={2506.03139},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.03139}, 
}