Analyzing LLM-Generated Code According to Four ISO/IEC 5055:2021 Categories

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

The use of large language models (LLMs) for generating code in software development is on the rise. While LLMs demonstrate impressive capabilities, there remains a need to evaluate the quality of generated code beyond functional correctness. This paper addresses a gap in current research by evaluating Python code generated by nine state-of-the-art LLMs according to the four main code quality categories (maintainability, reliability, performance efficiency, and security) defined by the ISO/IEC 5055:2021 standard. The evaluation spans three popular application domains (high-performance computing, machine learning, and data processing). It employs a stratified prompting approach with varying levels (short, medium, and long) of detail. Using widely adopted static code analysis tools, the study identifies which LLMs perform best across these categories. The analysis involved selecting nine algorithms across three application domains, and the generated code was compared against a human developer using four static code analysis tools. Metrics were organized into the four ISO 5055 categories, and composite scores were calculated for each category following preprocessing to ensure accurate evaluation. The results showed that GPT-4-Turbo produced the most reliable, performance-efficient, and secure code, while Gemini excelled in generating maintainable Python code among the evaluated models. The study concludes that properly prompted and configured LLMs can produce code that meets or even exceeds human-developed code in terms of the four ISO categories. Future work will continue to refine and expand upon these methodologies on other programming languages.
Original languageEnglish
JournalIEEE Access
Volume13
Pages (from-to)202482-202499
Number of pages18
ISSN2169-3536
DOIs
Publication statusPublished - 2025

Keywords

  • Analysis
  • ISO
  • LLMs
  • Maintainability
  • Performance
  • Python
  • Quality
  • Reliability
  • Security

Cite this