Analyzing LLM-Generated Code According to Four ISO/IEC 5055:2021 Categories

Rasmus Krebs; Somnath Mazumdar

doi:10.1109/ACCESS.2025.3637569

Analyzing LLM-Generated Code According to Four ISO/IEC 5055:2021 Categories

Rasmus Krebs
, Somnath Mazumdar^*

^*Corresponding author for this work

Department of Digitalization

Research output: Contribution to journal › Journal article › Research › peer-review

26 Downloads (Pure)

Abstract

The use of large language models (LLMs) for generating code in software development is on the rise. While LLMs demonstrate impressive capabilities, there remains a need to evaluate the quality of generated code beyond functional correctness. This paper addresses a gap in current research by evaluating Python code generated by nine state-of-the-art LLMs according to the four main code quality categories (maintainability, reliability, performance efficiency, and security) defined by the ISO/IEC 5055:2021 standard. The evaluation spans three popular application domains (high-performance computing, machine learning, and data processing). It employs a stratified prompting approach with varying levels (short, medium, and long) of detail. Using widely adopted static code analysis tools, the study identifies which LLMs perform best across these categories. The analysis involved selecting nine algorithms across three application domains, and the generated code was compared against a human developer using four static code analysis tools. Metrics were organized into the four ISO 5055 categories, and composite scores were calculated for each category following preprocessing to ensure accurate evaluation. The results showed that GPT-4-Turbo produced the most reliable, performance-efficient, and secure code, while Gemini excelled in generating maintainable Python code among the evaluated models. The study concludes that properly prompted and configured LLMs can produce code that meets or even exceeds human-developed code in terms of the four ISO categories. Future work will continue to refine and expand upon these methodologies on other programming languages.

Original language	English
Journal	IEEE Access
Volume	13
Pages (from-to)	202482-202499
Number of pages	18
ISSN	2169-3536
DOIs	https://doi.org/10.1109/ACCESS.2025.3637569
Publication status	Published - 2025

Keywords

Analysis
ISO
LLMs
Maintainability
Performance
Python
Quality
Reliability
Security

Access to Document

10.1109/ACCESS.2025.3637569Licence: CC BY

Full TextFinal published version, 2.55 MBLicence: CC BY

Handle.net

Persistent link

Cite this

@article{cb7619af2ae444fbae3cbad6b27739be,

title = "Analyzing LLM-Generated Code According to Four ISO/IEC 5055:2021 Categories",

abstract = "The use of large language models (LLMs) for generating code in software development is on the rise. While LLMs demonstrate impressive capabilities, there remains a need to evaluate the quality of generated code beyond functional correctness. This paper addresses a gap in current research by evaluating Python code generated by nine state-of-the-art LLMs according to the four main code quality categories (maintainability, reliability, performance efficiency, and security) defined by the ISO/IEC 5055:2021 standard. The evaluation spans three popular application domains (high-performance computing, machine learning, and data processing). It employs a stratified prompting approach with varying levels (short, medium, and long) of detail. Using widely adopted static code analysis tools, the study identifies which LLMs perform best across these categories. The analysis involved selecting nine algorithms across three application domains, and the generated code was compared against a human developer using four static code analysis tools. Metrics were organized into the four ISO 5055 categories, and composite scores were calculated for each category following preprocessing to ensure accurate evaluation. The results showed that GPT-4-Turbo produced the most reliable, performance-efficient, and secure code, while Gemini excelled in generating maintainable Python code among the evaluated models. The study concludes that properly prompted and configured LLMs can produce code that meets or even exceeds human-developed code in terms of the four ISO categories. Future work will continue to refine and expand upon these methodologies on other programming languages.",

keywords = "Analysis, ISO, LLMs, Maintainability, Performance, Python, Quality, Reliability, Security, Analysis, ISO, LLMs, Maintainability, Performance, Python, Quality, Reliability, Security",

author = "Rasmus Krebs and Somnath Mazumdar",

year = "2025",

doi = "10.1109/ACCESS.2025.3637569",

language = "English",

volume = "13",

pages = "202482--202499",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "IEEE",

}

TY - JOUR

T1 - Analyzing LLM-Generated Code According to Four ISO/IEC 5055:2021 Categories

AU - Krebs, Rasmus

AU - Mazumdar, Somnath

PY - 2025

Y1 - 2025

N2 - The use of large language models (LLMs) for generating code in software development is on the rise. While LLMs demonstrate impressive capabilities, there remains a need to evaluate the quality of generated code beyond functional correctness. This paper addresses a gap in current research by evaluating Python code generated by nine state-of-the-art LLMs according to the four main code quality categories (maintainability, reliability, performance efficiency, and security) defined by the ISO/IEC 5055:2021 standard. The evaluation spans three popular application domains (high-performance computing, machine learning, and data processing). It employs a stratified prompting approach with varying levels (short, medium, and long) of detail. Using widely adopted static code analysis tools, the study identifies which LLMs perform best across these categories. The analysis involved selecting nine algorithms across three application domains, and the generated code was compared against a human developer using four static code analysis tools. Metrics were organized into the four ISO 5055 categories, and composite scores were calculated for each category following preprocessing to ensure accurate evaluation. The results showed that GPT-4-Turbo produced the most reliable, performance-efficient, and secure code, while Gemini excelled in generating maintainable Python code among the evaluated models. The study concludes that properly prompted and configured LLMs can produce code that meets or even exceeds human-developed code in terms of the four ISO categories. Future work will continue to refine and expand upon these methodologies on other programming languages.

AB - The use of large language models (LLMs) for generating code in software development is on the rise. While LLMs demonstrate impressive capabilities, there remains a need to evaluate the quality of generated code beyond functional correctness. This paper addresses a gap in current research by evaluating Python code generated by nine state-of-the-art LLMs according to the four main code quality categories (maintainability, reliability, performance efficiency, and security) defined by the ISO/IEC 5055:2021 standard. The evaluation spans three popular application domains (high-performance computing, machine learning, and data processing). It employs a stratified prompting approach with varying levels (short, medium, and long) of detail. Using widely adopted static code analysis tools, the study identifies which LLMs perform best across these categories. The analysis involved selecting nine algorithms across three application domains, and the generated code was compared against a human developer using four static code analysis tools. Metrics were organized into the four ISO 5055 categories, and composite scores were calculated for each category following preprocessing to ensure accurate evaluation. The results showed that GPT-4-Turbo produced the most reliable, performance-efficient, and secure code, while Gemini excelled in generating maintainable Python code among the evaluated models. The study concludes that properly prompted and configured LLMs can produce code that meets or even exceeds human-developed code in terms of the four ISO categories. Future work will continue to refine and expand upon these methodologies on other programming languages.

KW - Analysis

KW - ISO

KW - LLMs

KW - Maintainability

KW - Performance

KW - Python

KW - Quality

KW - Reliability

KW - Security

KW - Analysis

KW - ISO

KW - LLMs

KW - Maintainability

KW - Performance

KW - Python

KW - Quality

KW - Reliability

KW - Security

U2 - 10.1109/ACCESS.2025.3637569

DO - 10.1109/ACCESS.2025.3637569

M3 - Journal article

SN - 2169-3536

VL - 13

SP - 202482

EP - 202499

JO - IEEE Access

JF - IEEE Access

ER -