Abstract
The use of large language models (LLMs) for generating code in software development is on the rise. While LLMs demonstrate impressive capabilities, there remains a need to evaluate the quality of generated code beyond functional correctness. This paper addresses a gap in current research by evaluating Python code generated by nine state-of-the-art LLMs according to the four main code quality categories (maintainability, reliability, performance efficiency, and security) defined by the ISO/IEC 5055:2021 standard. The evaluation spans three popular application domains (high-performance computing, machine learning, and data processing). It employs a stratified prompting approach with varying levels (short, medium, and long) of detail. Using widely adopted static code analysis tools, the study identifies which LLMs perform best across these categories. The analysis involved selecting nine algorithms across three application domains, and the generated code was compared against a human developer using four static code analysis tools. Metrics were organized into the four ISO 5055 categories, and composite scores were calculated for each category following preprocessing to ensure accurate evaluation. The results showed that GPT-4-Turbo produced the most reliable, performance-efficient, and secure code, while Gemini excelled in generating maintainable Python code among the evaluated models. The study concludes that properly prompted and configured LLMs can produce code that meets or even exceeds human-developed code in terms of the four ISO categories. Future work will continue to refine and expand upon these methodologies on other programming languages.
| Original language | English |
|---|---|
| Journal | IEEE Access |
| Volume | 13 |
| Pages (from-to) | 202482-202499 |
| Number of pages | 18 |
| ISSN | 2169-3536 |
| DOIs | |
| Publication status | Published - 2025 |
Keywords
- Analysis
- ISO
- LLMs
- Maintainability
- Performance
- Python
- Quality
- Reliability
- Security