Deploy, But Verify: Analysing LLM Generated Code Safety

Rasmus Krebs, Somnath Mazumdar

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

The number of large language models for code generation is rising. However, comprehensive evaluations that focus on reliability and security remain sparse. This study evaluated the Python language code quality generated by five large language models. They are GPT-4-Turbo, DeepSeek-Coder-33B-Instruct, Gemini Pro 1.0, Codex and CodeLLama70 b -Instruct. The evaluation considered three diverse application domains with varying prompt lengths for fair comparison. We found GPT-4-Turbo generated (on average) 4.5% more secure code than a Python code developer with three years of experience.
Original languageEnglish
Title of host publicationProceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025
Place of PublicationLos Alamitos, CA
PublisherInstitute of Electrical and Electronics Engineers Inc.
Publication dateMar 2025
Pages13-16
ISBN (Print)9798331524944
ISBN (Electronic)9798331524937
DOIs
Publication statusPublished - Mar 2025
Event33rd Euromicro International Conference on Parallel, Distributed and Network-based Processing. PDP 2025 - University of Turin, Torino, Italy
Duration: 12 Mar 202514 Mar 2025
Conference number: 33
https://pdp2025.org/

Conference

Conference33rd Euromicro International Conference on Parallel, Distributed and Network-based Processing. PDP 2025
Number33
LocationUniversity of Turin
Country/TerritoryItaly
CityTorino
Period12/03/202514/03/2025
Internet address

Keywords

  • Code
  • LLM
  • Python
  • Reliability
  • Safety
  • Security

Cite this