Deploy, But Verify: Analysing LLM Generated Code Safety

Rasmus Krebs, Somnath Mazumdar

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningpeer review

Abstract

The number of large language models for code generation is rising. However, comprehensive evaluations that focus on reliability and security remain sparse. This study evaluated the Python language code quality generated by five large language models. They are GPT-4-Turbo, DeepSeek-Coder-33B-Instruct, Gemini Pro 1.0, Codex and CodeLLama70 b -Instruct. The evaluation considered three diverse application domains with varying prompt lengths for fair comparison. We found GPT-4-Turbo generated (on average) 4.5% more secure code than a Python code developer with three years of experience.
OriginalsprogEngelsk
TitelProceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025
UdgivelsesstedLos Alamitos, CA
ForlagInstitute of Electrical and Electronics Engineers Inc.
Publikationsdatomar. 2025
Sider13-16
ISBN (Trykt)9798331524944
ISBN (Elektronisk)9798331524937
DOI
StatusUdgivet - mar. 2025
Begivenhed33rd Euromicro International Conference on Parallel, Distributed and Network-based Processing. PDP 2025 - University of Turin, Torino, Italien
Varighed: 12 mar. 202514 mar. 2025
Konferencens nummer: 33
https://pdp2025.org/

Konference

Konference33rd Euromicro International Conference on Parallel, Distributed and Network-based Processing. PDP 2025
Nummer33
LokationUniversity of Turin
Land/OmrådeItalien
ByTorino
Periode12/03/202514/03/2025
Internetadresse

Emneord

  • Code
  • LLM
  • Python
  • Reliability
  • Safety
  • Security

Citationsformater