Guan, B., Wan, Y., Bi, Z. et al. (4 more authors) (2024) CODEIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code. In: Findings of the Association for Computational Linguistics: EMNLP 2024. 2024 Conference on Empirical Methods in Natural Language Processing, 12-16 Nov 2024, Miami, Florida, USA. Association for Computational Linguistics , pp. 9243-9258.
Abstract
Large Language Models (LLMs) have achieved remarkable progress in code generation. It now becomes crucial to identify whether the code is AI-generated and to determine the specific model used, particularly for purposes such as protecting Intellectual Property (IP) in industry and preventing cheating in programming exercises. To this end, several attempts have been made to insert watermarks into machine-generated code. However, existing approaches are limited to inserting only a single bit of information. In this paper, we introduce CodeIP, a novel multi-bit watermarking technique that embeds additional information to preserve crucial provenance details, such as the vendor ID of an LLM, thereby safeguarding the IPs of LLMs in code generation. Furthermore, to ensure the syntactical correctness of the generated code, we propose constraining the sampling process for predicting the next token by training a type predictor. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP in watermarking LLMs for code generation while maintaining the syntactical correctness of code.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | ACL materials are Copyright © 1963–2025 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Distributed Systems & Services |
Depositing User: | Symplectic Publications |
Date Deposited: | 21 Feb 2025 10:59 |
Last Modified: | 21 Feb 2025 10:59 |
Published Version: | https://aclanthology.org/2024.findings-emnlp.541/ |
Status: | Published |
Publisher: | Association for Computational Linguistics |
Identification Number: | 10.18653/v1/2024.findings-emnlp.541 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:223608 |