Introducing IBIS for Efficient Data Provenance and Licensing Management

17 Sept 2024

Table of Links

Abstract and I. Introduction

II. Preliminaries

III. Proposed Design: IBis

IV. Detailed Construction

V. Implementation on DAML

VI. Evaluation

VII. Conclusion and References

VII. CONCLUSION

In this paper, we present IBIS, a blockchain-based data provenance, lineage, and copyright management system for AI models. IBIS provides evidence and limits power scope for iterative model retraining and fine-tuning processes by granting related licenses. We leverage blockchain-based multi-party signing capabilities to streamline the establishment of legally compliant licensing agreements between AI model owners and copyright holders. We also establish access control mechanisms to safeguard confidentiality by limiting access to authorized parties. Our system implementation is based on the Daml ledger model and Canton blockchain. Performance evaluations underscore the feasibility and scalability of IBIS across varying user, dataset, model, and license workloads. Potential future work includes exploring different on-chain data structures to optimize the performance of graph traversals, and extending IBIS to cover additional stages in AI lifecycle, such as data cleaning, model testing, and model explanation.

REFERENCES

[1] R. Bommasani et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.

[2] W. X. Zhao et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.

[3] Y. Zhu et al., “Large language models for information retrieval: A survey,” arXiv preprint arXiv:2308.07107, 2023.

[4] L. Bonifacio, H. Abonizio, M. Fadaee, and R. Nogueira, “Inpars: Data augmentation for information retrieval using large language models,” arXiv preprint arXiv:2202.05144, 2022.

[5] J. Li et al., “Pretrained language models for text generation: A survey,” arXiv preprint arXiv:2201.05273, 2022.

[6] J. Chen et al., “Benchmarking large language models in retrievalaugmented generation,” in AAAI, 2024.

[7] N. Carlini et al., “Extracting training data from large language models,” in USENIX Security, 2021, pp. 2633–2650.

[8] J. Hoffmann and thers, “An empirical analysis of compute-optimal large language model training,” NIPS, vol. 35, pp. 30 016–30 030, 2022.

[9] T. Chu, Z. Song, and C. Yang, “How to protect copyright data in optimization of large language models?” in AAAI, vol. 38, no. 16, 2024, pp. 17 871–17 879.

[10] N. Vyas, S. M. Kakade, and B. Barak, “On provable copyright protection for generative models,” in Int. Conf. on Machine Learning (ICML). PMLR, 2023, pp. 35 277–35 299.

[11] Z. Yu, Y. Wu, N. Zhang, C. Wang, Y. Vorobeychik, and C. Xiao, “Codeipprompt: intellectual property infringement assessment of code language models,” in Int. Conf. on Machine Learning (ICML). PMLR, 2023, pp. 40 373–40 389.

[12] Q. Lu et al., “Developing responsible chatbots for financial services: A pattern-oriented responsible AI engineering approach,” IEEE Intelligent Systems, 2023.

[13] Q. Lu, L. Zhu, X. Xu, J. Whittle, D. Zowghi, and A. Jacquet, “Operationalizing responsible AI at scale: CSIRO data61’s pattern-oriented responsible AI engineering approach,” Communications of the ACM (CACM), vol. 66, no. 7, pp. 64–66, 2023.

[14] A. Power, “Licensing agreements,” Miss. LJ, vol. 42, p. 169, 1970.

[15] M. Benjamin, P. Gagnon, N. Rostamzadeh, C. Pal, Y. Bengio, and A. Shee, “Towards standardization of data licenses: The montreal data license,” arXiv preprint arXiv:1903.12262, 2019.

[16] D. Contractor et al., “Behavioral use licensing for responsible AI,” in ACM Conf. on Fairness, Accountability, and Transparency, 2022, pp. 778–788.

[17] D. Siddarth, D. Acemoglu, D. Allen, K. Crawford, J. Evans, M. Jordan, and E. Weyl, “How AI fails us,” arXiv preprint arXiv:2201.04200, 2021.

[18] R. Li et al., “How do smart contracts benefit security protocols?” arXiv preprint arXiv:2202.08699, 2022.

[19] L. T. Nguyen et al., “Blockchain-empowered trustworthy data sharing: Fundamentals, applications, and challenges,” arXiv preprint arXiv:2303.06546, 2023.

[20] X. Xu, I. Weber, and M. Staples, Architecture for Blockchain Applications. Springer, 2019.

[21] W. Zhang et al., “Blockchain-based distributed compliance in multinational corporations’ cross-border intercompany transactions,” in Advances in Information and Communication Networks. Cham: Springer Int. Publishing, 2019, pp. 304–320.

[22] T. Scott, A. L. Post, J. Quick, and S. Rafiqi, “Evaluating feasibility of blockchain application for DSCSA compliance,” SMU Data Science Review, vol. 1, no. 2, 2018.

[23] M. Allena, “Blockchain technology and regulatory compliance: Towards a cooperative supervisory model,” European Review of Digital Administration & Law, pp. 37–43, 2022.

[24] O. Ural and K. Yoshigoe, “Survey on blockchain-enhanced machine learning,” IEEE Access, vol. 11, pp. 145 331–145 362, 2023.

[25] A. A. Hussain and F. Al-Turjman, “Artificial intelligence and blockchain: A review,” ETT, vol. 32, no. 9, p. e4268, 2021.

[26] Y. Liu, F. R. Yu, X. Li, H. Ji, and V. C. Leung, “Blockchain and machine learning for communications and networking systems,” IEEE Communications Surveys & Tutorials, 2020.

[27] A. Bernauer et al., “Daml: A smart contract language for securely automating real-world multi-party business workflows,” 2023.

[28] D. A. C. Team, “Canton: A Daml based ledger interoperability protocols,” Digital Asset, Tech. Rep., 2020. [Online]. Available: https:// www.digitalasset.com/hubfs/Canton/canton-whitepaper.pdf?hsLang=en

[29] D. Kreuzberger, N. Kuhl, and S. Hirschl, “Machine learning operations ¨ (MLOps): Overview, definition, and architecture,” IEEE Access, vol. 11, pp. 31 866–31 879, 2023.

[30] J. Litman, “What notice did,” Boston University Law Review, vol. 96, pp. 717–744, 2016.

[31] Q. Wang, R. Li, Q. Wang, and S. Chen, “Non-fungible token (NFT): Overview, evaluation, opportunities and challenges,” arXiv preprint arXiv:2105.07447, 2021.

[32] C. Meurisch and M. Muhlh ¨ auser, “Data protection in AI services: A ¨ survey,” ACM Computing Surveys, 2021.

[33] B. Gedik and L. Liu, “Protecting location privacy with personalized k-anonymity: Architecture and algorithms,” IEEE Trans. on Mobile Computing (TMC), vol. 7, no. 1, pp. 1–18, 2007.

[34] D. Xu, S. Yuan, and X. Wu, “Achieving differential privacy and fairness in logistic regression,” in Companion Proc. of The World Wide Web Conf. (WWW), 2019, pp. 594–599.

[35] J. Zhang, Z. Gu, J. Jang, H. Wu, M. P. Stoecklin, H. Huang, and I. Molloy, “Protecting intellectual property of deep neural networks with watermarking,” in AsiaCCS, 2018, pp. 159–172.

[36] R. Gilad-Bachrach et al., “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy,” in ICML. PMLR, 2016, pp. 201–210.

[37] P. Mohassel and Y. Zhang, “Secureml: A system for scalable privacypreserving machine learning,” in SP. IEEE, 2017, pp. 19–38.

[38] B. D. Rouhani, M. S. Riazi, and F. Koushanfar, “Deepsecure: Scalable provably-secure deep learning,” in Proc. of the Annual Design Automation Conf. (DAC), 2018, pp. 1–6.

[39] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in CCS, 2015, pp. 1310–1321.

[40] S. Servia-Rodr´ıguez, L. Wang, J. R. Zhao, R. Mortier, and H. Haddadi, “Privacy-preserving personal model training,” in IEEE/ACM Int. Conf. on Internet-of-Things Design and Implementation (IoTDI). IEEE, 2018, pp. 153–164.

[41] W. Liang et al., “Circuit copyright blockchain: Blockchain-based homomorphic encryption for IP circuit protection,” IEEE Trans. on Emerging Topics in Computing (TETC), vol. 9, no. 3, pp. 1410–1420, 2020.

[42] Y. Liu, H. Du et al., “Blockchain-empowered lifecycle management for AI-generated content products in edge networks,” IEEE Wireless Communications, 2024.

[43] A. Savelyev, “Copyright in the blockchain era: Promises and challenges,” Computer Law & Security Review, vol. 34, no. 3, pp. 550–561, 2018.

[44] N. Jing, Q. Liu, and V. Sugumaran, “A blockchain-based code copyright management system,” Information Processing & Management, vol. 58, no. 3, p. 102518, 2021.

[45] B. Wang et al., “Image copyright protection based on blockchain and zero-watermark,” TNSE, vol. 9, no. 4, pp. 2188–2199, 2022.

[46] G. Yu et al., “Ironforge: An open, secure, fair, decentralized federated learning,” TNNLS, 2023.

[47] A. Borzunov et al., “Distributed inference and fine-tuning of large language models over the internet,” NIPS, vol. 36, 2024.

[48] X. Liu et al., “Decentralized federated unlearning on blockchain,” arXiv preprint arXiv:2402.16294, 2024.

[49] Y. Gao, Z. Song, and J. Yin, “Gradientcoin: A peer-to-peer decentralized large language models,” arXiv.2308.10502, 2023.

[50] E. Androulaki et al., “Hyperledger Fabric: A distributed operating system for permissioned blockchains,” in EuroSys, 2018, pp. 1–15.

Authors:

(1) Yilin Sai, CSIRO Data61 and The University of New South Wales, Sydney, Australia;

(2) Qin Wang, CSIRO Data61 and The University of New South Wales, Sydney, Australia;

(3) Guangsheng Yu, CSIRO Data61;

(4) H.M.N. Dilum Bandara, CSIRO Data61 and The University of New South Wales, Sydney, Australia;

(5) Shiping Chen, CSIRO Data61 and The University of New South Wales, Sydney, Australia.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

How IBIS Handles Model Licenses, Datasets, and Authorized Models

Up Next →

Claude's Choice: Read Claude's Default Text Prompts in Full For: Write, Learn, Code, and Career Chat