How the IBIS Framework Manages Dataset, License, and AI Model Attributes

cover
17 Sept 2024

Abstract and I. Introduction

II. Preliminaries

III. Proposed Design: IBis

IV. Detailed Construction

V. Implementation on DAML

VI. Evaluation

VII. Conclusion and References

IV. DETAILED CONSTRUCTION

A. Data Models

The main attributes that IBIS framework supports can be broadly grouped as follows (see Table I):

• Dataset attributes: A dataset is uniquely identified by a datasetId and sourced from a specific URL. It includes information on the copyright owner, associated license, and models trained on it. Additionally, the dataset’s ownership and creator are tracked through the CO’s copyrightOwnerId.

• License attributes: Each license has a distinct licenseId and encompasses a defined scope, typically a URL/URI. It includes details like copyright ownership, digital signatures of owners, and validity timestamps. The license type identifier typeId aids in determining the applicable LVC smart contract for license validation. Moreover, it lists the datasets covered under the terms of the license.

• Model attributes: AI models are identified by a unique modeled and associated with owners. They utilize datasets for training, which are listed within the model’s attributes. Retrained models reference a source sourceModelId, and any subsequent models derived from it are listed as child models in childModelList. This structure establishes the lineage of models and facilitates the tracking of relationships between models and data within AI services.

We highlight two aspects. First, the data model is extensible, enabling AOs and COs to incorporate additional custom attributes as needed. For example, a storage URL can be

TABLE I: Dataset, license, and model attributes.

included in dataset metadata to indicate where AO stores the dataset. License can include custom attributes such as expiration date, exclusivity, and other terms and conditions. Second, a web path is employed to delineate the scope of what is being licensed, considering that the majority of AI models are trained using online data.

A running example. Fig.6 illustrates the logical interrelation among the three data models, within an example scenario where Model-1 is initially trained using three datasets, and subsequently retrained with a fourth dataset to yield Model-2. The two models are linked through Model-2’s sourceModelId attribute and Model-1’s childModelList. A dataset and a model are linked through the model’s datasetList and the dataset’s modelList. A license and a dataset are linked through the license’s datasetList and the dataset’s licensed.

B. Functional Operations

This section delineates the operations that can be performed by AOs and COs, along with their time complexity analysis. Table II lists the time complexity of operations and the entities authorized to perform them. We assume that the on-chain license registry, DMR, and MMR are implemented as hash maps on a smart contract, resulting in a time complexity of O(1) for searching them.

Fig. 6: Model, dataset, and license relationships.

Obtain dataset licenses. The getDatasetLicense operation retrieves copyright license of a given a dataset identifier datasetId. It initially searches the DMR hash map using the dataset datasetId as the key, which incurs a time complexity of O(1). Depending on whether the returned DMR record contains a lisenceId, this operation either retrieves the license by licenseId or searches for a relevant license in the license registry as follows:

• Retrieve the license with licenseId: This operation queries the license registry hash map using the licenseId as the key. As the resulting time complexity is O(1), the overall time complexity remains the same.

• Search for a relevant license: This operation extracts the dataset’s copyrightOwnerId and performs a search on the license registry using copyrightOwnerId, which involves a complexity of O(1). This search may yield a list of licenses with the same copyrightOwnerId (albeit with different scopes). Finally, a scan is conducted on the list of licenses to filter out the licenses with irrelevant scopes. Our framework operates on the premise that license scopes do not intersect, ensuring each dataset corresponds to at most one license. As CO is unlikely to have many licenses with the same AO for practical reasons, one can assume the size of this list to be small. Consequently, we can still assume the overall time complexity to be O(1).

TABLE II: Operations, complexities, and authorizations.

Obtain model license. Given a model identifier modelId, the getModelLicenses operation retrieves the licenses of its training datasets. This operation requires executing getDatasetLicense for each of xthe training datasets of the provided model and its upstream source models. To identify the contributing training datasets, the operation functions as a graph traversal algorithm on a graph with the given model as the root node, the upstream models as intermediate nodes, and the datasets as leaf nodes (as depicted in Fig.6). The time complexity of a basic graph traversal algorithm is O(|V | + |E|), where V is the set of vertices and E is the set of edges. Therefore, given a graph with a set of M models and D datasets, the time complexity of the graph traversal is O(|D| + |M|). As getDatasetLicense’s complexity is O(1), the graph traversal dominates the overall time complexity.

Check license validity. Given license data and environment variables as transaction inputs, the checkLicenseValidity operation verifies the validity of the license. Environment variables include the current date, the operating locations of AOs, and other variables that could potentially contravene the terms and conditions stipulated in the license agreement.

First, we need to locate the corresponding LVC smart contract for validating the license. This can be accomplished using another hash map where the license type typeID serves as the key and LVC’s address as the value. Consequently, this lookup operation can be performed in constant time, i.e., O(1). Next, we need to invoke the identified LVC contract to determine the license validity. The time complexity of the LVC contract is directly proportional to the number of environment variables that need validation. We abstract this time complexity as O(|E|), where E is the set of environment variables to validate. Therefore, the overall time complexity is O(1) + O(|E|) = O(|E|).

Obtain licensed datasets. The getLicensedDatasets operation retrieves the list of dataset identifiers datasetIds each with a valid license. It entails executing getDatasetLicense and checkLicenseValidity operations for each dataset. Given D datasets, the overall time complexity is O(|D| × {O(1) + O(E)} = O(|D||E|).

Obtain authorized datasets by license. Given a license identifier licenseId, the getDatasetsByLicense operation retrieves datasets covered by the license. This operation performs a search of the DMR using the licenseId, resulting in a time complexity of O(1).

Obtain authorized models by license. Given a license identifier licenseId, the getModelsByLicense operation retrieves the metadata of the models covered by the license. This operation entails executing getDatasetsByLicense, followed by conducting a graph traversal that goes from each dataset to the models trained on it (including child models indirectly trained on it). In the worst case, the overall time complexity is O(|D||M|).

Obtain model datasets. Given a model identifier modelId, the getModelDatasets operation retrieves the identifiers of its training datasets. This operation extracts the datasetList attribute from the provided model and its upstream source models. If the provided model has a set of M upstream models, then the time complexity is O(|M|).

C. License Renewal

License validity check is an ongoing task because a valid license may become invalid under certain circumstances (e.g., revoked or expired), necessitating AOs and COs to take appropriate actions to ensure continuous compliance with copyright laws. Following delineates how the framework facilitates license renewal checks and renewals.

1) License Renewal Check: The framework supports three types of license renewal checks (LRCs): license-driven, dataset-driven, and model-driven.

In license-driven LRC, AOs or COs conduct a periodic scan of the license registry, performing checkLicenseValidity on each license. If a license fails the validity check, an AO can execute the getModelsByLicense operation to gather the identifiers of datasets and models that depend on the invalid license. These can be added to a blacklist to prevent the use of those datasets and models in future training of new models or retaining. The specifics of how the blacklist is stored and managed fall beyond the scope of this paper.

Dataset-driven and model-driven LRC can be conducted ondemand before training a new model. In dataset-driven LRC, an AO can execute getDatasetLicense operation followed by the checkLicenseValidity operation for each training dataset to identify any dataset needing a license renewal. In contrast, in model-driven LRC, an AO can execute getModelLicenses operation followed by checkLicenseValidity operation for each license of the model, determining whether the model needs a license renewal.

2) License Renewal: A license renewal involves adding a new bilaterally signed license to the license registry, rather than updating existing records. This enables AOs and COs to access all historical licenses to prove regulatory compliance and avoid any disputes. After a new license has been added, an AO can execute the getModelsByLicense operation to gather the identifiers of datasets and models that depend on the renewed license. Then the DMR records of datasets are updated to reference the new license. As the list of dependent datasets and models can now be considered eligible for training, their identifiers are also removed from the blacklist.

D. Operation Atomicity

It is observed that several stages (i.e., S.1, S.3, and License Renewal) involve the update of multiple records. Apart from ensuring integrity and immutability, another advantage of maintaining DMR, license registry, and MMR on-chain is that such multiple updates are guaranteed to be atomic. This is because smart contracts can ensure atomicity where the actions included in one transaction either all take effect or none of them take effect. Therefore, care needs to be taken during implementation to ensure that all updates within a stage should be included in the same transaction.

Authors:

(1) Yilin Sai, CSIRO Data61 and The University of New South Wales, Sydney, Australia;

(2) Qin Wang, CSIRO Data61 and The University of New South Wales, Sydney, Australia;

(3) Guangsheng Yu, CSIRO Data61;

(4) H.M.N. Dilum Bandara, CSIRO Data61 and The University of New South Wales, Sydney, Australia;

(5) Shiping Chen, CSIRO Data61 and The University of New South Wales, Sydney, Australia.


This paper is available on arxiv under CC BY 4.0 DEED license.