Share

Protecting AI model and data privacy without sacrificing performance: a better tradeoff

Image de cybersécurité
CEA-List is developing frameworks capable of protecting the privacy of the personal data used to train artificial intelligence models without sacrificing models accuracy or performance.

Federated learning raises critical data privacy questions

Data privacy is a major concern when it comes to designing AI systems. This is because the statistical models used in AI are trained on large, previously-labeled datasets—some of which may contain proprietary and/or sensitive data.

Federated learning for collaborative AI model training

With federated learning, multiple stakeholders, both individuals and organizations, can provide access to their datasets to train an AI model. Learning is decentralized and updates to the shared model are aggregated on a central server to be distributed to all contributors. This approach ensures that each contributor’s data remains private.

However, at some point, end users are given access to the AI model trained on the data, raising security issues. A recently-discovered AI security risk—the model inversion attack—could leave training data exposed. A data holder or a bad actor with access to the central server, for example, could potentially reverse engineer the model’s output to determine what the input data was.

Differential privacy for enhanced security

Differential privacy is a technique that prevents bad actors from determining whether specific data (like data that can identify a person) has been used to train the model.

One of the most used differential privacy methods is random noise generation, either by the aggregation server before the model is sent, or by the end user. The downside is decreased model performance and, specifically, less accurate predictions. CEA-List’s research has focused on achieving a better tradeoff between data privacy and AI model accuracy.

Images reconstructed using an inversion attack:

Left: The image reconstructed from the model (left), compared to a copy of the image in the training dataset (right), unknown to the attacker.

Right: This is a second type of attack that targets the gradients exchanged at each learning iteration. The attacker can very faithfully reconstruct the input data from the gradients, again with no direct knowledge of the original data.

New SPEED framework eliminates need for a trusted agent

CEA-List’s new federated learning framework, SPEED, improves the privacy-accuracy tradeoff. The research that led to the advance was published in the journal Machine Learning2. In this framework, the end users are not considered trusted agents, and neither is the central server. The framework is based on a voting system. A simplified version of the model is trained on the end user’s private data. The server then sends an unlabeled data instance, and the end user’s simplified model yields an estimated label of the instance, called a vote.

SPEED uses homomorphic encryption to process the votes anonymously. End users encrypt their votes, and the server aggregates them without decrypting them. This means that the server cannot access individual votes. The framework was applied to test models, and performed at the state of the art. For example, SPEED protected the server data without sacrificing model accuracy, something that PATE3, another differential privacy framework, cannot do.

Approximate algorithms: Privacy at a reasonable computational burden

Homomorphic encryption uses approximate algorithms. As their name suggests, these algorithms produce approximate—but not exact—solutions. Quantization must be used to map continuous data into discrete datasets. This process introduces “natural” noise into the data, much like a rounding error. CEA-List researchers recently found theoretical proof that quantization does indeed introduce this noise.

SHIELD is deliberately approximative, thereby reducing the algorithm’s computational burden while still ensuring differential privacy4.

Potential use on sensitive data

SPEED and SHIELD could find their niches in healthcare, biometrics, and any other market where personal, sensitive, or proprietary data needs to be kept private. The potential is vast.

The solutions developed at CEA-List are already being used to increase the privacy of facial recognition systems, which typically require large training datasets that are often provided by organizations.

“We are now looking to expand into an even broader range of use cases, like cybersecurity, through EU-backed projects. When used together, the techniques we have developed could enable the secure sharing of data signatures, for example. This kind of solution could be extremely useful to public-sector organizations like government agencies that deal with threats daily. But the private sector is also showing increasing interest. We are talking to more and more businesses about how they can seize opportunities to extract value from their sensitive data without compromising on data privacy.”

Cédric Gouy-Pailler

CEA-List —

References

1. Yin, H., Malya, A. Vahdat et al (2021), See through Gradients: Image Batch Recovery via GradInversion. arxiv:2104.07586

2. Grivet Sébert, A., Pinot, R., Zuber, M. et al (2021). SPEED: secure, PrivatE, and efficient deep learning. Machine Learning 110, 675–694.

3. Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., & Talwar, K. (2017). Semi-supervised knowl-edge transfer for deep learning from private training data. 5th international conference on learning representations.

4. Grivet Sébert, A., Zuber, M., Stan, O. et al (2023). When approximate design for fast homomorphic computation provides differential privacy guarantees. arxiv:2304.02959

Read more

Challenges

Artificial intelligence

From home to work, artificial intelligence has made in roads into virtually every aspect of our lives. It has transformed how we relate to others, do our jobs, and interact with the devices we use eve...
Read more
Research programs

Responsible AI

CEA-List is rolling out an ambitious research program to support the responsible development of AI-based systems for industry and society.
Read more