What's the AI Privacy Dilemma
How to make sure AI works, despite preserving data privacy.
Imagine you’re watching a trolley running down the tracks endangering five people. You are close to the switch that would divert the trolley to another track where there’s one other person. Would you rather sacrifice one person to save the lives of five?
Welcome to the trolley dilemma, one of the best-known examples for being stuck between difficult choices. One way or the other, dilemmas have been a major discussion point in ethical and philosophical debates – not only in modern times. But what do dilemma discussions have to do with AI, you might ask? Well, everything.
The AI Dilemma
Artificial Intelligence presents us not only with great opportunities and solutions to everyday problems but also with major human challenges. AI might further put us into complex situations where we need to make difficult choices – or as Sune Hannibal Holm, associate professor of philosophy at the University of Copenhagen, put it in his TEDx talk in 2020:
“On the one hand, we can use these systems to prevent serious harms to people and society. On the other hand, using these systems seems to threaten important values such as privacy and autonomy.”
The AI Privacy Dilemma
Let’s take a closer look at one of the issues Holm identified: privacy. According to the official UN Universal Declaration of Human Right, Privacy is a basic human right.
It is also the focal point of what the AI community has defined as the specific AI Privacy Dilemma: To create accurate and precise AI models, you need massive amounts of user data for AI training. But generating, assembling, and using this large amount of data makes the direct violation of privacy possible; especially because the market for consumer data has become incredibly lucrative for many different players.
Not only is AI training hungry for data, we also create much more data, which has become much easier to collect and process. Just think of the ever-advancing numbers in the now-famous infographic “What happens in an internet minute?”. In their first edition in 2017, the two marketing specialists Lori Lewis and Chadd Callahan counted – amongst others – 156 million sent e-mails, 4.1 million videos watched on YouTube, 3.5 million Google search queries, and 900,000 Tinder swipes. Fast forward just four years and these numbers have risen to 190 million sent e-mails, 4.7 million viewed YouTube-videos, 4.1 million Google search queries, and 1.6 million Tinder swipes – and all this is happening in just one minute on the internet.
In case you like official statistics better, make sure to check out the EU data strategy. According to this paper, the EU projects a 530 % increase of the global data volume from 33 zettabytes in 2018 to 175 zettabytes in 2025 – stored on 512 GB tablets these tablets would stack up all the way to the moon in 2018 and five times back from there by 2025.
A large part of this data is probably collected, sold, and used for purposes you can’t control.
And even if your data is not sold by the company that collects it, don’t forget that data is stolen every single day; such as in the case of Yahoo a couple of years back when information of all of their 3 billion user accounts was leaked – the potentially biggest known data breach in history so far.
As technological devices, digital tools and services are integral parts of our everyday life, we can’t opt out of using AI anymore. Also, because we no doubt wouldn’t want to remove all the extra convenience, health and security we’ve experienced with the help of AI applications. Just imagine you wouldn’t be able to search the web accurately anymore or check an online map for the fastest route to your destination. AI is now used, for example, to edit genes, to perform eye surgery but also to create fortune cookies, or to compose music that you can enjoy while drinking a beer brewed with the help of AI.
Is there a way out of the AI Privacy Dilemma?
An AI well trained because of potential privacy-invading mass data collections or an inaccurate AI because of no data collection at all – these seem to be the only options out there in many consumer use cases of AI. Many attempts have been made so far to solve this problem; two of the most prominent ones rely on data anonymisation and homomorphic encryption.
In the case of data anonymisation, the most popular method is the so-called differential privacy. Here the original data is slightly changed in one statistical way or the other so that it can be processed centrally. Even though this method has a (comparatively) low complexity, a high maturity and can be applied for a wide range of cases, it is still much more of a bypass technology than a solution in itself. The main reason being the fact that personal data may be recalculated – for example by linking different databases. Last but not least, you also should keep in mind that even though the anonymisation technology might develop at a rapid pace, de-anonymisation technologies are also improving at the same time. What might be “privacy-preserving” today might not be secure anymore tomorrow. Therefore, this is one of the methods with a lower privacy rating.
Another method which raises hopes is homomorphic encryption. Data is encrypted on-device level and then processed centrally in encrypted form. But even though this is a promising approach, the considerable amount of computing resources needed for the calculations still hinders the widespread adaption.
What these methods have in common is that they collect data, change it in one way or the other and process it centrally. But what if we turn this approach around and don’t bring the data to the algorithm but rather the algorithm to the data? Edge computing and especially federated learning are doing just this.
Don’t bring the data to the algorithm, bring the algorithm to the data
With federated learning, all data stays on end devices and each end device trains its own AI local with its own data. These models are then encrypted and send to an aggregator that combines these encrypted AI models and can remove the encryption, but only on the aggregated AI model.
Besides protecting user privacy, this method also has the advantage that you don’t have to ship large amounts of raw data anymore and that it works asynchronously. It can also much better protect against attacks and manipulations than conventional anonymisation methods. For attackers to succeed, they would have to corrupt and control several, decentralised devices at the same time.
Another huge benefit of a decentralised over a centralised approach is that the tide is turning for centralised systems.
Whereas in 2018, 80 % of data was processed centrally, the afore mentioned official EU Data Strategy projects that by 2025 80 % of data is processed on device level.
Technological development not only brings us new, difficult choices but sometimes also new methods to face these challenges – if we’re willing to invest in them and use them for the greater good. In our understanding of AI dilemmata, we shouldn’t get stuck with a dichotomy fallacy of seeing only an either-or choice – and thus starve to death like Buridan’s Ass.
But we should rather try to find a third, new way out of this AI Privacy Dilemma, by challenging the status quo – so that we enable great, convenient AI technology while protecting and preserving privacy at the same time.