Has AI changed the value of our personal data?

From Excel to SQL, technologies that help us process and analyze information have led to ever-increasing innovation in the digital economy. Yet none of these technologies fundamentally changed the nature of the data they sought to organize. With the latest AI breakthroughs, things might finally be different.

In the years leading up to the latest AI boom, the idea that the value of user data was exaggerated was gaining steam. Tech and venture capital leaders routinely downplayed the idea that there was any real danger in the mining or selling of user data, if this was even happening at all. Much of this was a pushback to fears about online privacy that grew during the rise of web 2.0, perhaps reaching its zenith with Netflix’s The Social Dilemma.

A 2021 opinion piece by Tim O’Reilly references a widespread quote in tech that “data is the new oil”. The person who coined it aimed to illustrate how data needed to be refined to unlock its potential, but many took it to mean that data inherently held tremendous value, and individuals should be somehow tapping into this value. O’Reilly reasons that data’s value is actually better represented by sand — ubiquitous and only truly valuable after industrial-scale processing. As evidence, he cites data suggesting an individual user’s value to Google or Facebook is only about $5–10. The concerns about data ownership should not be focused on preserving privacy for its own sake, he argues. Instead, concerns should center around the misuse of data enabled by data sharing.

Indeed, many people are content to forfeit their privacy in exchange for free internet services, but that doesn’t define the complete narrative of ‘ownership’. As Benedict Evans points out in his piece “There’s no such thing as data”, the kind of data we are talking about matters too. You can’t use restaurant order data to design a new missile guidance system, for example. Evans is of course, right — all data is not the same. But that doesn’t mean there isn’t some value to data, whether in isolation or in aggregate. In reference to data-processing technologies such as SQL and Excel, Evans argues “These technologies are not national strategic assets — anyone can have them, but what for?”

Crucially, AI differs from previous tech in its interaction with data.

You can’t have powerful AI without vast volumes of data. The Large Language Models (LLMs) that power modern AI applications feed on user data to “train” and become more effective. Because the algorithms that power AI are not fully pre-programmed, but instead rely on pattern recognition with large datasets, the underlying data itself becomes more valuable. In part because of this, countries are correct to treat it as a strategic asset.

There has been a recent narrative shift at the national level too: governments view their regulatory decisions as a balance between too little restraint enabling AI companies to gobble up every piece of content online, and too much regulation hamstringing companies to the point where they are no longer competitive. Is this shift a result of the burgeoning “AI revolution”, or is it influenced by the significant investments that governments have made in tech regulation, foreign competition, and cybersecurity?

Some in the media have been quick to link the recent lawsuits surrounding the use of copyrighted materials in generative AI with the broader needs of LLMs to access vast amounts of data. However, the two need not necessarily conflict. For example, journalist Sheera Frenkel highlighted that both the US and China consider AI a strategic asset, where every regulatory measure potentially hampers their competitiveness.

“The US government sees itself in an arms race, at the moment, against China when it comes to AI. Both China and the United States have a lot of scientists that are invested in this. They have a lot of interest in being the world leaders in artificial intelligence. And so they know that every bit of regulation they put in place potentially holds back those US companies, as opposed to China, where there’s very little regulation on data and where there’s a ton of data online that the Chinese government can easily access and even give to Chinese AI companies if they want to speed ahead in what’s considered the AI arms race between the US and China.”

In reality, these are interconnected, yet still distinct, issues. We shouldn’t assume a black-and-white world in which nations can’t devise meaningful consumer and creator-first solutions for data ownership, while sustaining competitive AI sectors. Copyrighted material represents a small fraction — likely less than 1% — of all data used by LLMs. As for why the tradeoff between data access and international competitiveness has become more pronounced — the answer is likely all of the above.

Whether the correct metaphor for data is oil, sand or something entirely different, you would be hard-pressed today to argue that it does not matter at all. More voices — from consumers, scientists, artists, to individual creators — are asserting its importance. It’s time we explored innovative solutions to empower people to control their data.

Indeed, commentators are correct that thinking about data in terms of individual user value is pointless. But it is precisely because the value of data changes depending on its context, that we should not use existing profit models to determine its value in a new paradigm. $10 worth of data in an ad-funded model may seem like a drop in the bucket, but companies, governments and consumers are all willing to go to battle over it. Clearly, something about today is different. In a world where computer-generated content is cheaper than ever before, isn’t human-generated data everything?

Previous
Previous

We Need a New Framework for Thinking About Energy Policy

Next
Next

The US revisits Section 230 — What will it mean for the internet?