Data Digital Economics

A new source of power is a large dataset. Should we look for ways to give data explicit value?

heads up: this is a pretty old post, it may be outdated.

I've not read Who Owns the Future, but I have read Jaron Lanier's New York Times Op-ed, Fixing the Digital Economy and listened to him speak on the subject. His thesis summarized:

  1. Big data is collected from consumers either in exchange for free services or surreptitiously.
  2. Using powerful computers and good software, this dataset can be aggregated and analyzed to make unintuitive predictions.
  3. If the analysis is accurate and/or timely, these predictions can be used to, effectively, tell the future. This could enable the datas' owner to contend with traditional economic and sovereign powers.

Put succinctly: Modern A.I. uses powerful computing to make unintuitive predictions based on aggregation and quality analysis of data.

That's a good observation; Lanier then points out two consequences to this arrangement.

The power to effectively predict the future relies on data, good software, and powerful computers. Prerequisites that put this ability out of the reach of the average individual, and into the hands of a concentrated few.

Second, while algorithms can be created, and computers bought, data must be gathered from many other people. This reliance on other people with little to no compensation to them, strikes Lanier as unfair.

Lanier's suggestion, therefore is that we adopt a series of rules around data online and enact a mircopayments scheme to ensure that everyone is appropriately paid for their data.

Lanier's thesis

Wishes and Dreams

I've not read Who Owns the Future, and a good deal of the above summary is my extrapolation based on what Lanier said elsewhere, but the idea of framing A.I. as a source of power that controls the digital economy is worth looking at.

Certainly, the case is convincing that the banking industry with high-frequency trading, insurance companies maximizing profits while minimizing risk, and Google's ads, are able to profit and leverage forms of soft power. They do this with a massive dataset.

However, data collection from users, in many cases, compensates the users with free services. While this might be morally devious, there is compensation.

Data can also be public, as with the stock data that high-frequency traders use.

Surreptitious data collection, is the real problem.

Lanier would likely agree that data should only be collected if the user agrees, but his proposed way to get there is …difficult. He references Ted Nelson and Project Xanadu, suggests a micropayments scheme that hinges on fundamental changes to our digital infrastructure:

  1. Data must be canonical. Currently, URLs can be changed, copy and paste is a colloquialism. We need a scheme in which a piece of data is the sole version of it's type.
  2. Data must be immutable. Currently data can be deleted at no cost. Sites disappear from the Internet all the time. You can easily delete an excel doc on your computer.
  3. Use of data should leave a trace. Currently, sites accessing your purchase history from your credit card company do not notify you.

This vision of the future is very enticing from a technical level, but we're many stages away from being able to implement it.

It's long been a dream of "web 3.0" that the Internet could be a series of "machine-readable" documents. This has not materialized because the standards for how to make data machine-readable are notoriously difficult to pin down.

Machine readability needs to be accomplished before Xanadu, so that data can be verified and the "rules" enforced.

Standards Tactics

The web is a series of servers and software agreeing to operate around a set of principles. These protocols (or standards), are usually generally agreed on but the specifics can vary quite a bit.

Standards like HTML, CSS, HTTP, TLS, SSL, et al, all work because developers opt to implement them. If a site is different from the standard, it behaves oddly. Users are likely notice. Unfortunately, there's no way to similarly police standards on data integrity.

URLs are a generally agreed on, but the specifics get wonky. Everyone agrees on .com. what what comes after the slash? IDs or descriptive URLs? Query parameters or path names? Dates or titles? Even developer principles like REST are highly contentious. If a computer had to guess what lived on a random page based solely on the URL … well, that's how Google makes the big bucks.

URLs are just a part of the problem. Perhaps, if data was truly machine-readable, we would have a way of determining if it was canonical without depending on a standardization of URL schemes. Of course, machine-readability is still in it's early stages.

Data immutability could then be enforced by some sort of cryptographic system similar to Bitcoin's that focuses on ownership and history. Bitcoin works because each piece of currency is constructed out of a permanent record of all previous transactions. Such a system could would allow for data to be updated while remaining canonical and immutable.

Ensuring data is always protected against deletion or downtime is a tough proposition. Determining who actually owns data is difficult.

If that could be solved, perhaps a system of owner reputation cloud be tied to the availability of all of their data. I've been impressed by a Yelp like system of personal reputation presented in Daniel Suarez's sci-fi Deamon series in which every person gets rated on a dual scale of 5 points, and the number of reviews. Of course, true identity is necessary, and that's another problem. Regardless, ownership is necessary to deliver notifications on access. So, add a requirement to the list:

  1. Data must be canonical. Data should be accessible by only one URI, and should be verifiable as unique.
  2. Data must be immutable. It should not be possible to delete data. Changes should be recorded.
  3. Use of data should leave a trace. Access and modification should be recorded and the owner notified.
  4. Data must have an owner. A "responsible party" is attached to all data, and they're ultimately responsible for ensuring the rules are followed.

Even notification is a hard problem. It's not practical to send an email every time someone looks up your email address. A log file would need to live somewhere and isn't particularly user friendly. A new system would have to be built.

This might all be possible in a highly centralized system – the very thing that we'd like to avoid. Developing a set of protocols to describe all of this has a massive adoption problem. I don't see a path toward getting here.