Selling Yourself to A.I.


It had to happen, you know. I could see this coming now for a while and here it is.

A.I. is everywhere today. You can't get through a single day without seeing "A.I." or otherwise experiencing the effects of it. Ever more of our everyday lives are being assaulted by A.I. We're drowning in it.

And in order for A.I. systems like OpenAI's ChatGPT, Google's Gemini, Microsoft's CoPilot, Amazon's Rufus*, and all the others to work, their training models require mind-numbingly vast quantities of data.

* Yes, really, Rufus. As in from the movie Never Been Kissed "No Jason, it's not stick of gum. It's rufus... RUFUS".

All that data for the training has to come from somewhere. And that's what this article is about.

{{brizy_dc_image_alt imageSrc=

Collecting Training Data


Most existing surface web* content has already been scraped by web crawlers working for search engines and, more recently, A.I. And because new content is constantly being created, the scraping never stops.

* Surface web means content that is freely accessible to anyone, or anything, browsing the web. That is, content that's not locked behind paywalls or firewalled inside corporate or private databases.

Web crawlers have also gobbled up a lot of copyrighted works by the tens (hundreds? thousands?) of millions, such as books, songs, scholarly articles and papers, news publications, and a whole host of others.

And not for nothing, but those copyright holders aren't too pleased about that, either. Big Tech claims "fair use" but their voracious, wholesale copying of anything their scrapers can access sends a different signal. While they might not be making copyrighted content directly available, they are in a fantastically significant way negatively affecting the creators of those works. But that's a whole 'nother conversation beyond the scope of this article.

The problem that A.I. companies are running into now is a lack of fresh content for training their models.

Yes, new content is always appearing online but that's a pretty slow creek compared to the oceans data they've already collected over the last few years. It's not nearly enough content to satisfy the ever more complex training models.

One reckless solution some A.I. companies are drifting toward is what I call logophage -- literally, "word-eating." The term brings to mind the English idiom "eat your words," which means admitting error. Here, it describes an A.I. system that feeds its own outputs back into itself as training data, thus eating its own words, leading to error-prone, less accurate output -- a/k/a "slop". How deliciously ironic.

You've heard the phrase garbage in, garbage out. Well, that's what this is.

A Fresh Twist


Another more recent approach to addressing the shortage of training data is the gig worker model. Using intermediary, consumer facing apps like Kled AI or Neon, everyday people can "sell their experiences" to A.I. companies and make a little bit of money doing so.

What experiences, you might ask? Could be anything. A.I. companies advertise "tasks" on these apps, specifying what they want and how much they'll pay for it. Then people accept these tasks and perform what is required.

That might be uploading a bunch of text messages, email, recording your phone calls, listening in via your phone's mic (and/or camera) while you go about a task, possibly like visiting a retail store and interacting with staff about some product. Maybe going out on a date. Or a protest. Whatever the A.I. firm that commissioned the task wants.

This new gig worker model not only grows the overall content pool, it also opens a frontier in a new direction that previously did not exist. And that's moving from purely static online data collection to bespoke IRL* data collection. It's the first time that A.I. companies have widespread reach to normal people to make requests for specific curated real life experiences.

That's big. Hell, that's huge. Aldous Huxley might blush.

* IRL = In Real Life, a common initialism used in text messages and online discussion boards.

That is the future. It marks a major infection point in how A.I. training data is sourced.

It's pretty concerning, too. People are already willfully allowing unprecedented surveillance into their daily lives, both personal and professional. Now imagine how using A.I. training apps can turbocharge corporate surveillance.

TOS / EULA / PP

Ah, our old friends, the Troublesome Trio, partners in crime all, TOS, EULA, and PP have entered the chat.

OK, let me explain that bit of snark. TOS (Terms of Service), EULA (End User License Agreement) , and PP (Privacy Policy)

The TOS and EULA are more or less the same thing. The subtleties between them are minor and unimportant. They are both long, novella-length, agonizingly detailed legal documents that describe the conditions under which you may use a product or service, your rights (or lack thereof), and the rights of the provider. They are generally one-sided, strongly favoring the provider, hewing as close as possible to, but not crossing, the line so as not to be declared "unconscionable" by a court of law.

The PP is similar, but it focuses more on how your privacy is preserved (or not) and how your data and interactions with the product or service may be shared with other parties. As privacy concerns have grown, many products and services today include a separate Privacy Policy rather than cramming all that into the TOS or EULA.

So when you begin using that product or service, you are often asked to separately affirm that you have read, understand, and agree to each of those documents.

Given how new these consumer-facing A.I. data collection apps are, you would be wise to actually read those agreements before using them so you know exactly what you are getting into. Don't just blow past those terms like you usually would. I mean, really, who reads that crap? Well, you should, at least this time.

You could very well be perpetually giving up rights to your voice and or likeness for a one-time payment of a few dollars. What a bargain for the A.I. company.

Furthermore, fulfilling tasks may involve other people (3rd parties) either directly or coincidentally, whose words or likeness could be captured as well. Gig workers performing these tasks have no right to contractually bind them. So, depending on what the A.I. companies ultimately do with the training data collected by the app under the gig worker's authority, especially to the extent it involves 3rd parties, well, that could come back to bite that gig worker on the ass.

Corpo Eyes

I touched on the increased surveillance that these new training methods effectively bring along. Included in that lengthy TOS/EULA and PP is the legal cover for the A.I. firm and any/all affiliations that may have with other parties, to freely access and use that collected data however they see fit. That could go well beyond simply using it for A.I. training -- which is already sketchy enough all by itself.

There's a lot of things this world doesn't need more of and I'd put increased surveillance from all quarters pretty high on that list.

Is it really worth the few minor ducats you'll receive to expose yourself to that?

To be sure, the A.I. company is getting the better end of that deal by a country mile. You are worth more than that.