AI is everywhere today. You can't get through a single day without seeing "AI" or otherwise experiencing the effects of it. Ever more of our everyday lives are being assaulted by AI. We're drowning in it.
And in order for AI systems like OpenAI's ChatGPT, Google's Gemini, Microsoft's CoPilot, Amazon's Rufus*, and all the others to work, their training models require mind-numbingly vast quantities of data.
* Yes, really, Rufus. As in from the movie Never Been Kissed "No Jason, it's not stick of gum. It's rufus... RUFUS".
All that data for the training has to come from somewhere. And that's what this article is about.

Most existing surface web* content has already been scraped by web crawlers working for search engines and, more recently, AI. And because new content is constantly being created, the scraping never stops.
* Surface web means content that is readily accessible to anyone, or anything, browsing the web. That is, content that's not locked behind paywalls or firewalled inside corporate or private databases.
To the extent that a fair chunk of humanity's productive output over the last couple of centuries have been photographed, digitized, and put online, then various web crawlers have scraped that content for search engines and AI training models.
Web crawlers have also gobbled up a lot of copyrighted works by the millions, such as books, songs, scholarly articles and papers, news publications, and a whole host of others.
And those copyright holders aren't too pleased about that, either. Big Tech claims "fair use" but their voracious, wholesale copying of anything their scrapers can access sends a different signal. While they might not be making copyrighted content directly available verbatim, they are in a fantastically significant way negatively affecting the creators of those works. But that's a whole 'nother conversation beyond the scope of this article.
The problem that AI companies are running into now is a lack of fresh content for training their models.
Yes, new content is always appearing online but that's a pretty slow creek compared to the oceans data they've already collected over the last few years. It's not nearly enough content to satisfy the ever more complex training models.
One reckless solution some AI companies are drifting toward is what I call logophage -- literally, "word-eating." The term brings to mind the English idiom "eat your words," which means admitting error. Here, it describes an AI system that feeds its own outputs back into itself as training data, thus eating its own words, leading to lower quality, error-prone, less accurate output -- a/k/a "slop". How deliciously ironic.
Maybe you've heard the phrase garbage in, garbage out? Well, that's what this is.
Another more recent approach to addressing the shortage of training data is the gig worker model. Using intermediary, consumer facing apps like Kled AI or Neon, everyday people can "sell their experiences" to AI companies and make a tiny bit of money doing so.
What experiences, you might ask? Could be anything. AI companies advertise "tasks" on these apps, specifying what they want and how much they'll pay for it. Then people accept these tasks and perform what is required.
That might be uploading a bunch of text messages, email, recording your phone calls, listening in via your phone's mic (and/or camera) while you go about a task, possibly like visiting a retail store and interacting with staff about some product. Maybe going out on a date. Or a protest. Or just walk around commenting on things. Whatever the AI firm that commissioned the task wants.
This new gig worker model not only grows the overall content pool, it also opens a frontier in a new direction that previously did not exist. And that's moving from just a purely static online data collection to bespoke IRL* data collection. It's the first time that AI companies have widespread reach to normal people to make requests for specific curated real life experiences.
That's big. Hell, that's huge. Aldous Huxley might blush.
* IRL = In Real Life, a common initialism used in text messages and online discussion boards.
That is the future. It marks a major inflection point in how AI training data is sourced.
It's pretty concerning, too. People are already willfully allowing unprecedented surveillance into their daily lives, both personal and professional. Now imagine how using AI training apps can turbocharge corporate surveillance.
Ah, our old friends, the Troublesome Trio, partners in deception all, ToS, EULA, and PP have entered the chat.
OK, let me explain that bit of snark.
ToS (Terms of Service), EULA (End User License Agreement) , and PP (Privacy Policy)
The ToS and EULA are more or less the same thing. The subtleties between them are minor and unimportant. They are both long, novella-length, agonizingly detailed legal documents that describe the conditions under which you may use a product or service, your rights (or lack thereof), and the rights of the provider. They are generally one-sided, strongly favoring the provider and hewing as close as possible to -- but not crossing -- the line that would cause a court of law to declare them "unconscionable" and therefore unenforceable.
The PP is similar, but it focuses more on how your privacy is preserved (or not) and how your data and interactions with the product or service may be shared with other parties. As privacy concerns have grown, many products and services today include a separate Privacy Policy rather than cramming all that into the ToS or EULA.
So when you begin using a product or service, you are often asked to separately affirm that you have read, understand, and agree to each of those documents.
Given how new these consumer-facing AI data collection apps are, you would be wise to actually read those agreements before using them so you know exactly what you are getting into. Don't just blow past those terms like you usually would. I mean, really, who reads that crap? Well, you should, at least this time.
But, in short, you are surrendering all rights, title, and interest in the content you created -- exclusively, worldwide, and in perpetuity -- allowing the AI company and its assignees to use, modify, distribute, commercialize, or otherwise exploit that content for any purpose they choose. What a bargain for the AI company.
Furthermore, fulfilling the tasks may involve other people (3rd parties) either directly or coincidentally, whose words or likeness could be captured as well. Task workers performing these tasks have no right to contractually bind them. So, depending on what the AI companies ultimately do with the training data collected by the app under the gig worker's authority, especially to the extent it involves 3rd parties, well, that could come back to bite that worker on the ass.
In other words, if the 3rd party sued the AI company for (insert reason here) then the AI company could, in turn, come after the second party (the task worker).
Is it really worth the few minor ducats you might receive to expose yourself to that?
I touched on the increased surveillance that these new training methods effectively bring along. Included in that lengthy ToS/EULA and PP is the legal cover for the AI firm and any/all affiliations they may have with other parties, to freely access and use that collected data however they see fit. That could go well beyond simple AI training.
There's a lot of things this world doesn't need more of and I'd put increased surveillance from all quarters pretty high on that list.
To be sure, the AI company is getting the better end of that deal by a country mile. You, on the other hand, are getting the equivalent of a happy meal at McDonalds.
You are worth more than that.