The Growing Presence of AI Part 2: Data Origination and Latest Advancements

Welcome to part two in our AI series. This series aims to concisely deliver the ins and outs of AI in a digestible fashion. Find part one in the series here.

In this article, we’ll delve into the origins of the data fed into AI, the presence of biases in AI, and the use of copyrighted material

Where Does the Data to Train AI Originate?

To better understand the challenges and opportunities of artificial intelligence, it’s important to start with its foundation: the data itself. Where that data comes from, how it’s shaped, and the potential issues it carries—like bias or copyright concerns—determine how reliable and ethical AI can be.

Depending on the AI tool used, the source of the data can vary. When we look at OpenAI (the creators of ChatGPT), their data comes from three sources: publicly available information, information from third parties they partner with, and information from users.

Going one step further, FCH Observatory reports data can be gathered in three ways: (1) internally from their own sources, e.g., a user’s interactions with an AI are used as training data for the same AI; (2) external data sources, e.g., third-party data vendors, such as Sama (which is a firm that’s provided human-labeled data for multiple AI companies); and (3) web scraping, or using code to extract public data from websites.

How Are Biases Present in AI?

OpenAI defines publicly available data as information “freely and openly accessible on the internet,” excluding data from paywall sources and the dark web. OpenAI also mentions the use of filters to remove material like hate speech, adult content, spam, and sites with personal information. Occasionally, information or datasets can become a source of bias, as can external factors like third-party vendors.

Recall that one way AI is trained is through datasets. If the datasets used to train AI includes non-factual data and/or certain biases, the answers that utilize these datasets will be influenced by these factors. This is where I thought it would be interesting to ask ChatGPT if that is true. Paraphrasing the response, ChatGPT said: “AI models don’t have an independent notion of truth … If the dataset contains inaccuracies, misinformation, or biases, the model can pick those up and reproduce them.”

Again, ChatGPT is a base case we’re using as an example for the AI landscape. These biases may or may not exist—or exist in varying degrees—with other AI platforms. Rutgers reported on a study by Steve Rathje, a postdoctoral researcher at New York University, that found AI algorithms used in healthcare might be biased.

Consider an AI system that is coded to hire individuals based on 30 years of data from resumes of people hired at the firm. When the firm has a history of hiring individuals with certain characteristics, the AI will then favor individuals with those characteristics when filtering resumes. In this example, the data is technically “accurate,” but the AI’s resume selection based on it is still biased.

When examining external factors, it’s essential to remember that companies control these AIs and are subject to decisions by shareholders, executives, and the board of directors. TechCrunch highlights this by showing how Grok 4 will reference Elon Musk’s opinion on controversial subjects. The New York Times reported how July updates to Grok made its responses shift to the right on the government and the economy. This can be even seen on the government level. Business Insider reported that the president issued executive orders requiring government-used AIs be neutral and non-partisan, mentioning the banning of so-called “woke” biases. In January, TechCrunch reported OpenAI removed language surrounding political neutrality in its policy documents. Biases don’t necessarily need to come from Ais as systems can be changed by people.

The Use of Copyrighted Material in AI

The interaction between copyrighted material and AI remains murky.

One method AI uses to acquire large datasets needed for training is through “scraping.” Large, general internet scrapes inevitably include copyrighted material. This brings us to an important argument in the copyright AI debate—one that centers on the duplication of copyrighted material for training purposes. AI developers claim this is an example of “fair use” whereas content creators say this violates exclusive rights.

One example of this in action is the Kadrey v. Meta Platforms, Inc., class action lawsuit, in which creatives alleged copyright infringement by Meta. The Authors Guild writes that the Meta dataset “LibGen” contains more than 7.5 million books that AI companies have copied or partially copied for their AI systems. In the end, the courts concluded that this was fair use of these works.

Due to the potential mounting legal pressure and increasing unreliability of internet data (and to avoid lawsuits like Kadrey) some AI companies are acquiring large datasets through licensing agreements with publishers and content creators.

Image distributors like Shutterstock have formed alliances with AI developers—including OpenAI, Meta, and Amazon—to provide legally sourced, labeled images for training. OpenAI established partnerships with news organizations such as the Associated Press, Axel Springer, and the Financial Times to license their content for model training. Though these partnerships don’t necessarily equate to less internet scraping, they do show an example of how copyrighted material can be accessed differently.

Currently, there are laws and acts proposed that would limit AI in its image-generating or copyright-utilizing capacities. These include the Ensuring Likeness Voice and Image Security (ELVIS) Act, passed March 2024, which protects performers from unauthorized AI-generated cloning of their voice or likeness, and the Generative AI Copyright Disclosure Act of 2024, which requires AI developers to notify the U.S. Copyright Office of the copyrighted works used in training their models.

Other notable legal battles include a class-action lawsuit against Anthropic AI led by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson. The authors alleged that Anthropic’s training of its Claude model included copyrighted books—some legitimately acquired and others pirated.

In Thomson Reuters vs. ROSS Intelligence, ROSS scraped Westlaw headnotes to train its AI-powered legal search engine after losing a licensing negotiation. The court ruled this was not fair use due to the non-transformative commercial nature of use, direct market competition, and absence of licensing efforts.

Lastly, the battle of the titans: Disney and Universal vs. Midjourney. Disney and Universal allege Midjourney illegally trained its model on copyrighted images of their characters without permission. They argue this infringes copyright and negatively impacts their licensing business.

This all highlights a shifting legal landscape where courts, lawmakers, and companies are still figuring out the boundaries of fair use, licensing, and accountability in the age of AI.

What’s Next

I don’t yet know if AI is intrinsically bad or good. I’ll admit that these past two articles have focused on where AI goes wrong, but there are places where AI gets it right. There are also parties working to enhance ethics, regulation, and develop better systems within AI to combat the issues mentioned above. For example, a researcher at the DeGroote School of Business is developing strategies for AI to mitigate biases.

Regarding biases in medical data, a method developed at the Icahn School of Medicine at Mount Sinai helps identify and reduce biases in datasets, particularly in healthcare applications. Further, Amazon’s SageMaker Clarify and Google’s What-If Tool seek to address biases and unfair model behaviors.

On the open source and academic side, some solutions have worked on computer vision models with post-hoc debiasing, fairness metrics, visualization tools, guardrail systems for AI models that increase fairness by about 31 percent, and a human-in-the-loop system for auditing and mitigating bias in tabular data.

The next article will compare some of the top AI models, the competitive landscape, AI valuation, and theories like the Model Collapse Theory and Dead Internet Theory.

I’ll also examine more about the positive impact that AI is having. Given the field’s rapid advancement, deciding what to share with readers is challenging. With all new technologies, there’s both fear and misunderstanding of the unknown. As AI continues to advance and new code and regulations are created curtailing its access and use of data, it may become a useful tool. See you next quarter for the next article.