Digiday: Dozens of new third-party scrapers emerging as the new middlemen of content

A $1 billion market has formed — yet content creators receive zero

Inside the four-step pipeline: scrape → process → sell → reuse

"Not a tax — a hostile takeover"… "Napster is back, but iTunes is nowhere in sight"s

A new kind of middleman has emerged in the AI era. The familiar "ad tech tax" of the digital advertising age took a slice of revenue. The new "AI data broker" takes the entire pie — 100% of the content, with nothing paid in return. Worse: in some cases the same content is recycled into competing products that push the original publishers out of the market entirely. How exactly does this mechanism work? Drawing on Digiday's May 4 report on the alarm spreading through the U.S. publishing industry, this explainer walks through the pipeline step by step.

◆ What Is an "AI Data Broker"? — The Industry That Feeds AI

The term "data broker" is not new. Traditionally it referred to firms that collected personal and consumer data and sold it to marketers and advertisers. But the "AI data brokers" Digiday's reporting describes are a different breed. These firms automatically scrape news articles, blog posts, images and videos from across the open web, process them into structured training datasets, and sell them — or expose them via APIs — to AI companies such as OpenAI, Anthropic and Google.

Why does such an intermediary exist? Training a large language model (LLM) like ChatGPT requires billions to trillions of words of text. AI agents that browse, book and buy on behalf of users must continuously read fresh web content to function. Doing all of this collection in-house is technically and legally costly for AI firms. Outsourcing it to a specialist — that is the role AI data brokers fill.

AI ‘데이터 브로커’는 어떻게 콘텐츠 100%를 가져가는가미국 디지데이 보도에 따르면 AI 데이터 브로커(스크래퍼) 21~40개 업체가 퍼블리셔 콘텐츠를 무단 수집·가공·판매하는 10억 달러 규모 ‘스크래퍼 경제’를 형성. 콘텐츠 제작자에게 돌아가는 수익은 0원이며, 이는 ”세금이 아닌 IP 기반 적대적 인수”라는 업계 비판이 제기K-EnterTech HubJung Han

An anonymous publishing executive cited by Digiday compared them to the demand-side platforms (DSPs) of the ad tech world. If a DSP automates ad-inventory buying on behalf of advertisers, an AI data broker automates content acquisition on behalf of AI companies. The executive counted 30 to 50 such startup DSPs in the content space — "all of them effectively charging a 100% fee."

[Glossary ① Data Broker vs. Scraper] Strictly speaking, a "scraper" refers to the technology or firm that automatically collects content, while a "data broker" sells the collected and processed data downstream. In today's AI market, however, the two functions are usually vertically integrated within the same company, and the terms are used interchangeably. Digiday's reporting also treats "third-party web scraper" and "AI data broker" as effectively synonymous.

◆ How It Works — A Four-Step Pipeline: Scrape → Process → Sell → Reuse

The mechanism by which AI data brokers end up taking 100% of a publisher's content unfolds in four distinct steps. Tracing the flow makes clear why none of the resulting value flows back to the people who created the content.

[STEP 1] Crawling / Scraping — Automated programs (crawlers, bots) operated by the data broker continuously visit websites and download text, images and video wholesale. Publishers' consent is not part of the process. Even when robots.txt explicitly states "no-crawl," a growing number of operators ignore or actively circumvent the directive.

[STEP 2] Processing / Structuring — The raw content is cleaned and reformatted for AI training: ads, navigation and other peripheral elements are stripped out; body text, headlines, bylines and publication dates are normalized; images are captioned and videos transcribed. At the end of this step, the original article has been transformed into a structured training dataset.

[STEP 3] Selling / API — The dataset is then sold to AI companies, or exposed through real-time APIs that supply content on demand. Customers range from large AI labs (OpenAI, Anthropic, Google) to AI search and agent startups. According to Mordor Intelligence, this market has already reached $1 billion in size.

[STEP 4] Reuse / Competing Products — AI firms train chatbots, search engines and summarization services on the data. Users now get article-derived answers directly inside the AI interface — without ever visiting the publisher's site. The original creators lose traffic, advertising revenue and subscriber acquisition.

In this four-step pipeline, the original creator appears in only one

📎 Read full article on K-EnterTech Hub →


About K-EnterTech Forum · K-엔터테크포럼

K-EnterTech Forum (K-ETF, K-엔터테크포럼)은 엔터테인먼트 테크놀로지, K-콘텐츠, 한류, 미디어 정책 분야의 전문 인사이트를 제공하는 국내 대표 플랫폼입니다. K-팝·K-드라마·K-푸드·K-컬처와 AI·스트리밍·크리에이터 이코노미·방송 기술의 공진화(Co-Evolution) 전략을 연구하고, 국내외 포럼·행사를 통해 정책 및 산업 협력 의제를 이끌고 있습니다.
K-EnterTech Forum is Korea's leading platform for insights on entertainment technology, K-Content, Hallyu, and media policy — bridging Korean cultural industries with global technology trends.


고삼석 상임의장 · Chairman Samseog Ko

고삼석(Ko Samseog)은 K-EnterTech Forum 상임의장입니다. 동국대학교 첨단융합대학 석좌교수이자 국가인공지능전략위원회 분과위원으로, 30년 이상의 방송통신 정책 및 산업 경험을 바탕으로 K-콘텐츠와 글로벌 엔터테인먼트 기술의 융합을 선도하고 있습니다. 前 방송통신위원회 상임위원을 역임했으며, ZDNet Korea에 정기 칼럼을 연재 중입니다.
Samseog Ko is the founding Chairman (상임의장) of K-EnterTech Forum. He is a Distinguished Professor at Dongguk University and a member of Korea's National AI Strategy Committee. Former Commissioner of the Korea Communications Commission (KCC).

📩 familygang@naver.com  |  🌐 entertechfrum.com  |  고삼석 상임의장 소개 →