Legal Web Scraping in 2026: Boundaries of Collecting Public Data

2026-02-12

A practical guide to legal boundaries for collecting public web data in 2026: Terms, robots.txt, privacy, copyright, and EU database rights.

What changed by 2026 and why “legal scraping” is a hotter topic

Web scraping is no longer a niche technique. It is used for market intelligence, price monitoring, data journalism, and QA checks across advertising and e‑commerce. In 2026 the core tension is unchanged: data may look public, but platforms try to control access, extraction speed, and downstream reuse.

The rules are now multi‑layered. “It’s visible in a browser” is not enough. You typically need to consider: (1) unauthorized access / computer misuse rules, (2) contract restrictions (Terms/ToS), (3) copyright, (4) database rights (especially in the EU), (5) personal data regimes (GDPR/UK GDPR and others), (6) competition / unfair practices, and (7) technical signals such as robots.txt.

Publicly accessible data does not mean free to reuse

“Public” usually means available without login: product listings, prices, classifieds, news, open profiles. That describes the access path, not the legal permission to copy and reuse at scale. Even if anyone can open a page, it does not automatically mean you can:

copy content in bulk and republish it as your own;
bypass technical controls (anti‑bot, IP blocks, paywalls, tokens);
collect personal data and profile or resell it;
breach Terms you accepted (e.g., as a registered user);
overload servers and create DoS‑like effects.

A practical framing is: “Do we have lawful access?” + “Do we have the right to process/copy these specific elements?” + “Are we acting proportionately and in good faith?”

Three risk tiers for scraping projects

Lower risk: extracting non‑personal facts (prices, stock, specifications), no bypassing, reasonable request rates, no copying of creative content.
Medium risk: mixed datasets that may include personal elements (seller names, phone numbers, avatars) or competitive aggregation. Privacy, ToS, database rights, and competition law become central.
Higher risk: logged‑in scraping, anti‑bot circumvention, account farms, large‑scale personal data harvesting, site mirroring, resale of profiles/contacts.

1) Unauthorized access and “anti‑hacking” rules

Most jurisdictions have rules against unauthorized access to computer systems. A recurring practical takeaway from recent case law: when data is available without authentication, the risk of treating the mere act of viewing as “hacking” is lower than when you access behind a password. But risk increases quickly if you:

circumvent technical barriers in a way that looks like evasion rather than normal browsing;
use compromised or borrowed credentials;
generate behavior that resembles an attack (token guessing, aggressive traffic, scanning).

A simple 2026 rule of thumb: do not turn scraping into security bypass. The more you automate, the more important it is to show “visitor‑like” behavior with strict rate limiting and stability.

2) Terms of Service: when policy becomes legal exposure

Terms of Service are contractual. If you accepted them (account sign‑up, click‑through, API use), an anti‑scraping clause can support a contract claim.

One nuance: for logged‑off scraping, platforms may have a harder time showing a contract was formed. But once you operate under login or use accounts, ToS risk rises sharply. Practically:

Avoid scraping content that requires login if Terms prohibit it.
Avoid account farms or fake accounts where rules forbid automation.
Consider official APIs or data partnerships where feasible.

3) Copyright: facts are safer than expression

Facts (a price, a date, a model number) are typically not protected by copyright, but expression is: descriptions, photos, reviews, curated collections. The common mistake is equating “public” with “licensed to copy”.

Extract and display factual fields rather than full descriptive text.
Photos are a frequent trigger for claims; use licensed images or your own content.
For analytics, prefer transformation: normalization, aggregation, statistics—not “copy and display”.

4) EU database rights: scale matters

In the EU, databases can benefit from sui generis protection. The risk is often not a single record, but extracting a substantial part of a database, or repeated extraction of smaller parts that adds up. In practice:

Competitive aggregation of “most of the catalog” is high risk.
Targeted monitoring on limited samples is materially safer.
Investment in obtaining/verification/maintenance of the database can strengthen the rightsholder’s position.

5) EU text and data mining (TDM) and opt‑outs

The EU has specific copyright exceptions for text and data mining. A key commercial‑world idea is that TDM may be permitted where you have lawful access, but rightsholders can expressly reserve their rights (opt‑out) in an “appropriate manner”, including machine‑readable means.

This is why robots.txt, metadata, and similar signals have gained weight. Robots.txt is not a “lock”, but ignoring it can undermine a good‑faith narrative and, in some contexts, support an argument that rights were reserved.

6) Robots.txt in 2026: technical standard, growing legal relevance

Robots.txt is part of the Robots Exclusion Protocol (RFC 9309). The standard explicitly notes that robots.txt is not a form of access authorization; it is a policy request to crawlers. Still, it can matter:

as evidence you ignored an expressed restriction;
as a marker of responsible behavior (respecting disallow and reducing load);
as a practical machine‑readable signal for TDM opt‑outs in EU‑linked scenarios.

For commercial scrapers: parse robots.txt, respect sitemaps and user‑agent rules, and keep logs that demonstrate compliance.

7) Personal data: “public” can still mean “regulated”

If you collect identifiers (names, phone numbers, emails, photos, handles, profile IDs), you are likely processing personal data under GDPR/UK GDPR or similar regimes. Then the main boundary is not “scraping” but “processing”. You typically need:

a lawful basis (often legitimate interests plus a balancing test);
data minimization (collect only what you truly need);
transparency (how you inform data subjects);
security (access controls, encryption, logging);
retention limits and deletion procedures;
processes for data subject rights (access, erasure, objection).

Regulators increasingly view large‑scale profile harvesting as high risk. Many businesses are safer focusing on non‑personal public facts, or using personal data only with strict minimization and documented justification.

8) Unfair competition and “repackaging” someone else’s investment

Even if you avoid ToS and privacy pitfalls, competition law can remain. If your product mainly monetizes another platform’s catalog without permission, claims may be framed around free‑riding, unfair practices, or misleading users about the source.

Risk increases when your value is primarily an interface on top of someone else’s dataset.
Risk increases when updates are so frequent that you effectively substitute the platform.
Risk increases when you hide attribution and branding.

9) A “more legal” scraping checklist for 2026

Define purpose: why you collect, what fields are strictly necessary.
Limit scope: samples instead of “everything”, moderate frequency, caching.
No bypassing: no login circumvention, token abuse, paywall/anti‑bot breaking.
Respect robots.txt: disallow areas, load limits, sitemaps, user‑agent rules.
Avoid copying expression: don’t mirror creative text/photos without rights.
Privacy by design: filter personal fields; if unavoidable, document lawful basis and minimization.
Logs and auditability: what was fetched, at what rate, under which rules.
Rightsholder channel: clear complaints/takedown path and fast removal.

10) If a site explicitly “forbids scraping”

use an official API or partner access;
reduce to minimal factual collection and avoid personal data;
request permission or a license;
change the product so it does not depend on full replication (add your own data and analytics).

Where “lawful access” boundaries usually sit

“Lawful access” is a common hinge point. A page can be visible yet still problematic if you bypass geo/age gating, exploit technical misconfigurations to reach “closed” content, or mass‑abuse signed URLs and one‑time tokens.

Jurisdictions: one crawler, multiple regimes

Scraping is often cross‑border. Personal data rules may attach based on data subject location and the controller’s establishment, while contract and competition disputes may follow the platform’s jurisdiction or Terms (if applicable). For commercial projects, maintain a simple risk map: server location, user geography, and the exact fields collected.

DSA and researcher data access: a market signal

The EU Digital Services Act strengthens transparency and enables vetted researcher access to platform data in systemic‑risk contexts. The broader trend is toward managed access (procedures and APIs) rather than endless ad‑hoc scraping. If your product depends heavily on major platforms, plan a path from scraping to formal access.

Documentation that helps in real disputes

a data collection policy (what/why/from where);
technical controls description (rate limiting, caching, robots.txt handling);
a personal data assessment and filtering rules;
a response procedure (takedown, source blocking, escalation).

Conclusion

In 2026, “legal scraping” is not a single checkbox. It is a package: lawful access, respect for technical rules, minimization, no security bypass, care with personal data, and caution with database‑scale extraction. If you can explain your process as proportionate, transparent, and safe, your position is stronger—even when the data is publicly accessible.