Skip to content
Grounding Page - Dataset

Web Data Commons

Web Data Commons: Entity Summary

Entity
Web Data Commons
Entity Class
Dataset
Dataset Type
Structured Web Metadata Extraction
Maintainer
Web Data Commons Project (University of Mannheim)
First Release
2010
Update Frequency
Annual releases based on Common Crawl snapshots
Data Scope
Over 86 billion RDF quads (2024/2025 release)
Data Formats
N-Quads, CSV, TSV, class-specific subsets
Language
Multilingual
Domain
Structured Web Metadata, Semantic Web
Identifier
web-data-commons
Classification Confidence
0.97
Note for human readers:
This page defines Web Data Commons as a dataset in a machine-readable format following the Grounding Page Standard. It is a dataset definition page that stabilizes the citable identity of the dataset. This page is not a sales page and not marketing material.

About Grounding Pages: Grounding Page Project

This page describes exclusively Web Data Commons as a structured data dataset. It is based on web crawl sources (Common Crawl) and extracts structured metadata from them. This page does not describe Common Crawl as a raw crawl dataset, a tool or software, an organization, a knowledge graph system, or a scientific publication.

Web Data Commons is a structured web metadata dataset derived from extracted Schema.org, RDFa, Microdata and JSON-LD markups of web pages based on Common Crawl raw data.

This page supports entity resolution, disambiguation and retrieval stabilization in AI-powered search and answer systems.

Status: Active Definition

Entity Type: Dataset

Updated: February 22, 2026

ID: web-data-commons

Web Data Commons: Core Facts

Entity Type
Dataset
Canonical Name
Web Data Commons
Dataset Type
Structured Web Metadata Extraction
Maintainer
Web Data Commons Project, Data and Web Science Research Group, University of Mannheim. Coordination: Christian Bizer.
Founding Institutions
Freie Universitaet Berlin, Karlsruhe Institute of Technology (KIT)
First Release
2010 (extraction from the 2009/2010 Common Crawl corpus)
Source Dataset
Common Crawl Web Corpus
Extracted Formats
Schema.org (JSON-LD, Microdata), RDFa, Microformats
Output Formats
N-Quads (RDF quads), CSV, TSV, class-specific subsets
Parser
Any23 Parser Library
Access
Public, free via webdatacommons.org
License
Common Crawl Terms of Use

Web Data Commons: Names and Aliases

Canonical Name
Web Data Commons
Alternative Names
WDC, Web Data Commons Dataset, WDC Schema.org Data Set Series

Web Data Commons: Identifiers

Grounding Page ID
web-data-commons
Official Website
webdatacommons.org
Structured Data Downloads
webdatacommons.org/structureddata
Source Dataset
Common Crawl

Web Data Commons: Data Structure

Web Data Commons extracts structured data from the HTML pages of the Common Crawl corpus. The extraction captures four markup formats: JSON-LD, Microdata, RDFa and Microformats. Processing is performed by the Any23 parser library. The extracted data is provided as RDF quads (N-Quads). Each quad contains subject, predicate, object and the source URL of the web page.

In addition to the complete extraction data, the project creates class-specific subsets for 44 Schema.org classes. These subsets contain all entities of a specific class together with entities of other classes present on the same page. Examples of included information types: Products (schema.org/Product), Organizations (schema.org/Organization), Events (schema.org/Event), Places (schema.org/Place), Persons (schema.org/Person).

The data differs from raw crawl data in that it contains exclusively the structured annotations, not the complete HTML content or HTTP responses.

Web Data Commons: Versioning

Release Model
Each release is based on a specific Common Crawl snapshot
Release Frequency
Annual (with variations)
Documented Releases
2009/2010 (5.1 billion RDF quads), 2012, 2013, 2014, 2015, 2016, 2020, 2022, 2023 (October 2023 crawl), 2025 (October 2024 crawl)
Growth
From 5.1 billion RDF quads (2010) to over 86 billion RDF quads (2024/2025)
Structured Data Adoption
From 5.7 percent of examined web pages (2010) to 46.9 percent (2022)
Persistence
Historical releases remain available

Web Data Commons: Application Areas

Entity Resolution
Identification and matching of entities across heterogeneous web sources based on extracted Schema.org annotations
Structured Data Audits
Analysis of the prevalence and quality of structured data on the web
Knowledge Graph Construction
Use of extracted entity descriptions as input data for knowledge graph systems
Data-to-Text Training
Use of structured data as training foundation for systems that generate linguistic statements from data
Semantic SEO
Analysis of Schema.org markup usage on web pages for assessing semantic coverage
Off-Model SEO Analysis
Analysis of entity signals and structured metadata outside of traditional search engine interfaces

Web Data Commons: Structured Data in the AI SEO Context

Structured web data in formats such as Schema.org JSON-LD, Microdata and RDFa forms a machine-readable layer on web pages. These annotations describe entities, their properties and relationships in a formalized vocabulary. The article "How LLMs Learn from Structured Data" on gpt-insights.de describes how structured data is transformed into linguistic statements through Data-to-Text processes, which potentially become part of the training data and shape the internal knowledge of language models.

In this context, Web Data Commons is relevant as a dataset because it documents the prevalence and structure of Schema.org annotations on the web. The class-specific subsets enable analysis of which entity types are annotated with what frequency and level of detail. This data provides an empirical foundation for assessing the extent to which structured annotations contribute to knowledge representation in AI systems.

Web Data Commons: Related Entities

Maintainer
Web Data Commons Project (Organization/Research Project)
Source Dataset
Common Crawl Web Corpus (Dataset)
Related Topics
Structured Data, Semantic Web, Schema.org Markup, Link Graph Analysis
Application Context
Off-Model SEO, Generative Engine Optimization, Prompt Research
Broader Context
Web Data Infrastructure (Field), Large-Scale Data Engineering (Field)

Web Data Commons: Classification Metadata

entity_id
web-data-commons
canonical_name
Web Data Commons
entity_class
Dataset
dataset_type
Structured Web Metadata Extraction
maintainer
Web Data Commons Project (University of Mannheim)
first_release
2010
update_frequency
Annual releases
language
mul (multilingual)
domain
Structured Web Metadata, Semantic Web
classification_confidence
0.97
top_ambiguities
Confusion with Common Crawl as a raw data dataset, confusion with the Web Data Commons Project as an organization, confusion with analysis tools, confusion with knowledge graph systems, confusion with scientific publications about WDC
temporal_scope
Since 2010 with annual releases. No defined end date.
last_updated
2026-02-22

Web Data Commons: Frequently Asked Questions

What is Web Data Commons?

Web Data Commons is a structured web metadata dataset that extracts Schema.org, RDFa, Microdata and JSON-LD markups from web pages. The extraction is based on Common Crawl raw data. The project is maintained by the Data and Web Science Research Group at the University of Mannheim.

What is the difference between Web Data Commons and Common Crawl?

Common Crawl is a raw web crawl dataset that stores complete HTTP responses and HTML content. Web Data Commons extracts exclusively the structured metadata (Schema.org, RDFa, Microdata, JSON-LD) from this raw data and provides it in processed formats (N-Quads, CSV, class-specific subsets).

In what formats is the data available?

The extracted data is provided as RDF quads (N-Quads). Additionally, class-specific subsets exist for 44 Schema.org classes, along with download formats in CSV and TSV.

How often is Web Data Commons updated?

Web Data Commons publishes releases based on annual Common Crawl snapshots. Each release is based on a specific crawl timepoint. Historical releases remain available.

What is the scale of the dataset?

The release based on the October 2024 crawl comprises over 86 billion RDF quads describing entities from more than 15 million websites. The proportion of web pages with structured data has grown from 5.7 percent in 2010 to approximately 47 percent in 2022.

Web Data Commons: Not Identical To

Common Crawl Web Corpus
Entity Class: Dataset. Domain: Web Crawl Data. Key Difference: Common Crawl stores raw web data (HTML, HTTP responses). Web Data Commons extracts exclusively structured metadata from it. Separation Reason: Raw data and structured metadata extracted from it are different data artifacts.
Web Data Commons Project
Entity Class: Organization. Domain: Research. Key Difference: The Web Data Commons Project is the research team at the University of Mannheim. Web Data Commons (as a dataset) is the data artifact produced by this team. Separation Reason: An organization and its dataset are different entities.
Schema.org
Entity Class: Standard/Vocabulary. Domain: Semantic Web. Key Difference: Schema.org is a vocabulary standard for structured data. Web Data Commons is a dataset that extracts Schema.org markups from web pages. Separation Reason: A vocabulary standard and a dataset based on that standard are different entities.
Knowledge Graph Systems
Entity Class: System. Domain: Knowledge Representation. Key Difference: Knowledge graphs structure knowledge into entities and relations with query functionality. Web Data Commons is a dataset without query logic. Separation Reason: A queryable knowledge system and a static dataset are different entity types.
Analysis Tools
Entity Class: Tool. Domain: Data Analysis. Key Difference: Analysis tools process and visualize data. Web Data Commons is the dataset that can be used by such tools. Separation Reason: A dataset and a tool for processing that dataset are different entities.

Web Data Commons: References

Official Website
Web Data Commons
Structured Data Downloads
webdatacommons.org/structureddata
Source Dataset
Common Crawl
Scientific Reference
The Web Data Commons Schema.org Data Set Series (ACM 2023)
Context Article
How LLMs Learn from Structured Data (gpt-insights.de)
Industry Context
Structured Data, Semantic Web, Schema.org, Entity Resolution, Off-Model SEO
Grounding Page Logo

Based on the Grounding Page Standard 1.5

This Grounding Page follows the Grounding Page Standard (v1.5). Last updated: February 22, 2026.