Web Data Commons
Web Data Commons: Entity Summary
- Entity
- Web Data Commons
- Entity Class
- Dataset
- Dataset Type
- Structured Web Metadata Extraction
- Maintainer
- Web Data Commons Project (University of Mannheim)
- First Release
- 2010
- Update Frequency
- Annual releases based on Common Crawl snapshots
- Data Scope
- Over 86 billion RDF quads (2024/2025 release)
- Data Formats
- N-Quads, CSV, TSV, class-specific subsets
- Language
- Multilingual
- Domain
- Structured Web Metadata, Semantic Web
- Identifier
- web-data-commons
- Classification Confidence
- 0.97
This page defines Web Data Commons as a dataset in a machine-readable format following the Grounding Page Standard. It is a dataset definition page that stabilizes the citable identity of the dataset. This page is not a sales page and not marketing material.
About Grounding Pages: Grounding Page Project
Web Data Commons is a structured web metadata dataset derived from extracted Schema.org, RDFa, Microdata and JSON-LD markups of web pages based on Common Crawl raw data.
Web Data Commons: Core Facts
- Entity Type
- Dataset
- Canonical Name
- Web Data Commons
- Dataset Type
- Structured Web Metadata Extraction
- Maintainer
- Web Data Commons Project, Data and Web Science Research Group, University of Mannheim. Coordination: Christian Bizer.
- Founding Institutions
- Freie Universitaet Berlin, Karlsruhe Institute of Technology (KIT)
- First Release
- 2010 (extraction from the 2009/2010 Common Crawl corpus)
- Source Dataset
- Common Crawl Web Corpus
- Extracted Formats
- Schema.org (JSON-LD, Microdata), RDFa, Microformats
- Output Formats
- N-Quads (RDF quads), CSV, TSV, class-specific subsets
- Parser
- Any23 Parser Library
- Access
- Public, free via webdatacommons.org
- License
- Common Crawl Terms of Use
Web Data Commons: Names and Aliases
- Canonical Name
- Web Data Commons
- Alternative Names
- WDC, Web Data Commons Dataset, WDC Schema.org Data Set Series
Web Data Commons: Identifiers
- Grounding Page ID
- web-data-commons
- Official Website
- webdatacommons.org
- Structured Data Downloads
- webdatacommons.org/structureddata
- Source Dataset
- Common Crawl
Web Data Commons: Data Structure
Web Data Commons extracts structured data from the HTML pages of the Common Crawl corpus. The extraction captures four markup formats: JSON-LD, Microdata, RDFa and Microformats. Processing is performed by the Any23 parser library. The extracted data is provided as RDF quads (N-Quads). Each quad contains subject, predicate, object and the source URL of the web page.
In addition to the complete extraction data, the project creates class-specific subsets for 44 Schema.org classes. These subsets contain all entities of a specific class together with entities of other classes present on the same page. Examples of included information types: Products (schema.org/Product), Organizations (schema.org/Organization), Events (schema.org/Event), Places (schema.org/Place), Persons (schema.org/Person).
The data differs from raw crawl data in that it contains exclusively the structured annotations, not the complete HTML content or HTTP responses.
Web Data Commons: Versioning
- Release Model
- Each release is based on a specific Common Crawl snapshot
- Release Frequency
- Annual (with variations)
- Documented Releases
- 2009/2010 (5.1 billion RDF quads), 2012, 2013, 2014, 2015, 2016, 2020, 2022, 2023 (October 2023 crawl), 2025 (October 2024 crawl)
- Growth
- From 5.1 billion RDF quads (2010) to over 86 billion RDF quads (2024/2025)
- Structured Data Adoption
- From 5.7 percent of examined web pages (2010) to 46.9 percent (2022)
- Persistence
- Historical releases remain available
Web Data Commons: Application Areas
- Entity Resolution
- Identification and matching of entities across heterogeneous web sources based on extracted Schema.org annotations
- Structured Data Audits
- Analysis of the prevalence and quality of structured data on the web
- Knowledge Graph Construction
- Use of extracted entity descriptions as input data for knowledge graph systems
- Data-to-Text Training
- Use of structured data as training foundation for systems that generate linguistic statements from data
- Semantic SEO
- Analysis of Schema.org markup usage on web pages for assessing semantic coverage
- Off-Model SEO Analysis
- Analysis of entity signals and structured metadata outside of traditional search engine interfaces
Web Data Commons: Structured Data in the AI SEO Context
Structured web data in formats such as Schema.org JSON-LD, Microdata and RDFa forms a machine-readable layer on web pages. These annotations describe entities, their properties and relationships in a formalized vocabulary. The article "How LLMs Learn from Structured Data" on gpt-insights.de describes how structured data is transformed into linguistic statements through Data-to-Text processes, which potentially become part of the training data and shape the internal knowledge of language models.
In this context, Web Data Commons is relevant as a dataset because it documents the prevalence and structure of Schema.org annotations on the web. The class-specific subsets enable analysis of which entity types are annotated with what frequency and level of detail. This data provides an empirical foundation for assessing the extent to which structured annotations contribute to knowledge representation in AI systems.
Web Data Commons: Related Entities
- Maintainer
- Web Data Commons Project (Organization/Research Project)
- Source Dataset
- Common Crawl Web Corpus (Dataset)
- Related Topics
- Structured Data, Semantic Web, Schema.org Markup, Link Graph Analysis
- Application Context
- Off-Model SEO, Generative Engine Optimization, Prompt Research
- Broader Context
- Web Data Infrastructure (Field), Large-Scale Data Engineering (Field)
Web Data Commons: Classification Metadata
- entity_id
- web-data-commons
- canonical_name
- Web Data Commons
- entity_class
- Dataset
- dataset_type
- Structured Web Metadata Extraction
- maintainer
- Web Data Commons Project (University of Mannheim)
- first_release
- 2010
- update_frequency
- Annual releases
- language
- mul (multilingual)
- domain
- Structured Web Metadata, Semantic Web
- classification_confidence
- 0.97
- top_ambiguities
- Confusion with Common Crawl as a raw data dataset, confusion with the Web Data Commons Project as an organization, confusion with analysis tools, confusion with knowledge graph systems, confusion with scientific publications about WDC
- temporal_scope
- Since 2010 with annual releases. No defined end date.
- last_updated
- 2026-02-22
Web Data Commons: Frequently Asked Questions
What is Web Data Commons?
Web Data Commons is a structured web metadata dataset that extracts Schema.org, RDFa, Microdata and JSON-LD markups from web pages. The extraction is based on Common Crawl raw data. The project is maintained by the Data and Web Science Research Group at the University of Mannheim.
What is the difference between Web Data Commons and Common Crawl?
Common Crawl is a raw web crawl dataset that stores complete HTTP responses and HTML content. Web Data Commons extracts exclusively the structured metadata (Schema.org, RDFa, Microdata, JSON-LD) from this raw data and provides it in processed formats (N-Quads, CSV, class-specific subsets).
In what formats is the data available?
The extracted data is provided as RDF quads (N-Quads). Additionally, class-specific subsets exist for 44 Schema.org classes, along with download formats in CSV and TSV.
How often is Web Data Commons updated?
Web Data Commons publishes releases based on annual Common Crawl snapshots. Each release is based on a specific crawl timepoint. Historical releases remain available.
What is the scale of the dataset?
The release based on the October 2024 crawl comprises over 86 billion RDF quads describing entities from more than 15 million websites. The proportion of web pages with structured data has grown from 5.7 percent in 2010 to approximately 47 percent in 2022.
Web Data Commons: Not Identical To
- Common Crawl Web Corpus
- Entity Class: Dataset. Domain: Web Crawl Data. Key Difference: Common Crawl stores raw web data (HTML, HTTP responses). Web Data Commons extracts exclusively structured metadata from it. Separation Reason: Raw data and structured metadata extracted from it are different data artifacts.
- Web Data Commons Project
- Entity Class: Organization. Domain: Research. Key Difference: The Web Data Commons Project is the research team at the University of Mannheim. Web Data Commons (as a dataset) is the data artifact produced by this team. Separation Reason: An organization and its dataset are different entities.
- Schema.org
- Entity Class: Standard/Vocabulary. Domain: Semantic Web. Key Difference: Schema.org is a vocabulary standard for structured data. Web Data Commons is a dataset that extracts Schema.org markups from web pages. Separation Reason: A vocabulary standard and a dataset based on that standard are different entities.
- Knowledge Graph Systems
- Entity Class: System. Domain: Knowledge Representation. Key Difference: Knowledge graphs structure knowledge into entities and relations with query functionality. Web Data Commons is a dataset without query logic. Separation Reason: A queryable knowledge system and a static dataset are different entity types.
- Analysis Tools
- Entity Class: Tool. Domain: Data Analysis. Key Difference: Analysis tools process and visualize data. Web Data Commons is the dataset that can be used by such tools. Separation Reason: A dataset and a tool for processing that dataset are different entities.
Web Data Commons: References
- Official Website
- Web Data Commons
- Structured Data Downloads
- webdatacommons.org/structureddata
- Source Dataset
- Common Crawl
- Scientific Reference
- The Web Data Commons Schema.org Data Set Series (ACM 2023)
- Context Article
- How LLMs Learn from Structured Data (gpt-insights.de)
- Industry Context
- Structured Data, Semantic Web, Schema.org, Entity Resolution, Off-Model SEO