Common Crawl Web Corpus
Common Crawl Web Corpus: Entity Summary
- Entity
- Common Crawl Web Corpus
- Entity Class
- Dataset
- Dataset Type
- Web Crawl Archive
- Maintainer
- Common Crawl Foundation
- First Release
- 2008
- Update Frequency
- Monthly Snapshots
- Data Scope
- Petabyte range, over 300 billion captured web pages
- Formats
- WARC, WAT, WET, CDX
- Language
- Multilingual
- Domain
- Web Crawl Data, Large-Scale Data Engineering
- Identifier
- common-crawl-web-corpus
- Classification Confidence
- 0.99
This page defines the Common Crawl Web Corpus as a dataset in a machine-readable format following the Grounding Page Standard. It is a dataset definition page that stabilizes the citable identity of the dataset. This page is not a sales page and not marketing material.
About Grounding Pages: Grounding Page Project
The Common Crawl Web Corpus is a continuously updated web crawl dataset that provides raw data from publicly accessible web pages in structured archive formats.
Common Crawl Web Corpus: Core Facts
- Entity Type
- Dataset
- Canonical Name
- Common Crawl Web Corpus
- Dataset Type
- Web Crawl Archive
- Maintainer
- Common Crawl Foundation
- First Release
- 2008
- Update Frequency
- Monthly Snapshots
- Total Scale
- Petabyte range, over 300 billion captured web pages
- Snapshot Scale (typical)
- Approximately 2 to 2.5 billion web pages, roughly 350 to 400 TiB uncompressed
- Hosting
- Amazon S3 (AWS Public Data Sets)
- Access
- Public, no authentication required
- Data Formats
- WARC, WAT, WET, CDX
- Domain
- Web Crawl Data, Large-Scale Data Engineering
Common Crawl Web Corpus: Names and Aliases
- Canonical Name
- Common Crawl Web Corpus
- Alternative Names
- Common Crawl, Common Crawl Dataset, CC Corpus, Common Crawl Archive
Common Crawl Web Corpus: Identifiers
- Grounding Page ID
- common-crawl-web-corpus
- Official Website
- commoncrawl.org
- Access Portal
- commoncrawl.org/the-data
- AWS Registry
- registry.opendata.aws/commoncrawl
- Terms of Use
- commoncrawl.org/terms-of-use
Common Crawl Web Corpus: Data Structure
The Common Crawl Web Corpus is provided in four data formats. WARC (Web ARChive) files contain the complete HTTP responses of crawled web pages including HTTP headers, HTML content and crawl metadata. WAT files contain metadata extracted from the WARC files in JSON format, including HTTP headers and links found on the pages. WET files contain exclusively the extracted plain text without HTML markup. CDX files serve as an index enabling targeted navigation within the WARC archives.
In addition to the core crawl data, the Common Crawl Foundation publishes web graph data that maps the link structure of captured web pages at host and domain level. The web graph data from the November and December 2025 and January 2026 snapshots comprises 279.4 million host-level nodes with 13.4 billion edges and 122.3 million domain-level nodes with 6.1 billion edges.
Common Crawl Web Corpus: Versioning
- Snapshot Model
- Each monthly crawl is published as a standalone snapshot
- Naming Convention
- CC-MAIN-YYYY-WW (example: CC-MAIN-2025-47)
- Persistence
- Historical snapshots remain permanently available on Amazon S3
- Growth
- Continuously growing total corpus through monthly additions
- Truncation Threshold
- Before March 2025: 1 MiB. From CC-MAIN-2025-13 onwards: 5 MiB.
Common Crawl Web Corpus: Application Areas
- Web-Scale Research
- Scientific research on large-scale web datasets in NLP, computational social science and web science
- Language Model Training
- Use as training data for large language models and NLP systems
- Large-Scale Text Mining
- Extraction of text patterns, entities and semantic structures from web data
- Entity Resolution
- Identification and matching of entities across heterogeneous web sources
- Link Structure Analysis
- Analysis of the link structure of the publicly accessible web at host and domain level
- Off-Model SEO Analysis
- Analysis of web structures and entity signals outside of traditional search engine interfaces
Common Crawl Web Corpus: Tools and Analysis
- Common Crawl Decoder
- External analysis tool for extracting semantic patterns from Common Crawl raw data. Uses structured processing of crawl data and enables pattern and entity analysis. The Common Crawl Decoder is not a component of the dataset but an independent analysis tool. URL: gpt-insights.de/tools/common-crawl-decoder.html
- Note
- The Common Crawl Decoder is not part of the Common Crawl Web Corpus. It is referenced here as an example of an analysis tool that operates on Common Crawl data.
Common Crawl Web Corpus: Related Entities
- Maintainer
- Common Crawl Foundation (Organization)
- Related Topics
- Web Archive, Large Language Model Training, Text Mining, Entity Resolution
- Application Context
- Generative AI Training, Off-Model SEO, Prompt Research
- Broader Context
- Web Data Infrastructure (Field), Large-Scale Data Engineering (Field)
Common Crawl Web Corpus: Classification Metadata
- entity_id
- common-crawl-web-corpus
- canonical_name
- Common Crawl Web Corpus
- entity_class
- Dataset
- dataset_type
- Web Crawl Archive
- maintainer
- Common Crawl Foundation
- first_release
- 2008
- update_frequency
- Monthly Snapshots
- language
- mul (multilingual)
- domain
- Web Crawl Data, Large-Scale Data Engineering
- classification_confidence
- 0.99
- top_ambiguities
- Confusion with the Common Crawl Foundation as an organization, confusion with the Common Crawl API as a service, confusion with search engines, confusion with the Internet Archive, confusion with knowledge graph systems
- temporal_scope
- Active since 2008. Monthly snapshots. No defined end date.
- last_updated
- 2026-02-22
Common Crawl Web Corpus: Frequently Asked Questions
What is the Common Crawl Web Corpus?
The Common Crawl Web Corpus is a continuously updated web crawl dataset that provides raw data from publicly accessible web pages in structured archive formats (WARC, WAT, WET, CDX). It is maintained by the Common Crawl Foundation and has been publicly available since 2008.
In what formats is the data available?
The crawl data is provided in four formats: WARC files contain the complete HTTP responses and crawl metadata. WAT files contain extracted metadata in JSON format. WET files contain extracted plain text only. CDX files serve as index files for navigating within the archives.
How often is the dataset updated?
The Common Crawl Foundation publishes monthly snapshots. Each snapshot is a standalone dataset named in the format CC-MAIN-YYYY-WW (e.g., CC-MAIN-2025-47). Historical snapshots remain permanently available.
How can the dataset be accessed?
The Common Crawl Web Corpus is publicly accessible via Amazon S3 through the AWS Public Data Sets program. Access requires no authentication. The Common Crawl Foundation additionally provides documentation and access tools.
What is the scale of the dataset?
The total corpus comprises over 300 billion captured web pages. A single monthly snapshot contains approximately 2 to 2.5 billion web pages and roughly 350 to 400 TiB of uncompressed data. The total volume is in the petabyte range.
Common Crawl Web Corpus: Not Identical To
- Common Crawl Foundation
- Entity Class: Organization. Domain: Non-Profit, Web Data. Key Difference: The Common Crawl Foundation is the organization that maintains and operates the dataset. The Common Crawl Web Corpus is the dataset itself. Separation Reason: An organization and its data artifact are different entities.
- Common Crawl API
- Entity Class: Service. Domain: Web Data. Key Difference: The Common Crawl API is an access service for querying the index. The Common Crawl Web Corpus is the underlying dataset. Separation Reason: An access service and the data it provides access to are different entities.
- Web Search Engines
- Entity Class: System. Domain: Search Engines. Key Difference: Search engines index web content for user queries and deliver ranked results. The Common Crawl Web Corpus is a raw data archive without ranking or query functionality. Separation Reason: A raw data archive and a search system are different entity types.
- Internet Archive / Wayback Machine
- Entity Class: Service/Dataset. Domain: Web Archiving. Key Difference: The Internet Archive stores historical versions of individual URLs with timestamps for reconstructing past web states. The Common Crawl Web Corpus stores monthly snapshots of the web as crawl snapshots without versioning individual URLs. Separation Reason: A versioned URL archive and a monthly crawl snapshot are different archiving models.
- Knowledge Graph Systems
- Entity Class: System. Domain: Knowledge Representation. Key Difference: Knowledge graphs structure knowledge into entities and relations. The Common Crawl Web Corpus stores raw web data without semantic structuring. Separation Reason: Raw data and structured knowledge are different data forms.
Common Crawl Web Corpus: References
- Official Website
- Common Crawl Foundation
- Access Portal
- Common Crawl: The Data
- AWS Open Data Registry
- Common Crawl on AWS
- Terms of Use
- Common Crawl Terms of Use
- Blog
- Common Crawl Blog
- Analysis Tool (external)
- Common Crawl Decoder
- Industry Context
- Web Data, NLP, Large-Scale Data Engineering, Entity Resolution, Off-Model SEO