10 years ago

What is the Common Crawl Initiative?

Share in:
LinkedIn
Facebook
Twitter/X
Email
Share in:

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl scours the entire World Wide Web and archives all the pages it goes through. The organization then works to give everyone free access to its archives and data sets.

According to the organization, it currently has petabytes of data that they have collected in seven years of crawling the Web. In 2012, the company already crawled 3.8 billion documents that totaled more than 100 terabytes. That translates to 61 million domains, more than 92 million PDF files, around 6.6 million Word documents, and 1.3 million Excel spreadsheets.

It has metadata information, raw web page information and text samples and is stored in different academic cloud platforms all over the world, as well as the public data sets of Amazon Web Services.

Common Crawl is registered as a non-profit in the state of California.

Common Crawl โ€œcrawlsโ€ the entire Web four times a year and was founded by Gil Elbaz. Elbaz is no stranger to the Web and big data. He is the CEO and founder of Factual, a company that is involved in big data. Before founding Factual, Elbaz also founded a company that Google acquired, helping it launch it Adwords business for Web sites and pages. Speaking of Google, Common Crawl also has Peter Norvig on board. Norvig is the search giant’s director of research.

Feeling nostalgic? Looking for an old Friendster, Multiply, or other sites that have since closed down? Don’t worry, you can use Common Crawl’s URL Search at https://urlsearch.commoncrawl.org and get the source code for your pages.

That is just one of the things that Common Crawl can help you do. The company has been instrumental in helping startups and individuals conduct research, education, training and analysis. These people now have access to data that was previously available only to big search engines. For instance, Common Crawl’s data has been used by Tineye, which is a reverse image search service that lets you find similar photos to the one you have searched for or uploaded.

You can also use Common Crawl to gain insights, like for example, how many Web pages connect to Facebook. This was what a programmer wanted to find out, and his work with Common Crawl gave him enough credibility and credence to secure funding for Lucky Oyster, which helps users find more information about their social platforms and data. Other use cases that have cropped up over the years include measuring public sentiments by analyzing the emotions and content of online forums and discussions about certain things, and a better dictionary built on Wikipedia pages.

Indeed, with all the data that Common Crawl has made available to developers and startups, the latter now have the power of big data in their hands. It is all just a matter of how to process and analyze these data and what to use it for. Common Crawl gives you a wealth of ideas in its examples page at https://commoncrawl.org/the-data/examples.

However, the biggest potential that Common Crawl has is that it can probably give Google a run for its money. The service uses a customized and open-sourced crawler. The crawler skips spammy sites by using basic page rank. And it already has more or less crawled the Web a few times over. If the people behind Common Crawl really wanted to compete with Google, it already has the foundations to be a serious contender.

Call Four Cornerstone to get the latest on technology and software that could help you.

Photo by Common Crawl.

Scroll to Top