Provides access to Common Crawl WARC files via Amazon Web Services.

Installation

You can install the development version of ccwarcs from GitHub with:

Prerequisites

This package will not work unless you are signed up for Amazon Web Services (AWS) and have generated the appropriate security credentials.

To obtain an AWS account, open the AWS home page and then click Sign Up.

After you have created an AWS account, use the Identity and Access Management (IAM) console to create a user. On the Permissions tab for the new user, grant it the AmazonS3ReadOnlyAccess policy. On the Security credentials tab, create a new access key.

To make the new security credentials available to the ccwarcs package, you will typically take one of two actions:

  1. Set the environment variables (‘AWS_ACCESS_KEY_ID’, ‘AWS_SECRET_ACCESS_KEY’, ‘AWS_DEFAULT_REGION’, and ‘AWS_SESSION_TOKEN’)

  2. Install and configure the AWS Command Line Interface

Additional options are documented at the home page for the aws.signature package.

Using ccwarcs

Load the library.

library(ccwarcs)

Test whether AWS credentials can be located.

Create a new ccwarcs_options object with default values. If the directory ~/.ccwarcs_cache does not already exist, the package will ask if you would like to create it.

The cdx_sleep option determines the number of seconds to wait between calls to the Common Crawl Index Server. Values smaller than the default of 0.3 may cause the index server to refuse your requests.

The cache option determines where calls to the index server and web page archives will be cached on the local file system. Accessing the index server and downloading WARCs over the internet is slow. The ccwarcs package minimizes downloads by caching both of these.

The function cdx_fetch_list_of_crawls will retrieve the current, complete list of available crawl archives. Avoid calling this function too often.

Next, retrive index information about the URL r-project.org from the 2019-35 crawl.

cc_index <- 
  get_cc_index(urls = "r-project.org", crawls = "2019-35", .options = opts)

The data returned corresponds with the following call to the index server: https://index.commoncrawl.org/CC-MAIN-2019-35-index?url=r-project.org&output=json

The index information is held in a data frame.

A web page archive (WARC) is uniquely described by the values in four columns.

Choose one of the archived versions of the r-project.org web page to retrieve from the archive.

Use the get_warc function to obtain the archived web page.

The web page archive can be converted into an HTML object and manipulated using the rvest package.

Contributing

Please note that the ‘ccwarcs’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.