Get a List of WARCs from Common Crawl Index Server

The list of WARCs is cached in a directory specified by the cache argument to ccwarcs_options

get_cc_index(urls, crawls, .options = NULL)

Arguments

urls

urls	A vector of URLs of captured pages, allowing `*` as a wildcard character
crawls	A vector of Ids of CC crawls to search Values in `crawls` are typically character strings in the format `YYYY-ww`, e.g. 2018-47 for the crawl published in the 47th week of 2018. See https://index.commoncrawl.org/ for a list of crawls, and cdx_fetch_list_of_crawls for programmatic access to this list.
.options	An optional object of class ccwarcs_options

A vector of URLs of captured pages, allowing * as a wildcard character

crawls

A vector of Ids of CC crawls to search

Values in crawls are typically character strings in the format YYYY-ww, e.g. 2018-47 for the crawl published in the 47th week of 2018. See https://index.commoncrawl.org/ for a list of crawls, and cdx_fetch_list_of_crawls for programmatic access to this list.

.options

An optional object of class ccwarcs_options

Value

A tibble

Examples

# not run:
# url <- "http://www.celebuzz.com/2017-01-04"
# crawl <- "2018-47"
# results <- get_cc_index(url, crawl)

Get a List of WARCs from Common Crawl Index Server

Arguments

Value

Examples

Contents