The list of WARCs is cached in a directory specified by the cache argument to ccwarcs_options

get_cc_index(urls, crawls, .options = NULL)

Arguments

urls

A vector of URLs of captured pages, allowing * as a wildcard character

crawls

A vector of Ids of CC crawls to search

Values in crawls are typically character strings in the format YYYY-ww, e.g. 2018-47 for the crawl published in the 47th week of 2018. See https://index.commoncrawl.org/ for a list of crawls, and cdx_fetch_list_of_crawls for programmatic access to this list.

.options

An optional object of class ccwarcs_options

Value

A tibble

Examples

# not run: # url <- "http://www.celebuzz.com/2017-01-04" # crawl <- "2018-47" # results <- get_cc_index(url, crawl)