Get a WARC from the Common Crawl via AWS

The WARC is cached in a directory specified by the cache argument to ccwarcs_options

get_warc(filename, offset, length, digest, include_headers = FALSE,
  .options = NULL)

Arguments

filename	AWS path to the WARC
offset	Starting byte offset for the chunk
length	Number of bytes in the chunk
digest	Common Crawl digest for requested chunk
include_headers	If TRUE, include the WARC and HTTP headers in the result. See below.
.options	An optional object of class ccwarcs_options

Value

HTML contents of the requested WARC, and optionally the WARC and HTTP headers. If include_headers = TRUE, the result is a list with elements headers and payload. Otherwise the result is a character string (vector) containing the HTML of the requested WARC.

Examples

if (FALSE) {
warc <- get_warc(filename = paste0("crawl-data/CC-MAIN-2019-35/segments/1566027317516.88/",
           "warc/CC-MAIN-20190822215308-20190823001308-00434.warc.gz"),
         offset = 938342027, length = 3019, digest = "ILB6S7TS5WMLJVJIUBRQA53XRK2I3DN7",
         include_headers = TRUE)
cat(warc[[1]])
cat(warc[[2]])
}

Get a WARC from the Common Crawl via AWS

Arguments

Value

Examples

Contents