The WARC is cached in a directory specified by the cache argument to ccwarcs_options

get_warc(filename, offset, length, digest, include_headers = FALSE,
  .options = NULL)

Arguments

filename

AWS path to the WARC

offset

Starting byte offset for the chunk

length

Number of bytes in the chunk

digest

Common Crawl digest for requested chunk

include_headers

If TRUE, include the WARC and HTTP headers in the result. See below.

.options

An optional object of class ccwarcs_options

Value

HTML contents of the requested WARC, and optionally the WARC and HTTP headers. If include_headers = TRUE, the result is a list with elements headers and payload. Otherwise the result is a character string (vector) containing the HTML of the requested WARC.

Examples

if (FALSE) { warc <- get_warc(filename = paste0("crawl-data/CC-MAIN-2019-35/segments/1566027317516.88/", "warc/CC-MAIN-20190822215308-20190823001308-00434.warc.gz"), offset = 938342027, length = 3019, digest = "ILB6S7TS5WMLJVJIUBRQA53XRK2I3DN7", include_headers = TRUE) cat(warc[[1]]) cat(warc[[2]]) }