The WARC is cached in a directory specified by the
cache
argument to ccwarcs_options
get_warc(filename, offset, length, digest, include_headers = FALSE, .options = NULL)
filename | AWS path to the WARC |
---|---|
offset | Starting byte offset for the chunk |
length | Number of bytes in the chunk |
digest | Common Crawl digest for requested chunk |
include_headers | If TRUE, include the WARC and HTTP headers in the result. See below. |
.options | An optional object of class ccwarcs_options |
HTML contents of the requested WARC, and optionally the WARC and HTTP
headers. If include_headers = TRUE
, the result is a list with elements
headers
and payload
. Otherwise the result is a character string
(vector) containing the HTML of the requested WARC.
if (FALSE) { warc <- get_warc(filename = paste0("crawl-data/CC-MAIN-2019-35/segments/1566027317516.88/", "warc/CC-MAIN-20190822215308-20190823001308-00434.warc.gz"), offset = 938342027, length = 3019, digest = "ILB6S7TS5WMLJVJIUBRQA53XRK2I3DN7", include_headers = TRUE) cat(warc[[1]]) cat(warc[[2]]) }