Load libraries.

Test whether AWS credentials can be located.

Create a new ccwarcs_options object with default values. If the directory ~/.ccwarcs_cache does not already exist, the package will ask if you would like to create it.

Get the current list of crawls.

Search for archived articles published in 2018.

Get the HTML of the archived articles.

If you would like to look at the contents of the archived HTML, you can do something like the following. The code below saves the HTML to a temporary file location, then opens it in a web browser.

Looking at one of the web pages (using the code above), determine which part of the page you want to extract. For these articles, the article’s main text can be obtained with the code below.

Extract the article text for all archived HTML pages.

Use the tidytext package to tokenize the articles into words. Then remove common words (stop_words) and “words” comprising only numbers.

For each month, pool all words together, then calculate the term frequency - inverse document frequency. Visualize the top 10 words per month.

For each article, calculate the Tf-Idf, then plot the most distinctive words across all articles in 2018.