After checking all the prerequisites, you can finally start writing the code.
install.packages('rvest')
Place the cursor at the end of the line and press the “Run” button above the code editor. You will see in your console the progress of the package’s installation.
The installation happens once, so now you can comment or delete the previous line:
#install.packages('rvest')
Now you have to load (or import) the library:
library(rvest)
I will use the read_html function to send a GET request to the target website, which will download the needed HTML document. This way I will download the needed HTML document:
book_html <- read_html("https://www.goodreads.com/book/show/61439040-1984")
The result is now stored in the book_html variable, which you can also see by simply typing in the console:
If you need at any moment to check out the official documentation for a function you want to use, type in the console:
help(function_name)
RStudio will open an HTTP server with a direct link to the docs. For read_html the output will be:
To get the reviews list, I will use the html_elements function. It will receive as input the CSS selector I found earlier:
reviews <- book_html %>% html_elements('div.review')
The result will be a list of XML nodes, which I will iterate to get the date and the rating of each individual element:
R programmers use the pipe operator “%>%” to make coding more versatile. Its role is to pass the value of the left operand as an argument to the right operand.
You can chain the operands (as you will see later in this guide), thus it can help you reduce plenty of local variables. The previous line of code written without the pipe operator would look like this:
reviews <- html_elements(book_html, 'div.review')
To gather the data, I will initialize two vectors outside the loop. By taking a quick look at the website, I can guarantee that both vectors will have the same length.
dates <- vector()
ratings <- vector()
Now, while iterating through the reviews list, I look for two values: date and rating. As you saw before, the date is an anchor element that has the reviewDate class.
The rating is a span element with the staticStars class, and it contains five span elements for each star. If the user accorded a star, then the span element will have the p10 class name while the rest of them will have the p0 class name.
The code will look like this:
for (review in reviews) {
review_date = review %>% html_element('a.reviewDate') %>% html_text()
dates <- c(dates, review_date)
review_rating_element = review %>% html_element('span.staticStars')
valid_stars = review_rating_element %>% html_elements('span.p10')
review_rating = length(valid_stars)
ratings <- c(ratings, review_rating)
}
Note the html_element function; it is not a typo. You can use html_elements when you want to extract a list of XML nodes and html_element for a single one.
In this case, I applied the latter for a smaller section of the HTML document (a review). I also used the html_text function to help me get the text content of the element I found.
Finally, I will merge the two vectors in a single data frame to centralize the data:
result = data.frame(date = dates, rating = ratings)
And the final result will look like this: