Using Website Metadata for URL Previews in Rmarkdown

Recently I was creating a list of R resources and thought it might be interesting to replicate website previews seen when including links in twitter of slack posts. If I am honest though, a list of just URL’s is neither interesting nor informative. And I did not want to spend all day looking up extra information about each resource.

Most modern websites include a large amount of metadata information embedded in the <head></head> tag. Where this metadata complies with the Open Graph Protocol it is used to generate website previews and turns websites into data rich documents. For example if you right-click on this page select “inspect”, you will find the following html <meta> tags:


<meta property='og:title' content='Using Website Metadata for URL Previews in Rmarkdown'>
<meta property='og:description' content='Have you ever noticed those fancy website previews that popup in some platforms like 
twitter or slack when you include a link. How does the platform know exactly what 
to display?
'>
<meta property='og:site_name' content='Restless Data'>
<meta property='og:type' content='article'>

The Open Graph tags are where the property attribute starts with “og:”. There are also specific tags for twitter and facebook and we will read all of them. To use this metadata for our own previews within an rmarkdown document we will need to do a little web-scraping and generate some html. Lets start with the web-scraping.

Extracting Metadata

The rvest package makes it really easy to import the content of a webpage simply using read_html(url) |> html_element('head'). However, in this case we only need the metadata so there is no need to read the entire web page. The readLines() function can limit the number of lines read into R. Also, if you run readLines() within a while loop you can automatically stop reading once you find the end of the header. In general, this is not a particularly efficient way to read large files, but it is very fast in this case as we mostly need to read less than 100 lines.

Let’s use the Open Graph Protocol website as an example:

page <- file('https://ogp.me/', open = "r")

continue <- TRUE
doc <- as.character()

while(continue){
  l <- readLines(page, n = 1)
  continue <- !grepl('</head>', x = l, ignore.case = TRUE)
  doc <- c(doc, l)
}

doc

##  [1] "<!DOCTYPE html>"                                                                                                                           
##  [2] "<html>"                                                                                                                                    
##  [3] "  <head prefix=\"og: https://ogp.me/ns#\">"                                                                                                
##  [4] "    <meta charset=\"utf-8\">"                                                                                                              
##  [5] "    <title>The Open Graph protocol</title>"                                                                                                
##  [6] "    <meta name=\"description\" content=\"The Open Graph protocol enables any web page to become a rich object in a social graph.\">"       
##  [7] "    <script type=\"text/javascript\">var _sf_startpt=(new Date()).getTime()</script>"                                                      
##  [8] "    <link rel=\"stylesheet\" href=\"base.css\" type=\"text/css\">"                                                                         
##  [9] "    <meta property=\"og:title\" content=\"Open Graph protocol\">"                                                                          
## [10] "    <meta property=\"og:type\" content=\"website\">"                                                                                       
## [11] "    <meta property=\"og:url\" content=\"https://ogp.me/\">"                                                                                
## [12] "    <meta property=\"og:image\" content=\"https://ogp.me/logo.png\">"                                                                      
## [13] "    <meta property=\"og:image:type\" content=\"image/png\">"                                                                               
## [14] "    <meta property=\"og:image:width\" content=\"300\">"                                                                                    
## [15] "    <meta property=\"og:image:height\" content=\"300\">"                                                                                   
## [16] "    <meta property=\"og:image:alt\" content=\"The Open Graph logo\">"                                                                      
## [17] "    <meta property=\"og:description\" content=\"The Open Graph protocol enables any web page to become a rich object in a social graph.\">"
## [18] "    <meta prefix=\"fb: https://ogp.me/ns/fb#\" property=\"fb:app_id\" content=\"115190258555800\">"                                        
## [19] "    <link rel=\"alternate\" type=\"application/rdf+xml\" href=\"https://ogp.me/ns/ogp.me.rdf\">"                                           
## [20] "    <link rel=\"alternate\" type=\"text/turtle\" href=\"https://ogp.me/ns/ogp.me.ttl\">"                                                   
## [21] "  </head>"

Now we can use read_html() to read the text and parse the tags; we want all of the meta tags plus the title tag. The inclusion of paste0() produces a character of length 1 which is required for read_html(), plus it ensures all the tags start and end points line up. The function html_nodes() is used to select only the nodes we are interested in. Some websites don’t have a meta tag for the website title, so we select the title tag just in case.

library(rvest)

nodes <- doc |>
  paste0(collapse = '') |>
  read_html() |>
  html_elements('meta, title')
nodes[1:7]

## {xml_nodeset (7)}
## [1] <meta charset="utf-8">\n
## [2] <title>The Open Graph protocol</title>\n
## [3] <meta name="description" content="The Open Graph protocol enables any web ...
## [4] <meta property="og:title" content="Open Graph protocol">\n
## [5] <meta property="og:type" content="website">\n
## [6] <meta property="og:url" content="https://ogp.me/">\n
## [7] <meta property="og:image" content="https://ogp.me/logo.png">\n

Now that we have all of the nodes, we can extract the parts of the tag with the information we need. The nodeset is a list so using the purrr package makes this extraction a lot easier. But you could also use lapply or sapply. As an end result we need to be able to lookup a metadata label and return the value. While this can easily be done in a data.frame, for lookups I find it a little clearer to convert to a list.

library(purrr)
library(dplyr)

metadata <- map_dfr(c('property', 'name'), function(tag){
  nodes |>
    map_dfr(~ tibble(property = html_attr(.x, tag),
                     content = html_attr(.x, 'content'))) |>
    filter(!is.na(property))
})

metadata <- setNames(as.list(metadata$content), metadata$property)

Now that everything is working we can wrap it up into a function that takes any URL and returns a metadata lookup list. In the function below I am also pulling out the title tag. It can be used if the og:title is missing.

read_meta <- function(url){
  
  page <- file(url, open = "r")
  continue <- TRUE
  doc <- as.character()
  
  while(continue){
    l <- readLines(page, n = 1)
    continue <- !grepl('</head>', x = l, ignore.case = TRUE)
    doc <- c(doc, l)
  }

  close(page)
  
  nodes <- doc |>
    paste0(collapse = '') |>
    read_html() |>
    html_nodes('meta, title')
  
  metadata <- map_dfr(c('property', 'name'), function(tag){
    nodes |>
      map_dfr(~ tibble(property = html_attr(.x, tag),
                       content = html_attr(.x, 'content'))) |>
      filter(!is.na(property))
  })
  
  # check for a title tag
  title <- nodes |> html_nodes(xpath = "/html/head/title") |> html_text()
  title_tag <- if ('og:title' %in% metadata$property) 'title' else 'og:title'
  
  if (length(title) > 0) {
    metadata <- metadata |>
      add_row(property = title_tag,
              content = title)
  }
  
  metadata <- setNames(as.list(metadata$content), metadata$property)
  
  return(metadata)
  
}

Render a Website Preview Card

Adding a web component like a card is now quite easy due to CSS frameworks like Bootstrap. In the card below I have a simple card created using htmltools tags with predefined styles from Bootstrap. Apart from importing the Bootstrap CSS library, no other CSS is needed for this card.

library(htmltools)

card <- function(metadata){
  div(class = 'card text-white bg-secondary mb-3 border-light border-3',
            div(style = 'display: inline-grid; grid-template-columns: 55% 40%;padding-left:12px',
                div(class = 'card-body text-dark',
                    h2(class = 'card-title',
                       a(href = metadata$`og:url`,
                         metadata$`og:title`)
                       ),
                    p(class="card-text text-secondary",
                      metadata$`og:description`),
                    p(class="card-text text-secondary",
                      if (!is.null(metadata$author)) {
                        div("Author(s):", br(), metadata$author)
                      })
                  ),
                div(if (!is.null(metadata$`og:image`)) img(class="h-100", 
                                                           style = 'padding:20px;',
                                                           src = metadata$`og:image`, alt = metadata$`og:image:alt`) else '')
                )
            )
}

Before calling the card function we just need to do a little cleaning up of the metadata. First where there are multiple authors listed in the metadata, we need to collapse them into a single character to fit in our author div. Next, we want to ensure that if the og:image:alt tag is missing, something meaningful can be used in its place.

card_render <- function(metadata){
  
  if (!is.null(metadata$author) & length(metadata$authur) > 1) {
    metadata$author <- metadata$author |> paste(collapse = '')
  }
  
  if (!is.null(metadata$`og:image:alt`)) {
    metadata$`og:image:alt` <- paste0('Cover Image for ', metadata$`og:title`)
  }

  card(metadata)
}

Finally we render the card. With the checks we have added around metadata titles, authors and alternate image text we should be able to produce a basic summary for any URL; from an R book to a funny movie.

read_meta('https://bookdown.org/yihui/rmarkdown/') |>
  card_render()

R Markdown: The Definitive Guide

The first official book authored by the core R Markdown developers that provides a comprehensive and accurate reference to the R Markdown ecosystem. With R Markdown, you can easily create reproducible data analysis reports, presentations, dashboards, interactive applications, books, dissertations, websites, and journal articles, while enjoying the simplicity of Markdown and the great power of R and other languages.

Author(s):
Yihui Xie, J. J. Allaire, Garrett Grolemund

read_meta('https://www.imdb.com/title/tt1386588/') |>
  card_render()

The Other Guys (2010) - IMDb

The Other Guys: Directed by Adam McKay. With Will Ferrell, Derek Jeter, Mark Wahlberg, Eva Mendes. Two mismatched New York City detectives seize an opportunity to step up like the city's top cops, whom they idolize, only things don't quite go as planned.

This has just been a simple example. Most websites provide a lot of structured information via metadata within the <head> tag. So check out the Open Graph Protocol website to understand all the possibilities.