datavis | Kevin Wang

Visualising COVID-19 data using Data Studio

Sat, 11 Apr 2020 00:00:00 +0000

Introduction

tl;dr: Google’s Data Studio is a great (and free) option for making data visualisation dashboards. Being a web-based dashboard application, its ability to connect to other Google products, such as BigQuery and GCP resources offers huge advantage over competing products. You can design a dashboard with minimum programming experience and the process is similar to writing PowerPoint slides. However, DS is mostly for the purpose of visualisation, so R/Shiny still has an advantage in computing and modeling. Below is my hacky 2-page report, correct up to April 2020:

I am in the process of learning how to use BigQuery (some progress are documented here in this post). When I was exploring BigQuery’s web interface, I noticed that BigQuery could be connected with Data Studio (DS), a Google product that makes dashboard/BI visualisations.

Some person comments

As I recently have delivered a lecture on Shiny, I thought I will check out how Data Studio works and how it compares with Shiny. Sure, I have used Tableau before, but I have never used this kind of data visualisation software with databases before.

Overall, I think DS is a great web-based tool for quick explorations/visualisation of the data, its ability to connect to BigQuery and other databases is very attractive. I almost can’t believe how smooth the connection was after I have tried to connect the same data into R. While many features that DS offers are not dissimilar to Microsoft Power BI and Tableau, DS is free to use! Which is quite important for, well, people like myself who just lost the privilege to claim student discounts.

The drawback of DS is also clear: it is not Shiny, so it won’t handle complex modelling/computations in the background (that being said, I am sure there is a way that you can connect it to Google Cloud products for backend computations). DS is also web-based, so connecting it with data in your local laptop isn’t quite straight forward. It is also not quite as reproducible since it is not constructed using scripts. But these shouldn’t overshadow how easy it was to use DS and share it.

If a single visualisation is all you are after, not complex modelling, then DS is definitely a faster option than Shiny with a gentler learning curve.

Starting started with a DS report

A DS report is made up of multiple pages, much like PowerPoint slides. This tutorial is quite helpful to learn the basics.

The most powerful part of DS, in my opinion, is how you can add multiple data sources, e.g. SQL databases or Google Sheets. But being part of the Google ecosystem, I think the real power lies in how DS interacts with other Google products like Google Analytics and other third-party data sources.

In the DS report above, I made connections to the COVID-19 data from Johns Hopskins University and the World Bank global population data. I couldn’t find a Google public data documentation page for the latter, but you can make a SQL query using:

SELECT *
FROM `bigquery-public-data.world_bank_global_population.population_by_country` 
LIMIT 10

Having added the these two data into the DS report, you can visualise these data using charts, such as a table on the page 1 of the report:

In DS, every chart is associated with one data source that you have added. Of course, you can put together multiple data in some meaningful way and make a visualisation. This is exactly what I have done to calculate the “confirmed cases per 1 million” statistic.

A bit more data manipulations

In order to make the heatmap at the beginning of this post, a bit more data manipulations are needed. In particular, joining the two data together using “country” as a common key. Once this is done, we can take the ratio of the confirmed cases and the population and calculate the “confirmed cases per 1 million” statistic and making a heatmap visualisation is straightforward after that.

Joining data in DS is called “blending data”, which you can find out more here. This is identical to the operation of the various JOIN operations in SQL if you choose your settings correctly.

Aside: of course, joining data isn’t the magic silver bullet for everything. If you examine closely on page 2 of the report, you will see that some countries have null values in the confirmed cases column, e.g. South Korea. This is because the name of the country was not consistent between the two counties. E.g. “South Korea” in one data and “Korea, South” in another. This has to be fixed by making a new column with matched names. The way that I did it was to create a country_correct column in the population data using the definitions such as

CASE 
    WHEN country IN ("United States") THEN "US" 
    WHEN country IN ("Russian Federation") THEN "Russia" 
    WHEN country IN ("Iran, Islamic Rep.") THEN "Iran" 
    ELSE country
END

You will probably notice that this is SQL code and quite similar to the dplyr function case_when. So I think some knowledge of SQL will be helpful when using DS.

Making the visualisations

This is where I was pleasantly surprised and where a point-and-click dashboard visualisation wins over Shiny. Since the country_region/country_correct is a geography variable, it can be directly plotted onto a world map automatically. Also importantly, linked interactivity is easily achieved between the table and the heatmap on page 2.

Aside: these tasks are possible, butnot always the easiest to do in R, plotly is the most competitive in this area (see this example and this example). Sure, I can pick up the codes in 5 minutes, but think about the learning curve behind learning ggplot/plotly framework, this could be overwhelming for people without a programming background. I can’t tell you how many hours I have spent trying to make a plot look right!

Conclusion

For the R/Shiny fanatics, put down your pitchforks, I am not going to abandon R/Shiny any time soon. Shiny is definitely great for customisation, but DS and other professional dashboard software actually do have a lot of awesome ideas. Advanced R/Shiny users can do many things that DS can do, but this is only assuming you have enough experience and know where to look for the right answers. For example:

The fact that I can readily share this report without setting up my own Shiny server or website is a great example of how Google thought about how their products will be distributed!
I am also not an artist, so I actually dislike spending hours to make my Shiny app look pretty, DS actually makes this process much easier since making a dashboard UI is quite similar to making PowerPoint slides.
Connecting to databases might be annoying and slow at times (seriously, it takes about 20 seconds for me to fetch the COVID-19 data over the National Broadband Network here in Australia), but when DS tries to fetch data, it has the backings of Google’s massive servers, and making changes to my visualisation is often done in less than 2 seconds!

I can see myself using DS to visualise my pre-computed results in R and distribute this kind of interactive reports with others without paying for expensive Shiny servers, assuming that simple visualisations are all that I am after.

COVID-19 visualisations

Fri, 27 Mar 2020 00:00:00 +0000

Introduction

The recent COVID-19 outbreak has caused much disruptions to people’s daily lives. As the policy of self-isolation gets adopted by many countries around the world, many people took to social media to share important resources and data visualisation illustrating the severity of COVID-19. I want to quite clear about my intentions behind this blog post: I am not an epidemiologist/biologist/medical doctor. I will refrain from making any inferences from this data (ironic for a statistician) because it would be irresponsible for me to make commentaries on an ongoing public health crisis in which I am not an expert on. I am only here to show you some interesting R coding and data visualisations.

Between mid-February 2020 and mid-March 2020, I was in Cornell University (New York state) and observing the spread of COVID-19 quite closely. I was increasingly worried about the dramatic increase in the number of cases in US and the potential shutdown of the Australian border. I was on the brink of re-booking all my flights before it is too late. It is around the same time that I was asked by my supervisor back in Australia to design a lecture in Shiny apps, so I thought it will be useful for me to write an app and other visualisations to answer the following questions:

What do the confirmed cases for each country looks like? What is the days-lagged in confirmed cases for each compare when compare to China (i.e. cross-correlation)?
What are the Sydney-bound flights that had confirmed cases? Is there a route that is safer than others?

Shiny app for confirmed cases and added cases

This app attemps to answer the first question, code: https://github.com/kevinwang09/covid19. I can’t afford a server at this point, so you will need to run this app locally by reading through the instructions in the README of that repo.

Based on my simple visualisations at the time (~15 March 2020), I estimated that US’s major outbreak lags behind that of China by about 45 days or so. So it wasn’t so dangerous for me when I was in Cornell around mid-March, however, it was definitely not ideal as the county I was in already had two confirmed cases. Any delays in my departure could spell trouble. This is unfortunately true since at the time of writing, US has overtaken China in confirmed cases and New York state shares the biggest percentage of those confirmed cases.

The structure of the app is quite simple:

The COVID-19 data is fetched using the nCov2019 package using this line of code here
Cumulative confirmed cases are extracted here and time series plot is made here and the cross-correlation plot is made here.
Similarly, the plots for added cases are here and here.

Interactive animation of flights with confirmed case

This is a standalone RMarkdown document: https://kevinwang09.github.io/covid19/confirmed_flights.html#5_plotly_visualisations.

The New South Wales Health website publishes a list of flights with confirmed cases of COVID-19.

The coding beind this visualisation is also quite straight-forward:

The data are scrapped from the NSW Health website using the xml2 and rvest packages. I particularly like the elegance of the coding style using tidyver to scrap this data, though some inspiration came from this StackOverflow thread

url = "https://www.health.nsw.gov.au/Infectious/diseases/Pages/coronavirus-flights.aspx"
raw = xml2::read_html(url)

raw_flights_tbl = raw %>%
  rvest::html_node(xpath = ".//div[@id='ctl00_PlaceHolderMain_contentc1__ControlWrapper_RichHtmlField']/table") %>%
  rvest::html_table() %>% 
  as_tibble() %>% 
  janitor::clean_names()

The geographical locations are then queried through Google Maps API for their longitude and latitudes.

all_geocode = tibble(
  location = c(flights_tbl$origin, flights_tbl$destination) %>% unique,
  geocode = purrr::map(location, ggmap::geocode))

After grabbing these data, it is time to make plots using this chunk of code. Althought a lot of small adjustments were needed, but this example from Plotly’s official website was quite helpful.