Chapter 2 Data Visualization

2.1 Introduction to visualization

Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data. Data visualization requires “information that has been abstracted in some schematic form, including attributes or variables for the units of information.” You can read more about data visualization here https://en.m.wikipedia.org/wiki/Data_visualization and here https://en.m.wikipedia.org/wiki/Michael_Friendly

2.1.1 History of data visualization

In his 1983 book which carried the title The Visual Display of Quantitative Information, the author Edward Tufte defines graphical displays and principles for effective graphical display. The book mentioned that “Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency.”

2.1.2 Processes and Objectives of visualization

Visualization is the process of representing data graphically and interacting with these representations. The objective is to gain insight into the data. Some of the processes are outlined here http://researcher.watson.ibm.com/researcher/view_group.php?id=143

2.2 What makes good graphics

You may require these to make good graphics:

  1. Data
  2. Substance rather than about methodology, graphic design, the technology of graphic production or something else
  3. No distortion to what the data has to say
  4. Presence of many numbers in a small space
  5. Coherence for large data sets
  6. Encourage the eye to compare different pieces of data
  7. Reveal the data at several levels of detail, from a broad overview to the fine structure
  8. Serve a reasonably clear purpose: description, exploration, tabulation or decoration
  9. Be closely integrated with the statistical and verbal descriptions of a data set.

2.3 Graphics packages in R

There are many graphics packages in R. Some packages are aimed to perform general tasks related with graphs. Some provide specific graphics for certain analyses. The popular general graphics packages in R are:

  1. graphics : a base R package
  2. ggplot2 : a user-contributed package by Hadley Wickham
  3. lattice : a user-contributed package

Except for graphics package (a a base R package), other packages need to downloaded and installed into your R library. Examples of other more specific packages - to run graphics for certain analyses - are:

  1. survminer::ggsurvlot
  2. sjPlot

For this course, we will focus on using the ggplot2 package.

2.4 Introduction to ggplot2 package

The ggplot2 package is an elegant, easy and versatile general graphics package in R. It implements the grammar of graphics concept. The advantage of this concept is that, it fasten the process of learning graphics. It also facilitates the process of creating complex graphics

To work with ggplot2, remember

  • start with: ggplot()
  • which data: data = X
  • which variables: aes(x = , y = )
  • which graph: geom_histogram(), geom_points()

The official website for ggplot2 is here http://ggplot2.org/.

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

2.5 Preparation

2.5.1 Set a new project or set the working directory

It is always recommended that to start working on data analysis in RStudio, you create first a new project.

Go to File, then click New Project.

You can create a new R project based on existing directory. This method is useful because an RStudio project keep your data, your analysis, and outputs in a clean dedicated folder or sets of folders.If you do not want to create a new project, then make sure you are inside the correct directory (the working directory). The working directory is a folder where you store.

Type getwd() in your Console to display your working directory. Inside your working directory, you should see and keep

  1. dataset or datasets
  2. outputs - plots
  3. codes (R scripts .R, R markdown files .Rmd)

2.5.2 Questions to ask before making graphs

You must ask yourselves these:

  1. Which variable or variables do I want to plot?
  2. What is (or are) the type of that variable?
  • Are they factor (categorical) variables ?
  • Are they numerical variables?
  1. Am I going to plot
  • a single variable?
  • two variables together?
  • three variables together?

2.5.3 Read data

The common data formats include

  1. comma separated files (.csv)
  2. MS Excel file (.xlsx)
  3. SPSS file (.sav)
  4. Stata file (.dta)
  5. SAS file

Packages that read these data include haven package. Below are the functions to read SAS, SPSS and Stata file.

  1. SAS: read_sas() reads .sas7bdat + .sas7bcat files and read_xpt() reads SAS transport files (version 5 and version 8). write_sas() writes .sas7bdat files.
  2. SPSS: read_sav() reads .sav files and read_por() reads the older .por files. write_sav() writes .sav files.
  3. Stata: read_dta() reads .dta files (up to version 15). write_dta() writes .dta files (versions 8-15).

Data from databases are less common but are getting more important and more common. R can also read these data. Some examples of databases format are:

  1. MySQL
  2. SQLite
  3. Postgresql
  4. Mariadb

2.5.4 Load the library

The ggplot2 package is one of the core member of tidyverse package (https://www.tidyverse.org/). So, if we load the tidyverse package, we will access to other packages under tidyverse which include dplyr, readr, ggplot2.

Loading a package will give you access to

  1. help pages
  2. functions
  3. datasets
library(tidyverse)

If you run the code and you see there is no package called tidyverse then you need to install the tidyverse package. To install the package, type install.package("tidyverse") in the Console. Once the installation is complete, type library(tidyverse) to load the package.

2.5.5 Open dataset

For now, we will use the built-in dataset in the gapminder package. You can read more about gapminder from https://www.gapminder.org/. The gapminder website contains many useful datasets and show wonderful graphics. It is made popular by Dr Hans Rosling.

To load the package, type

library(gapminder)

call the data gapminder into R and browse the first 6 observations of the gapminder data

gapminder <- gapminder
head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

We can list the variables and look at the type of the variables in the dataset

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...

The gapminder data have

  1. 6 variables
  2. 1704 observations
  3. There are 2 factor variables, 2 integer variables and 2 numeric variables

We can examine the basic statistics of the datasets by using summary(). This function will list

  1. the frequencies
  2. some descriptive statistics: min, 1st quartile, median, mean, 3rd quartile and max
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

To know more about the package, we can use the \(?\) mark

?gapminder

2.6 Basic plot

We can start create a basic plot by setting these parameters

  • data = gapminder
  • variables = year, lifeExp
  • graph = scatterplot

In ggplot2 which is a package under tidyverse package, you can use the \(+\) sign to connect the function. And in R, your codes can span multiple lines. This will increase the visibility of the codes.

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp))

Now, you can see that the plot shows:

  1. the relationship between year and life expectancy.
  2. as variable year advances, the life expectancy increases.

the ggplot() tells R to plot what variables from what data. And geom_point() tells R to make a scatter plot.

2.7 Adding another variable

You realize that we plotted 2 variables based on aes(). We can add the third variable to make a more complicated plot. For example:

  1. data = gapminder
  2. variables = year, life expectancy, continent

For this, the objective to create plot might be to see the relationship between year and life expectancy based on continent.

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp, colour = continent))

What can you see from the scatterplot? You may notice that

  1. Europe countries have high life expectancy
  2. Africa countries have lower life expectancy
  3. One Asia country looks like an outlier (very low life expectancy)
  4. One Africa country looks like an outlier (very low life expectancy)

Now, we will replace the 3rd variable with GDP (variable gdpPercap) and make the plot correlates with the size of GDP.

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp, size = gdpPercap))

ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values. The plot suggets that higher GDP countries have longer life expectancy.

Instead of using colour, we can use shape especially in instances where there is no facility to print out colour plots.

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp, shape = continent))

But, see what will happen if you set the colour and shape like below but outside the aes parentheses. For example, let set the parameter colour to blue

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp), colour = 'blue')

And then parameter shape to plus (which is represented by number 3).

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp), shape = 3)

You may wonder what number corresponds to what type of shape. You can type \(?pch\). And you will see in the Viewer pane, the explanation about the shape available in R. It also shows what number that corresponds to what shape.

2.8 Making subplots

We can split our plots based on a factor variable and make subplots using the facet(). For example, if we want to make subplots based on continents, then you need to set these parameters:

  • data = gapminder
  • variable year on the x-axis and lifeExp on the y-axis
  • split the plot based on continent
  • the number of rows for the plot are 3
ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp)) + 
  facet_wrap(~ continent, nrow = 3)

Now, what happen if we change the value for the nrow

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = year, y = lifeExp)) + 
  facet_wrap(~ continent, nrow = 2)

2.9 Overlaying plots

Each geom_X() in ggplot2 indicates different visual objects.

This is a scatterplot

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp))

This is a smooth line

ggplot(data = gapminder) +
  geom_smooth(mapping = aes(x = gdpPercap, y = lifeExp))

And we can regenerate the smooth plot based on continent using the linetype(). We use log(gdpPercap) to reduce the skewness of the data.

ggplot(data = gapminder) +
  geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp, linetype = continent))

Another smooth plot but setting the parameter for colour

ggplot(data = gapminder) +
  geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp, colour = continent))

2.10 Combining geom

We can combine more than one geoms to overlay plots. The trick is to use multiple geoms in a single line of R code

ggplot(data = gapminder) +
  geom_point(mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp))

The codes above show duplication or repetition. To avoid this, we can pass the mapping to ggplot().

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point() +
  geom_smooth()

And we can expand this to make scatterplot shows different colour for continent

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(mapping = aes(colour = continent)) +
  geom_smooth()

Or expand this to make the smooth plot shows different colour for continent

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point() +
  geom_smooth(mapping = aes(colour = continent))

Or both the scatterplot and the smoothplot

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(mapping = aes(shape = continent)) +
  geom_smooth(mapping = aes(colour = continent))

2.11 Statistical transformation

Let us create a bar chart, with y axis as the frequency.

ggplot(data = gapminder) +
  geom_bar(mapping = aes(x = continent))

If we want the y-axis to show proportion, we can use these codes

ggplot(data = gapminder) +
  geom_bar(mapping = aes(x = continent, y = ..prop..,
                         group = 1))

2.12 Customizing title

We can customize many aspects of the plot using ggplot package. For example, from gapminder dataset, we choose GDP and log it (to reduce skewness) and life expectancy, and make a scatterplot. We named the plot as my_pop

mypop <- ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point() +
  geom_smooth(mapping = aes(colour = continent))
mypop

You will notice that there is no title in the plot. A title can be added to the plot.

mypop + ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy")

Title in multiple lines by adding \n

mypop + ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy:
                \nData from Gapminder")

2.13 Adjusting axes

We can specify the tick marks

  1. min = 0
  2. max = 12
  3. interval = 1
mypop + scale_x_continuous(breaks = seq(0,12,1))

And we can label the x-axis and y-axis

mypop + ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy:
                \nData from Gapminder") + ylab("Life Expentancy") + xlab("Percapita GDP in log")

2.14 Choosing theme

The default is gray theme or theme_gray()

This is the black and white theme

mypop + theme_bw()

This is the classic theme

mypop + theme_classic()

2.15 Saving plot

In R, you can save the plot into different format. You can also set other parameters such as the dpi and the size for the plot. One of the preferred formats for saving a plot is as a PDF format.

2.16 Saving plot using ggplot2

Here, we will show how to save plots in R. In this example, let us use the object for the plot named mypop and add a title, an x label, an y label and choose the classic theme,

myplot <- mypop + 
ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy:
                \nData from Gapminder") + ylab("Life Expentancy") + 
  xlab("Percapita GDP in log") +
  scale_x_continuous(breaks = seq(0,12,1)) +
  theme_classic()
myplot

We now can see a nice plot. And next, we want to save the plot (currently on the screen) to these formats:

  1. pdf format
  2. png format
  3. jpg format

The codes we can use are:

library(here)
ggsave(plot = myplot, here("plots","my_pdf_plot.pdf"))
ggsave(plot = myplot, here("plots","my_png_plot.png")) 
ggsave(plot = myplot, here("plots","my_jpg_plot.jpg"))

If we want to add more customization before saving the plot, for example, we want to set these parameters:

  1. width = 10 cm (or you can use in)
  2. height = 6 cm (or you can use in)
  3. dpi = 150. dpi is dots per inch

Now, you can run these codes

ggsave(plot = myplot, here('plots','my_pdf_plot2.pdf'), 
                           width = 10, height = 6, units = "in",
                           dpi = 150, device = 'pdf')
ggsave(plot = myplot, here('plots','my_png_plot2.png'), 
       width = 10, height = 6, units = "cm", 
       dpi = 150, device = 'png')
ggsave(plot = myplot, here("plots","my_jpg_plot2.jpg"), 
       width = 10, height = 6, units = "cm",
       dpi = 150, device = 'jpg')