Chapter 2 Data Visualization
2.1 Introduction to visualization
Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data. Data visualization requires “information that has been abstracted in some schematic form, including attributes or variables for the units of information.” You can read more about data visualization here https://en.m.wikipedia.org/wiki/Data_visualization and here https://en.m.wikipedia.org/wiki/Michael_Friendly
2.1.1 History of data visualization
In his 1983 book which carried the title The Visual Display of Quantitative Information, the author Edward Tufte defines graphical displays and principles for effective graphical display. The book mentioned that “Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency.”
2.1.2 Processes and Objectives of visualization
Visualization is the process of representing data graphically and interacting with these representations. The objective is to gain insight into the data. Some of the processes are outlined here http://researcher.watson.ibm.com/researcher/view_group.php?id=143
2.2 What makes good graphics
You may require these to make good graphics:
- Data
- Substance rather than about methodology, graphic design, the technology of graphic production or something else
- No distortion to what the data has to say
- Presence of many numbers in a small space
- Coherence for large data sets
- Encourage the eye to compare different pieces of data
- Reveal the data at several levels of detail, from a broad overview to the fine structure
- Serve a reasonably clear purpose: description, exploration, tabulation or decoration
- Be closely integrated with the statistical and verbal descriptions of a data set.
2.3 Graphics packages in R
There are many graphics packages in R. Some packages are aimed to perform general tasks related with graphs. Some provide specific graphics for certain analyses. The popular general graphics packages in R are:
- graphics : a base R package
- ggplot2 : a user-contributed package by Hadley Wickham
- lattice : a user-contributed package
Except for graphics package (a a base R package), other packages need to downloaded and installed into your R library. Examples of other more specific packages - to run graphics for certain analyses - are:
- survminer::ggsurvlot
- sjPlot
For this course, we will focus on using the ggplot2 package.
2.4 Introduction to ggplot2 package
The ggplot2 package is an elegant, easy and versatile general graphics package in R. It implements the grammar of graphics concept. The advantage of this concept is that, it fasten the process of learning graphics. It also facilitates the process of creating complex graphics
To work with ggplot2, remember
- start with:
ggplot()
- which data:
data = X
- which variables:
aes(x = , y = )
- which graph:
geom_histogram()
,geom_points()
The official website for ggplot2 is here http://ggplot2.org/.
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
2.5 Preparation
2.5.1 Set a new project or set the working directory
It is always recommended that to start working on data analysis in RStudio, you create first a new project.
Go to File, then click New Project.
You can create a new R project based on existing directory. This method is useful because an RStudio project keep your data, your analysis, and outputs in a clean dedicated folder or sets of folders.If you do not want to create a new project, then make sure you are inside the correct directory (the working directory). The working directory is a folder where you store.
Type getwd()
in your Console to display your working directory. Inside your working directory, you should see and keep
- dataset or datasets
- outputs - plots
- codes (R scripts
.R
, R markdown files.Rmd
)
2.5.2 Questions to ask before making graphs
You must ask yourselves these:
- Which variable or variables do I want to plot?
- What is (or are) the type of that variable?
- Are they factor (categorical) variables ?
- Are they numerical variables?
- Am I going to plot
- a single variable?
- two variables together?
- three variables together?
2.5.3 Read data
The common data formats include
- comma separated files (
.csv
) - MS Excel file (
.xlsx
) - SPSS file (
.sav
) - Stata file (
.dta
) - SAS file
Packages that read these data include haven package. Below are the functions to read SAS, SPSS and Stata file.
- SAS:
read_sas()
reads .sas7bdat + .sas7bcat files and read_xpt() reads SAS transport files (version 5 and version 8). write_sas() writes .sas7bdat files. - SPSS:
read_sav()
reads .sav files and read_por() reads the older .por files. write_sav() writes .sav files. - Stata:
read_dta()
reads .dta files (up to version 15). write_dta() writes .dta files (versions 8-15).
Data from databases are less common but are getting more important and more common. R can also read these data. Some examples of databases format are:
- MySQL
- SQLite
- Postgresql
- Mariadb
2.5.4 Load the library
The ggplot2 package is one of the core member of tidyverse package (https://www.tidyverse.org/). So, if we load the tidyverse package, we will access to other packages under tidyverse which include dplyr, readr, ggplot2.
Loading a package will give you access to
- help pages
- functions
- datasets
library(tidyverse)
If you run the code and you see there is no package called tidyverse then you need to install the tidyverse package. To install the package, type install.package("tidyverse")
in the Console. Once the installation is complete, type library(tidyverse)
to load the package.
2.5.5 Open dataset
For now, we will use the built-in dataset in the gapminder package. You can read more about gapminder from https://www.gapminder.org/. The gapminder website contains many useful datasets and show wonderful graphics. It is made popular by Dr Hans Rosling.
To load the package, type
library(gapminder)
call the data gapminder into R and browse the first 6 observations of the gapminder data
<- gapminder
gapminder head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
We can list the variables and look at the type of the variables in the dataset
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
The gapminder data have
- 6 variables
- 1704 observations
- There are 2 factor variables, 2 integer variables and 2 numeric variables
We can examine the basic statistics of the datasets by using summary()
. This function will list
- the frequencies
- some descriptive statistics: min, 1st quartile, median, mean, 3rd quartile and max
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
To know more about the package, we can use the \(?\) mark
?gapminder
2.6 Basic plot
We can start create a basic plot by setting these parameters
- data = gapminder
- variables = year, lifeExp
- graph = scatterplot
In ggplot2 which is a package under tidyverse package, you can use the \(+\) sign to connect the function. And in R, your codes can span multiple lines. This will increase the visibility of the codes.
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp))
Now, you can see that the plot shows:
- the relationship between year and life expectancy.
- as variable year advances, the life expectancy increases.
the ggplot()
tells R to plot what variables from what data. And geom_point()
tells R to make a scatter plot.
2.7 Adding another variable
You realize that we plotted 2 variables based on aes()
. We can add the third variable to make a more complicated plot. For example:
- data = gapminder
- variables = year, life expectancy, continent
For this, the objective to create plot might be to see the relationship between year and life expectancy based on continent.
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = continent))
What can you see from the scatterplot? You may notice that
- Europe countries have high life expectancy
- Africa countries have lower life expectancy
- One Asia country looks like an outlier (very low life expectancy)
- One Africa country looks like an outlier (very low life expectancy)
Now, we will replace the 3rd variable with GDP (variable gdpPercap) and make the plot correlates with the size of GDP.
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp, size = gdpPercap))
ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values. The plot suggets that higher GDP countries have longer life expectancy.
Instead of using colour, we can use shape especially in instances where there is no facility to print out colour plots.
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp, shape = continent))
But, see what will happen if you set the colour and shape like below but outside the aes parentheses. For example, let set the parameter colour to blue
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp), colour = 'blue')
And then parameter shape to plus (which is represented by number 3).
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp), shape = 3)
You may wonder what number corresponds to what type of shape. You can type \(?pch\). And you will see in the Viewer pane, the explanation about the shape available in R. It also shows what number that corresponds to what shape.
2.8 Making subplots
We can split our plots based on a factor variable and make subplots using the facet()
. For example, if we want to make subplots based on continents, then you need to set these parameters:
- data = gapminder
- variable year on the x-axis and lifeExp on the y-axis
- split the plot based on continent
- the number of rows for the plot are 3
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp)) +
facet_wrap(~ continent, nrow = 3)
Now, what happen if we change the value for the nrow
ggplot(data = gapminder) +
geom_point(mapping = aes(x = year, y = lifeExp)) +
facet_wrap(~ continent, nrow = 2)
2.9 Overlaying plots
Each geom_X()
in ggplot2 indicates different visual objects.
This is a scatterplot
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp))
This is a smooth line
ggplot(data = gapminder) +
geom_smooth(mapping = aes(x = gdpPercap, y = lifeExp))
And we can regenerate the smooth plot based on continent using the linetype()
. We use log(gdpPercap)
to reduce the skewness of the data.
ggplot(data = gapminder) +
geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp, linetype = continent))
Another smooth plot but setting the parameter for colour
ggplot(data = gapminder) +
geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp, colour = continent))
2.10 Combining geom
We can combine more than one geoms to overlay plots. The trick is to use multiple geoms in a single line of R code
ggplot(data = gapminder) +
geom_point(mapping = aes(x = log(gdpPercap), y = lifeExp)) +
geom_smooth(mapping = aes(x = log(gdpPercap), y = lifeExp))
The codes above show duplication or repetition. To avoid this, we can pass the mapping to ggplot()
.
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
geom_point() +
geom_smooth()
And we can expand this to make scatterplot shows different colour for continent
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(mapping = aes(colour = continent)) +
geom_smooth()
Or expand this to make the smooth plot shows different colour for continent
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
geom_point() +
geom_smooth(mapping = aes(colour = continent))
Or both the scatterplot and the smoothplot
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(mapping = aes(shape = continent)) +
geom_smooth(mapping = aes(colour = continent))
2.11 Statistical transformation
Let us create a bar chart, with y axis as the frequency.
ggplot(data = gapminder) +
geom_bar(mapping = aes(x = continent))
If we want the y-axis to show proportion, we can use these codes
ggplot(data = gapminder) +
geom_bar(mapping = aes(x = continent, y = ..prop..,
group = 1))
2.12 Customizing title
We can customize many aspects of the plot using ggplot package. For example, from gapminder dataset, we choose GDP and log it (to reduce skewness) and life expectancy, and make a scatterplot. We named the plot as my_pop
<- ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
mypop geom_point() +
geom_smooth(mapping = aes(colour = continent))
mypop
You will notice that there is no title in the plot. A title can be added to the plot.
+ ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy") mypop
Title in multiple lines by adding \n
+ ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy:
mypop \nData from Gapminder")
2.13 Adjusting axes
We can specify the tick marks
- min = 0
- max = 12
- interval = 1
+ scale_x_continuous(breaks = seq(0,12,1)) mypop
And we can label the x-axis and y-axis
+ ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy:
mypop \nData from Gapminder") + ylab("Life Expentancy") + xlab("Percapita GDP in log")
2.14 Choosing theme
The default is gray theme or theme_gray()
This is the black and white theme
+ theme_bw() mypop
This is the classic theme
+ theme_classic() mypop
2.15 Saving plot
In R, you can save the plot into different format. You can also set other parameters such as the dpi and the size for the plot. One of the preferred formats for saving a plot is as a PDF format.
2.16 Saving plot using ggplot2
Here, we will show how to save plots in R. In this example, let us use the object for the plot named mypop
and add a title, an x label, an y label and choose the classic theme,
<- mypop +
myplot ggtitle("Scatterplot showing the relationship of GDP in log and life expectancy:
\nData from Gapminder") + ylab("Life Expentancy") +
xlab("Percapita GDP in log") +
scale_x_continuous(breaks = seq(0,12,1)) +
theme_classic()
myplot
We now can see a nice plot. And next, we want to save the plot (currently on the screen) to these formats:
pdf
formatpng
formatjpg
format
The codes we can use are:
library(here)
ggsave(plot = myplot, here("plots","my_pdf_plot.pdf"))
ggsave(plot = myplot, here("plots","my_png_plot.png"))
ggsave(plot = myplot, here("plots","my_jpg_plot.jpg"))
If we want to add more customization before saving the plot, for example, we want to set these parameters:
- width = 10 cm (or you can use
in
) - height = 6 cm (or you can use
in
) - dpi = 150. dpi is dots per inch
Now, you can run these codes
ggsave(plot = myplot, here('plots','my_pdf_plot2.pdf'),
width = 10, height = 6, units = "in",
dpi = 150, device = 'pdf')
ggsave(plot = myplot, here('plots','my_png_plot2.png'),
width = 10, height = 6, units = "cm",
dpi = 150, device = 'png')
ggsave(plot = myplot, here("plots","my_jpg_plot2.jpg"),
width = 10, height = 6, units = "cm",
dpi = 150, device = 'jpg')