Today’s goals is to compare readability with long versus wide format of data. Plus, we will fill a date column with missing values in a panel dataset. Let’s load the packages we are going to use:
library(tidyverse)
# set a theme to stay for the rest of the plotting
theme_set(theme_minimal())
Load the Data
We will be using COVID-19 vaccination data, which are available at the following repo. I also collected and wrangled the data here for Orizzonti Politici.
We proceed by loading the dataset via its url, so that it’s updated every time we run the script.
url_doses_delivered_ita <-
"https://raw.githubusercontent.com/italia/covid19-opendata-vaccini/master/dati/consegne-vaccini-latest.csv"
read_csv(url_doses_delivered_ita, col_types = cols(area = col_factor())) %>%
rename(
dosi_consegnate = numero_dosi,
) %>%
relocate(data_consegna, .after = 'area') -> doses
doses %>% head(10)
## # A tibble: 10 x 8
## area data_consegna fornitore dosi_consegnate codice_NUTS1 codice_NUTS2
## <fct> <date> <chr> <dbl> <chr> <chr>
## 1 ABR 2020-12-27 Pfizer/B… 135 ITF ITF1
## 2 ABR 2020-12-30 Pfizer/B… 7800 ITF ITF1
## 3 ABR 2021-01-05 Pfizer/B… 3900 ITF ITF1
## 4 ABR 2021-01-07 Pfizer/B… 3900 ITF ITF1
## 5 ABR 2021-01-11 Pfizer/B… 3900 ITF ITF1
## 6 ABR 2021-01-12 Pfizer/B… 4875 ITF ITF1
## 7 ABR 2021-01-13 Moderna 1300 ITF ITF1
## 8 ABR 2021-01-18 Pfizer/B… 3510 ITF ITF1
## 9 ABR 2021-01-20 Pfizer/B… 3510 ITF ITF1
## 10 ABR 2021-01-21 Pfizer/B… 2340 ITF ITF1
## # … with 2 more variables: codice_regione_ISTAT <dbl>, nome_area <chr>
Our data has the following columns: area
indicates the region, data
when deliveries happened, fornitore
is the supplier, dosi_consegnate
how many doses have been delivered.
Now we will perform the same operations on the dataset. Namely, we are going to:
- Create the implicit missing values;
- Create a cumulative sum of doses column for each supplier;
- Create a cumulative sum of doses;
- Choose the most informative format.
Deal with implicit missing values
The quickest way (I know of) is to create a grid with the missing values and then join it with the data.
doses %>%
tidyr::expand(
# only works if it's the first argument
data_consegna = full_seq(data_consegna, 1),
area, fornitore) -> doses_grid
And then perform the join. In this case, we do not have to specify the argument on
and the same dimensions of the joint item and the joined one show the join was executed successfully! (What a tongue twist.)
doses %>%
full_join(doses_grid) %>%
# reorder rows
arrange(area, data_consegna) %>%
# fill NAs
mutate(dosi_consegnate = coalesce(dosi_consegnate, 0)) %>%
arrange(area, data_consegna, fornitore) -> doses_long
Let’s use what we have done so far to make a visualisation:
doses_long %>%
ggplot(aes(data_consegna, dosi_consegnate)) +
geom_col(aes(fill = fornitore)) +
labs(
title = 'Total Vaccine Doses Delivered',
x = NULL,
y = NULL
)
Moderna has so far made only one delivery!
Work with Long Format
Data is already in the long format (normally, we would need to use pivot_longer()
to switch from wide to long format).
We might create a column with the cumulative total number of doses available each day in each region and set up a shiny
app to monitor closely how things are going in each region.
Wide Format
Now, we have to turn doses
into the wide format. First, we need to pivot the data: that’s almost half of the following code and it’s much easier than expected. Then we group the data by area and create a bunch of columns.
doses_long %>%
# pivot to the wider format
pivot_wider(
# create new features out of the levels of the following column
names_from = fornitore,
# values of the new features come from the following column
values_from = dosi_consegnate
) %>%
# for simplicity, let's rename
rename(
dosi_pfizer = 'Pfizer/BioNTech',
dosi_moderna = 'Moderna'
) %>%
group_by(area) %>%
mutate(
# replce NA with zero
dosi_pfizer = coalesce(dosi_pfizer, 0),
dosi_moderna = coalesce(dosi_moderna, 0),
# create columns with totals
totale_pfizer = cumsum(dosi_pfizer),
totale_moderna = cumsum(dosi_moderna),
totale_dosi = totale_moderna + totale_pfizer
) %>%
ungroup() -> doses_wide
doses_wide %>% head(10)
## # A tibble: 10 x 11
## area data_consegna codice_NUTS1 codice_NUTS2 codice_regione_… nome_area
## <fct> <date> <chr> <chr> <dbl> <chr>
## 1 ABR 2020-12-27 <NA> <NA> NA <NA>
## 2 ABR 2020-12-27 ITF ITF1 13 Abruzzo
## 3 ABR 2020-12-28 <NA> <NA> NA <NA>
## 4 ABR 2020-12-29 <NA> <NA> NA <NA>
## 5 ABR 2020-12-30 <NA> <NA> NA <NA>
## 6 ABR 2020-12-30 ITF ITF1 13 Abruzzo
## 7 ABR 2020-12-31 <NA> <NA> NA <NA>
## 8 ABR 2021-01-01 <NA> <NA> NA <NA>
## 9 ABR 2021-01-02 <NA> <NA> NA <NA>
## 10 ABR 2021-01-03 <NA> <NA> NA <NA>
## # … with 5 more variables: dosi_moderna <dbl>, dosi_pfizer <dbl>,
## # totale_pfizer <dbl>, totale_moderna <dbl>, totale_dosi <dbl>
Long vs Wide Format
Each of the two has its own shortcomings and advantages. Let’s see some visual examples:
Cumulative Doses Available
doses_wide %>%
ggplot(aes(
x = data_consegna,
y = totale_dosi,
fill = area)) +
geom_area() +
labs(
title = 'Total Doses Available in Italy',
x = NULL,
y = NULL,
fill = 'Region'
)
This can’t be swiftly achieved with a long format:
doses_long %>%
group_by(area, fornitore) %>%
mutate(totale_dosi = cumsum(dosi_consegnate)) %>%
ggplot(aes(data_consegna, totale_dosi, fill = area)) +
geom_area()
doses_long %>%
group_by(area, fornitore) %>%
mutate(totale_dosi = cumsum(dosi_consegnate)) %>%
ggplot(aes(data_consegna, totale_dosi, fill = area)) +
geom_col()
doses_long %>%
group_by(area, fornitore) %>%
mutate(totale_dosi = cumsum(dosi_consegnate)) %>%
ggplot(aes(data_consegna, totale_dosi, fill = fornitore)) +
geom_col()
geom_area()
fails and geom_col()
is not as elegant. We can use it for faceting, though!
doses_long %>%
group_by(area, fornitore) %>%
mutate(totale_dosi = cumsum(dosi_consegnate)) %>%
ggplot(aes(data_consegna, totale_dosi, fill = fornitore)) +
geom_area() +
facet_wrap(~ area, nrow = 4) +
theme(legend.position = 'bottom') +
labs(
title = 'Deliveries by Region',
x = NULL,
y = NULL,
fill = NULL
)
Vaccine Deliveries Grouped by Area
Let’s use group_by(area)
to get cumulative data for each region.
doses_wide %>%
group_by(area) %>%
summarise(
totale_pfizer = sum(dosi_pfizer),
totale_moderna = sum(dosi_moderna)
) %>%
mutate(
totale_dosi = totale_pfizer + totale_moderna
) %>%
add_row(
area = 'ITA',
totale_pfizer = sum(.$totale_pfizer),
totale_moderna = sum(.$totale_moderna),
totale_dosi = sum(.$totale_dosi),
)
## # A tibble: 22 x 4
## area totale_pfizer totale_moderna totale_dosi
## <chr> <dbl> <dbl> <dbl>
## 1 ABR 37380 1300 38680
## 2 BAS 20775 0 20775
## 3 CAL 62680 0 62680
## 4 CAM 171345 8200 179545
## 5 EMR 213525 7400 220925
## 6 FVG 60715 0 60715
## 7 LAZ 222670 7500 230170
## 8 LIG 77540 0 77540
## 9 LOM 379530 0 379530
## 10 MAR 43880 1500 45380
## # … with 12 more rows
The wide format does not offer much, but can be better integrated with other data to create new features. The long format allows to make quick plotting, but adding a row with totals for the country would be definitely more demaning.
doses_long %>%
group_by(area, fornitore) %>%
summarise(
totale_dosi = sum(dosi_consegnate)
) %>%
ggplot(aes(area, totale_dosi, fill = fornitore)) +
geom_col() +
theme(legend.position = 'bottom')
Vaccine Deliveries Grouped by Date
We will group_by(data)
to obtain how many doses have been delivered in the whole of Italy every day by each supplier:
doses_wide %>%
group_by(data_consegna) %>%
summarise(
totale_pfizer = sum(dosi_pfizer),
totale_moderna = sum(dosi_moderna),
) %>%
mutate(
totale_dosi_consegnate = totale_pfizer + totale_moderna,
totale_dosi = cumsum(totale_dosi_consegnate)
) %>%
ggplot(aes(data_consegna, totale_dosi)) +
geom_area(fill = 'cadetblue')
This does not really tell much more. But, once again, it can be enriched with other data and have multiple objects (rather than layers/dimensions).