Now that we have wrangled the data a bit, we can proceed with some visualisations. We want to plot three things:
- How many vaccinations are administered daily.
- How many doses have been administered so far and their ratio.
- See how regions perform in terms of doses administered and doses received.
Data Wrangling
Let’s load the data:
read_csv(
'https://raw.githubusercontent.com/orizzontipolitici/covid19-vaccine-data/main/data_ita/doses_by_date_ita.csv',
) -> doses_by_date
read_csv(
'https://raw.githubusercontent.com/orizzontipolitici/covid19-vaccine-data/main/data_ita/vaccinations_by_area_ita.csv', col_types = cols(area = col_factor()) ) -> vaccinations_by_area
These two datasets are incompatible: first, data grouped by area needs to be grouped to the Italian level. Then, we need to transform deliveries data into the long format.
Let’s start with grouping vaccination data. Columns beyond the twelfth are not needed: basically, they are the sum of the doses injected each day… which is exactly what we will be computing within summarise()
!
vaccinations_by_area %>%
# the other columns are the same as the summarised one
select(1:12) %>%
group_by(data, fornitore) %>%
summarise(across(where(is.numeric), sum)) %>%
ungroup() -> vaccinations_by_date
Then, let’s move to vaccine data. This seems a bit more challenging at first, as we need to pack the two columns with the total deliveries of each supplier into a single one. pivot_longer()
comes to rescue!
doses_by_date %>%
# we will just need data and the two supplier cols
select(-totale_dosi_consegnate, -totale_dosi) %>%
# to get prettier names into the new column
rename(
'Pfizer/BioNTech' = totale_pfizer,
Moderna = totale_moderna
) %>%
# pivot longer magic:
pivot_longer(c('Pfizer/BioNTech', Moderna),
# will have values Pfizer/BioNTech and Moderna
names_to = 'fornitore',
# will report the deliveries of the day
values_to = 'consegne') %>%
# we need to add cumulative doses available at each date
group_by(fornitore) %>%
mutate(dosi_totali = cumsum(consegne)) -> doses_by_date_long
Vaccinations: First and Second Doses
Now that the two are compatible, we could join them. However, given the structure of the data, we’d better create two new separate tables: for instance, since we cannot observe to who the first and second shots are administered, we can create a new table with this info only.
vaccinations_by_date %>%
select(data, fornitore, prima_dose, seconda_dose) %>%
pivot_longer(cols = !c(data, fornitore),
names_to = 'dose',
values_to = 'count') %>%
group_by(fornitore, dose) %>%
mutate(
total_count = cumsum(count)
) -> vaccinations_long
Vaccines from Moderna are so few right now that we might as well group by it. In the future, comparing how shots are administered by each supplier would bear greater value.
vaccinations_long %>%
group_by(data, dose) %>%
summarise(across(where(is.numeric), sum)) %>%
# swap the order of the factor
mutate(dose = fct_rev(dose)) %>%
ggplot(aes(data, total_count, fill = dose)) +
geom_area() +
scale_fill_viridis_d(begin = 0.95, end = 0.55) +
labs(
title = 'Vaccinations in Italy',
subtitle = 'Cumulative First and Second Doses Administered',
x = NULL,
y = NULL,
fill = NULL
) +
theme(legend.position = 'bottom')
We can also show it as a percentage: it’s a bit more complicated since we have to move from wider to create the new percentage columns and then back to longer for plotting.
vaccinations_long %>%
group_by(data, dose) %>%
summarise(across(where(is.numeric), sum)) %>%
# pivot back to wider to create new columns
pivot_wider(
!total_count,
names_from = dose,
values_from = c(count)
) %>%
# take out redundant text from column names
rename_with( ~ stringr::str_remove(.x, c('count_'))) %>%
mutate(across(
# apply it to these cols
c(prima_dose, seconda_dose),
# replace with
~ .x / (prima_dose + seconda_dose)
)) %>%
# pivot to longer
pivot_longer(
# don't use data
!data,
# names and values
names_to = 'dose',
values_to = 'pct'
) %>%
ggplot(aes(data, pct, fill = dose)) +
geom_area() +
scale_fill_viridis_d(begin = 0.55, end = 0.95) +
labs(
title = 'Vaccinations in Italy',
subtitle = 'Cumulative First and Second Doses Administered',
x = NULL,
y = NULL,
fill = NULL
) +
scale_y_continuous(labels = scales::percent) +
theme(legend.position = 'bottom')
Daily Vaccinations Data
We can also plot a line with daily shots (again, data is grouped by fornitore
):
vaccinations_long %>%
group_by(data, dose) %>%
summarise(across(where(is.numeric), sum)) %>%
ungroup() %>%
ggplot(aes(data, count, col = dose)) +
geom_line() +
# `begin` and `end` are swapped as factors are not swapped as in the graph above
scale_color_viridis_d(begin = 0.55, end = 0.95) +
labs(
title = 'Vaccinations in Italy',
subtitle = 'First and Second Doses Administered',
x = NULL,
y = NULL,
color = NULL
) +
theme(legend.position = 'bottom')
Putting these together
We will be using a call to ggpubr::ggarrange()
:
Comparing First and Second Doses
vaccinations_by_date %>%
filter(fornitore != 'Moderna') %>%
select(data, prima_dose, seconda_dose) %>%
mutate(lead = lead(seconda_dose, 20)) %>%
ggplot(aes(data, cumsum(prima_dose))) +
geom_line(col = 'salmon') +
geom_line(aes(y = cumsum(lead)), col = 'cornflowerblue') +
labs(
title = 'Total First and Second Doses Injected',
subtitle = 'Red is the total number of first doses\nBlue is the total number of second doses',
x = NULL,
y = NULL
)
To whom doses are administered to?
Let’s use vaccinations data to see how shots are distributed.
Male and Female
National Level
vaccinations_by_area %>%
group_by(data) %>%
summarise(across(where(is.numeric), sum)) %>%
select(data, sesso_maschile, sesso_femminile) %>%
pivot_longer(!data, names_to = 'sesso', values_to = 'count') %>%
ggplot(aes(data, count, fill = sesso)) +
geom_col() +
labs(
title = 'Vaccinations by Sex in Italy',
subtitle = 'Blue is male, red is female',
x = NULL,
y = NULL,
fill = NULL
) +
theme(legend.position = 'none')
Let’s see it as percentage:
vaccinations_by_area %>%
# we are not interested in the supplier
group_by(data) %>%
summarise(across(where(is.numeric), sum)) %>%
select(data, sesso_maschile, sesso_femminile) %>%
mutate(
# transform into percentage
across(c(sesso_femminile, sesso_maschile), ~ .x / (sesso_maschile + sesso_femminile)),
# replace NAs with 0
across(where(is.numeric), ~ coalesce(.x, 0L))
) %>%
pivot_longer(!data, names_to = 'sesso', values_to = 'count') %>%
ggplot(aes(data, count, fill = sesso)) +
geom_col() +
labs(
title = 'Vaccinations by Sex, per Area',
subtitle = 'Blue is male, red is female',
x = NULL,
y = NULL,
fill = NULL
) +
scale_y_continuous(labels = scales::percent) +
theme(legend.position = 'none')
At the national level, more women are being vaccinated than men. This may also be due to more of them being health workers.
Regional Level with facet_grid()
We exploit the extra area
information to see these proportions at the local level. Given the high number of regions, it’s won’t look pretty, at all! I shall get back to this with shiny
, soon.
vaccinations_by_area %>%
select(data, area, sesso_maschile, sesso_femminile) %>%
pivot_longer(!c(data, area), names_to = 'sesso', values_to = 'count') %>%
ggplot(aes(data, count, fill = sesso)) +
geom_col() +
facet_wrap(~ area, nrow = 3) +
labs(
title = 'Vaccinations by Sex, per Area',
subtitle = 'Blue is male, red is female',
x = NULL,
y = NULL,
fill = NULL
) +
# to make the dates fit! thanks to `r-graphics.org`!
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
theme(legend.position = 'none')
We did manage to avoid getting an absolute awful graph, though!
By Category
vaccinations_by_area %>%
# show off some other selection syntax
select(data, area, operatori_sanitari:ospiti_rsa) %>%
# for now, no area
group_by(data) %>%
summarise(across(where(is.numeric), sum)) %>%
pivot_longer(
!data,
names_to = 'categoria',
values_to = 'count'
) %>%
ggplot(aes(data, count, fill = categoria)) +
geom_col() +
theme(legend.position = 'bottom') +
scale_fill_viridis_d(begin = 0.2, end = 0.75) +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(
title = 'Vaccinations by category',
x = NULL,
y = NULL,
fill = NULL
)
Who’s doing better?
Let’s use some other data to find the region which is performing better. Let’s load the data:
read_csv('https://raw.githubusercontent.com/orizzontipolitici/covid19-vaccine-data/main/data_ita/totals_by_area_ita.csv') %>%
select(-NUTS2, -nome) %>%
mutate(area = as.factor(area)) -> totals_by_area
And then make a plot:
totals_by_area %>%
filter(area != 'ITA') %>%
ggplot(aes(
x = dosi_ogni_mille,
y = vaccinati_ogni_mille
)) +
geom_vline(xintercept = mean(totals_by_area$dosi_ogni_mille), color = 'grey') +
geom_hline(yintercept = mean(totals_by_area$vaccinati_ogni_mille), color = 'grey') +
geom_point() +
ggrepel::geom_text_repel(aes(label = area)) +
# xlim(0,100) +
# ylim(0,100) +
labs(
title = 'Which Region is Ahead?',
subtitle = 'Doses delivered compared to doses injected',
x = NULL,
y = NULL
)