Tidy Tuesday

Tidy Tuesday Week 37/2023 - The Global Human Day

Home
tidytuesday
2023
week 37

Fajzel et al. (2023) shows that the time spent on activities such as Food Growth and Collection is highly dependent on the GDP of the country in question. I therefore decided to cluster the countries according to the time for Food Growth and Collection and GDP. The data on GDP in 2021 comes from Arel-Bundock (2022).

The Code

# Load Required R Libraries
# The following libraries are loaded to provide necessary functions and data for the analysis and visualization.
library(tidyverse)            # For data manipulation and visualization

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytuesdayR)        # For accessing TidyTuesday datasets
library(showtext)            # For working with fonts

## Lade nötiges Paket: sysfonts
## Lade nötiges Paket: showtextdb

library(glue)                # For text formatting
library(ggtext)              # For enhanced text formatting in ggplot2
library(factoextra)          # For clustering analysis

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)             # For clustering functions
library(patchwork)           # For combining multiple plots
library(WDI)                 # For fetching World Development Indicators (WDI) data
library(sf)                  # For working with spatial data

## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE

library(rnaturalearth)       # For world map data

## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
## Support for Spatial objects (`sp`) will be deprecated in {rnaturalearth} and will be removed in a future release of the package. Please use `sf` objects with {rnaturalearth}. For example: `ne_download(returnclass = 'sf')`

library(rnaturalearthdata)   # For additional world map data

## 
## Attache Paket: 'rnaturalearthdata'
## 
## Das folgende Objekt ist maskiert 'package:rnaturalearth':
## 
##     countries110

# Load Data
# Data is loaded from the 'tidytuesdayR' package for a specific week (2023, week 37).
tuesdata <- tidytuesdayR::tt_load(2023, week = 37)

## --- Compiling #TidyTuesday Information for 2023-09-12 ----
## --- There are 4 files available ---
## --- Starting Download ---

## 
## 	Downloading file 1 of 4: `all_countries.csv`
## 	Downloading file 2 of 4: `country_regions.csv`
## 	Downloading file 3 of 4: `global_human_day.csv`
## 	Downloading file 4 of 4: `global_economic_activity.csv`

## --- Download complete ---

# Extract Specific Datasets
# Relevant datasets are extracted from the loaded data.
all_countries <- tuesdata$all_countries
country_regions <- tuesdata$country_regions
global_human_day <- tuesdata$global_human_day
global_economic_activity <- tuesdata$global_economic_activity

# Fetch World Development Indicators (WDI) Data
# The code fetches WDI data for Gross Domestic Product (GDP) per capita for the year 2021.
# The data is renamed for clarity.
wdi <- WDI(indicator='NY.GDP.PCAP.KD', country="all", start=2021, end=2021) %>% rename(country_iso3 = iso3c, GDP = NY.GDP.PCAP.KD)

# Load World Country Data
# World country data is loaded and stored as a spatial object for mapping purposes.
world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)

## [1] "sf"         "data.frame"

# Load Fonts and Define Colors
# Fonts are loaded and colors are defined for text and symbols in visualizations.
font_add_google("Playfair Display", "playfair")
font_add('fa-reg', 'c:/Users/info/OneDrive/Dokumente/fonts/Font Awesome 6 Free-Regular-400.otf')
font_add('fa-brands', 'c:/Users/info/OneDrive/Dokumente/fonts/Font Awesome 6 Brands-Regular-400.otf')
font_add('fa-solid', 'c:/Users/info/OneDrive/Dokumente/fonts/Font Awesome 6 Free-Solid-900.otf')
showtext_auto()
bg <- "white"
col1 <- thematic::okabe_ito()[1]
col2 <- thematic::okabe_ito()[2]
col3 <- thematic::okabe_ito()[3]

# Define Symbols
# Symbols are defined using HTML-style code with appropriate colors and fonts.
twitter <- glue("<span style='color:{col1};font-family:fa-brands;'>&#xf099;</span>")
mastodon <- glue("<span style='color:{col1};font-family:fa-brands;'>&#xf4f6;</span>")
link <- glue("<span style='color:{col1};font-family:fa-solid;'>&#xf0c1;</span>")
data <- glue("<span style='color:{col1};font-family:fa-solid;'>&#xf1c0;</span>")
quote <- glue("<span style='color:{col1};font-family:fa-solid;'>&#xf10d;</span>")
space <- glue("<span style='color:{bg}'>-</span>")
space2 <- glue("<span style='color:{bg}'>--</span>") # This creates horizontal lines for formatting.

# Define Title
# A formatted title for the analysis is defined using glue.
t <- glue("<b>Clustering The World By Time Spent For Food Growth & <br>Collection And GDP Per Capita</b>")

# Define Caption
# A formatted caption is defined, including social media icons and links.
cap <- glue("{twitter}{space2}@web_design_fh{space2} 
	{space2}{mastodon}{space2}@frankhaenel @fosstodon.org{space2}
	{space2}{link}{space}{space2}www.frankhaenel.de<br>
	{data}{space2}The{space}Human{space}Chronome{space}Project{space2}|{space2}WDI{space}package<br>
	{quote}{space2}Fajzel,{space}William,{space}et{space}al.{space}'The{space}global{space}human{space}day'.{space}Proceedings{space}of{space}the{space}National{space}Academy{space}of{space}Sciences,{space}vol.{space}120,{space}no.{space}25,{space}2023.")

# Define Cluster Labels
# Labels for different clusters are defined.
A <- "High GDP, Low Food Growth & Collection"
B <- "Low GDP, High Food Growth & Collection"
C <- "Moderate GDP, Moderate Food Growth & Collection"

# Data Preparation
# Data is filtered and joined, handling missing values.
df <- all_countries %>% filter(Subcategory == "Food growth & collection") %>% left_join(wdi, by = "country_iso3") %>% na.omit

# Elbow Method Plot
# The elbow method plot is created to determine the optimal number of clusters.
ellbow <- fviz_nbclust(scale(df[,c(6,11)]), kmeans, method = "wss") +
theme(
      plot.title = element_markdown(size = 12, hjust = 0.5, lineheight = 1.3, family = "playfair"),
      plot.subtitle = element_markdown(size = 15, hjust = 0.5, lineheight = 1.3, family = "playfair"),
      plot.caption = element_markdown(size = 9, hjust = 0.5, lineheight = 1.3, color = "lightgrey", family = "playfair"),
      axis.title = element_markdown(size = 8, color = "darkgrey", family = "playfair"),
      axis.text = element_markdown(size = 8, color = "darkgrey", family = "playfair"))

# Gap Statistic Plot
# The gap statistic plot is created to evaluate clustering quality.
gap_stat <- clusGap(scale(df[,c(6,11)]),
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)
gap <- fviz_gap_stat(gap_stat) +
theme(
      plot.title = element_markdown(size = 11, hjust = 0.5, lineheight = 1.3, family = "playfair"),
      plot.subtitle = element_markdown(size = 15, hjust = 0.5, lineheight = 1.3, family = "playfair"),
      plot.caption = element_markdown(size = 9, hjust = 0.5, lineheight = 1.3, color = "lightgrey", family = "playfair"),
      axis.title = element_markdown(size = 8, color = "darkgrey", family = "playfair"),
      axis.text = element_markdown(size = 8, color = "darkgrey", family = "playfair"))

# Seed for Reproducibility
# A random seed is set to ensure reproducibility in k-means clustering.
set.seed(1)

# K-Means Clustering
# K-means clustering is performed on the data.
km <- kmeans(scale(df[,c(6,11)]), centers = 3, nstart = 25)
km

## K-means clustering with 3 clusters of sizes 30, 66, 87
## 
## Cluster means:
##   hoursPerDayCombined        GDP
## 1          -1.1161476  1.8968843
## 2           1.1214480 -0.5433786
## 3          -0.4658752 -0.2418797
## 
## Clustering vector:
##   [1] 1 2 2 2 1 3 3 3 1 1 3 2 1 2 3 2 3 3 3 3 3 3 3 3 3 1 2 3 2 1 1 3 2 2 2 2 3
##  [38] 3 2 2 3 3 3 1 3 1 2 1 3 3 3 3 3 3 2 1 2 1 2 2 1 3 2 2 3 2 2 3 3 2 2 3 1 2
##  [75] 3 2 3 2 2 1 3 3 1 1 1 3 3 1 3 3 3 2 2 1 3 2 3 2 3 3 3 3 3 1 3 2 3 2 2 3 3
## [112] 3 2 1 2 3 2 2 2 3 2 3 3 2 2 2 1 1 3 1 3 3 3 3 3 2 3 3 3 3 3 2 1 2 3 2 3 3
## [149] 1 2 2 3 2 3 2 3 3 3 1 3 2 2 2 2 3 2 2 3 3 3 2 2 3 3 1 3 3 2 2 2 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 33.61329 33.58082 25.02299
##  (between_SS / total_SS =  74.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

df$cluster <- km$cluster

# Scatter Plot
# A scatter plot is created to visualize the clustered data.
scatter <- ggplot(data=df,aes(x=GDP,y=hoursPerDayCombined,color=as.factor(cluster))) + geom_point(show.legend = F) +
scale_color_manual(values = c(col1, col2, col3)) +
theme(
      plot.title = element_markdown(size = 11, hjust = 0.5, lineheight = 1.3, family = "playfair"),
      plot.subtitle = element_markdown(size = 15, hjust = 0.5, lineheight = 1.3, family = "playfair"),
      plot.caption = element_markdown(size = 9, hjust = 0.5, lineheight = 1.3, color = "lightgrey", family = "playfair"),
      axis.title = element_markdown(size = 8, color = "darkgrey", family = "playfair"),
      axis.text = element_markdown(size = 8, color = "darkgrey", family = "playfair")) +
labs(title = "Time spent for Food growth and <br>collection vs. GDP per Capita")

# Choropleth Map
# A choropleth map is created to display clustering results on a world map.
map <- world %>% rename(country_iso3 = iso_a3) %>%
    left_join(df, by = "country_iso3") %>%
    ggplot() +
    geom_sf(aes(fill = as.factor(cluster)), linewidth = 0.01) + theme(legend.position="bottom") +
    scale_fill_manual(values=c(col1, col2, col3),
                      name="Clusters",
                      breaks=c("1", "2", "3"),
                      labels=c(A, B, C),na.value = "lightgrey") +
        theme(legend.text = element_markdown(size = 7))

# Combine Plots
# The plots are combined using the 'patchwork' package.
a <- ellbow + gap + scatter
patchwork <- map / a + plot_layout(heights = c(2, 1))

# Add Plot Annotations and Adjust Theme
# Annotations and theme adjustments are applied to the combined plot.
patchwork + plot_annotation(
     subtitle = t, caption = cap
 ) &
     theme(
      plot.subtitle = element_markdown(size = 14, hjust = 0.5, lineheight = 1.3, family = "playfair"),
      plot.caption = element_markdown(size = 9, hjust = 0, lineheight = 1.3, color = "darkgrey", family = "playfair"))

This layout integrates the elbow method plot, gap statistic plot, scatter plot, and choropleth map to provide insights into the clustering of world countries by GDP per capita and time spent on food growth and collection. Elbow Method Plot: This plot helps identify the optimal number of clusters for the world countries based on GDP per capita and time spent on food growth and collection. Gap Statistic Plot: This plot assesses the quality of clustering for world countries, considering GDP per capita and time spent on food growth and collection. Scatter Plot: This plot visualizes the clustering of world countries based on GDP per capita (x-axis) and time spent on food growth and collection (y-axis). Each color represents a different cluster. Choropleth Map: This map illustrates the clustering of world countries on a global map. Countries are shaded according to their cluster assignment.

Documentation

Introduction

This documentation explains the R code used to analyze and visualize data related to time spent on food growth and collection and GDP per capita for different countries. The analysis includes clustering countries based on these factors and presenting the results using various plots and visualizations.

Required Libraries

The code begins by loading several R libraries required for data manipulation, visualization, and font handling. These libraries include:

tidyverse: For data manipulation and visualization.
tidytuesdayR: For accessing TidyTuesday datasets.
showtext: For working with fonts.
glue: For text formatting.
ggtext: For enhanced text formatting in ggplot2.
factoextra and cluster: For clustering analysis.
patchwork: For combining multiple plots.
WDI: For fetching World Development Indicators (WDI) data.
sf, rnaturalearth and rnaturalearthdata: For working with spatial data and world map data.

Data Loading and Preparation

Data is loaded from the TidyTuesday dataset for a specific week (2023, week 37) using tidytuesdayR::tt_load.
Specific datasets are extracted from the loaded data, including all_countries, country_regions, global_human_day, and global_economic_activity.
World Development Indicators (WDI) data for Gross Domestic Product (GDP) per capita for the year 2021 is fetched and renamed for clarity.
World country data is loaded and stored as a spatial object using the ne_countries function from the rnaturalearth package for mapping purposes.

Fonts and Colors

Fonts are loaded, and colors are defined for text and symbols in visualizations. Font Awesome icons are used for social media symbols.

Symbols Definition

Symbols are defined using HTML-style code with appropriate colors and fonts. Symbols include icons for Twitter, Mastodon, links, data, and quotes

Title and Caption

A formatted title and caption for the analysis are defined using the glue function.

Cluster Labels

Labels for different clusters are defined as A, B, and C.

Data Preparation

Data is filtered and joined, handling missing values. The resulting dataframe is stored as df.

Elbow Method Plot

An elbow method plot is created to determine the optimal number of clusters. It helps in selecting the appropriate number of clusters for k-means clustering.

Gap Statistic Plot

A gap statistic plot is created to evaluate the quality of clustering. It compares the performance of the clustering algorithm for different numbers of clusters.

K-Means Clustering

K-means clustering is performed on the data using the kmeans function. The data is divided into three clusters based on GDP per capita and time spent on food growth and collection.

Scatter Plot

A scatter plot is created to visualize the clustered data, with GDP per capita on the x-axis and time spent on food growth and collection on the y-axis. Each cluster is represented by a different color.

Choropleth Map

A choropleth map is created to display clustering results on a world map. Countries are shaded according to their cluster assignment.

Combine Plots

The plots are combined using the patchwork package to create a single layout containing the elbow method plot, gap statistic plot, scatter plot, and choropleth map.

Plot Annotations and Theme Adjustments

Annotations and theme adjustments are applied to the combined plot, including the title, subtitle, and caption.

Output

The final output is an HTML document containing all the visualizations and annotations created during the analysis. The document provides insights into the clustering of countries based on GDP per capita and time spent on food growth and collection.

References

Fajzel, W., Galbraith, E. D., Barrington-Leigh, C., Charmes, J., Frie, E., Hatton, I. A., Mézo, P. L., Milo, R., Minor, K., Wan, X., Xia, V., & Xu, S. (2023). The Global Human Day. Proceedings of the National Academy of Sciences of the United States of America, 120(25). https://doi.org/10.1073/pnas.2219564120

Arel-Bundock, V. (2022). _WDI: World Development Indicators and Other World Bank Data_. R package version 2.7.8, https://CRAN.R-project.org/package=WDI.