library(tidygraph)
library(ggraph)
EBS Department Network
Are we all connected in the EBS department at Monash University?
This started as the question of how far apart PhD students are from each other in the EBS department and so I thought that surely not all students are connected. What I will be doing in this blog post is to represent the relationship between the PhD student and supervisors using a graph structure. Graph structure is made up of nodes and are connected by edges. First, I will treat both students and staff as nodes, and use edges to represent the relationships. This will allows me to see whether the disconnection within the department is presented or not. If the graph is not all connected, I will be able to observed many cluster. However, if we were all connected, one cluster will be observed.
<- as_tbl_graph(full) |>
graph activate(nodes) |>
mutate(status = ifelse(name %in% full$to, "student", "supervisor"))
ggraph(graph, layout = "stress") +
geom_edge_link() +
geom_node_point(aes(color = status)) +
geom_node_label(aes(label = name), size = 1.5, repel = TRUE, label.padding = 0.15) +
scale_color_manual(values = c("blue", "red"),
labels = c("Student", "Supervisor")) +
labs(color = "") +
theme_void()
Surprise! We are all connected. This is a very interesting finding for me since the EBS department has a variety of fields, such as business analytics, econometrics, actuarial science etc. So, who is the connection to us all? If I were to remove Farshid Vahid-Araghi, Elvis Yang’s supervisor, then there would be a total of two clusters; econometrics and business analytics. Given that Farshid has now retired, the EBS department is endangered to be split apart.
Good news. Remember, during the data process, I have talked about the data not being fully accurate. Actually, Betts Ruscoe’s supervisors are Klaus Ackermann and Bonsoo Koo!!. What does this mean? It means even after Elvis Yang submitted his thesis, the new connection between econometrics and business analytics has been established through Betts and her supervisor Bonsoo.
Let’s explore some more details from this dataset. Who has supervised the most students?
|>
full count(from) |>
arrange(-n) |>
head(10) |>
rename(
Supervisors = from,
`Total students` = n) |>
::kable() kableExtra
Supervisors | Total students |
---|---|
Dianne Cook | 6 |
Jiti Gao | 6 |
Dan Zhu | 5 |
Athanasios Pantelous | 4 |
David Frazier | 4 |
Bin Peng | 3 |
Catherine Forbes | 3 |
George Athanasopoulos | 3 |
Michael Lydeamore | 3 |
Rob Hyndman | 3 |
To be honest, this is not that of a surprise.
References
Graduate research students and supervisors. (2024, October 21). Monash Business School. https://www.monash.edu/business/research/our-researchers/graduate-research-students-and-supervisors?queries_degree_query_posted=1&queries_degree_query=Econometrics+and+Business+Statistics
Pedersen T (2024). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package version 2.2.1, https://CRAN.R-project.org/package=ggraph.
Pedersen T (2024). tidygraph: A Tidy API for Graph Manipulation. R package version 1.3.1, https://CRAN.R-project.org/package=tidygraph.
Staff directory. (2022, July 4). Monash Business School. https://www.monash.edu/business/econometrics-and-business-statistics/our-people/staff-directory
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.
The question I want to answer is whether all the PhD students in our department are connected or not. The data that will be used to answer this question will be web-scraped from the Monash Website.
This website contains information about graduate research students and supervisors. These are the details all the PhD students were asked to send to the admin team which should be accurate information. However, there are some details I should keep in mind. First, PhD does not regularly send updated details to the admin team. Secondly, PhD supervisors can be changed and that could lead to misinformation. There may be better data kept by the Monash Graduate Research Office, however, that will require a lot of work to obtain. Since the data on the website should still be accurate enough and for time saving I will use the web-scraping to obtain the data.
I accept 6 marks deduction for not including Task 1.3.
The download process for the data is as follows:
# Library package
library(tidyverse)
library(rvest)
# Function to scrape phd information from a directory page
<- function(url) {
scrape_phd_info # Read the webpage
<- xml2::read_html(url)
webpage
# Find all phd heading elements
<- webpage %>%
heading_elements html_elements("h3.box-listing-element__heading")
# Find all description elements
<- webpage %>%
description_elements html_elements("p.box-listing-element__description")
# Initialize an empty dataframe to store results
<- data.frame(
phd_data url = character(),
phd_name = character(),
supervisor = character(),
stringsAsFactors = FALSE
)
# For each heading element, find corresponding description and extract information
for (i in 1:length(heading_elements)) {
# Get the phd name
<- html_text(heading_elements[i], trim = TRUE)
phd_name
# Try to find corresponding description element
if (i <= length(description_elements)) {
<- html_text(description_elements[i], trim = TRUE)
description_text
# Extract text after "Supervisors:" if it exists
if (grepl("Supervisors:", description_text)) {
# Extract the supervisors text
<- str_extract(description_text, "(?<=Supervisors:).*")
supervisors_text <- trimws(supervisors_text)
supervisors_text
# Split supervisors by comma, semicolon, or "and"
<- strsplit(supervisors_text, "\\s*[,;]\\s*|\\s+and\\s+")[[1]]
supervisors_list <- trimws(supervisors_list)
supervisors_list
# Create one row per supervisor
for (supervisor in supervisors_list) {
if (nchar(supervisor) > 0) {
<- rbind(
phd_data
phd_data,data.frame(
phd_name = phd_name,
supervisor = supervisor,
stringsAsFactors = FALSE
)
)
}
}else {
} # Add a row with NA for supervisor if none found
<- rbind(
phd_data
phd_data,data.frame(
phd_name = phd_name,
supervisor = NA,
stringsAsFactors = FALSE
)
)
}
}
}
return(phd_data)
}
<- as.vector(glue::glue("https://www.monash.edu/business/research/our-researchers/graduate-research-students-and-supervisors?queries_degree_query=Econometrics+and+Business+Statistics&queries_degree_query_posted=1&result_1619149_result_page={a}",
url a = c("1", "2", "3", "4")))
for (i in seq_along(url)) {
download.file(url[i],
destfile = paste0("data/page", i, ".html"),
quiet = TRUE)
}
<- glue::glue("data/page{a}.html", a = c("1", "2", "3", "4"))
file
<- do.call(rbind, lapply(file, scrape_phd_info)) |>
results rename(from = supervisor,
to = phd_name)
save(results, file = "data/supervisors.Rdata")
<- function(url) {
extract_team_names # Read the webpage
<- xml2::read_html(url)
webpage
# Create empty lists to store results
<- list()
leadership_names <- list()
academic_names
# Find Leadership Team section
<- webpage %>%
leadership_section html_elements("h2:contains('Leadership team')") %>%
html_element(xpath = "following-sibling::*")
# Extract names from leadership section (adjusting selector as needed)
# This assumes the names are in links right after the h2
if (length(leadership_section) > 0) {
<- leadership_section %>%
leadership_names html_elements("a") %>%
html_text() %>%
trimws()
}
# Find Academic Staff section
<- webpage %>%
academic_section html_elements("h2:contains('Academic staff')") %>%
html_element(xpath = "following-sibling::*")
# Extract names from academic section
if (length(academic_section) > 0) {
<- academic_section %>%
academic_names html_elements("a") %>%
html_text() %>%
trimws()
}
# Create a results list
<- list(
results leadership_team = leadership_names,
academic_staff = academic_names
)
return(results)
}
download.file("https://www.monash.edu/business/ebs/our-people/staff-directory", destfile = "data/main.html", quiet = TRUE)
<- extract_team_names("data/main.html") |>
main unlist() |>
as_tibble() |>
rename(supervisor = value) |>
filter(!str_detect(supervisor, "@monash.edu"))
save(main, file = "data/main.Rdata")
I start by first downloading the HTML page from the Monash website, then scrape PhD name using the h3 tags with the box-listing-element__heading class and p tags with the box-listing-element__description class for details. Since we only care about the supervisor’s name, I then use regex to extract that. Next, to cross-reference with the supervisor’s name I have also web-scraped all the EBS staff from the Monash Website. Now, the dataset has to be in a tidy data format which will make it easier for me to do an analysis.
library(fuzzyjoin)
load("data/supervisors.Rdata")
load("data/main.Rdata")
<- results |>
full mutate(from = str_remove(from, "Prof |A/Prof |Dr |Prof. |Dr. |Assoc Prof. |Professor ")) |>
stringdist_left_join(main,
by = c("from" = "supervisor"),
max_dist = 2) |>
mutate(supervisor = ifelse(is.na(supervisor), from, supervisor)) |>
mutate(supervisor = case_when(
== "David T. Frazier" ~ "David Frazier",
supervisor == "Di Cook" ~ "Dianne Cook",
supervisor == "Farshid Vahid-Araghi" ~ "Farshid Vahid",
supervisor == "Susan VanderPlas" ~ "Susan Vanderplus",
supervisor == "Thiyanga S. Talagala" ~ "Thiyanga Talagala",
supervisor .default = supervisor
|>
)) select(to, supervisor) |>
rename(from = supervisor)
To make it easier, I have removed the title from the EBS staff. Then I use the stringdist_left_join to join any staff name that is very close in terms of character distance. However, after going through all of that there are still some staff names with large distances so I have to manually fix them. One comment about the string distance is that, yes, I could have increased the threshold, but, that could also lead to a mismatch.
References
Graduate research students and supervisors. (2024, October 21). Monash Business School. https://www.monash.edu/business/research/our-researchers/graduate-research-students-and-supervisors?queries_degree_query_posted=1&queries_degree_query=Econometrics+and+Business+Statistics
Robinson D (2020). fuzzyjoin: Join Tables Together on Inexact Matching. R package version 0.1.6, https://CRAN.R-project.org/package=fuzzyjoin.
Staff directory. (2022, July 4). Monash Business School. https://www.monash.edu/business/econometrics-and-business-statistics/our-people/staff-directory
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H (2024). rvest: Easily Harvest (Scrape) Web Pages. R package version 1.0.4, https://CRAN.R-project.org/package=rvest.
I would like to acknowledge the use of AI for generating the web-scraping script. However, I have already deleted the chat log, thus, I accept the deduction for this.