This work was undertaken as part of the “Movement of people for education and loneliness” Data Challenge organised by the Jean Golding Institute for Data-Intensive Research at the University of Bristol and ONS.
The report is broken into three sections:
The following people contributed to this report:
This report was written in R Markdown. The .Rmd version can be found here.
In an attempt to better understand the data provided a visualization program displaying the geographic and time distribution of the z-score provided was developed. The program takes form as a website accessible at the following web address http://51.38.69.55/.
The program is hosted on a low performance private server. Acces to this server provided by the link above is for basic demonstration purposes. For proprer use and user interaction the program should be hosted on an appropriate server.
The source code for the database generation and the visualization is available here.
The following section explains the different features of the site.
All of the data provided in a spreadsheet format and the external data used was transfered to a SQL data base. The database is composed of two tables. The ‘final_data’ table contains the provided data. While the ‘student_migration’ table contains the data on student migration provided by HESA (Higher Education Statistics Agency).
Located in the upper section of the site the control panel allows to select what information will be displayed in the lower section.
In the lower section on the right side is displayed the bar graph(s) representing the data of the cities selected in the control panel where one or multiple condition of interest can as well be selected. The x-axis shows the year of the data collection. The y-axis displays the z-score value(s) of the selected condition(s).
In the lower section on the left is displayed 3 maps. They each displays the data for one year. The available years are 2016, 2017 and 2018. The data for only one condition can be displayed at a time. Different visualization mode are available including choropleth map of the regions or counties in England which uses gradiant coloring of the location. In addition, Scatter map which distinguishes data using different colors and sizes of the data points.
In order to visualize the migration of students from abroad to England the following data-set provided by HESA(Higher Education Statistics Agency) was used: https://www.HESA.ac.UK/data-and-analysis/students/table-11 The data-set information on the number of students moving to each English region, the domicile of the students, as well as their mode and level of study between 2014 and 2018.
Data visualization is an important step which allow a better understanding of any data-set before analysis. Here the data provided is displayed in different form such that a clear view of the data distribution in England is achieved.
Question: Does the movement of students to a different region of England for education associate with increased lonliness in their county or region of origin?
In addition to the general visualisation of the data-sets provided by the JGI Data Competition and others sourced online in Section 1, we wanted to explore the hypothesis that students leaving their county/region to travel to a different region of the country for higher education might be associated with the loneliness score in that county/region.
We choose this question because we believe the effect of student migration on those left behind is often overlooked in terms of analysis/interventions, with the focus usually being placed on preventing loneliness in students arriving in a new location.
Note: All code used within this section is available for review. You can view it by clicking the “Code”" button on the right hand side of the screen within a specific sub-section. For example:
# This is a demonstration of how code is presented within this section!
The data-sets used in this analysis are laid out below, and where appropriate, a link to their source is provided. For familiarity, the first few rows of each data-sets are presented.
Sourced from the JGI Data Competition team.
msoa11cd | msoa11nm | msoa11nmw | st_areasha | st_lengths | loneills_2016 | loneills_2017 | loneills_2018 |
---|---|---|---|---|---|---|---|
E02006534 | Adur 001 | Adur 001 | 6530065 | 16833 | 1.5867 | 1.225 | 1.857 |
E02006535 | Adur 002 | Adur 002 | 5845402 | 16300 | 2.1739 | 1.905 | 2.042 |
E02006536 | Adur 003 | Adur 003 | 10594058 | 17615 | 0.8986 | 1.257 | 1.231 |
E02006537 | Adur 004 | Adur 004 | 2953407 | 15209 | 1.6069 | 1.802 | 2.135 |
E02006538 | Adur 005 | Adur 005 | 1992233 | 16823 | 1.9869 | 1.920 | 2.255 |
E02006539 | Adur 006 | Adur 006 | 8296730 | 17177 | 0.4642 | 1.089 | 1.260 |
Sourced from the HESA website, this data-set describes the number of students moving to each region from a given domicile/town.
X4.way.domicile | Domicile | Level.of.study | Mode.of.study | Academic.Year | Region.of.HE.provider | Number | |
---|---|---|---|---|---|---|---|
19 | All | Aberdeenshire | All | Full-time | 2017/18 | North East | 40 |
20 | All | Aberdeenshire | All | Full-time | 2017/18 | North West | 35 |
21 | All | Aberdeenshire | All | Full-time | 2017/18 | Yorkshire and The Humber | 25 |
22 | All | Aberdeenshire | All | Full-time | 2017/18 | East Midlands | 15 |
23 | All | Aberdeenshire | All | Full-time | 2017/18 | West Midlands | 20 |
24 | All | Aberdeenshire | All | Full-time | 2017/18 | East of England | 30 |
Sourced from here, this data-set matches town names to their respective counties and countries.
Town | County | Country |
---|---|---|
Ampthill | Bedfordshire | England |
Arlesey | Bedfordshire | England |
Bedford | Bedfordshire | England |
Biggleswade | Bedfordshire | England |
Dunstable | Bedfordshire | England |
Flitwick | Bedfordshire | England |
Sourced from here, this data-set matches counties to their respective regions.
County | Region |
---|---|
Bedfordshire | East |
Berkshire | South East |
Bristol | South West |
Buckinghamshire | South East |
Cambridgeshire | East |
Cheshire | West Midlands |
Sourced from here, this data-set matches districts to their respective counties.
District | County |
---|---|
Adur | West Sussex |
Allerdale | Cumbria |
Amber Valley | Derbyshire |
Arun | West Sussex |
Ashfield | Nottinghamshire |
Ashford | Kent |
To create a single data-set which matched towns to their respective counties and regions, we merged data-sets 3 & 4, on the basis of county name. We also removed country information after limiting the new data-set to towns/counties/regions in England.
# Import town.county list, and limit to towns in England
town.county <- read.csv("data/Towns_List.csv")
town.county <- town.county[which(town.county$Country=="England"),]
town.county <- town.county[,c(1:2)]
# Import county.region list
county.region <- read.csv("data/county.region.csv", stringsAsFactors=FALSE, header = TRUE)
colnames(county.region)[1] <- "County"
# Merge above datasets on the basis of county to create town.county.region data
town.county.region <- merge(town.county, county.region)
Town | County | Region |
---|---|---|
Ampthill | Bedfordshire | East |
Arlesey | Bedfordshire | East |
Bedford | Bedfordshire | East |
Biggleswade | Bedfordshire | East |
Dunstable | Bedfordshire | East |
Flitwick | Bedfordshire | East |
In order to prepare for later analyses, we added county and region information to the HESA data, using the newly created data-set in Section 2.3.1 above. The first few lines of the resulting data-set are presented below.
# Read in HESA student migration
data.move <- read.csv("data/Movement data.csv", header = FALSE)
# Clean dataframe of description and correct column names
data.move <- data.move[18:nrow(data.move),]
colnames(data.move) = make.names(as.character(unlist(data.move[1, ])))
data.move = data.move[-1, ]
data.move <- as_tibble(data.move)
# Remove "City of" from Domicile column to aid merging
data.move <- data.move %>%
mutate(Domicile = gsub("City of ", "", Domicile))
# Merge on basis of county name
tmp1 <- merge(data.move, county.region, by.x = "Domicile", by.y = "County")
colnames(tmp1)[1] <- "County"
# Merge on basis on town name
tmp2 <- merge(data.move, town.county.region, by.x = "Domicile", by.y = "Town")
tmp2 <- tmp2[,c(2:9)]
# Combine the two merged datasets created above
countydata <- rbind(tmp1, tmp2)
County | Region | Academic.Year | Level.of.study | Mode.of.study | Region.of.HE.provider | Number |
---|---|---|---|---|---|---|
Bristol | South West | 2017/18 | Other undergraduate | Part-time | East of England | 5 |
Bristol | South West | 2017/18 | Other undergraduate | Part-time | London | 10 |
Bristol | South West | 2015/16 | All postgraduate | Part-time | Wales | 130 |
Bristol | South West | 2017/18 | Other undergraduate | Part-time | West Midlands | 25 |
Bristol | South West | 2015/16 | All postgraduate | Part-time | The Open University in England | 70 |
Bristol | South West | 2015/16 | All postgraduate | Part-time | Total England | 2265 |
We ran into issues around how to convert the academic year (used in the student migration data) to the calendar year (used in the loneliness index data).
We decided to assign the full effect of migration in a given academic year to it’s earliest calendar year, on the basis that the effect of student migration on loneliness is likely to be most acute in the months immediately after their departure. As an example, the 2014/2015 academic year was re-coded as 2014.
However, taking this approach meant that there was mismatch between the student migration data (2014 to 2017) and the loneliness score data (2016 to 2018). As a result, all subsequent analysis are limited to the two years where data was available from both sources: 2016 and 2017.
countydata <- countydata %>%
mutate(
Academic.Year = gsub("2014/15", "2014", Academic.Year),
Academic.Year = gsub("2015/16", "2015", Academic.Year),
Academic.Year = gsub("2016/17", "2016", Academic.Year),
Academic.Year = gsub("2017/18", "2017", Academic.Year)
)
In keeping with our research question, to prepare the data-set to examine the effect on loneliness within a county of students moving out of the county to study, we:
countydata.trimmed.out <- filter(
countydata,
Region.of.HE.provider != "Total England" &
Region.of.HE.provider != "Total United Kingdom" &
Region.of.HE.provider != "The Open University in England" &
Region.of.HE.provider != Region
)
We then summed all those leaving a county to attend a university in a different region of the UK to create an outward migration for education figure. The plot below shows the results of this analysis for 2016.
Note: the plot below shows 47 of the 48 counties in England. We had limited the analysis to England in Section 2.3.2 as we only had loneliness data for this country. Further, the City of London is not shown below, as it was included as part of Greater London in the student migration data.
# Create sum of people moving from each county in 2016
data.outof.domicile <- filter(countydata.trimmed.out, X4.way.domicile == "All", Level.of.study == "All", Mode.of.study == "All") %>%
group_by(Academic.Year, County) %>%
summarise(sum.no = sum(as.numeric(as.character(Number))))
# Filter data to 2016/2017
fig1.data <- data.outof.domicile[which(data.outof.domicile$Academic.Year=="2016" | data.outof.domicile$Academic.Year=="2017"),]
# Create plot
ggplot(fig1.data, aes(x = reorder(County,sum.no), y=sum.no)) +
facet_grid(.~Academic.Year) +
geom_bar(stat="identity") +
coord_flip() +
labs(x = "County (England only)",
y = "Total number of students moving from the specififed county",
title = "Outward movement of students for education \n (county level)") +
scale_y_continuous(breaks=seq(0, 175000, 50000),label = scales::comma, expand=c(0.05,0))+
theme(plot.margin = unit(c(1,1,1,1), "cm"),
plot.title = element_text(hjust = 0.5))
To address our second hypothesis, we create an outward migration for education figure at the regional level. The results of this are presented below.
# Building on the intital analysis done above
data.outof.region <- filter(countydata.trimmed.out, X4.way.domicile == "All", Level.of.study == "All", Mode.of.study == "All") %>%
group_by(Academic.Year, Region) %>%
summarise(no.out = sum(as.numeric(as.character(Number))))
# Filter data to 2016
data.outof.region.plot <- data.outof.region[which(data.outof.region$Academic.Year=="2016" | data.outof.region$Academic.Year=="2017"),]
# Create plot
ggplot(data.outof.region.plot, aes(x = reorder(Region, no.out), y=no.out)) +
facet_grid(Academic.Year~.) +
geom_bar(stat="identity", fill = "#56B4E9") +
coord_flip() +
labs(x = "Region (limited to England)",
y = "Total number of students moving from this region",
title = "Outward movement of students for education (region level)")+
scale_y_continuous(label = scales::comma, expand=c(0.05,0)) +
theme(plot.margin = unit(c(1,1,1,1), "cm"),
plot.title = element_text(hjust = 0.5))
To address our final hypothesis, we created a total migration for education figure (number of students leaving the region plus number of students arriving in the region) per region.
We created a assumptive rather than net migration for education per region figure on the basis that having students moving into a region is more likely to increase loneliness than reduce it, as they will not simply replace the students who left in terms of social contacts, plus they might possibly be homesick/lonely themselves.
# Limit dataset to those travelling to HE provider within England
countydata.trimmed.in <- countydata
countydata.trimmed.in$Region.of.HE.provider <- as.character(countydata.trimmed.in$Region.of.HE.provider)
countydata.trimmed.in$Region.of.HE.provider <- ifelse(countydata.trimmed.in$Region.of.HE.provider == "East of England","East", countydata.trimmed.in$Region.of.HE.provider)
countydata.trimmed.in$Region.of.HE.provider <- ifelse(countydata.trimmed.in$Region.of.HE.provider == "Yorkshire and The Humber","Yorkshire & Humber", countydata.trimmed.in$Region.of.HE.provider)
countydata.trimmed.in <- filter(
countydata.trimmed.in,
Region.of.HE.provider == "West Midlands" |
Region.of.HE.provider == "Yorkshire & Humber" |
Region.of.HE.provider == "South West" |
Region.of.HE.provider == "South East" |
Region.of.HE.provider == "North West" |
Region.of.HE.provider == "North East" |
Region.of.HE.provider == "London" |
Region.of.HE.provider == "East"|
Region.of.HE.provider == "East Midlands",
Region.of.HE.provider != Region
)
countydata.trimmed.in$Region.of.HE.provider <- as.character(countydata.trimmed.in$Region.of.HE.provider)
countydata.trimmed.in$Region.of.HE.provider <- ifelse(countydata.trimmed.in$Region.of.HE.provider == "East of England","East", countydata.trimmed.in$Region.of.HE.provider)
countydata.trimmed.in$Region.of.HE.provider <- ifelse(countydata.trimmed.in$Region.of.HE.provider == "Yorkshire and The Humber","Yorkshire & Humber", countydata.trimmed.in$Region.of.HE.provider)
# Create sum of people moving into each region of England
data.into.region <- filter(countydata.trimmed.in, X4.way.domicile == "All", Level.of.study == "All", Mode.of.study == "All") %>%
group_by(Academic.Year, Region.of.HE.provider) %>%
summarise(no.in = sum(as.numeric(as.character(Number))))
colnames(data.into.region)[2] <- "Region"
# Create dataframe of total population chance due to migration
data.sum.migration <- merge(data.into.region, data.outof.region)
data.sum.migration <- data.sum.migration %>%
mutate(no.total = no.in + no.out)
# Filter data to 2016/2017
fig2.data.1 <- data.into.region[which(data.outof.region$Academic.Year=="2016" | data.outof.region$Academic.Year=="2017"),]
fig2.data.1[4] <- "Inwards"
colnames(fig2.data.1)[4] <- "Direction"
colnames(fig2.data.1)[3] <- "No"
fig2.data.2 <- data.outof.region[which(data.outof.region$Academic.Year=="2016" | data.outof.region$Academic.Year=="2017"),]
fig2.data.2[4] <- "Outwards"
colnames(fig2.data.2)[4] <- "Direction"
colnames(fig2.data.2)[3] <- "No"
fig2.data <- rbind(fig2.data.1,fig2.data.2)
# Plot results
ggplot(fig2.data, aes(x = reorder(Region,No), y= No, fill = Direction)) +
facet_grid(Academic.Year~.) +
geom_bar(stat="identity") +
coord_flip() +
labs(x = "Region (limited to England)",
y = "Total number of students moving into/out of region",
title = "Total population change due to migration for education \n (region level)") +
scale_y_continuous(label = scales::comma, expand=c(0.05,0)) +
scale_fill_manual(values=c("#E69F00", "#56B4E9")) +
theme(plot.margin = unit(c(1,1,1,1), "cm"),
legend.position="bottom",
plot.title = element_text(hjust = 0.5))
The final element of pre-analysis preparation involved adding county and regional data to the loneliness data. This was complicated by the fact that the loneliness data-set presented the loneliness score per district rather than per county. We used the District/County dataset to map districts to their corresponding counties, and subsequently added regional information as well.
data.lonliness <- read.csv("data/msoa_loneliness.csv")
data.lonliness <- data.lonliness %>%
mutate(District = gsub('[[:digit:]]+', '', msoa11nm),
District = gsub(" $","", District, perl=T))
# Add county Level information
district.county <- read.csv("data/district-county.csv")
colnames(district.county)[1] <- "District"
data.lonliness <- merge(data.lonliness,district.county,by="District",all.x=TRUE)
data.lonliness$County <- as.character(data.lonliness$County)
data.lonliness$County <- ifelse(is.na(data.lonliness$County) == TRUE,"0", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Bournemouth", "Dorset", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Stockton-on-Tees", "County Durham", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Poole", "Dorset", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Christchurch", "Dorset", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "East Dorset", "Dorset", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "West Dorset", "Dorset", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "North Dorset", "Dorset", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Purbeck", "Dorset", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Shepway", "Kent", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "St Edmundsbury", "West Suffolk", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "St. Helens", "Merseyside", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Forest Heath", "West Suffolk", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Suffolk Coastal", "East Suffolk", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Waveney", "East Suffolk", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Taunton Deane", "Somerset West and Taunton", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "West Somerset", "Somerset West and Taunton", data.lonliness$County)
data.lonliness$County <- ifelse(data.lonliness$District == "Weymouth and Portland", "Dorset", data.lonliness$County)
# Add regional information
county.region <- read.csv("data/county.region.csv")
colnames(county.region)[1] <- "County"
data.lonliness <- merge(data.lonliness, county.region, by="County",all.x=TRUE)
data.lonliness$Region <- ifelse(data.lonliness$County == "West Suffolk", "East", as.character(data.lonliness$Region))
data.lonliness$Region <- ifelse(data.lonliness$County == "East Suffolk", "East", as.character(data.lonliness$Region))
data.lonliness$Region <- ifelse(data.lonliness$County == "Somerset West and Taunton", "South West", as.character(data.lonliness$Region))
# Summarise by county
mean.lonliness.county <- data.lonliness %>%
group_by(County) %>%
summarise("2016" = mean(loneills_2016),
"2017" = mean(loneills_2017),
"2018" = mean(loneills_2018))
mean.lonliness.county <- gather(mean.lonliness.county, Academic.Year, mean.score, "2016","2017","2018", factor_key = FALSE)
# Merge with movement data
move.lonliness.county <- merge(mean.lonliness.county, data.outof.domicile)
ggplot(move.lonliness.county, aes(x=sum.no,y=mean.score))+
facet_grid(Academic.Year~.)+
geom_point() +
geom_smooth(method='lm') +
scale_x_continuous(labels = comma) +
labs(x = "Total number of students leaving a county",
y = "Mean loneliness score for that county")
move.lonliness.county.2016 <- move.lonliness.county[which(move.lonliness.county$Academic.Year == 2016),]
move.lonliness.county.2017 <- move.lonliness.county[which(move.lonliness.county$Academic.Year == 2017),]
# Create regression models
out.county.linearMod.2016 <- lm(mean.score ~ sum.no, data=move.lonliness.county.2016)
out.county.linearMod.2017 <- lm(mean.score ~ sum.no, data=move.lonliness.county.2017)
# Extract p-values
out.county.p.2016 <- (summary(out.county.linearMod.2016)$coefficients[,4])[2]
out.county.p.2017 <- (summary(out.county.linearMod.2017)$coefficients[,4])[2]
The correlation coefficients for 2016 and 2017 were -0.1865 and -0.1645 respectively, indicating low correlation between the migration for education at the county level and mean loneliness score for that county in either year.
This was supported by further analyses. Linear regression showed poor evidence for an association between mean loneliness score per county and the number of students who have left that county for education reasons (Hypothesis 1). The regression coefficient for 2016 was -7.247910^{-6} (p=0.2094) and for 2017 was -6.016110^{-6} (p=0.2692).
# Summarise by Region
mean.lonliness.region <- data.lonliness %>%
group_by(Region) %>%
summarise("2016" = mean(loneills_2016),
"2017" = mean(loneills_2017),
"2018" = mean(loneills_2018))
mean.lonliness.region <- gather(mean.lonliness.region, Academic.Year, mean.score, "2016","2017","2018", factor_key = FALSE)
# Merge with movement out of region data
out.move.lonliness.region <- merge(mean.lonliness.region, data.outof.region)
ggplot(out.move.lonliness.region, aes(x=no.out,y=mean.score)) +
facet_grid(Academic.Year~.) +
geom_point(aes(colour = Region), size = 2) +
geom_smooth(method='lm', colour = "#56B4E9") +
scale_x_continuous(labels = comma) +
scale_color_brewer(palette="Paired") +
labs(x = "Total number of students moving leaving a region",
y = "Mean loneliness score for that region")
out.move.lonliness.region.2016 <- out.move.lonliness.region[which(out.move.lonliness.region$Academic.Year == 2016),]
out.move.lonliness.region.2017 <- out.move.lonliness.region[which(out.move.lonliness.region$Academic.Year == 2017),]
# Create regession models
out.linearMod.2016 <- lm(mean.score ~ no.out, data=out.move.lonliness.region.2016)
out.linearMod.2017 <- lm(mean.score ~ no.out, data=out.move.lonliness.region.2017)
# Extract p-values
out.p.2016 <- (summary(out.linearMod.2016)$coefficients[,4])[2]
out.p.2017 <- (summary(out.linearMod.2017)$coefficients[,4])[2]
Similar to the county level, there was little evidence for an association between migration for education at the regional level and mean loneliness score for that region (Hypothesis 2). The correlation coefficients for 2016 and 2017 were -0.2008 and -0.2278 respectively, while the linear regression coefficient for 2016 was -2.574310^{-6} (p=0.6045) and for 2017 was -2.81710^{-6} (p=0.5556).
data.sum.migration <- data.sum.migration[,c(1,2,5)]
total.move.lonliness.region <- merge(data.sum.migration, mean.lonliness.region)
ggplot(total.move.lonliness.region, aes(x=no.total,y=mean.score)) +
facet_grid(Academic.Year~.) +
geom_point(aes(colour = Region), size = 2) +
geom_smooth(method='lm', colour = "#56B4E9") +
scale_x_continuous(labels = comma) +
scale_color_brewer(palette="Paired") +
labs(x = "Total number of students moving to/from region",
y = "Mean loneliness score for that region")
total.move.lonliness.region.2016 <- total.move.lonliness.region[which(total.move.lonliness.region$Academic.Year == 2016),]
total.move.lonliness.region.2017 <- total.move.lonliness.region[which(total.move.lonliness.region$Academic.Year == 2017),]
# Create regression models
total.linearMod.2016 <- lm(mean.score ~ no.total, data=total.move.lonliness.region.2016)
total.linearMod.2017 <- lm(mean.score ~ no.total, data=total.move.lonliness.region.2017)
# Extract p-values
total.p.2016 <- (summary(total.linearMod.2016)$coefficients[,4])[2]
total.p.2017 <- (summary(total.linearMod.2017)$coefficients[,4])[2]
As expected based on the results of the previous two analyses, there was little evidence to support our third and final hypothesis, that there is an association between the total population change in a region due to migration for education and the mean loneliness score for that region. The linear regression coefficient for this analysis for 2016 was -3.25810^{-6} (p=0.4246) and for 2017 was -3.431510^{-6} (p=0.3867).
There was no evidence that outward migration for education at either the county or regional level had any effect on the mean loneliness score in the county or region. Similarly, there was no evidence for a relationship between total population change (sum of students moving into and out of a region) due to migration for education in a region and the mean loneliness score for that region.
Is the loneliness of students linked to the distance between their domicile and place of study?
The Support Vector Algorithm (SVM) is used in an attempt to classify student data into three classes: “lonely”, “neutral” or “not lonely”. The result of the classifier would therefore be interpreted as the input data corresponding to a student who is likely to be lonely/not likely to be lonely knowing the domicile and the distance traveled between the domicile and the place of study.
By mapping the loneliness score of the region in England to the student H.E. provider location, training sets are created. They contain the longitude and latitude of the student, the distance (km) between the domicile and the H.E. provider and a binary representation of the loneliness z-score.
The classes are defined as follow:
- class 0: region loneliness z-score <= 0
- class 1: 0 < region loneliness z-score < 1
- class 2: region loneliness z-score >= 1
The Linear Discriminant Analysis (LDA) technique allows to reduce the 3 dimensions( distance traveled, domicile latitude, domicile longitude) data set in a 2D space which can be visualized.
Testing and visualization of the data reveals that the features used in the input, namely, the distance traveled, the domicile latitude and longitude do not describe how lonely a student may be. However the input data lack of resolution in the sense that it only takes into consideration the loneliness z-score of the region in England which induce very large amount of bias. Indeed the visualization tool developed shows that the z-score of the counties in England is highly variable. More importantly the distance traveled is not accurate as the exact location of study is not known, only the region of study is known in the data provided by HESA (Higher Education Statistics Agency) https://www.HESA.ac.UK/data-and-analysis/students/table-11.
Further work would include a repeat of the experiment with data containing the exact migration of the students. The impact of features such as the level of study and mode of study should be explored too.
The source code for the training set generation and the classifier can be found here.