Student Retention Model Helping At-Risk Students

Let’s Talk Identifying and Helping At-Risk Students, Let’s Talk Applied Data Science and the Student Retention Model

Modeling is about understanding behavior. It’s mostly applied towards commercial goals, but it will warm your heart and fill it with purpose when you apply it towards helping real people and especially our youth.

Applied data science requires good data and good modeling skills but also a lot of pre- and post- analysis and workflow planning. This can go a long way in not only identifying who is at risk but tailoring the best intervention to help, in our case students, get back on track. And that’s the big picture.

This isn’t easy as one size doesn’t fit all, and you won’t be able to scale this without some serious work. For example, if a student lives too far, can we get him or her to do some work from home or if a student studies hard but still fails, and that can be easily measured, could there be a language or computer issue? Each situation will be different and applied data science can help in identifying those cases with accuracy and empower field professionals, in this case, teachers, to deploy their resources where it counts most.

The Student Performance Data Set

Let’s walk through this student retention model using public data from Paulo Corte of the University of Minho in Portugal available on the UCI Machine Learning Repository ( The data set includes grades, demographic and social attributes and school events for two class subjects, math, and Portuguese.

This isn’t a full applied data science pipeline as the data has already been collected for us meaning that the work with the initial stakeholders and educational professionals has already begun. So, we’ll take what we have and feed it into the modeling and workflow/intervention portions of the tasks.

An Actual Production Model

Unlike the stock market where past performance is no guarantee of future returns, in education, unfortunately, past academic behavior is highly correlated to future behavior. Therefore, having a model that can predict a student’s performance as early as possible is highly desirable. In a production scenario, you would run the model as early as possible, even before the class starts and regularly through the academic year.

The UCI model we will use here offers interesting social and demographic features such as family life, social settings, and alcohol consumption that can model and predict a student’s aptitude even before a class begins.

A Model with XGBoost to Predict a Student’s Final Grade

Out of the two datasets available, we will work with the Portuguese one as it contains more data. It includes 649 students and 33 features. We’ll use all available features, and this requires feature engineering as some of them are categorical. After binarizing and pivoting all of them, we end up with 59 features. Here are some of the most predictive ones to model a student’s final grade according to our model:

More details at UCI dataset notes (

Feature Name Description
G1 First class grade
G2 Second class grade
Absences Number of school absences
Dalc Workday alcohol consumption
Age Student’s age (from 15 to 22)
Freetime Free time after school  
Health Current health  
Famrel Quality of family relationships    
Traveltime Home to school travel time  

The outcome variable is the final grade for the class which ranges between 0 and 20. XGBoost does a great job learning the student’s behavior and returns an RMSE score of 1.16 (this means the likelihood of the prediction will fall between +/- 1.16). I won’t cover XGBoost much here as we are more interested in what comes after the modeling phase. According to the variable importance chart (which sorts each feature in order of importance according to the model) we confirm that past grades are the strongest predictor of future performance (and therefore highlights the importance of intervening as early as possible to break out a student stuck in a destructive pattern).

Feature Importance ChartFinal Grade Predictions and Designing Interventions

An easy way of discovering how a particular ‘at-risk’ student could enjoy extra support is to compare that student with his or her peers. This is trivial to automate. We gather all the predictive features (for brevity, we’ll only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class.

Let’s see how that works in Python (I am using arbitrary splits and tweak depending on the data and the questions asked). Let’s take all the students with a predicted final score of less than 10 over 20).

Student Predictive Score

We then loop through each ‘at-risk’ student, compare every feature against the class average and add them to our report whenever a value is below the 25-percentile or above the 75-percentile of the class. It is that simple. It will even print out a mini-report for us. Let analyze a few of these cases.

Post-Prediction Analysis

Student ID: 163

Below: G2 9 Class: 11.57
Below: famrel 2 Class: 3.93
Above: goout 5 Class: 3.18
Below: Medu 1 Class: 2.51
Above: Walc 5 Class: 2.28
Above: failures 2 Class: 0.22

Student 163, struggles with family quality life (low ‘famrel’ score), values his friends (high ‘goout’ – out with friends), comes from a home with little education (low ‘Medu’ – mother’s education), and consumes a lot of alcohol over the weekends (high ‘Walc’ – alcohol consumptions on weekends). With no drinking during the week but a strong need for a peer support group (‘goot’ – out with friends), this student could benefit from after-school clubs with strong role models emphasizing education attended by other students his/her age.

Student ID: 131

Below: G2 9 Class: 11.57
Above: absences 10 Class: 3.66
Above: goout 5 Class: 3.18
Above: failures 3 Class: 0.22
Above: reason_reputation 1 Class: 0.22
Above: Mjob_services 1 Class: 0.21

Student 131 shows a good aptitude for studies/education as he/she didn’t fail the first G1 test as it doesn’t appear in our report. The school was chosen for its reputation thus it is important to the family. The excessive absences (10 versus the class mean of 3.66) and the ‘goout’ score may indicate that this child could benefit from after-school activities with other kids his/her age that encourages academics.

Student ID: 81

Below: G2 9 Class: 11.57
Below: age 15 Class: 16.74
Above: studytime 3 Class: 1.93
Below: schoolsup_no 0 Class: 0.9
Above: nursery_no 1 Class: 0.2
Above: reason_home 1 Class: 0.23
Above: schoolsup_yes 1 Class: 0.1

Student 81 also did well on the G1 as it doesn’t show in the report. Interestingly, this student is over 1.5 years younger than the overall class and could benefit being put in a more age-appropriate grade.

And Now the Hard Part

Do you see the bigger picture? As data scientists, we like to automate as much as possible of this process. Unfortunately, the last part requires thinking and face-to-face interaction. Every student’s story will be different, and this is where interacting with them on a one-on-one basis will break the vicious cycle of a bad grade leading to another bad grade.

All this may not be the job of a data scientist but taking part, seeing how your work gets fed into the workflow of other professionals whether teachers, social workers, health workers or law enforcement will change the way you see your work forever.

And this is how you become a better data scientist.