Student Performance : Regression Modeling

6 min readSep 10, 2023

Exploring Factors Affecting Student Performance

Introduction

The use of student performance data is an important tool for improving student learning. By carefully analyzing this data, educators can identify areas where students need additional support and make changes to their teaching practices to help all students succeed.

This project aims to analyze and improve student performance through data-driven insights and strategies. The Student Performance Dataset is used for this project, which consists of 10,000 student records. Each record contains information about various predictors as well as a performance index. By analyzing the data, the project will identify the factors that influence student performance and develop strategies to improve it.

Dataset Overview

The dataset used in this project is obtained from Kaggle’s “Student Performance (Multiple Linear Regression)” dataset. It contains several variables, including:


- Hours Studied: The total number of hours spent studying by each student.
- Previous Scores: The scores obtained by students in previous tests.
- Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).
- Sleep Hours: The average number of hours of sleep the student had per day.
- Sample Question Papers Practiced: The number of sample question papers the student practiced.
- Performance Index: A measure of the overall performance of each student. 
  The performance index represents the student's academic performance and has been rounded to the nearest integer. 
  The index ranges from 10 to 100, with higher values indicating better performance.

Exploratory Data Analysis

The dataset aims to provide insights into the relationship between the predictor variables and the performance index. The predictor variables are studying hours, previous scores, extracurricular activities, and sleep hours. The performance index is a measure of student performance. Researchers and data analysts can use this dataset to explore the impact of each predictor variable on student performance.

Before we begin, let’s prepare the dataset that will be used for this analysis. The first step is to import several Python libraries that will be used.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.gridspec import GridSpec

import statsmodels.formula.api as smf
import statsmodels.api as sm

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics import r2_score

from sklearn.model_selection import cross_val_score, cross_validate, KFold

The next step involves importing data in CSV format using the “pd.read_csv” and subsequently displaying information extracted from the dataset.

Next, we identify NaN values and we identify if there are any duplicate data in the dataset we are using the following approach:

nan_col = dataset.isna().sum().sort_values(ascending = False)
nan_col

n_data = len(dataset)

percent_nan_col = (nan_col/n_data) * 100
percent_nan_col

dataset[dataset.duplicated(keep=False)]

dataset = dataset.drop_duplicates(keep="first")
dataset.duplicated().sum()

After data processing has been performed and the data quality has been ensured for analysis, we can proceed with the data analysis process.

Correlation in Variables

Using the data, we aim to identify the variable that has the highest relationship with student performance index.

The table analysis indicates a positive linear correlation between the Performance Index and several factors, including Hours Studied, Previous Scores, Extracurricular Activities, Sleep Hours, and Sample Question Papers Practiced, all displaying an upward trend. Notably, the Previous Scores variable demonstrates a stronger linear correlation with the Performance Index, underscoring its significance in influencing student performance.

Average in each categorical variables

From the summaries provided, it appears that among the categorical variable analyzed, Extracurricular Activities exhibit not significantly large between them.

Numerical Variables with Categorical Variable

Numerical Variables with Extracurricular Activities

The Extracurricular Activities variable does not exhibit any distinct pattern across the numerical variables related to the Performance Index; they show nearly the same Performance Index values for each numerical variable.

Based on the plot, we will analyze the hypothesis as follows:

H0 : The average Performance Index of students who participate in Extracurricular Activities is equal to that of students who do not participate.

H1 : The average Performance Index of students who participate in Extracurricular Activities is greater than that of students who do not participate.

With a significance level (alpha) of 0.05.

np.var(data_extracurricular), np.var(data_not_extracurricular)
(370.874513922569, 366.5309625072948)

p-value = 0.004787545663371792

If the p-value 0.004787545663371792 is less than 0.05, 
then reject the null hypothesis

From the above result, it is known that the p-value is 0.0047875, therefore H0 is rejected. This means that there is enough evidence to conclude that there is a significant difference between the tested groups, indicating that The average Performance Index of students who participate in Extracurricular Activities is greater than that of students who do not participate.

Modelling

Regression model using Previous Scores and Performance Index variables

r-squared = 0.8374722131868853

Performance_Index = -15.237815 + 1.014593 × Previous_Scores

The variable that have highest relation to performance index is previous scores. The expected performance index of student who don’t have previous scores is -15.237815. Comparing student that have previous scores, each increase of 1 unit in the previous score will increase the performance index value by 1.014593.

The model explained 83.7% of variance of performance index. After observing the data, a decrease in the performance index value from the previous score was noticed. This leads us to assume that this decline is responsible for making the intercept value negative.

Regression model using all variables predictor

scores_ols_all_pred["test_rsquared"].mean()
0.9878134870121024

The model, which incorporates all predictors, exhibits a good fit as it can explain 98.8% of the variance in performance index.

Performance_Index for not participate in Extracurricular Activities = -33.235116 + 2.856112 × Hours_Studied + 1.018599 x Previous_Scores + 0.482003 x Sleep_Hours

Performance_Index for participate in Extracurricular Activities = -33.235116 + 0.632041 x Extracurricular_Activities + 2.856112 × Hours_Studied + 1.018599 x Previous_Scores + 0.482003 x Sleep_Hours

Based on model, the average of student’s performance index who participate in achieve Extracurricular Activities is higher 0.632041 than student who do not participate in achieve Extracurricular Activities.

After observing the model, a decrease in the performance index value from the previous score was noticed. This leads us to assume that this decline is responsible for making the intercept value negative.

r-squared = 0.9878440597259003

The performance of the model is good, the model explained 98.78% of variance of performance index.

From the regression model we have analyzed above, it can be concluded that the predictor variable that significantly influences the Performance Index value is the previous score. This is evident because its standard error is the smallest among the predictor variables.

Conclusion

In summary, this project emphasized the importance of using student performance data to enhance teaching and learning practices. It identified Previous Scores as the most influential predictor of student performance. The analysis revealed valuable insights into the factors affecting student success, providing educators with actionable information to support their students effectively.

For the github link of this project, you can view it at the following location:

https://github.com/inesiameita/Statistics-for-Business/blob/main/Statistics%20for%20Business.ipynb

If there are any shortcomings in the analysis and report, please feel free to provide feedback in order to improve the outcome of this project.
Thank you 🚀