Student Performance : Regression Modeling
Exploring Factors Affecting Student Performance
Introduction
The use of student performance data is an important tool for improving student learning. By carefully analyzing this data, educators can identify areas where students need additional support and make changes to their teaching practices to help all students succeed.
This project aims to analyze and improve student performance through data-driven insights and strategies. The Student Performance Dataset is used for this project, which consists of 10,000 student records. Each record contains information about various predictors as well as a performance index. By analyzing the data, the project will identify the factors that influence student performance and develop strategies to improve it.
Dataset Overview
The dataset used in this project is obtained from Kaggle’s “Student Performance (Multiple Linear Regression)” dataset. It contains several variables, including:
- Hours Studied: The total number of hours spent studying by each student.
- Previous Scores: The scores obtained by students in previous tests.
- Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).
- Sleep Hours: The average number of hours of sleep the student had per day.
- Sample Question Papers Practiced: The number of sample question papers the student practiced.
- Performance Index: A measure of the overall performance of each student.
The performance index represents the student's academic performance and has been rounded to the nearest integer.
The index ranges from 10 to 100, with higher values indicating better performance.
Exploratory Data Analysis
The dataset aims to provide insights into the relationship between the predictor variables and the performance index. The predictor variables are studying hours, previous scores, extracurricular activities, and sleep hours. The performance index is a measure of student performance. Researchers and data analysts can use this dataset to explore the impact of each predictor variable on student performance.
Before we begin, let’s prepare the dataset that will be used for this analysis. The first step is to import several Python libraries that will be used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.gridspec import GridSpec
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score, cross_validate, KFold
The next step involves importing data in CSV format using the “pd.read_csv” and subsequently displaying information extracted from the dataset.
Next, we identify NaN values and we identify if there are any duplicate data in the dataset we are using the following approach:
nan_col = dataset.isna().sum().sort_values(ascending = False)
nan_col
n_data = len(dataset)
percent_nan_col = (nan_col/n_data) * 100
percent_nan_col
dataset[dataset.duplicated(keep=False)]
dataset = dataset.drop_duplicates(keep="first")
dataset.duplicated().sum()
After data processing has been performed and the data quality has been ensured for analysis, we can proceed with the data analysis process.
Correlation in Variables
Using the data, we aim to identify the variable that has the highest relationship with student performance index.
The table analysis indicates a positive linear correlation between the Performance Index and several factors, including Hours Studied, Previous Scores, Extracurricular Activities, Sleep Hours, and Sample Question Papers Practiced, all displaying an upward trend. Notably, the Previous Scores variable demonstrates a stronger linear correlation with the Performance Index, underscoring its significance in influencing student performance.
Average in each categorical variables
From the summaries provided, it appears that among the categorical variable analyzed, Extracurricular Activities exhibit not significantly large between them.
Numerical Variables with Categorical Variable
- Numerical Variables with Extracurricular Activities
The Extracurricular Activities variable does not exhibit any distinct pattern across the numerical variables related to the Performance Index; they show nearly the same Performance Index values for each numerical variable.
Based on the plot, we will analyze the hypothesis as follows:
H0 : The average Performance Index of students who participate in Extracurricular Activities is equal to that of students who do not participate.
H1 : The average Performance Index of students who participate in Extracurricular Activities is greater than that of students who do not participate.
With a significance level (alpha) of 0.05.
np.var(data_extracurricular), np.var(data_not_extracurricular)
(370.874513922569, 366.5309625072948)
p-value = 0.004787545663371792
If the p-value 0.004787545663371792 is less than 0.05,
then reject the null hypothesis
From the above result, it is known that the p-value is 0.0047875, therefore H0 is rejected. This means that there is enough evidence to conclude that there is a significant difference between the tested groups, indicating that The average Performance Index of students who participate in Extracurricular Activities is greater than that of students who do not participate.
Modelling
- Regression model using Previous Scores and Performance Index variables
r-squared = 0.8374722131868853
Performance_Index = -15.237815 + 1.014593 × Previous_Scores
The variable that have highest relation to performance index is previous scores. The expected performance index of student who don’t have previous scores is -15.237815. Comparing student that have previous scores, each increase of 1 unit in the previous score will increase the performance index value by 1.014593.
The model explained 83.7% of variance of performance index. After observing the data, a decrease in the performance index value from the previous score was noticed. This leads us to assume that this decline is responsible for making the intercept value negative.
- Regression model using all variables predictor
scores_ols_all_pred["test_rsquared"].mean()
0.9878134870121024
The model, which incorporates all predictors, exhibits a good fit as it can explain 98.8% of the variance in performance index.
Performance_Index for not participate in Extracurricular Activities = -33.235116 + 2.856112 × Hours_Studied + 1.018599 x Previous_Scores + 0.482003 x Sleep_Hours
Performance_Index for participate in Extracurricular Activities = -33.235116 + 0.632041 x Extracurricular_Activities + 2.856112 × Hours_Studied + 1.018599 x Previous_Scores + 0.482003 x Sleep_Hours
Based on model, the average of student’s performance index who participate in achieve Extracurricular Activities is higher 0.632041 than student who do not participate in achieve Extracurricular Activities.
After observing the model, a decrease in the performance index value from the previous score was noticed. This leads us to assume that this decline is responsible for making the intercept value negative.
r-squared = 0.9878440597259003
The performance of the model is good, the model explained 98.78% of variance of performance index.
From the regression model we have analyzed above, it can be concluded that the predictor variable that significantly influences the Performance Index value is the previous score. This is evident because its standard error is the smallest among the predictor variables.
Conclusion
In summary, this project emphasized the importance of using student performance data to enhance teaching and learning practices. It identified Previous Scores as the most influential predictor of student performance. The analysis revealed valuable insights into the factors affecting student success, providing educators with actionable information to support their students effectively.
For the github link of this project, you can view it at the following location:
https://github.com/inesiameita/Statistics-for-Business/blob/main/Statistics%20for%20Business.ipynb
If there are any shortcomings in the analysis and report, please feel free to provide feedback in order to improve the outcome of this project.
Thank you 🚀