Data Analysis of Customer Vehicle Insurance
Probability Project
Introduction
Insurance policies involve companies providing compensation for specific losses or damages in exchange for regular premium payments. This applies to vehicle insurance, where customers pay an annual premium to an insurance provider. In the event of an unfortunate accident involving the vehicle, the insurance company will provide compensation, known as the ‘sum assured’, to the customer. This arrangement ensures coverage and protection for the customer in case of accidents or damages to their vehicle.
This project focuses on conducting a probability analysis using a dataset of vehicle insurance. The dataset includes valuable information on demographics, vehicles, and insurance policies. Demographic factors such as gender, age, and region code type, along with vehicle-related data like vehicle age and damage, are analyzed. The insurance policy details, including premium and sourcing channel, are also considered. The objective of this analysis is to examine the probabilistic relationships between these variables and gain insights into the factors that affect vehicle insurance.
Dataset Overview
The dataset used in this project is obtained from Kaggle’s
“Health Insurance Cross Sell Prediction” dataset. It contains several variables, including:
- id : Unique ID for the customer
- Gender : Gender of the customer
- Age : Age of the customer
- Driving_License
0 : Customer does not have DL,
1 : Customer already has DL
- Region_Code : Unique code for the region of the customer
- Previously_Insured
1 : Customer already has Vehicle Insurance,
0 : Customer doesn't have Vehicle Insurance
- Vehicle_Age : Age of the Vehicle
- Vehicle_Damage
1 : Customer got his/her vehicle damaged in the past.
0 : Customer didn't get his/her vehicle damaged in the past.
- Annual_Premium : The amount customer needs to pay as premium in the year
- Policy_Sales_Channel : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
- Vintage : Number of Days, Customer has been associated with the company
- Response
1 : Customer is interested,
0 : Customer is not interested
Research Questions
The purpose of this analysis is to gain deeper insights and understanding of vehicle insurance data. The various analyses conducted in this project will be explained in the following points.
Before we begin, let’s prepare the dataset that will be used for this analysis. The first step is to import several Python libraries that will be used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv("train.csv")
dataset.info()
The dataset has been imported successfully using the “pd.read_csv” function, then display the information contained in the dataset. This will provide us with an overview of the dataset, including the column names, data types, and the number of non-null values in each column.
Next, we identify NaN values and we identify if there are any duplicate data in the dataset we are using the following approach:
nan_col = dataset.isna().sum().sort_values(ascending = False)
nan_col
n_data = len(dataset)
percent_nan_col = (nan_col/n_data) * 100
percent_nan_col
dataset[dataset.duplicated()]
After data processing has been performed and the data quality has been ensured for analysis, we can proceed with the data analysis process.
Descriptive Statistical Analysis
Question 1 : What is the average age in this data?
The average age of the customers in the dataset is 39 years.
Question 2 : What is the average premium that customers need to pay in a year?
The average premium that customers need to pay in a year Rs. 30,564
Question 3 : What is the total number of male customers and female customers?
Number of male customers : 206,089
Number of female customers : 175,020
Thus, number of male customers > female customers
Question 4 : What is the average age of male and female customers?
Average age of male customers : 41
Average age of female customers : 36
Thus, average age of male customers > female customers
Question 5 : What is the average annual premium paid by customers who have vehicle insurance and those who do not have vehicle insurance?
Average premium customers who have vehicle insurance : Rs. 30644.29
Average premium who do not have vehicle insurance : Rs 30496.82
Thus, average premium customers who have vehicle insurance > customers who do not have vehicle insurance
Question 6 : What is the average annual premium paid by customers who have driving license and those who do not have driving license?
Average premium customers who have driving license : Rs. 30554.92
Average premium customers who do not have driving license : Rs. 34999.73
Thus, average premium customers who have driving license < customers who do have driving license
Question 7 : What is the variance of the annual premiums paid by customers who have vehicle insurance and those who do not have vehicle insurance?
Variance premium customers who have vehicle insurance : 250286700
Variance premium customers who not have vehicle insurance : 335192986
Thus, variance premium customers who have vehicle insurance < customers who not have vehicle insurance
Discrete Variable Analysis
Question 1 : How is probability mass function distribution of the customer age?
The resulting plot will show the probability distribution of the customer age in the dataset.
Question 2 : What is the proportion of the number of male vs female customers? Which gender has the higher proportion?
Based on the table and graph above, it is evident that the proportion of male customers is higher than that of female customers, with a proportion of 0.540761.
Question 3 : What is the proportion of the average annual premium of male vs female customers?
Based on the table and graph above, it is evident that the average annual premium for male customers is slightly higher than that for female customers, with a proportion of 0.501038. However, the difference in values between the two proportions is not significant.
Question 4 : What is the proportion of the number of customer who have vehicle insurance vs customer who do not have vehicle insurance?
Based on the table and graph above, it is evident that the proportion of customers who have vehicle insurance is smaller than the proportion of customers who do not have vehicle insurance, with a proportion of 0.45821.
Question 5 : What is the proportion of the number of customer’s vehicle age?
Based on the table and graph above, it is evident that the proportion of customers based on the age of their vehicles varies. The category of vehicles with an age of 1–2 years has the highest proportion, which is 0.525613, compared to the other categories. On the other hand, the category of vehicles with an age of more than 2 years has the smallest proportion, which is 0.042001.
Question 6 : What is the proportion of the number of customer’s response?
Based on the table and graph above, it is evident that the proportion of customers interested in having vehicle insurance is smaller than the proportion of customers who are not interested, with a proportion of 0.877437.
Question 7 : What is the probability of customers being interested in having insurance when the age of their vehicle is less than 1 year?
The probability of customers being interested in having insurance when the age of their vehicle is less than 1 year : 0.018898.
Question 8 : What is the probability of customers being interested in having insurance when the customer ever had previously insurance?
The probability of customers being interested in having insurance when the customer ever had previously insurace : 0.00041458
Continuous Variable Analysis
Question 1 : Which one is more likely to occur
- Customer above the age of 30 pays annual premium above 30500
- Customer below the age of 30 pays annual premium above 30500
Probability customer with age > 30 who have annual premium > 30500 is 0.5738
Probability customer with age < 30 who have annual premium > 30500 is 0.5029
It is more likely to occur customer with age > 30 who have annual premium > 30500 because the probability is higher, it is 0.5738
Question 2 : Which one is more likely to occur
- Customer above the age of 30 who have previous insurance
- Customer below the age of 30 who have previous insurance
Probability have insurance with age above 30 is 0.3211
Probability have insurance with age above 30 is 0.6577
It is more likely to occur customer with age < 30 who have previous insurance because the probability is higher, it is 0.6577
Question 3 : Which one is more likely to occur
- Customer above the age of 30 who do not have previous insurance and pays annual premium above 30500
- Customer below the age of 30 who have previous insurance and pays annual premium above 30500
Probability age above 30 who do not have previous insurance and pay annual premium above 30500 is 0.5865
Probability age below 30 who have previous insurance and pay annual premium above 30500 is 0.5066
It is more likely to occur customer with age > 30 who do not have previous insurance and pay annual premium > 30500 because the probability is higher, it is 0.5865
Variable Correlation Analysis
Question 1 : How is the correlation between the variables Age and Annual Premium?
Based on the plot, it can be observed that the correlation between age with annual premium is positive, but the correlation is weak with a value of 0.06751. This indicates that the variables age have a positive influence, but it is not really significant on annual premium.
Question 2 : How is the correlation between the variables Previously_Insured = 1, Age and Annual Premium?
Based on the plot, it can be observed that the correlation between age where the previously insured is yes with annual premium is positive, but the correlation is weak with a value of 0.02687. This indicates that the variables age and previously insured have a positive influence, but it is not really significant on annual premium.
Question 3 : How is the correlation between the variables Response = 1, Age and Annual Premium?
Based on the plot, it can be observed that the correlation between age where the customer have interest in insurance with annual premium is positive, but the correlation is weak with a value of 0.12396. This indicates that the variables age and response have a positive influence, but it is not really significant on annual premium.
Question 4 : How is the correlation between the variables Vehicle Age below 1, Age and Annual Premium?
Based on the plot, it can be observed that the correlation between age where the vehicle age below 1 year with annual premium is negative, but the correlation is weak with a value of -0.08218. This indicates that the variables age and vehicle age below 1 year have a negative influence, but it is not really significant on annual premium.
Hypothesis Testing
Question 1 : The annual premium for those previously insured is less than or equal to the annual premium for those not previously insured
- H0 = Annual premium have previously insured <= annual premium not have previously insured
- H1 = Annual premium have previously insured > annual premium not have previously insured
Stat value = 2.66743187388792
P-Value = 0.00382183073651225
If the p-value 0.00382183073651225 is less than 0.05,
then H0 is rejected and H1 is accepted
From the above result, it is known that the p-value is 0.00382, therefore H0 is rejected. This means that there is enough evidence to conclude that there is a significant difference between the tested groups, indicating that the Annual Premium for those who have previously insured is greater than the annual premium for those who have not previously insured.
Question 2 : The annual premium have driven license is less than annual premium not have driven license
- H0 = Annual premium have driven license <= annual premium not have driven license
- H1 = Annual premium have driven license > annual premium not have driven license
Stat value = -6.836091756273567
P-Value = 0.9999999999920179
If the p-value 0.9999999999920179 is greater than 0.05,
then H0 fails to be rejected
From the above result, it is known that the p-value is 0.9999999999920179, we fail to reject H0. This means that there is not enough evidence to conclude the presence of a significant difference between the tested groups. In this context, we cannot draw the conclusion that the alternative hypothesis H1 is true or that a significant difference exists between the tested groups, indicating that the Annual Premium for those who have have driven license is less than or equal to the annual premium for those who have not driven license.
Question 3 : The proportion of license-driven owners with response 1 is greater than or equal to the proportion of non-license-driven owners with response 1
- H0 = Proportion of license-driven owners with response 1 >= proportion of non-license-driven owners with response 1
- H1 = Proportion of license-driven owners with response 1 < proportion of non-license-driven owners with response 1
Stat value = 6.269198152076968
P-Value = 0.9999999998185439
If the p-value 0.9999999998185439 is greater than 0.05, then H0 fails to be rejected
From the above result, it is known that the p-value is 0.0.9999999998185439, we fail to reject H0. This means that there is not enough evidence to conclude the presence of a significant difference between the tested groups. In this context, we cannot draw the conclusion that the alternative hypothesis H1 is true or that a significant difference exists between the tested groups, indicating that the proportion of license-driven owners with interested in having insurance is greater than or equal to proportion of non-license-driven owners with interested in having insurance.
Question 4 : The proportion of previous insurance owners with response 1 is greater than or equal to the proportion of insurance owners with response 0
- H0 = Proportion of previous insurance owners with response 1 >= proportion of insurance owners with response 0
- H1 = Proportion of previous insurance owners with response 1 < proportion of insurance owners with response 0
Stat value = -210.61826302899098
P-Value = 0.0
If the p-value 0.0 is less than 0.05, then H0 is rejected and H1 is accepted
From the above result, it is known that the p-value is 0.0. Therefore, we reject H0 and accept the alternative hypothesis H1, suggesting that there is a significant difference or relationship between the variables being tested. Proportion of previous insurance owners with response 1 is less than proportion of insurance owners with response 0.
Question 5 : The variance of annual premium for males and females is the same
- H0 = variance annual premium for males = variance annual premium females
- H1 = variance annual premium for males ≠ variance annual premium females
Stat value = 290.05457638998547
P-Value = 5.114293745910101e-65
If the p-value 5.114293745910101e-65 is less than 0.05,
then H0 is rejected and H1 is accepted
From the above result, it is known that the p-value is 5.114293745910101e-65, we can conclude that the result is statistically significant. The extremely small pvalue indicates that the observed data is highly unlikely to occur under the assumption of equal variances between the groups being compared. Therefore, we reject H0 and accept the alternative hypothesis H1, indicating that there is a significant difference in the variances between annual premium for males and annual premium females.
Summary and Conclusion
- In this project, through descriptive statistical analysis, we obtained valuable information about the dataset, including the average age of customers, average annual premiums, gender proportions, and the proportion of customers with vehicle insurance.
- We also performed discrete and continuous variable analyses to explore probabilistic relationships between variables. This included examining the probability mass function distribution of customer age, gender proportions, average annual premiums by gender, proportions of customers with vehicle insurance, proportions based on vehicle age, and proportions of customer interest in insurance.
- Additionally, we analyzed variable correlations to understand the relationships between age, annual premiums, and other variables. Hypothesis testing was conducted to determine the significance of certain relationships and differences between groups.
In conclusion, this analysis provided valuable insights into the vehicle insurance dataset, offering understanding of customer characteristics, preferences, and relationships between variables. The findings can support informed decision-making, optimization of communication strategies, and improvement of the overall business model and revenue.
For the Jupyter Notebook of this project, you can view it at the following location:
If there are any shortcomings in the analysis and report, please feel free to provide feedback in order to improve the outcome of this project.
Thank you 🚀