Data Wrangling and SQL Project Olist: Brazilian E-Commerce Database

Sales Data Analysis

inesiameita
10 min readJul 8, 2023

--

Introduction and Objectivity

Olist is the largest department store in the Brazilian marketplace, providing a platform for small businesses across Brazil to connect with various channels through a single contract. By leveraging their services, merchants can sell their products on the Olist Store and benefit from streamlined logistics solutions provided by Olist’s logistics partners.

In this project, we will analyze the Olist Store database, the database contains information about thousands of orders placed between 2016 and 2018 in various markets across Brazil.

We will analyze several aspects, including product analysis and sales analysis, to assess their objectivity. The objectives of the analysis are as follows:

  1. Which products are the most popular or sold the most?
  2. Which products have the highest total sales?
  3. What types of payment are most frequently used by customers?
  4. How is the trend of orders placed each month of the year?
  5. How much is the total revenue of Olist during that period?
  6. How is the trend of revenue for each quarter and each month of the year?

Data Overview

The Olist database that we are using for this project is stored in the following relational database.

Exploration and Data Processing

Let’s start the explanation of this project by importing several Python libraries such as pandas, matplotlib, seaborn, and sqlite3 on this project and reading the data from a .db (database) file to create a DataFrame. Here’s :

Then, we execute the SQL query, this code will print the names of all the tables present in the “olist.db” database.

Next step is reading data from the olist dataset table in the SQLite database using the pd.read_sql() function from the Pandas library.

Exploring and processing the data involves various tasks to gain insights and prepare the data for analysis. Here are some common steps in data exploration and processing:
To start processing the data, We merge multiple datasets into a single table. The merged table will result in the following table with 119,143 rows and 48 columns.

And then we drop unused columns in dataset.

Next, we identify NaN values in the data using the following approach:

This code calculates the number of NaN values in each column of the filtered_data and sorts them in descending order. The resulting nan_col variable will contain the column names as the index and the corresponding count of NaN values in each column. After successfully identifying the NaN values, drop or remove rows with missing values from dataset.

Thus, after obtaining the NaN value identification, resulting in clean data with a size of 113,194 rows and 24 columns, we will identify if there are any duplicate data in the dataset we are using. The steps to perform this identification are as follows:

To convert the order_approved_at column in a dataset to datetime data, we provide these steps. This code will convert the values in that column to datetime format.

Identifying inconsistent formats in data is important because it can lead to errors in data analysis and processing. Inconsistent formats refer to different formats for data that should have the same format. It is crucial to ensure consistency and accuracy of the data used in analysis, visualization, and modeling. Therefore, we perform the identification of inconsistent formats in the dataset by:

After data processing has been performed and the data quality has been ensured for analysis, we can proceed with the data analysis process.

Data Analysis

Question 1.
Which products are the most popular or sold the most?

To answer the question, we can perform a groupby operation on the product_category_name_english column and calculate the total number of products sold in each category. We can then sort the results based on the highest number of products sold. Here are the obtained results:

Based on the chart above, it can be observed that the most popular product or the best-selling item is bed bath table, with total sales quantity is 11,684 products. The product category that is most popular and has the highest number of sales because this category sells bedding, home apparel, bathroom, and outdoor lifestyle products that are in high demand by customers for their daily needs.

Question 2.
Which products have the highest total sales?

To answer the question, similar to the previous approach, we can perform a groupby operation on the product_category_name_english column and calculate the total number of sales for each category. Additionally, we can display the total number of products sold and the average price for each category. Finally, we can sort the results based on the highest total number of sales in R$ Million. Here are the obtained results:

Based on the chart above, it can be determined that the product with the highest total sales is health beauty, with a total sales is 1.267 in R$ million. The high number of sales and the relatively high average price for the health beauty product category contribute to its position as the highest-selling category among the other groups. The combination of a large customer demand and higher-priced products results in a higher total sales figure for this category compared to others.

Question 3.
What types of payment are most frequently used by customers?

Based on the chart above, it can be determined that the frequently used payment type is credit card, with number of transactions is 83,528. That is a possible reason, customers may prefer using credit cards because of the convenience they offer. To increase payment usage in other categories, it is indeed possible to implement promotional programs and collaborate with other payment service providers to enhance customer interest. By offering attractive promotions, discounts, or rewards specific to those categories, customers may be incentivized to choose alternative payment methods.

Question 4.
How is the trend of orders placed each month of the year?

To answer this question, we can calculate the total number of orders for each month and group them based on the transaction month, here are the results obtained:

Based on the trend chart, it can be observed that the orders for each month shows an increasing trend, reaching its peak in November of 2017 with 8692 product sales. However, in December of 2017, the total orders experiences a significant decline with 6261 product sales, and increase in January of 2018. It can be observed that towards the end of the year, the total number of orders tends to decrease, while it increases again at the beginning of the year. This pattern may occur due to increased customer shopping interest and needs at the start of the year. These observations suggest that there may be seasonal variations in customer behavior and demand, with a higher inclination towards shopping and increased order numbers at the beginning of the year.

Question 5.
How much is the total revenue of Olist during that period?

To answer this question, we accumulate the total payments received by Olist and then convert it into million Brazilian Reals (R$).

Olist generated a total revenue of R$ 19.4202M for the period from October 2017 to July 2018.

Question 6.
How is the trend of revenue for each quarter and each month of the year?

To answer this question, we would accumulate the total revenue and group it into quartiles and months to observe the trend of increasing or decreasing total revenue in each time range.

Based on the trend chart, it can be observed that the revenue for each quarter shows an increasing trend, reaching its peak in the second quarter of 2018 with 4.1932 R$ Million. However, in the third quarter of 2018, the total revenue experiences a significant decline with 2.4889 R$ Million.

Based on the trend chart, it can be observed that the revenue for each month shows an increasing trend, reaching its peak in November of 2017 with 1.5275 R$ Million. However, in December of 2017, the total revenue experiences a significant decline with 0.9910 R$ Million and increase in January of 2018.

Factors such as New Year’s resolutions, holiday shopping, or sales events during the early months of the year can contribute to this trend. Additionally, customers may have higher spending power at the beginning of the year due to salary bonuses, tax refunds, or other financial factors. These observations suggest that there may be seasonal variations in customer behavior and demand, with a higher inclination towards shopping and increased order numbers at the beginning of the year. As a result, the total revenue obtained during that period is higher.

Summary and Conclusion

  1. According to the provided chart, the product category bed bath table stands out as the most popular and best-selling item, with a total sales quantity of 11,684 products.
  2. The provided chart shows that the product category “health beauty” has the highest total sales, amounting to 1.267 million R$. This category’s success can be attributed to the combination of high customer demand and relatively higher average prices compared to other product categories.
  3. Based on the chart provided, it is evident that the most frequently used payment type is credit card, with a total of 83,528 transactions. This observation aligns with the possible reason that customers prefer using credit cards due to the convenience they offer.
  4. These observations indicate that there are seasonal trends in customer behavior, with a higher inclination towards shopping and increased order numbers at the beginning of the year. This information can be valuable for businesses to plan their marketing strategies and adjust their operations to align with customer demand throughout the year.
  5. During the period from October 2017 to July 2018, Olist accumulated a total revenue of R$ 19.4202 million. This figure represents the total payments received by Olist during that specific timeframe. By accumulating the payments and converting the total into million Brazilian Reals (R$), the revenue generated by Olist can be measured and understood in terms of its financial performance.
  6. The revenue figures show consistent patterns of increasing trends in both quarterly and monthly data, with the highest points occurring at the beginning of the year. These patterns can be attributed to factors such as holiday shopping, New Year’s resolutions, and increased customer purchasing power due to financial factors like bonuses and tax refunds. These observations suggest that there are seasonal variations in customer behavior and demand, with a greater inclination towards shopping and increased revenue at the start of the year.

In conclusion, there is still much more to explore from the Olist database. Moving forward, conducting analyses from various different perspectives can provide valuable insights and opportunities for further exploration. By exploring the dataset from different angles, such as customer segmentation, geographical analysis, product trend analysis, or supply chain analysis, we can gain a deeper understanding of Olist’s business dynamics and uncover new possibilities.

For the Jupyter Notebook of this project, you can view it at the following location:

github.com/inesiameita/Data-Wrangling-and-SQL-Project-Olist-Database

If there are any shortcomings in the analysis and report, please feel free to provide feedback in order to improve the outcome of this project.
Thank you 🚀

--

--