Data Wrangling

Data Wrangling in Python: Case Study of Netflix

Introduction

According to IBM, data wrangling is the process of cleaning, structuring, and enriching raw data to be used in data science, machine learning (ML), and other data-driven applications. It addresses data quality issues such as missing values, duplicates, outliers, and formatting inconsistencies.

Objectives

Data wrangling process include:

The clean dataset was exported as a .csv file as seen in this link

Exploratory Data Analysis (EDA)

Titanic Exploratory Data Analysis

History of the Titanic

The RMS Titanic was a British cruise ship that sank on April 15, 1912, during its maiden voyage from Southampton, England, to New York City. It was the largest ship in the world at the time, measuring 882 feet long and 92 feet wide, and displacing 52,310 long tons. The Titanic departed from Southampton on April 10, 1912, stopping at the French port of Cherbourg and the Irish port of Queenstown to pick up more passengers before setting off across the Atlantic Ocean. Despite receiving several ice warnings, the Titanic continued at full speed, and at 11 p.m. on April 14, 1912, lookout Fred Fleet spotted an iceberg dead ahead. First Officer William Murdoch ordered the ship turned hard to port and signaled the engine room to reverse direction, but the ship was too large, moving too fast, and the iceberg was too close. The Titanic struck the iceberg, causing several compartments to begin filling with water. The ship’s design included a double-hull and 16 watertight compartments sealed by massive doors that could be instantly triggered by a single electric switch on the bridge or automatically by electric water-sensors. However, the ship was not designed to withstand a collision that would flood more than four compartments, and the iceberg had caused five compartments to begin filling with water. The ship began to sink, and the crew began to get people aboard the lifeboats. There were not enough lifeboats for all the passengers, and many left the Titanic only half full. The Titanic sank at 2 a.m. on April 15, 1912, with approximately 1,500 people still on board. The disaster led to the establishment of the International Ice Patrol and the first International Convention for Safety of Life at Sea, which required every ship to have lifeboat space for each person embarked, hold lifeboat drills, and maintain a 24-hour radio watch.

Objective

To conduct an Exploratory Data Analysis on the Titanic dataset. This includes Univariate, Bivariate, Multivariate, Outlier, and Target Analysis of the data. The dataset was obtained from this link

Data Dictionary

Feature Name Description Category (Optional)
PassengerId Unique identifier of the passenger
Survived Survival 0 = No, 1 = Yes
Pclass Ticket class 1 = Upper, 2 = Middle, 3 = Lower
Name Name of the passenger
Sex Gender of the passenger male, female
Age Age in years
SibSp Number of siblings / spouses aboard the Titanic
Parch Number of parents / children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of Embarkation Southampton (S), Cherbourg (C), Queenstown (Q)

The image below shows the relationship between Passenger Class, Sex, and Survival. Majority of passengers who survived the accident were in 1st class compared to 2nd and 3rd classes. 3rd class experiences the highest death rates.

Outlier Handling

Outliers cannot be removed because:

Conclusion

61.62% of the population succumbed to the accident compared to 38.38% who survived. Majority of passengers who boarded the titanic were between the age of 20 and 40 accounting for the young adults in the population. Majority of passengers in the 3rd class died compared to 1st and 2nd class. Majority of men succumbed to the accident compared to women. Majority of the passengers who succumbed to the accident were destined for Southampton port. However, there was an exception case of an elderly who survived accident that need to be looked into

For further Titanic EDA insights, Click Here

Dunes Sales Exploratory Data Analysis

EDA consists of Data Cleaning (Feature Engineering, identifying and dealing with missing data) Univariate, Bivariate, and Multivariate Analysis, and obtaining insights.

The dataset consists of the following parameters:

For further Dunes EDA insights, Click Here.

Soran Automobiles Exploratory Data Analysis

About Soran Automobiles

Soran Automobiles is an automobile company that specializes in providing high-quality vehicles to customers worldwide. With a focus on innovation and customer satisfaction, Soran Automobiles has become a leading brand in the industry. At Soran Automobiles, we understand the importance of a reliable and efficient vehicle, which is why we offer a wide range of cars with different specifications to meet the needs of our diverse customer base. Our cars are equipped with advanced technology and safety features to ensure a comfortable and secure driving experience.

Problem Overview

The company director is seeking to gain valuable insights that can help:

The Director believes that that the analysis will help them make data-driven decisions and take business to the next level. EDA consists of Feature Engineering, identifying and dealing with missing data, and eliminating duplicates. The dataset consists of the following parameters:

For further Soran Automobiles Analysis, Click Here

Visualization

Qlik HR Dashboard using Tableau

This project aims at analyzing and visualizing a dashboard that will be used by the HR department to make decisions. The dataset was obtained from this link, customized to the Kenyan context

Objectives

EDA was performed on a dataset containing employement data from Qlik. The dataset includes various features such as employee demographics, job titles, departments, and performance ratings. The goal is to gain insights into the employee performance, demographics, and trends to inform HR decisions.

The dataset consists of the following 15 features:

Qlik HR Dashboard

For further Qlik EDA insights, Click here

Data Analysis using Tableau: Dula Chains Supply Chain Case Study

About Dula Chains

DulaChains Solution is a leading global retail supply chain business revolutionizing how products are sourced and delivered worldwide. With our unwavering commitment to excellence, innovation, and sustainability, we provide end-to-end supply chain solutions that empower retailers, manufacturers, and suppliers to thrive in an ever- evolving marketplace. Our global reach spans across continents, connecting diverse markets and bridging the gap between suppliers and retailers. We leverage our extensive network to source the highest quality products, and deliver them to our clients' doorsteps in a timely manner. We take care of the entire supply chain process, allowing our clients to focus on their core business objectives.

I recently conducted a Supply Chain Analysis using Tableau, and I am excited to share my findings.

Problem Overview

The executive team wanted a dashboard that provides a comprehensive overview of the company's profits while incorporating forecasts for the coming months. The dashboard will help the executive team to make data-driven decisions and devise effective strategies to implement process improvements and proactively address potential issues before they impact the company's performance.

DulaChain seeks to leverage on their data resources to uncover patterns and trends of their supply chain operations by: The dataset consists of the following parameters:

Key Findings

Data Analysis using Microsoft Excel: Case Study of Sterling E-Commerce

STERLING E-COMMERCE DASHBOARD with Microsoft Excel

I recently conducted a Sales Analysis using Excel, and I am excited to share my findings.

Problem Overview

The management would like to leverage data to gain insights into the business and improve efficiency by:

Key Findings

Machine Learning Projects

Heart Disease Prediction using Supervised Learning: Case of Peterside Hospital, Lagos, Nigeria

About Peterside Hospital

Peterside Hospital is a private hospital located in Lagos, Nigeria. It was founded in 2000 by Dr. Efemena Ikpro, a renowned Nigerian physician. The hospital provides a wide range of medical services, including general medicine, surgery, paediatrics, obstetrics and gynaecology, cardiology, neurology, and oncology. Peterside Hospital is known for its state-of-the-art facilities, modern equipment, and highly skilled medical professionals. The hospital has a team of over 300 medical personnel, including doctors, nurses, and other healthcare professionals, who are trained locally and internationally. The hospital has several specialized clinics, including a diabetes clinic, a fertility clinic, a heart clinic, and a cancer clinic. It also offers services such as health check-ups, laboratory tests, and imaging services.

Objective

To build a machine-learning model that can predict the likelihood of a person having a heart disease based on the given features.

The features include:

For further Peterside Hospital Heart Disease Prediction insights Click Here.

Customer Segmentation using Clustering Algorithm K-Means: Case of Selore Nigeria

K-Means clustering is an unsupervised learning technique that is suitable for working with unlabelled data that lacks predefined categories or groups. The primary objective of this segment is to group similar data points together, with the number of groups determined by the value of K. The algorithm follows an iterative approach and assigns each data point to one of the K groups based on the features that are provided.

About the business - Selore

Selore Nigeria is a popular electronics retail chain in Nigeria that specializes in the sale of mobile phones, tablets, laptops, and other related accessories. The company was founded in 2022 and has since grown to become one of the largest retailers of mobile phones and related products in the country. Selore Nigeria offers a wide range of mobile phones from top brands such as Samsung, Apple, Nokia, and Huawei, among others. The company is also known for its after- sales support, including repairs and maintenance services, as well as trade-in options for old devices. With its numerous outlets nationwide, Selore Nigeria is a go-to destination for many Nigerians seeking to purchase mobile phones and related accessories.

Business Problem

A retail company wants to understand the spending behaviour of its customers to target them with personalized marketing campaigns. They have collected data on customer demographics (age, gender) and spending habits (annual income, spending score) and want to segment their customers based on their spending behaviour. The objective of this project is to build a customer segmentation model to help the retail company in developing targeted marketing strategies.

Objective

The goal of this project is to segment customers based on their purchasing behaviour and demographic information. We will use unsupervised learning techniques to identify groups of customers with similar behaviour and characteristics. This information can be used to develop targeted marketing campaigns, personalize product recommendations, and optimize pricing strategies.

The retail company has collected data on 200 customers containing:

For further Selore Customer Segmentation insights Click Here.

Computer Vision Projects

Face, Eyes, and Smile Detection

OpenCV is a ML Model used in Face, Eyes, and Smile Detection.

For further details on Face, Eyes, and Smile Detection Click Here

Hand Detection

MediaPipe Hand Landmarker model is a ML Model that lets you detect the landmarks of the hand in an image.

To find out more on Hand Detection Click Here

Pose Detection

MediaPipe is a ML Model that identifies and classifies real-time pose estimation of human body parts and joints of images and videos.

For further details on Pose Detection Click Here