Person coding on a laptop with HTML code on screen, showcasing development work.

SQL Data Cleansing and Analysis project

How can we take raw redundancy data and transform it in SQL Server to enable analysis of trends and drivers?

Data cleaning

Data cleaning is the first step where we take the raw data and fix a lot of the issues with the format of the data and make it more suitable for visualizations and analysis.

The first step is to import the data into MySQL Workbench. This is done by creating a new database (schema) and then import table data.

To view the data in the table:

To work on this data, we want to create a copy of the data so that we keep an original set of the raw data we can go back to.

The steps of data cleansing (on the “layoffs_staging” table) will then be:

— 1. Remove Duplicates
— 2. Standardize Data
— 3. Null Values or blank values
— 4. Remove any columns or rows

Removing Duplicates

We will now be looking for any duplicate data and removing it from the table.

The first setup is to insert a row id to the data in order to aid the removal of any duplicates.

We can then run this query and change the company name to investigate the duplicates

Then to remove the duplicates, we first create a new table (a staging 2 database) which includes row number.

We can then filter on row numbers greater than 1 (the duplicates!)

and Delete them:

Standardizing Data

Next is to make sure that each column contains standard format of data throughout.

For example we can trim spaces off the start and end of entries (white spaces).

Then look at the industry and look at blanks or slightly different versions of the same industry

e.g Crypto / Crypto Currency):

Then check it worked:

Repeat the process or any other issues with format.

For example “Country”:

SELECT DISTINCT country
FROM layoffs_staging2
ORDER BY 1;

Need to remove United States.

And then date format setting of the column (change from text to date format):

Null Values or blank values

Remove null or black values. First try and populate by looking at similar rows (same industry)

An example is Airbnb:

Identify and Update the values by using a JOIN:

Removing rows where two columns are null

Remove any columns or rows

Removing the row number column:

SELECT *
FROM layoffs_staging2;

ALTER TABLE layoffs_staging2
DROP column row_num;

Analysis of Data

Flatlay of a business analytics report, keyboard, pen, and smartphone on a wooden desk.

Here are a few ways we can use SQL to manipulate the data to find trends or snapshots. Otherwise known as Exploratory Data Analysis!

Lets bring up the cleaned data:

Can we see what was the maximum number of layoffs in one day?

This means that 100% of the company was laid off which totaled 12,000 jobs.

What other companies have laid off all employees (ordered by largest first)?

Or we could look at ordering it by the most funds raised:

We could group the company by the sum of laid off:

We can check the date range for the data:

This shows that the data covers 3 years from around the outbreak of Covid 19.

We can see which industries suffered most during this time:

Not really surprising that consumer and retail would be worst effected due to the lockdowns.

Which countries had the most layoffs?

How were the numbers grouped by year?

It should be noted that 2023 is only 3 months of data!

We can do a rolling sum of layoffs until the end of the data period:

We can now see how the numbers built up by month over the period.

Lets look at it by company by year, then rank each year as a top 5:

There you have some examples of how you can use SQL queries to dive in an analyse data. This is just a start and there are many more ways of digging into and presenting the results!