Exploratory Data Analysis of Fortune 1000 Companies

Introduction to Exploring Fortune 1000 Companies Data using Python

As a data analyst, one of the most important skills you need to have is the ability to work with data to gain insights and answer questions. Recently, my girlfriend started working for a Fortune 1000 company, which sparked my curiosity about the makeup of these companies. In particular, I was interested in the percentage of women who are CEOs, which states have the most Fortune 1000 companies, and the top profitable companies. To answer these questions, I searched for a suitable dataset on Google and found one on Kaggle that had exactly what I was looking for. The dataset contained data for the year 2021, and was scraped from the fortune website by the dataset author.

Before diving into the analysis, let’s discuss the meaning of exploratory data analysis (EDA). EDA is a valuable process used by data scientists and analysts to investigate and analyze datasets. It helps to identify the main characteristics of the data through various visualization techniques.

To begin the analysis, I used Python, along with the Pandas, Seaborn, Squarify, and Matplotlib libraries. Pandas was used to read in the dataset, Seaborn was used for data visualization, Squarify was used to create a treemap, and Matplotlib was used for additional data visualization.

Here’s the Python code used to import the necessary libraries:

Import libraries used for analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import squarify # makes tree map
import seaborn as sns

Read in data and perform EDA

After importing the necessary libraries, I read in the dataset using the pandas library, and then checked the structure of the data and the first few rows using the following code:

df = pd.read_csv("Fortune_1000.csv") # Read csv file into python

df.dtypes # see structure of data
## company               object
## rank                   int64
## rank_change          float64
## revenue              float64
## profit               float64
## num. of employees      int64
## sector                object
## city                  object
## state                 object
## newcomer              object
## ceo_founder           object
## ceo_woman             object
## profitable            object
## prev_rank             object
## CEO                   object
## Website               object
## Ticker                object
## Market Cap            object
## dtype: object
df.head(n=10)
##               company  rank  ...  Ticker  Market Cap
## 0             Walmart     1  ...     WMT      411690
## 1              Amazon     2  ...    AMZN     1637405
## 2         Exxon Mobil     3  ...     XOM      177923
## 3               Apple     4  ...    AAPL     2221176
## 4          CVS Health     5  ...     CVS       98496
## 5  Berkshire Hathaway     6  ...    BRKA      550878
## 6  UnitedHealth Group     7  ...     UNH      332885
## 7            McKesson     8  ...     MCK       29570
## 8                AT&T     9  ...       T      206369
## 9   AmerisourceBergen    10  ...     ABC       21246
## 
## [10 rows x 18 columns]

Next, I wanted to find out the percentage of women who are CEOs. To do this, I used the value_counts() method in Pandas to count the number of female and male CEOs, and then created a pie chart using Matplotlib to visualize the results:

## ([<matplotlib.patches.Wedge object at 0x7ff183548eb0>, <matplotlib.patches.Wedge object at 0x7ff1835604f0>], [Text(-1.074994922300288, 0.23320788371879284, 'Male'), Text(1.074994911383038, -0.23320793404293647, 'Female')], [Text(-0.5863608667092479, 0.12720430021025061, '93.2%'), Text(0.5863608607543842, -0.1272043276597835, '6.8%')])

From the pie chart, it’s clear that only 6.8% of Fortune 1000 companies are led by female CEOs.

Next, I wanted to see which states have the most Fortune 1000 companies. I used the value_counts() method in Pandas to count the number of companies in each state, and then created a treemap using Squarify to visualize the results:

## (0.0, 100.0, 0.0, 100.0)

This code uses pandas and squarify libraries to create a treemap that shows the top 20 states with the most Fortune 1000 companies.

First, the code creates a DataFrame ‘state_count’ using the value_counts() method to count the number of times each state appears in the ‘state’ column of the original DataFrame ‘df’. The resulting DataFrame is then sorted in descending order using nlargest() function to get the top 20 states with the most companies.

The sizes and labels for the treemap are then created. The sizes variable is a list of the ‘counts’ column of the ‘state_count’ DataFrame, while the labels variable is a list comprehension that creates a string for each state and its count in the format ‘state’.

Finally, the squarify library is used to create the treemap. The sizes and labels variables are passed to the squarify.plot() method, along with a color map and an alpha value to adjust the opacity of the squares. The plt.axis() method is used to turn off the x and y axis labels and the title of the plot is set using plt.title(). The resulting treemap shows the states with the most Fortune 1000 companies in a visually appealing way.

To wrap up, let’s take a look at how we can compare profit and revenue in a bar plot. We can use this visual tool to gain insights into the performance of the top five Fortune 1000 companies. To achieve this, we melted the data to create a tidy format, then used Seaborn to create a clean and informative bar plot. Additionally, we created a table to show the percentage change in revenue to profit for each company.

# Profit vs. revenue in bar plot
rev_prof = df.nlargest(5,'revenue')
rev_prof = pd.melt(rev_prof,id_vars = ['company'],value_vars = ['revenue','profit'])
sns.set_theme(style="whitegrid")
sns.set_color_codes("pastel")
p1 =sns.barplot(x="value", y="company",hue = "variable",data= rev_prof)
p1.set_title("Profit Vs. Revenue for Top 5 Fortune 1000 Companies")
p1.set(xlabel='Millions ($)', ylabel="")

# Make table to put next to plot to show percent change in rev to prof
df2 = df.nlargest(5,'revenue')
#define custom function
def find_change(df2):
    change = (df2['profit']/df2['revenue'])*100
    return(change)
df3 = df2.groupby('company').apply(find_change).reset_index()
df3=df3.round()
#per_change = rev_prof.groupby('company','variable','value').assign(percent_change = (''))
# Put barplot and table together in same plot

plt.subplots_adjust(left=0.2, bottom=0.4)

the_table = plt.table(cellText=df3.values,
          cellLoc = 'center', rowLoc = 'center',
          transform=plt.gcf().transFigure,
          bbox = ([0.3, 0.1, 0.5, 0.2]))
the_table.auto_set_font_size(False)
the_table.set_fontsize(6)

plt.show()

These are just a few examples of the many techniques available for EDA. If you want to dive deeper into this topic, I highly recommend checking out the Python Plot Gallery. It’s an excellent resource to explore the vast array of visualization tools available and discover new ways to gain insights from your data.

Kevin Zolea
Kevin Zolea
Environmental Specialist