Data Analysis and Visualisation - Mapping the Kenyan Startup Ecosystem of 2022 using Python

Introduction

Hello, I Am David Marko. I came across a report on the start-up ecosystem in Kenya graciously provided by DisruptAfrica https://disrupt-africa.com. By the end of the article, I hope to get a few insights into the Kenyan Startup Ecosystem in 2022.

Why?

Source: xkcd.com

I don't have a good answer to that. I got curious to see if there would be any patterns or relevant insights that could be of help sometime in the future. I also could have analyzed in MS Excel, but as a Python developer, my response to anyone who asks would be the famous quote

To a man with a hammer, everything looks like a nail

Requirements

Basic Knowledge of Python
Basic Knowledge of Using Jupyter or Google Colab
Python Libraries (Matplotlib, pandas, NumPy, Jupyter Lab)

Note: You can also download the excel file for the report from here. https://disruptafrica.gumroad.com/l/bqyjt

Installing Libraries

The first step was to open my terminal and install the following libraries by typing the following commands

We will be using Python3.0 and above

pip3 install jupyter
pip3 install numpy matplotlib pandas plotly

Jupyter Labs will help set up an environment for me to run python commands for cleaning and to visualize the provided data. Optionally, you can use Google Colab (https://colab.research.google.com/)

Set up Environment

I created a jupyter notebook instance by using the following command
```
 jupyter notebook
```
This opens your default browser.
Create a new notebook in the folder you selected as indicated below. This will open a new tab for your notebook where we can start running python code.

Note:

Copy and paste the Excel file onto the same folder that you put your notebook in.

Initiating libraries and importing Excel files

First, let's import the libraries onto the notebook

I loaded up the excel file we downloaded into pandas as a Dataframe. To ensure that the file was loaded properly, run df to view the structure of the dataframe.

df = pd.read_excel('Kenyan_Startup_Database_Kenyan_Startup_Ecosystem_Report_2022.xlsx')

The next step was to clean up the dataframe into something that would be easier to plot. This is by removing any empty rows or columns, excel headers or other irrelevant information. After a glance, the only modifications I had to make were as follows:

Drop any empty row or column with no data or NaN values
Maintain the first row as the index
Remove all Sector headers indicated

df.drop('Unnamed: 7', axis=1, inplace=True) #Remove Unnamed: 7 Column without reassigning the dataframe
df = df[df['Sector'].notna()] # Override the dataframe with a new dataframe where the Sector column is not NaN
df

Now I have a clean dataframe df that is ready for plotting.

Data Visualization

Here is where I started to visualize the report using matplotlib while checking for specific insights. Below, I will showcase the insight I got, the code used and the output.

Location

I plotted a simple bar graph where I grouped all locations by size and plotted them onto a bar chart. As seen below, most Startups have headquarters in Nairobi.

 # Group by locations
 locations = df.groupby(['HQ']).size().to_frame('Companies')
 # Initiate the Pie Chart
 fig_locations = px.pie(locations, values='Companies', names=locations.index, title='Locations')
 # Include labels to show the location and percentage
 fig_locations.update_traces(textposition='inside', textinfo='percent+label')
 # Visualize the data
 fig_locations.show()

Year of Launch

To visualize the year created, I chose a scattered bubble chart, grouped the data by sector and plotted. As seen, there was a huge surge in Fintech startups in 2022 and 2021

 # Create a new dataframe grouping data by sector and year of launch
 year_created = df.groupby(['Year of launch', 'Sector']).size().to_frame('Number').reset_index()
 # Convert the Year column into an integer
 year_created['Year of launch'] = year_created['Year of launch'].astype('int')
 # Plot onto a scattered graph
 fig = px.scatter(year_created, x="Year of launch", y="Number",
              size="Number", color="Sector",
                  hover_name="Sector", log_x=True, size_max=40)
 fig.show()

Female Founded Startups.

Here, I extracted only the female-founded companies and below are some interesting insights:

18% of all startups have Female Founders

  # Group the columns
  female_led = df.groupby(['Female founder?']).size().to_frame('Number')
  # Plot onto a graph
  fig_locations = px.pie(female_led, values='Number', names=female_led.index, title='Female Founded?')
  fig_locations.update_traces(textposition='inside', textinfo='percent+label')
  # Preview
  fig_locations.show()

The oldest female-founded company is Komaza (komaza.com) which works in the Conversation Tech and hires the most employees (242)
```
  oldest = df[df['Female founder?'] == 'Yes'].sort_values(['Year of launch'], ascending=[True])
```

The highest percentage of female founders are in the Fintech industry, followed by E-Health, Recruitment and HR. The least being Construction and AI/IoT

  # Group sector where Female founder is Yes
  female_sectors = df[df['Female founder?'] == 'Yes'].groupby(['Sector']).size().to_frame('Number')
  # Create pie chart object
  fig_sectors = px.pie(female_sectors, values='Number', names=female_sectors.index, title='Female Founders by Sector')
  # Add labels onto the pie chart
  fig_sectors.update_traces(textposition='inside', textinfo='percent+label')
  # Preview
  fig_sectors.show()

2016 had the highest spike in the number of female founders compared to any other year. It has been a stagnant trend for the following 6yrs.

  # Group by year of launch only when Female founder? is Yes
  female_founded = df[df['Female founder?'] == 'Yes']
  female_led = female_founded.groupby(['Year of launch']).size().to_frame('Number')
  # Plot onto a line graph
  fig = px.line(female_led, x=female_led.index, y="Number", title='Trend in Female Founded Companies')
  # Preview
  fig.show()

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1673268449914/874a38cd-4cd0-4a40-bc2f-786e4ff939bc.png align="center")

Gender Distribution

Tech startups are majorly male-dominated, but on a sector-to-sector basis, there are variations as some have more dominance than others.

 # Create new dataframes for each gender group
 female_founded = df[df['Female founder?'] == 'Yes']
 male_founded = df[df['Female founder?'] == 'No']
 # Group each dataframe by sector
 female_founded = female_founded.groupby(['Sector']).size().to_frame('Number').reset_index()
 male_founded = male_founded.groupby(['Sector']).size().to_frame('Number').reset_index()
 # Left merge to create a new dataframe on shared sectors
 merged = female_founded.merge(male_founded, how='left', on='Sector')
 # rename columns
 merged.rename(columns = {'Number_x':'Female Founders', 'Number_y':'Male Founders'}, inplace = True)
 merged.sort_values(['Female Founders', 'Male Founders'], ascending=[False, False])
 # Plot on graph
 fig = px.bar(merged, x="Sector", y=["Female Founders","Male Founders"], title="Gender Distribution by Sector")
 fig.show()

This graph however doesn't show sectors with no female founders. I merged and compared to get the results as indicated below.

 only_female = female_founded['Sector'][~female_founded['Sector'].isin(male_founded['Sector'])]
 # There were no sectors exclusively run by Female founders

 only_male = male_founded['Sector'][~male_founded['Sector'].isin(female_founded['Sector'])]
 # This gave me a list of sectors that dont have any female founders
 only_male

Accelerator

A surprising number of startups start from Accelerators and Incubators as seen from the graph and code below

 accelerated = df.groupby(['Accelerated?']).size().to_frame('Number')
 fig_locations = px.pie(accelerated, values='Number', names=accelerated.index, title='Accelerated?')
 fig_locations.update_traces(textposition='inside', textinfo='percent+label')
 fig_locations.show()

There is also an increase in the number of startups joining accelerators over the years, with the most common Startup Sector being Fintech and the most active year being 2020

 accelerator = df[df['Accelerated?'] == 'Yes'].groupby(['Year of launch', 'Accelerated?', 'Sector']).size().to_frame('Number').reset_index()
 accelerator['Year of launch'] = accelerator['Year of launch'].astype('int')
 fig = px.scatter(accelerator, x="Year of launch", y="Number",
              size="Number", color="Number",
                  hover_name="Sector", log_x=True, size_max=40)
 fig.show()

Employees

Kenyan Startups have hired a total of 13,140 employees with Conservation, Agritech and Ecommerce being the dominant sectors in recruitment

 total_employees = df['Number of employees'].sum()
 total_employees

 average_employees = df.groupby(['Sector'])['Number of employees'].mean().to_frame('Average')
 average_employees['Average'] = average_employees['Average'].astype('int')
 # average_employees
 fig = px.bar(average_employees, x=average_employees.index, y='Average')
 fig.show()

Conclusion

Thanks for reading along, I hope to update the article or create a second part once I can curate more information like Funding, Salaries, Grants, News, Mergers or Acquisitions and more.

You can download the notebook and relevant files from here