Data Analysis and Visualisation - Mapping the Kenyan Startup Ecosystem of 2022 using Python

Using Pandas and Plotly

Introduction

Hello, I Am David Marko. I came across a report on the start-up ecosystem in Kenya graciously provided by DisruptAfrica https://disrupt-africa.com. By the end of the article, I hope to get a few insights into the Kenyan Startup Ecosystem in 2022.

Why?

Source: xkcd.com

I don't have a good answer to that. I got curious to see if there would be any patterns or relevant insights that could be of help sometime in the future. I also could have analyzed in MS Excel, but as a Python developer, my response to anyone who asks would be the famous quote

To a man with a hammer, everything looks like a nail

Requirements

  • Basic Knowledge of Python

  • Basic Knowledge of Using Jupyter or Google Colab

  • Python Libraries (Matplotlib, pandas, NumPy, Jupyter Lab)

Note: You can also download the excel file for the report from here. https://disruptafrica.gumroad.com/l/bqyjt

Installing Libraries

The first step was to open my terminal and install the following libraries by typing the following commands

We will be using Python3.0 and above

pip3 install jupyter
pip3 install numpy matplotlib pandas plotly

Jupyter Labs will help set up an environment for me to run python commands for cleaning and to visualize the provided data. Optionally, you can use Google Colab (https://colab.research.google.com/)

Set up Environment

  1. I created a jupyter notebook instance by using the following command

     jupyter notebook
    

    This opens your default browser.

  2. Create a new notebook in the folder you selected as indicated below. This will open a new tab for your notebook where we can start running python code.

Note:

Copy and paste the Excel file onto the same folder that you put your notebook in.

Initiating libraries and importing Excel files

First, let's import the libraries onto the notebook

I loaded up the excel file we downloaded into pandas as a Dataframe. To ensure that the file was loaded properly, run df to view the structure of the dataframe.

df = pd.read_excel('Kenyan_Startup_Database_Kenyan_Startup_Ecosystem_Report_2022.xlsx')

The next step was to clean up the dataframe into something that would be easier to plot. This is by removing any empty rows or columns, excel headers or other irrelevant information. After a glance, the only modifications I had to make were as follows:

  • Drop any empty row or column with no data or NaN values

  • Maintain the first row as the index

  • Remove all Sector headers indicated

df.drop('Unnamed: 7', axis=1, inplace=True) #Remove Unnamed: 7 Column without reassigning the dataframe
df = df[df['Sector'].notna()] # Override the dataframe with a new dataframe where the Sector column is not NaN
df

Now I have a clean dataframe df that is ready for plotting.

Data Visualization

Here is where I started to visualize the report using matplotlib while checking for specific insights. Below, I will showcase the insight I got, the code used and the output.

  1. Location

    I plotted a simple bar graph where I grouped all locations by size and plotted them onto a bar chart. As seen below, most Startups have headquarters in Nairobi.

     # Group by locations
     locations = df.groupby(['HQ']).size().to_frame('Companies')
     # Initiate the Pie Chart
     fig_locations = px.pie(locations, values='Companies', names=locations.index, title='Locations')
     # Include labels to show the location and percentage
     fig_locations.update_traces(textposition='inside', textinfo='percent+label')
     # Visualize the data
     fig_locations.show()
    

  2. Year of Launch

    To visualize the year created, I chose a scattered bubble chart, grouped the data by sector and plotted. As seen, there was a huge surge in Fintech startups in 2022 and 2021

     # Create a new dataframe grouping data by sector and year of launch
     year_created = df.groupby(['Year of launch', 'Sector']).size().to_frame('Number').reset_index()
     # Convert the Year column into an integer
     year_created['Year of launch'] = year_created['Year of launch'].astype('int')
     # Plot onto a scattered graph
     fig = px.scatter(year_created, x="Year of launch", y="Number",
                  size="Number", color="Sector",
                      hover_name="Sector", log_x=True, size_max=40)
     fig.show()
    

  3. Female Founded Startups.

    Here, I extracted only the female-founded companies and below are some interesting insights:

    • 18% of all startups have Female Founders

        # Group the columns
        female_led = df.groupby(['Female founder?']).size().to_frame('Number')
        # Plot onto a graph
        fig_locations = px.pie(female_led, values='Number', names=female_led.index, title='Female Founded?')
        fig_locations.update_traces(textposition='inside', textinfo='percent+label')
        # Preview
        fig_locations.show()
      

    • The oldest female-founded company is Komaza (komaza.com) which works in the Conversation Tech and hires the most employees (242)

        oldest = df[df['Female founder?'] == 'Yes'].sort_values(['Year of launch'], ascending=[True])
      
    • The highest percentage of female founders are in the Fintech industry, followed by E-Health, Recruitment and HR. The least being Construction and AI/IoT

        # Group sector where Female founder is Yes
        female_sectors = df[df['Female founder?'] == 'Yes'].groupby(['Sector']).size().to_frame('Number')
        # Create pie chart object
        fig_sectors = px.pie(female_sectors, values='Number', names=female_sectors.index, title='Female Founders by Sector')
        # Add labels onto the pie chart
        fig_sectors.update_traces(textposition='inside', textinfo='percent+label')
        # Preview
        fig_sectors.show()
      

    • 2016 had the highest spike in the number of female founders compared to any other year. It has been a stagnant trend for the following 6yrs.

        # Group by year of launch only when Female founder? is Yes
        female_founded = df[df['Female founder?'] == 'Yes']
        female_led = female_founded.groupby(['Year of launch']).size().to_frame('Number')
        # Plot onto a line graph
        fig = px.line(female_led, x=female_led.index, y="Number", title='Trend in Female Founded Companies')
        # Preview
        fig.show()
      
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1673268449914/874a38cd-4cd0-4a40-bc2f-786e4ff939bc.png align="center")
  1. Gender Distribution

    Tech startups are majorly male-dominated, but on a sector-to-sector basis, there are variations as some have more dominance than others.

     # Create new dataframes for each gender group
     female_founded = df[df['Female founder?'] == 'Yes']
     male_founded = df[df['Female founder?'] == 'No']
     # Group each dataframe by sector
     female_founded = female_founded.groupby(['Sector']).size().to_frame('Number').reset_index()
     male_founded = male_founded.groupby(['Sector']).size().to_frame('Number').reset_index()
     # Left merge to create a new dataframe on shared sectors
     merged = female_founded.merge(male_founded, how='left', on='Sector')
     # rename columns
     merged.rename(columns = {'Number_x':'Female Founders', 'Number_y':'Male Founders'}, inplace = True)
     merged.sort_values(['Female Founders', 'Male Founders'], ascending=[False, False])
     # Plot on graph
     fig = px.bar(merged, x="Sector", y=["Female Founders","Male Founders"], title="Gender Distribution by Sector")
     fig.show()
    

    This graph however doesn't show sectors with no female founders. I merged and compared to get the results as indicated below.

     only_female = female_founded['Sector'][~female_founded['Sector'].isin(male_founded['Sector'])]
     # There were no sectors exclusively run by Female founders
    
     only_male = male_founded['Sector'][~male_founded['Sector'].isin(female_founded['Sector'])]
     # This gave me a list of sectors that dont have any female founders
     only_male
    

  2. Accelerator

    A surprising number of startups start from Accelerators and Incubators as seen from the graph and code below

     accelerated = df.groupby(['Accelerated?']).size().to_frame('Number')
     fig_locations = px.pie(accelerated, values='Number', names=accelerated.index, title='Accelerated?')
     fig_locations.update_traces(textposition='inside', textinfo='percent+label')
     fig_locations.show()
    

    There is also an increase in the number of startups joining accelerators over the years, with the most common Startup Sector being Fintech and the most active year being 2020

     accelerator = df[df['Accelerated?'] == 'Yes'].groupby(['Year of launch', 'Accelerated?', 'Sector']).size().to_frame('Number').reset_index()
     accelerator['Year of launch'] = accelerator['Year of launch'].astype('int')
     fig = px.scatter(accelerator, x="Year of launch", y="Number",
                  size="Number", color="Number",
                      hover_name="Sector", log_x=True, size_max=40)
     fig.show()
    

  3. Employees

    Kenyan Startups have hired a total of 13,140 employees with Conservation, Agritech and Ecommerce being the dominant sectors in recruitment

     total_employees = df['Number of employees'].sum()
     total_employees
    
     average_employees = df.groupby(['Sector'])['Number of employees'].mean().to_frame('Average')
     average_employees['Average'] = average_employees['Average'].astype('int')
     # average_employees
     fig = px.bar(average_employees, x=average_employees.index, y='Average')
     fig.show()
    

Conclusion

Thanks for reading along, I hope to update the article or create a second part once I can curate more information like Funding, Salaries, Grants, News, Mergers or Acquisitions and more.

You can download the notebook and relevant files from here

Did you find this article valuable?

Support David Marko by becoming a sponsor. Any amount is appreciated!