Co2 Emissions: Who’s Responsible?

Benjamin Riela

21 min readApr 20, 2021

By Benjamin Riela

Github Repository:

rielaben/carbon_emissions_project

Contribute to rielaben/carbon_emissions_project development by creating an account on GitHub.

github.com

https://static.dw.com/image/55722772_101.jpg

Motivation

I want to preface this analysis by saying that I came into this project knowing very little about Global Warming and the co2 emissions problem. I knew that co2 emissions were a major issue and I wanted to learn more, so this project served as a fantastic opportunity to not only practice my data science skills but to also learn more about a topic that I have been curious about for a long time.

I came into this analysis with two simple questions:

Who is responsible for the current carbon emissions problem?
Who should bear the most weight in fixing the carbon emission problem?

For the purposes of this study, I chose to focus just on carbon dioxide (co2) emissions. Co2 is responsible for about 80% of the greenhouse gases that contribute to Global Warming, so I chose to squarely focus on co2 for this analysis.

Github Repository Structure

I organized my files in my Github repository in the following way:

“data_cleaning.ipynb” reads in the raw data (will be discussed below) from the data sources and performs all necessary data cleaning operations. The output from this file is “cleaned_data.csv”, which is used as input for the following files to perform queries and visualizations.
“co2_codebook.ipynb” displays a dataframe that is a codebook which details and explains all the columns regarding carbon emissions (see image below).
“SQL_analysis.ipynb” reads in “cleaned_data.csv” as input and performs PostgreSQL queries on the data.
“python_analysis.ipynb” also reads in “cleaned_data.csv” as input. This file performs the python data analysis and visualizations.

While I will include some code/screenshots from these files in this article, to keep this of reasonable length I will not include everything. However, if you can run the code along as you read this article, I would highly recommend that option.

If you are not as interested in the Data Sources and Data Processing parts of this project, click here to skip right to the Analysis and Visualization section (where the investigation for who is responsible for the co2 emissions problem starts).

Data Sources

The first data source I use in the project is from Our World in Data. This focused on co2 emissions data and was presented in multiple csv files, with the most important file being “owid-co2-data.csv”. It is presented in this format:

This dataset has 55 columns in it, and each column is explained in the codebook (which you can see if you run the “co2_codebook.ipynb” file in the Github repository). Here is a screenshot from the first couple entries in the codebook:

From this data source, we will primarily be focusing on the “iso_code”, “country”, “year”, “co2”, “co2_per_capita”, “cumulative_co2”, and “population” columns.

The second data source I use in this project is from the World Bank Group. I used data from two pages, which can be seen here and here. The files I used from these csv package downloads are “Metadata_Country_API_NY.GDP.MKTP.CD_DS2_en_csv_v2_2163564.csv”, “API_11_DS2_en_csv_v2_2163688.csv”, and “Metadata_Country_API_11_DS2_en_csv_v2_2163688.csv”. From these files, the columns that I will use are “Country Name”, “Country Code”, “Region”, “Income”, “2018 GDP”.

Data Processing

The first step is to import all the World Bank datasets and combine them together to make a dataset that we can then merge with the carbon emissions dataset. The following 4 screenshots show the steps to consolidate all the data of interest from the World Bank into a single dataframe:

Now with the income data in the desired format, we need to add GDP information. We can start by using the following function to see the percentage of missing values that for each column in the gdp_df dataframe:

def percent_missing_cols(df):
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
    return missing_value_df

Since 2018 had about 5.7% of values missing, but 2019 had almost 13% values missing, we can isolate the 2018 GDP as the column to perform GDP analysis on. In this analysis, we’re only using GDP to measure a country’s current economic status, so historical GDP information has no role in this project.

The following code merges the GDP information with the income information of each country and filters on the 2018 GDP information to create a merged dataframe with all the economic information we need for this analysis:

country_wealth_data = pd.merge(merged_counties, gdp_df, how='inner', on='Country Code').drop("Country Name_y", axis=1).rename(columns={'2018':'2018 GDP', 'Country Name_x':'Country Name'})
country_wealth_data = country_wealth_data[["Country Name", "Country Code", "Region", "Income", "2018 GDP"]]

Now we can finally create a merged dataframe combining the cleaned GDP/income data with the co2 emissions data. The result is a dataframe that contains information for 202 countries from the year data was first available for each country (for some countries this goes back to 1750) until 2019 (the 2020 data was not available at the time of this writing). In total there are 20,098 entries in “merged_df”, with the dataframe spanning 19 columns.

merged_df = pd.merge(country_wealth_data, co2_data, how='inner', left_on='Country Code', right_on='iso_code')
merged_df.head()

The following code filters the columns so that we only keep the columns of interest, then rename some of them to include units:

# trim down columns 
merged_df = merged_df[['Country Name', 'Country Code', "Region", "Income", '2018 GDP', 'year',
       'co2', 'co2_growth_prct', 'co2_growth_abs',
       'consumption_co2', 'trade_co2', 'trade_co2_share',
       'co2_per_capita', 'consumption_co2_per_capita',
       'share_global_co2', 'cumulative_co2',
       'share_global_cumulative_co2', 'population']]# rename columns to include units
merged_df = merged_df.rename(columns=
                               {'Country Name':'country',
                                'Country Code': 'country code',
                                'co2':'co2 (M Tonnes)', 
                                'co2_growth_abs':'co2_growth_abs (M Tonnes)',
                                'consumption_co2':'consumption_co2 (M Tonnes)',
                                'trade_co2': 'trade_co2 (M Tonnes)',
                                'co2_per_capita':'co2_per_capita (Tonnes)',
                                'consumption_co2_per_capita':'consumption_co2_per_capita (Tonnes)',
                                'cumulative_co2':'cumulative_co2 (M Tonnes)'
                                })

The “Income” column above includes the word “income” after every income category (for example, “High income”, “Low income”, Upper middle income”), so the following code eliminates the “income” so the column is less repetitive:

merged_df['Income'] = merged_df.apply(lambda x: x['Income'].rsplit(' ', 1)[0], axis=1)

Now we should check the data types of the columns:

While the majority of this is fine, we can change the data type for 2 columns. One is the “Income” column. This is because “Income” is an ordinal variable (a ranked categorical variable), and therefore it should be converted to type “category”. This can be done with the following code.

# Create assign the category ranking
my_categories = pd.CategoricalDtype(categories = ['Low', 'Lower middle', 'Upper middle', 'High'],ordered=True)# Convert income column from type object to type category
merged_df['Income'] = merged_df['Income'].astype(my_categories)

Now that this is a category, we can make queries such as the following, which returns only countries with incomes of “Upper middle” or higher:

display(merged_df[merged_df['Income'] >= 'Upper middle'][['country', 'Income']].drop_duplicates().reset_index().drop('index', axis=1))

The other column we can change the data type of is the “year” column. Although this is not completely necessary, this conversion allows to practice working with time series data. With the following code, we can convert the “year” column from type int to type datetime64[ns].

merged_df['year'] = pd.to_datetime(merged_df['year'], format='%Y')

Now we can look at the data types of our column again and see the changes:

As you can see with the last cell above, after that step we are done with the “data_cleaning.ipynb” file and can convert this dataframe into a csv file.

Analysis and Visualization

Now we can get into analyzing and visualizing this data. In this section, I will be going back and forth between SQL queries and python analysis/visualizations, even though these are in separate files in my repository.

To set up our “SQL_analysis.ipynb” file we need to do the following:

import pandas as pd
import numpy as np
%load_ext sql
%sql postgres://jovyan:si330studentuser@localhost:5432/si330
import psycopg2
import sqlalchemypd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
country_data = pd.read_csv("cleaned_data.csv").drop("Unnamed: 0", axis=1)
# convert year back to int to avoid SQL errors
country_data['year'] = pd.DatetimeIndex(country_data['year']).year#create engine and load dataset into sql table
engine = sqlalchemy.create_engine('postgres://jovyan:si330studentuser@localhost:5432/si330', paramstyle="format")
%sql drop table if exists co2_data
country_data.to_sql('co2_data', engine)

For the “python_analysis.ipynb” file, the following code will set up the same dataframe as we had in the “data_cleaning.ipynb” file. The screenshot is a helper function setting 2019 as the max_year (so that when the 2020 data comes in, we should only have to change max_year to 2020 to get an updated analysis).

Note that because the data was converted to a csv file, we need to reassign “Income” and “year” to categorical and datetime variables as we did above.

import pandas as pd
import numpy as np
%load_ext sql
%sql postgres://jovyan:si330studentuser@localhost:5432/si330
import psycopg2
import sqlalchemy
import matplotlib.pyplot as plt
import unittest
from scipy import stats
import seaborn as snspd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)country_data = pd.read_csv("cleaned_data.csv").drop("Unnamed: 0", axis=1)# The datatypes of the "Income" and "year" 
# columns do not carry over from the data_cleaning.ipynb file. 
# This is beacause in converting the data to a csv and then back 
# to a dataframe, the data is turned into text and so the special 
# column types we gave these columns vanish. Therefore we need to 
# reassign the columns as we did in the other file.
my_categories = pd.CategoricalDtype(categories = ['Low', 'Lower middle', 'Upper middle', 'High'],
                                    ordered=True)
country_data['Income'] = country_data['Income'].astype(my_categories)
country_data['year'] = pd.to_datetime(country_data['year'])

The first visualization function in our python analysis plots time series data for a dataframe. This function is very flexible, as you can plot time series data (in this case, data by year) for one dataframe or a list of dataframes (both will be done in this analysis). It also allows you to pick the start year and end year they want to see plotted, assuming there is data available for those years.

def time_series_plot(dfs, col, start_year=False, end_year = max_year, title='Global', max_yval=None, 
                     fill_below=True, list_countries=None):
    fig = plt.figure(figsize=(12, 5))
    list_country_index = 0
#     each df in 'dfs' resperesnts a dataframe for emissions data over time for 
#     a specific country
    for df in dfs:
#         returns year-by-year data for a particular column (such as 'co2 (M Tonnes)')
        df = column_sums_by_year(df, col)
#     if start year is not specified then the start year is the earliest year that data is available
        if start_year==False:
            start_year = min(df.index.year)#     error checking to see if the inputted years have available data
        if (end_year > max(df.index.year))or(start_year < min(df.index.year)):
            print(f'ERROR: You must choose an end year between {min(df.index.year)} and {max(df.index.year)}')
            return
        
#     get data for year range
        series = df[str(start_year):str(end_year)]
    
#     extract x and y values and plot
        x_values = series.index
        y_values = series
        plt.plot(x_values, y_values)if start_year == 1750:
            plt.title(f'{title} {col} emissions from the Industrial Revolution to {end_year}', fontsize=14)
        else:
            plt.title(f'{title} {col} emissions from {start_year} to {end_year}', fontsize=14)plt.xlabel('Year', fontsize=14)
        plt.ylabel(f'{col}', fontsize=14)
#         set y and x limits and shade below the line on graph
        if fill_below==True:
            plt.fill_between(x_values, y_values)
        plt.xlim([min(x_values), max(x_values)])
        if max_yval == None:
            plt.ylim([min(y_values), max(y_values)*1.05])
        else:
            plt.ylim([min(y_values), max_yval*1.05])
        
#         set legend if there are multiple countries
        if list_countries!=None:
            plt.legend(list_countries)
        
        plt.grid(True)
    return fig

The following function calls will return the next two graphs (in order):

time_series_plot([country_data], 'co2 (M Tonnes)').savefig('imgs/global_co2_since_industrial_revolution.png')time_series_plot([country_data], 'co2 (M Tonnes)', 1919).savefig('imgs/global_co2_1919_2019.png')

You can see on these two graphs how drastic the rise in co2 has been since the Industrial Revolution, and more specifically since the 1950’s.

This brings up the question, which countries produced the most co2 emissions in the most recent year data was available (2019), and how much was produced?

We can start by writing an SQL query to see which regions of the world produced the most co2 emissions in 2019:

%%sql 
select row_number () over (order by round(sum("co2 (M Tonnes)")) desc) as "rank",
"Region", round(sum("co2 (M Tonnes)")) as "2019 co2 Emissions" from co2_data 
where "co2 (M Tonnes)" is not null and year=2019 group by "Region" order by "2019 co2 Emissions" desc;

Now which countries produced the most co2 in 2019 and how much did they emit?

%%sql 
select row_number () over (order by round("co2 (M Tonnes)") desc) as "rank",
"country", round("co2 (M Tonnes)") as "2019 co2 Emissions", "Income" from co2_data 
where "co2 (M Tonnes)" is not null and year=2019 order by "2019 co2 Emissions" desc limit 10;

While these SQL queries are nice, it would also be nice to visualize this with Python. The following functions can help us achieve this.

# This function returns data frame filtered by a specific year with # countries sorted by descending values by the column of interest
def country_data_by_year(df, year, col):
    return df[pd.DatetimeIndex(df['year']).year==year].sort_values(by=col, ascending=False)

The year_data_graph function returns a bar plot showing a ranking of the input column for a specific year. With this function, the user can input any year, column, and the number of countries they want to see in the bar plot.

def year_data_graph(df, year, col, num_countries_display):
    data_year = country_data_by_year(df, year, col)
    
#     get median value, this will be plotted on the bar graph as well
    column_median_val = data_year[col].median()
    fig = plt.figure(figsize=(12, 5))
    
    plt.bar(data_year['country'][:num_countries_display].append(pd.Series(['Median All Countries'])),
            data_year[col][:num_countries_display].append(pd.Series([column_median_val])))
    plt.title(f'{year} Top {num_countries_display} Country {col} Levels')
    plt.xlabel('Country Name')
    plt.ylabel(f'{col}')
    plt.xticks(rotation = 90)
    plt.tight_layout()
    return fig#function call
year_data_graph(country_data, max_year, 'co2 (M Tonnes)', 40).savefig('imgs/2019_top_40_co2.png')

*Note the far right bar is the median co2 emissions total for all countries in the dataset

It’s clear through this graph that China and the U.S. are without question the top co2 producers. This brings up the thought of how they stack up against the rest of the world combined.

To answer this question, we can create a pie chart function. After creating some helper functions and code that can be found in the notebook, below is the pie chart function. Following this is a couple of function calls and the results.

def pie_chart(input_nums, labels, title, total_val):
    
#     checking to see that the numbers to plot are correct
    total_val_num = sum(input_nums)
    np.testing.assert_almost_equal(total_val, total_val_num)
    
    zeros_list = []
    for i in input_nums:
        zeros_list.append(0)
    zeros_tuple = tuple(zeros_list)
    explode = zeros_tuplefig, ax = plt.subplots()
    ax.pie(input_nums, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)
#     makes sure the elements are drawn in a circle
    ax.axis('equal')
    
    plt.title(title)
    return fig#function calls
pie_chart([world_without_china_year_co2, df_curr_co2_values[0]], ['Rest of World', 'China'], 
          f"China Co2 Emissions vs Rest of World for {max_year}", 
          total_world_max_year).savefig('imgs/china_vs_rest_world_pie_chart.png')pie_chart([world_without_china_US_year_co2, df_curr_co2_values[0], df_curr_co2_values[1]], ['Rest of World', 'China', 'US'], 
          f"China and US Emissions vs Rest of World for {max_year}", 
          total_world_max_year).savefig('imgs/china_and_us_vs_rest_of_world_pie_chart.png')pie_chart([world_without_top_four_co2, df_curr_co2_values[0], df_curr_co2_values[1], df_curr_co2_values[2], df_curr_co2_values[3]], 
          ['Rest of World', 'China', 'US', 'India', 'Russia'], 
          f"China, US, India and Russia Emissions vs Rest of World for {max_year}", 
          total_world_max_year).savefig('imgs/top_4_vs_rest_of_world_pie_chart.png')

I personally found these numbers fascinating, as the top 4 countries (out of 202) were responsible for almost 57% of our world’s co2 emissions. By looking just at these last couple figures, it seems like having China reign in their co2 emissions would be a huge step in tackling our emissions problem.

But can we explore more into this? Let’s take a deeper look at China and its co2 emissions growth.

# call to the time series plot (china_co2_by_year was created by a helper function, shown in notebook)time_series_plot([china_co2_by_year],'co2 (M Tonnes)', title='China').savefig('imgs/china_1899_2019.png')

We can see here that China’s growth in co2 emissions really started to ramp up after the turn of the century. But by exactly how much did it increase? The following code answers this question:

There has been 200+% growth for China’s co2 emissions from 1999 to 2019, and I find it fascinating how recent this growth has been. According to the Federal Reserve of St. Louis, China’s industrial revolution did not start until about 1980, which makes sense when looking at the above time plot of China’s emissions.

This brings up the idea that up until this point in the analysis, we’ve only looked at data from the most recent year available (2019). What about cumulative co2 emissions?

According to the codebook, the cumulative co2 emissions column measures “cumulative emissions of CO2 from 1751 through to the given year, measured in million tonnes”. Using this statistic gives us a better understanding of how much historical co2 emissions have been produced.

We can first write an SQL query to find this answer:

%%sql 
select row_number () over (order by "cumulative_co2 (M Tonnes)" desc) as "rank",
"country","year","cumulative_co2 (M Tonnes)", "Income"  from co2_data 
where "cumulative_co2 (M Tonnes)" is not null and year = 2019 order by "cumulative_co2 (M Tonnes)" desc limit 10;

And then visualize it with this function call:

year_data_graph(country_data, max_year, 'cumulative_co2 (M Tonnes)', 30).savefig('imgs/top_30_cumulative_co2.png')

Looking at this bar graph tells us a different story than the previous bar graph, as we can see now that the United States historically has by far been the leading co2 emitter since 1751.

Let’s look at some overlaying time series plots to see how the year-by-year co2 emissions of the top 3 emitters for 2019 (the U.S., China, and India) have changed over time.

us_data = co2_levels_per_country('United States')
india_data = co2_levels_per_country('India')
time_series_plot([china_co2_by_year, us_data, india_data],'co2 (M Tonnes)', 
                 title='US, China, India', start_year=1899, max_yval=11000, 
                 list_countries=['China','US', 'India']).savefig('imgs/us_china_india_co2_time_series.png')

It’s impressive to see that the U.S. has actually decreased in emissions since the early 2000’s, but at the same time, we can see how the U.S. has been producing co2 emissions at a high level through most of the 1900’s in addition to the 2000’s (the same cannot be said for China and India).

While China and India have steadily increased their emissions in the past couple of decades, it’s clear that the United States needs to take a large portion of the blame for our current co2 emissions situation because of its high historical emissions.

What percentage of cumulative co2 emissions are the U.S. and other top countries responsible for? To answer this, we will be making pie charts again, but this time with cumulative co2 values. This will show who is responsible for all of the co2 released into the atmosphere over our history since the Industrial Revolution.

#function call 1
pie_chart([world_without_top_1, df_cml_co2_values[0]], ['Rest of World', 'United States'], 
          f"US Cumulative Co2 Emissions vs Rest of World for {max_year}", 
          total_world_max_year_cml).savefig('imgs/us_vs_rest_of_world_cumulative_co2_pie_chart.png')#function call 2
pie_chart([world_without_top_five_cml_co2, df_cml_co2_values[0], df_cml_co2_values[1], 
           df_cml_co2_values[2], df_cml_co2_values[3], df_cml_co2_values[4]],
          ['Rest of World', 'United States', 'China', 'Russia', 'Germany', 'UK'], 
          f"Top 5 Cumulative Co2 Emissions vs Rest of World for {max_year}", 
          total_world_max_year_cml).savefig('imgs/top_5_vs_rest_of_world_cumulative_co2_pie_chart.png')

It’s again quite stunning to see how top-heavy this emissions problem is. As you can see, the U.S. is responsible for over a quarter of all co2 emitted into the atmosphere since the Industrial Revolution, which is stunning all by itself. It is also interesting to note that while India was number 3 for 2019 emissions, their cumulative total emissions ranking was at 7.

What can we learn by looking at the amount of cumulative co2 emissions produced when grouped by country income? We can first look at an SQL query:

%%sql 
select row_number () over (order by sum("cumulative_co2 (M Tonnes)") desc) as "rank",
"Income", sum("cumulative_co2 (M Tonnes)") as "Cumulative co2 Emissions" from co2_data 
where "cumulative_co2 (M Tonnes)" is not null and year = 2019 group by "Income" order by "Cumulative co2 Emissions" desc;

The function below produces pie charts that visualize the percentage of co2 emitted by income category (essentially visualizing the query from above). You can change the column of interest to any column and the year to any year there is data on.

def income_levels_pie_chart(df, col, year):
    display(list(df[col]))
    
    labels = list(df[col].index)
    sizes = list(df[col])
    explode = (0, 0, 0, 0)fig, ax1 = plt.subplots()
    ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)
    ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    ax1.set_title(f"{col} By Income Level in {year}")return fig#this function calls the function above to make the graphs
def co2_emissions_by_income(df, col, year):
    
    data_year = country_data_by_year(df, year, col)
#     graphing by sum for each income bracket
    sum_data = data_year.groupby('Income').sum()
    fig = income_levels_pie_chart(sum_data, col, year)
    return fig
#function call
co2_emissions_by_income(country_data, 'cumulative_co2 (M Tonnes)', max_year).savefig('imgs/income_cumulative_co2_pie_chart.png')

This pie chart shows that for cumulative co2 emissions through 2019, countries with higher incomes were an astonishing 91.5% responsible for the current emissions problem. This perhaps means that many of the higher income countries were able to become rich in part by emitting lots of co2 through industrialization in the previous decades, while developing countries that didn’t have the same technology and resource opportunities were not producing as many emissions. This suggests that richer countries should be more responsible for not only decreasing their emissions, but for also trying to remove the emissions that were historically released by them into the atmosphere.

At this point, it makes sense to start plotting correlations between the columns in our dataset. The best way to do this is by starting with a heat map and correlation matrix, and we can create this using the following function below.

*Note that from this point through the rest of the analysis (except for the paired t-test section), I will be working with data from 2018, and this is because there was too much economic data missing during 2019 (as explained here).

# data values for 2018 alone
year_2018 = country_data[pd.DatetimeIndex(country_data['year']).year==2018]def heat_map(df):
    corrmat = df.corr()
    top_corr_features = corrmat.index
    fig = plt.figure(figsize=(10,10))
    cmap = sns.diverging_palette(230, 20, as_cmap=True)
    mask = np.triu(np.ones_like(corrmat))
    g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap=cmap, vmax=1, vmin=-1, mask=mask)
    return figheat_map(year_2018).savefig('imgs/heat_map.png')

The function below creates a scatter plot with two columns, and it displays the Pearson correlation as well as the p-value:

def co2_emissions_scatter(df, x_col, y_col, title=None):
    test_data = df[(df[x_col].notna()) & (df[y_col].notna())]
    new_x = test_data[x_col]
    new_y = test_data[y_col]fig, ax = plt.subplots(figsize=(10, 6))ax.scatter(x = new_x, y = new_y)
    plt.title(title)
    plt.xlabel(x_col)
    plt.ylabel(y_col)test = stats.pearsonr(new_x, new_y)
    display("Test Statistics", test)
    
    return fig

What is the relationship between 2018 GDP (measured in USD) and 2018 cumulative co2 emissions?

co2_emissions_scatter(country_data[pd.DatetimeIndex(country_data['year']).year==2018], '2018 GDP', 
                      'cumulative_co2 (M Tonnes)', 
                      'Relationship between 2018 GDP and 2018 cumulative co2 (M Tonnes) emissions').savefig(
    'imgs/2018_gdp_vs_cumulative_co2_scatter.png')

The Pearson correlation (not shown above, but can be seen in the Github Repository) for these two columns is approximately 97%, holding a p-value of about 4.05e-58. This means that the results are significant assuming a 5% threshold. The two countries that are separated from the rest are the U.S. (top-right of the graph) and China (middle-right on the graph).

This graph above in particular goes to show that there is a very solid relationship between how economically strong a country is (represented by 2018 GDP) and the amount of co2 emissions that country has released into the atmosphere. This suggests that the strongest economic countries should bear the most blame for the current co2 emissions situation.

So now we have an idea of where the most pollution currently and historically has been coming from. However, there is one problem with this, and it is that countries differ in size in terms of population. There may be smaller countries that, per capita (per person), are worse polluters than the U.S., China, India, and other major emitters, but it would be hard to identify these countries because we have only looked at emissions in terms of totals per country so far. Therefore, we also need to account for these statistics per capita in addition to total emissions for entire countries.

This function below creates a paired t-test, breaking the data frame by desired income level and filtering by the column of interest.

def paired_t_test(df, income_level, col_of_interest):
    cat1 = df[df['Income'] <= income_level]
    cat2 = df[df['Income'] > income_level]current_co2_per_capita_test = stats.ttest_ind(cat1[col_of_interest].dropna(), 
                                                  cat2[col_of_interest].dropna(),equal_var=False)
    print(f"{col_of_interest}: {current_co2_per_capita_test}")#function call
data_max_year = country_data_by_year(country_data, 2019, 'co2_per_capita (Tonnes)')
paired_t_test(data_max_year, 'Lower middle', 'co2_per_capita (Tonnes)')

Running this code gives us a t-value of =-9.35, and a p-value=4.78e-17. The p-value indicated that the results are significant has a 5% threshold. The t-value is a ratio of the difference between two groups and the difference within the groups (the 2 groups being “low” and “lower middle” income countries vs. “upper middle” and “high income” countries). Essentially, the higher the magnitude (absolute value) of the t-score is, the more difference there is between groups; the smaller the magnitude of the t-score is, the more similarity there is between the groups. A magnitude higher than 9 is extremely high and indicates that there is a large difference between these two groups of incomes when it comes to co2 emissions per capita. More information about T-values can be found here.

Now let’s see what the top countries are in terms of co2 emissions per capita. We can start with another SQL query:

%%sql 
select row_number () over (order by "co2_per_capita (Tonnes)" desc) as "rank",
"country", "co2_per_capita (Tonnes)", "Income" from co2_data
where "co2_per_capita (Tonnes)" is not null and year=2018 order by "co2_per_capita (Tonnes)" desc limit 10;

Now let's visualize this in python:

year_data_graph(year_2018, 2018, 'co2_per_capita (Tonnes)', 40).savefig('imgs/2018_top_40_co2_per_capita.png')

So it turns out there are many small countries (such as Sint Maarten, Luxembourg, Faroe Islands, Estonia, and Turkmenistan) that are very high in their co2 emissions, but because their populations are so small, we were not able to see these before. Some of these top 40 countries (such as Qatar, Trinidad and Tobago, Kuwait, Bahrain, Brunei, UAE, Saudi Arabia, and Oman) are well known for their fossil-fuel reserves, and a large part of their economies are reliant on fossil fuels. For these countries, there is not a direct incentive (besides trying to take care of the environment) to switch away from using excessive fossil fuels, so it makes sense why these countries would rank so high in co2 emissions per capita. This graph also shines a light on many prominent Western countries (Australia, United States, Canada, South Korea, Netherlands, Germany, Japan, etc.) for their co2 per capita emissions. This tells us that these countries, despite ranking among the most advanced and richest countries in the world, are still being wasteful in their co2 emissions as well.

Circling back to earlier in this analysis, where do China and India fall in these co2 per capita rankings?

year_2018_per_capita_co2 = country_data_by_year(year_2018, 2018, 'co2_per_capita (Tonnes)')
year_2018_per_capita_co2 = year_2018_per_capita_co2.reset_index().drop('index', axis=1)
display(year_2018_per_capita_co2.loc[year_2018_per_capita_co2['country'] == 'China'])
display(year_2018_per_capita_co2.loc[year_2018_per_capita_co2['country'] == 'India'])
display(len(year_2018_per_capita_co2))

China ranks 41st (index starts at 0) out of 199 countries in terms of co2 emissions per person. So although on a country-level they are currently emitting the most co2, and have released the second most co2 emissions in cumulative, because they have almost 1.4 Billion people, these figures per person are not as egregious as they may initially seem. India, as shown above, ranks 121st out of 199 countries in terms of co2 per capita. They have 1.3+ Billion people, so in terms of emissions per person, it makes sense that they do not rank close to a country like the U.S. with a population of about 330 Million.

Conclusion

As we have seen, there doesn’t seem to be one clear and obvious answer as to who is responsible for our current emissions situation and who should bear the most weight in cleaning it up. Some countries might try to blame China and India because of the volume of emissions they produce, but with their combined populations making up almost 3 Billion people, those kinds of emissions numbers might be expected with so many people living there. Some countries may try to blame the United States and developed Western nations, who seemingly became rich through many decades of high co2 emissions in the 1900’s and have contributed a large amount historically to the problem we face today. And it is also important to recognize that there are many small countries, along with countries that economically rely on their fossil fuel reserves, where emissions per person are extremely high. There is an argument to be made that if the larger countries need to scale back in co2 emissions, then it is only fair that smaller countries who have been wasteful in emissions must also do the same.

As cliché as it may sound, this really sounds like it is going to take a group effort from many different countries if we want to try to curb global co2 emissions. While we saw that some countries produce significantly more co2 emissions than others, I also saw in completing this project that the actions of just one or two countries will not be able to tackle this alone. I’m very curious to look at the 2020 data (when it becomes available) and to see what the impact of COVID-19 has been on the co2 emissions problem. As travel decreased heavily in 2020, did global co2 emissions finally decrease? If they did, this could be some significant momentum to start heading in the right direction.

Thank you so much for taking a look at my project, I look forward to creating more in-depth projects like this one in the future!