This is a little addendum to my last article using python to analyse the Oxford UK data set from the UK Met Office.

Since writing that article, I have read some pretty good analyses covering the same kind of thing. . Of course they do not show the code for the analyses but hey, I’m a code nerd and writing the code to analyse the data is half the fun!

In other analyses, I have noticed R squared values being quoted for the linear regression models based om Temperature over Time. Whilst there is nothing particularly wrong with this, it really doesn’t say much about the data. The most you can say is that the measurements do/do not correlate well with time.

To look at correlation you need two variables. One is the dependent variable and the other is the independent variable. To prove the hypothesis that Temperatures are dependent on CO2 levels, you should check for a correlation between them.

In this case, Temperatures will be the dependent variable and CO2 levels will be the independent variable.

One of my favourite youtube bloggers is Josh Starmer. His channel is called statquest. If you don’t want to wade through Statistics text books then Josh should be a ‘go to’ guy.

I’ve linked his video on R squared below. Simple and to the point.

I’m going to use the Oxford UK data set from the UK Met Office and the CO2 data from Mauna Lua to test for a correlation between the two. For pythonistas, I will include the code which is dependent on the previous analysis which you can find **here** or an HTML file **here**.

The CO2 data from Mauna Lua is from 1959 and is the yearly average. This is great because it is almost a linear growth year on year. It does not go down in any year and so I can use it as the independent variable along the Y axis.

The Temperature data is the annual average TmaxC from Oxford for the same period. And before you start moaning about averages of averages remember, this is an exercise in correlation. It is not a definitive pronouncement on the planet threatening catastrophic existential threat of rising levels of CO2 may have on Earth.

The first part of the code is:

```
df_cor = grouped[grouped['Year'] >= 1959]
del df_cor['rolling_average']
df_cor.reset_index(inplace=True,drop=True)
df_cor = df_cor.assign(co2 = df_co2['PPM'])
```

All I am doing here is creating a pandas data frame with the temperatures and CO2 levels as columns. I’ve grabbed all the temperature data from 1959 onwards, deleted a column I don’t need and the reset the index to start from zero. Then I’ve added a column with the CO2 data t give me a nice new data frame to poke around with.

A lot of code in python data analysis is for beautifying plots of your data to see what it looks like. Here is the code for that:

```
fig, ax = plt.subplots(figsize=(20,8))
plt.figtext(.5,.9,f'Fig. 13 Scatter plot of Average Yearly TmaxC V Co2 levels PPM', fontsize=18, ha='center')
sns.regplot(df_cor.co2,df_cor.TmaxC,color='green')
plt.xlabel('Co2 PPM 1959 - 2018',fontsize=15)
plt.ylabel('Annual average TmaxC centigrade',fontsize=15);
```

This just sets up some titles and what not for the plot and uses Seaborn ‘regplot’ to show a scatter plot with a regression line.

I think it is *de rigueur *to plot the residuals to see what we are dealing with. So here it is what it looks like complete with **Lowess** line.

So now I can use ‘*scipy.stats*‘ to determine what the linear regression tells me about the correlation between the two variables.

```
slope, intercept, r_value, p_value, std_err = stats.linregress(df_cor['co2'], df_cor['TmaxC'])
print(f'slope: {slope:2.4f} \nintercept: {intercept:2.4f}\
\nRsquared Value: {r_value**2:2.4f} \nPvalue: {p_value:2.4f}\nStd Error = {std_err:2.4f}')
```

That is as simple as using ‘*lineregress*‘ from the ‘s*cipy*‘ stats library. Which results in:

slope: 0.0226 intercept: 6.3244 Rsquared Value: 0.4695 Pvalue: 0.0000 Std Error = 0.0032

‘*R-squared (R ^{2}) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. So, if the R^{2} of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.*

‘From **Investopedia**

The R squared value is 0.4695. That means that CO2 levels explain 46.95% of the TmaxC variable.

**So if you want to know what temperature value a higher CO2 level will give you compared to a lower one? Flip a coin.**

Please remember that this is just a little tutorial on statistical correlation. I have no idea what the expansion rate of David Attenborough’s waist size over the same period has on the increase in CO2 levels. But this is a good link to make you *think.*

Very informative and productive post. Thanks for sharing your knowledege with us.

LikeLike

Reblogged this on Rozina's Persian Kitchen.

LikeLike