Using python and machine learning to analyze my running. Part 1
I love running and I try to do it as seriously as my academic duties allow me. Since I started using Python on a regular basis as part of my research, I have been trying to use it to analyze my running. I want to use this analysis to improve my running.
This post is the first part of a two-part series. In this first part, I will describe how I have been using python to get some insights into my running. So, the approach of this post will be twofold; first, the description of the methods I am using, and second, the interpretation of the results.
I’ve been using Runkeeper to track my runs and get the usual statistics, such as duration, distance and pace. Runkeeper has the option to export the data in csv format, which can be directly imported to a pandas dataframe. In another post I will describe how to process the Runkeeper files to obtain the dataframe I am using here.
First, we import the necessary modules and read the dataframe.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import daterunning_df = pd.read_pickle('running.pickle')
The dataframe is saved and read as a pickle file in order to preserve the data types of each column (in particular, I wan to preserve the type datetime.date). The dataframe looks like this:
This dataframe contains my running sessions since 2016, which is the year I took upon running more seriously. The important columns are:
- Date: The date of the run.
- Time: The duration of the running.
- Pace: The pace, given in minutes/kilometer
- Type. The type of running, some of these types are: EZ (runnings at conversational pace, the bulk of the training), Carrera (Race), Tempo, Fartlek (fast running intermixed with of slower running), Subidas (running uphill), Larga (long runs, typically more than 15km), Velocidad (intervals of fast running), Recuperación (short and slow runs), PFT (physical fitness test).
- Observations: Free form text describing the training session.
- Heart rate: the heart rate, measured by an activity tracker.
- MHR percentage: Max heart rate, it is given by 220-age. The MHR percentage indicates the heart rate training zone.
Using pandas and matplotlib methods we can gain some useful insight regarding the progress and quality of the training.
Using some python/pandas one-liners, I can get a basic statistic description of my runs since 2016.
today = date.today()
start = running_df.loc[0,'Fecha']
difference = today - start
days_passed = difference.daysprint("I have run",running_df.shape[0],"times")
print("Total kilometers:",round(np.sum(running_df['Distancia'].values),2))
print("Mean distance:",round(np.mean(running_df['Distancia'].values),2),"km")
print("Mean duration of each run:",round(np.mean(running_df['Tiempo'].values),1),"minutes")
print("Longest run:",np.max(running_df['Distancia'].values),"km")
print("On average, I run every",days_passed//running_df.shape[0],"days.")
First, we can observe the predominance of the EZ runs in the training, which should constitute around 70% of the training (in kilometers), according to several sources. Below we can check whether this holds or not.
tipos = list(set(running_df['Type'].values))
counts = {x:running_df[running_df['Type']==x].shape[0] for x in tipos}sizes = np.array(list(counts.values()))
percents = 100.*sizes/sizes.sum()plt.figure(dpi=130)
patches, texts = plt.pie(sizes, shadow=False, startangle=90)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(tipos, percents)]
plt.legend(patches, labels, loc='left center', bbox_to_anchor=(-0.1, 1.),fontsize=8)plt.show()
tipos = list(set(running_df['Type'].values))
counts = {x:np.sum(running_df[running_df['Type']==x] ['Distance'].values) for x in tipos}sizes = np.array(list(counts.values()))
percents = 100.*sizes/sizes.sum()plt.figure(dpi=130)
patches, texts = plt.pie(sizes, shadow=False, startangle=90)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(tipos, percents)]
plt.legend(patches, labels, loc='left center', bbox_to_anchor=(-0.1, 1.),fontsize=8)plt.show()
Here, the first conclusion we can draw is that the training must include more easy runs. One could repeat this analysis including only the last training cycle or the last month.
Now, using matplotlib’s date_plot function, it is easy to visualize each training cycle leading a marathon (I have running one marathon per year for the past few years, except last year, due to the pandemic).
In the graph above, it is noticeable the break due to the pandemic and how I have been trying to build my endurance again.
Finally, in the following graphs, one can gain a little more insight of the ongoing progress. In this first graph, we plot the paces of all the easy runs. Ideally, we should see a decreasing tendency. This is not the case because, as I started to get better, I was running the easy runs faster than I should.
import matplotlib.dates as datesez_df = running_df[running_df['Type']=='EZ'].copy()
ez_df = ez_df[ez_df['Pace']!=0].copy()
x = [dates.datestr2num(str(d)) for d in ez_df['Fecha'].values]
y = ez_df['Pace'].valuesplt.figure(dpi=150)
plt.plot_date(new_x, y, fmt="bo", tz=None, xdate=True)
plt.ylabel("Pace (min/km)")
plt.show()
In the next graph, we plot a timeline of the percentages of the MHR in the easy runs. This indicates that I have been running at more comfortable paces, which benefits the overall training.
The same analysis can be applied to the other types of running, or to more specific time windows, to get a more complete picture of the training.
In the next part of this post, I will be using Machine Learning methods to analyze these timeseries.