October 26th, 2021
An Introduction to Time Series Modeling:
Category: Time Series
Time Series Preprocessing and Feature Engineering
By: Adam Murphy
Time is the only nonrenewable resource – Sri Ambati, Founder and CEO, H2O.ai.
Prediction is very difficult, especially if it’s about the future – Niels Bohr, Nobel Prize-Winning Physicist.
Despite its inherent difficulty, every business needs to make predictions. You may want to forecast sales or estimate demand or gauge future inventory levels. Perhaps you want to predict temperature changes or the price of a stock. Whatever it is, you will need data. It will have time on the x-axis and the value you are measuring (demand, temperature, etc.) on the y-axis. It can also have other features, which we will discuss later. We call this time-series data, and there are special tools and techniques to work with it.
We will work with the following dataset in this article. It is taken from the excellent (free) textbook Forecasting: Principles and Practice 2nd Edition and shows the number of electrical equipment orders in Europe from 1996-2012. The data has been normalized (a value of 100 equals 2005 orders), and it has also been adjusted by working days. These adjustments are two examples of preprocessing steps common in time series analysis.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set() df = pd.read_csv('electrical_equipment.csv', index_col=0, parse_dates=['date']) plt.plot(df['date'], df['orders']) plt.show()
The first five rows:
In this series, we will introduce you to time series modeling – the act of building predictive models on time series data. This first article explains common preprocessing and feature engineering techniques. Subsequent articles discuss models and model diagnostics.
Time Series Modeling Pipeline
The process for creating time series models is quite similar to the standard supervised machine learning pipeline.
We like to think of it in six steps:
- Extract, Transform, Load (ETL) – collect data and store it in a usable format,
- Exploratory Data Analysis (EDA) – explore data and deepen your understanding,
- Preprocessing – clean data and shape into a format time series models expect,
- Feature Engineering – create information-dense features to improve model performance,
- Model Making & Tuning – build and tune a range of models,
- Model Diagnostics – assess the quality of your model(s). This final step is vital and includes many statistical tests unique to time series analysis. If a model fails these tests, you need to go back to previous stages and improve them.
In this article, we focus on Preprocessing and Feature Engineering.
Time Series Preprocessing
Once you have collected (ETL) and explored (EDA) your data, it’s time to clean and shape it into a format time series models expect. Most of the time, you cannot just feed in the date and value columns as they are. Or rather, if you did, you would get poor results in comparison to adequately preprocessed data!
Many traditional models are adaptations of linear models; therefore, it’s recommended that you perform missing value imputation and outlier detection and removal. This is standard practice for supervised machine learning as well.
You can use libraries like scikit-learn and pandas to do this or, if you use H2O Driverless AI, it will perform these steps for you without you having to lift a finger or program any specific steps.
Using the Driverless AI python client, first set up your client
import driverlessai address = 'http://ip_where_driverless_is_running:12345' username = 'username' password = 'password' dai = driverlessai.Client(address = address, username = username, password = password) # make sure to use the same user name and password when signing in through the GUI
Then create your experiment
experiment = dai.experiments.create(name='walmart_time_series', test_dataset=test_data, ...)
Now your experiment is running, and H2O Driverless AI is performing the preprocessing steps (among many other things, some of which we will discuss later).
To a novice, time-series data may look random. But datasets contain similar patterns, which we can isolate to analyze individually.
Here are three of the most common patterns:
- Trend – a long-term increase or decrease in the data (linear or non-linear)
- Seasonal – a pattern that repeats at a fixed and known frequency, e.g., every Monday the value increases, or every hour the value decreases
- Cyclical – a rise and fall pattern that is not of a fixed or known frequency, e.g., business boom-and-bust cycles.
Cycles are different from seasonal patterns because we do not know when they will happen or for how long they will last ahead of time. But we know the frequency and length of seasonal patterns.
The process of isolating these components is called trend-seasonal decomposition, or decomposition for short. Usually, we combine the trend and cycle parts and call it the trend-cycle or trend.
We break each time series down into three parts: trend, seasonal, and a remainder term (anything that is not part of the first two). You can combine these elements in an additive (trend + seasonal + remainder) or multiplicative (trend x seasonal x remainder) way. The former is used for linear models and the latter for non-linear, i.e., quadratic or exponential.
There are several methods you can use to do this but, for brevity, we will demonstrate basic additive decomposition.
from statsmodels.tsa.seasonal import seasonal_decompose # Index of dataframe must be DateTime df_datetime_idx = df.set_index('date') # Create additive seasonal decomposition result_add = seasonal_decompose(df_datetime_idx, model='additive') # Plot fig = result_add.plot() plt.show()
Here we show electrical equipment orders over several years (top), followed by the trend, seasonal and remainder components (in descending order). The trend shows the general movement of the original data and has a similar shape. The seasonal component exposes the lower-level variation, and the remainder contains other fluctuations the first two parts don’t show.
Create Stationary Data
Traditional time series models often assume that the data fed into them is stationary. A dataset is stationary if the underlying process that created it does not change over time. In other words, the data has a constant mean and variance. This property implies it has neither trend nor seasonal components – by definition, these impact values based on the time, i.e., annual seasonality implies higher sales at the end of every year.
One time series created by a stationary (upper) process and another by a non-stationary (lower) process – source
Most datasets you work with will not be stationary. Thus, you need to transform them so that they are. This transformation is essential for the autoregressive/ARIMA models.
Here are some methods to stationarize your data.
This is the easiest method and involves calculating the difference between consecutive elements. It stabilizes the mean and reduces the impact of trends and seasonal behavior, leaving the model free to focus on predicting one point after another.
It is easy to do this using the
diff() method in pandas.
df['orders_diff'] = df['orders'].diff() plt.plot(df['orders_diff']) plt.show()
Applying differencing completely changes the shape of the plot; the up and downward trends have been removed. However, we can still see seasonality because the positive and negative peaks occur with regular frequency. Thus, this data is still not stationary. Let’s fix this with the following method.
Seasonal differencing is the same as differencing, but you calculate the difference between elements in the same season. For example, if there were weekly seasonality, you would calculate the difference between every Monday, every Tuesday, and so on. This is more effective than ordinary differencing at removing seasonal trends, but it does not work all the time. If you can still see seasonal trends, your data is not stationary, and you may need to apply further differencing.
To differentiate seasonal and ordinary differencing, we sometimes call the latter first-order differencing, i.e., differences at lag 1.
diff() method and set the
periods parameter to the length of the seasonal trend you want to difference. Annual differencing is often the most effective but let’s look at a few options.
df['3month_diff'] = df['orders'].diff(periods=3) df['9month_diff'] = df['orders'].diff(periods=9) df['12month_diff'] = df['orders'].diff(periods=12)
The 3-month and 9-month plots still exhibit seasonal behavior (the peaks and troughs occur at regular intervals), whereas the 12-month chart looks devoid of seasonality. However, it doesn’t look like a random process generated it. To fix this, let’s apply first-order differencing.
To fix this, let’s apply first-order differencing to the 12-month plot.
df['12month_1month_diff'] = df['12month_diff'].diff()
Much better! The peaks/troughs do not follow a consistent pattern, the plot is centered around 0, and the variance looks relatively constant.
Note: you can apply seasonal and first-order differencing in either order and will get the same results. However, if you use seasonal differencing first, you may get a stationary time series straight away and thus have one less preprocessing step to do.
In tabular data modeling, you may apply log or square root transformations to features to create a more normal distribution. Since stationary data has constant mean and variance, we can think of each point as being drawn from a normal distribution. Thus, we can apply the same transformations.
If a dataset is quadratically increasing, applying the square root will transform it into a linear trend. If it grows exponentially, taking the (natural) log will make it linear. You can also test other methods, such as power transformations.
df['sqrt'] = np.sqrt(df['orders']) df['log'] = np.log(df['orders'])
Since our data is neither quadratically nor exponentially increasing, the transformations do not change the shape, just the scale of the data (look at the y-axis).
Often you will combine transformations and differencing. The transformations help to stabilize the variance, and differencing helps to stabilize the mean. The combination results in more stationary data than one or the other alone.
# First order differencing df['diff'] = df['orders'].diff() # First apply log, then differencing df['diff_log'] = np.log(df['orders']).diff()
On the left, we have first-order differencing. The other two plots show a log transformation followed by first-order differencing with different y-axis scales. The variance for applying first-order differencing alone is 165, but the variance of log + differencing 0.02 – an 8,000x difference!
We’ve now discussed the essential preprocessing steps needed for time series modeling. Let’s look at feature engineering.
Time Series Feature Engineering
Our electrical orders dataset has two columns: date and orders. If you are used to building tabular models with tens, hundreds (or even thousands!) of columns, you may be a bit confused about how to make predictions with just two variables.
However, a host of features are within those columns waiting to be uncovered and pumped into your models: here is an overview of the most common ones.
Before we start, the two most used traditional time series models are 1) autoregressive/ARIMA and 2) smoothing. We will briefly explain how they work in the sections below and point out which features lend themselves to which models.
A fundamental assumption of autoregressive models is that we can use past values to predict future ones. So, we need to create features to represent these past values. We call these lag variables, and they lag behind the actual time series by 1, 2, 3, or (many) more time steps.
shift() method in pandas to create lagged variables.
# Create 3 lagged variables df['t-1'] = df['orders'].shift(1) df['t-2'] = df['orders'].shift(2) df['t-3'] = df['orders'].shift(3)
We created three lag variables:
t-1, which we can use to predict
orders. As is always the case with lag variables, you need to drop the first few rows that contain NaNs to use them. From row 3 onwards, we are good to go.
Smoothing models assume they can predict future behavior from the aggregated statistics (typically the average) of past values. Thus, aggregated features such as the average, standard deviation, skewness, min, and max can be valuable additions.
aggregate method in pandas to create aggregated features.
lagged_feature_cols = ['t-3', 't-2', 't-1'] # Drop first 3 rows due to NaNs df_lagged = df.loc[3:, lagged_feature_cols + ['orders']] # Create feature df to use for aggregation calculations df_lagged_features = df_lagged.loc[:, lagged_feature_cols] # Create aggregated features df_lagged['max'] = df_lagged_features.aggregate(np.max, axis=1) df_lagged['min'] = df_lagged_features.aggregate(np.min, axis=1)
We created the
max columns which are the minimum and maximum values, respectively, from the
Trend & Seasonality
Thanks to trend-seasonal decomposition, you can pass the trend and seasonal components of the time series as individual features. Alternatively, you can use just one of them to force the model to focus on this aspect and ignore the other. Macroeconomic forecasts often do this. We care more about the unemployment trend and not so much about the seasonal increase that happens every year after college graduation, for example.
Specific times of the day, week, month, or year can significantly impact a time series. For example, the number of cars on the road is higher between 8am-9am and 5pm-6pm than other times due to rush hour. Thus, it is helpful to extract datetime-specific features such as hour_of_day (1, 2, 3, etc.), day_of_week, week_of_year, month_of_year, and so on. You can also create boolean features such as is_weekday or is_holiday.
If the date column/index is the correct dtype (datetime64), you get access to loads of easy ways to create new date-specific features in pandas.
# Set index to the date column of original df # drop first 3 rows due to NaNs from lagged vars (see above) df_lagged.index = df.date[3:] # Create month and quarter columns df_lagged['month'] = df_lagged.index.month df_lagged['quarter'] = df_lagged.index.quarter
Here we created the
quarter columns in a single line using the
.quarter attributes of the datetime index.
The best features are ones specific to your problem domain. These require domain-specific knowledge or thorough googling on your behalf to create. But the effort is well worth it. Often finding one such feature can be the difference between a good model and a great one. Unfortunately, you can’t find out everything just by reading blog posts like this one 😉
We could include many other features, but we think this is enough to whet your appetite. If you want to learn more, a tremendous open-source package for automated time series feature extraction is tsfresh. You can see a complete list of all the features they extract here.
There you have it, a quick introduction to preprocessing and feature engineering practices tailored to time series data. You could treat time series as just another supervised machine learning problem, but that isn’t going to give you excellent results. Instead, apply the techniques you’ve learned in this article to create robust models and high-quality forecasts that will empower your business’s growth in the months and years to come.
If you want to give everyone in your company the power to create incredible models with minimal effort, you can start a 14-day free trial of H2O AI Hybrid Cloud.