Time series analysis - explained
posted on 16 Jun 2020 under category tutorial
A times series is a set of data recorded at regular times. For example, you might record the outdoor temperature at noon every day for a year.
The movement of the data over time may be due to many independent factors.
Long term trend: the overall movement or general direction of the data, ignoring any short term effects such as cyclical or seasonal variations. For example, the enrollment trend at a particular university may be a steady climb on average over the past 100 years. This trend may be present despite having a few years of loss or stagnant enrollment followed by years of rapid growth. Cyclical Movements: Relatively long term patterns of oscillation in the data. These cycles may take many years to play out. There is a various cycles in business economics, some taking 6 years, others taking half a century or more. Seasonal Variation: Predictable patterns of ups and downs that occur within a single year and repeat year after year. Temperatures typically show seasonal variation, dropping in the Fall and Winter and rising again in the Spring and Summer. Noise: Every set of data has noise. These are random fluctuations or variations due to uncontrolled factors.
Each factor has an associated data series:
Yt = Tt × Ct × St × Nt
Often only one of the oscillating factors, Ct or St, is needed.
The idea behind forecasting is to predict future values of data based on what happened before. It’s not a perfect science, because there are typically many factors outside of our control which could affect the future values substantially. The further into the future you want to forecast, the less certain you can be of your prediction.
Just look at weather reporting! Figuring out if it will rain tomorrow is not too difficult, but it’s virtually impossible to predict if it will rain exactly a month from now.
Basically, the theory behind a forecast is as follows.
There are a few key terms that are commonly encountered when working with Time Series, these are inclusive of but not exclusively described in the following list:
Trend: A time series is supposed to have an upward or downward trend if the plot shows a constant slope when plotted. It is noteworthy that for a pattern to be a trend, it needs to be consistent through the series. If we start getting upwards and downwards trend in the same plot, then it could be an indication of cyclical behavior.
Sesonality: If a TS has a regular repeating pattern based on the calendar or clock, then we call it to have seasonality. Soemtimes there can be different amount of seasonal variation across years, and in such cases it is ideal to take the log of values in order to get a better idea about seasonality. It is noteworthy that Annual Data would rarely have a seasonal component.
Cyclical TS: Cyclical patterns are irregular based on calendar or clock. In these cases there are trends in the Time Series that are of irregular lengths, like business data, there can be revenue gain or loss due to a bad product but how long the upward or downward trend would last, is unknown beforehand. Often, when a plot shows cyclical behavior, it would be a combination of big and small cycles. Also, these are very difficult to model because there is no specific pattern and we do not know what would happen next.
Autocorrelation/Serial Correlation: The dependence on the past in case of a univariate time series is called autocorrelation.
The two major types of time series are as follows:
Most of the theory that has been developed was developed for stationary time series.
a. **Univariate Time Series**: We usually use regression technique to model non-stationary time series and we also have to consider the time series aspects (like autocorrelation i.e. dependence on the past and differencing, i.e. the technique used in TS to try and get rid of autocorrelation). **If we have a Time Series with trend, seasonality and non-constant variance, then it is best to log the data which would then allow us to stabilize the variance, hence the name 'Variance Stablizing Transformation' This allows us to fit our data in additive models, making it a very useful technique.**
b. **Multivariate Time Series**: Many variables measured through time, typically measured on the same time points. In some other cases, they may be measured on different time points but within an year. Such data would need to be analyzed using multiple regression models. We might also want to fit the time variable in the model (after transformations maybe) but this could allow us to determince whether there was any **trend** in the response and capture it. If you do not find any trend in the data, then the time variable can be chucked out but **modelling the data without checking for the time variable would be a horrible idea**.
More observations is often a good thing because it allows greater degrees of freedom. (Clarification needed, will read a few paper to better this concept)
Panel Data/Logitudinal Data: Panel data is a combination of cross-sectional data and time series. There can be two kinds of time logging for panel data, implicit and explicit. In econometrics and finance studies, the time is implicitly marked which in areas like medical studies explicit marking of time is done on Panel data. The idea is to take a cross-sectional measurements at some point in time and then repeat the measurements again and again to form a time series.
Continuous Time Models: In this kind of data, we have a Point(Time component) process and a Stochastic (Random) process. For instance, the number of people arriving at a particular restaurant is an example of this where the groups of people ae arriving at certain points in time(point process) and the size of the group is random (stochastic process).
NOTE: No matter what a time series data is about, on the axis (while modeling) we always use a sequence of numbers starting from 1, up to the total number of observations. This allows us to focus on the sequence of measurements because we already know the frequency the time interval between any two observations.
A few key features to remember when modelling time series are as follows:
The general concept of dependence on the past is called autocorrelation in statistics. If we are going to use regression models for mdoeling non-stationary data, we need to ensure that they satisfy the underlying assumptions necessary for regression models, i.e. normally distributed, zero mean, constant variance in the residual plot.
The population variance denoted by is calculated using the sample variance. For any given sample Y, it’s variance is calculated as follows:
Note: Variance is a special case of covariance, it just means how the sample of observations varies with itself.
It tells us how two varaibles vary with each other. Simply put, are the large values of X related to large values in Y or are the large values in X related to small values in Y? It is calculated as follows:
Standardized covariance is called correlation, therefore it always lies between -1 and 1. It is represented as
where represented the standard deviation for x and represents the standard deviation for y.
Autocorrelation takes place for univariate time series.
Covariance is with respect to itself in this case, for the past values that have occurred in time and is represented as follows.
In a univariate time series, the covariance between two observations k time periods apart (lag k), is given by
Standardized autocovariance is equal to sample autocorrelation (k time periods apart) which is denoted by and is given as follows:
There are two types of autocorrelation that we look for:
The following methods are used for detecting autocorrelation:
This function does the tests for all possible lags and plots them at the same time as well. It is very time efficient for this reason.
There are certain key points, that need to be considered when we do, Time Series Regression. They are as follows:
Often the smoothing or filtering is done before the data is delivered and this can be bad thing because we don’t know how seasonality was removed and also because we can only predict deseasonal values.
We can also do an STL analysis which would give us the seasonality, trend and remainder plots for any time series.
NOTE: If you find that the range of the seasonality plot is more or less the same as the range of the remained plot, then there isn’t much seasonality. The STL is designed to look for seasonality so it will always give you a seasonality plot and it’ll be for the person reading to determine whether there is actual seasonlity or not.
Both can be used to remove seasonality