Time series data refers to a sequence of measurements that are made over time at regular or irregular intervals with each observation being a single dimension. Both low and high-dimensional time series are frequently characterized by unique challenges that are often
not present in cross-sectional data. When working with such data, it is helpful to utilize an effective tool that can facilitate the processing, munging, and evaluation needed to execute analysis. Be it for time series regression, clustering, forecasting, or
dimensionality reduction, I have found that R is usually that tool. In this post, I will highlight some of the key packages for analyzing time series data that beginners to the R programming language should get acquainted with.
To run the code in this post, you will need to access the following data through the unix terminal. It will download a csv file from the City of Chicago website that contains information on reported incidents of crime that occurred in the city of Chicago from 2001 to the present.
$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > chicago_crime_data.csv
Import the data into R and get the aggregate number of reported incidents of theft by day.
dat = fread("chicago_crime_data.csv") colnames(dat) = gsub(" ", "_", tolower(colnames(dat))) dat[, date2 := as.Date(date, format="%m/%d/%Y")] mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] mydat[1:6]
Perfect…let’s get started.
A. Data Storage:
The first set of packages that one should be aware of is related to data storage. One could use data frames or tibbles, but there are already a number of data structures that are optimized for time series data.
The xts package offers a number of great tools for data manipulation and aggregation. At it’s core is the xts object, which is essentially a matrix object that can represent time series data at different time increments. Xts is a subclass of the zoo object, and that provides a lot of functionality.
Here are some functions in xts that are worth investigating:
library(xts) # create a xts object mydat2 = as.xts(mydat) mydat2 # filter by date mydat2["2015"] ## 2015 mydat2["201501"] ## Jan 2015 mydat2["20150101/20150105"] ## Jan 01 to Jan 05 2015 # replace all valuues from Aug 25 onwards with 0 mydat2["20170825/"] <- 0 mydat2["20170821/"] # get the last one month last(mydat2, "1 month") # get stats by time frame apply.monthly(mydat2, sum) apply.monthly(mydat2, quantile) period.apply(mydat2, endpoints(mydat2,on='months'), sum) period.apply(mydat2, endpoints(mydat2,on='months'), quantile)
The second set of packages that one beginner to time series analysis in R should be aware of relates to dates and times. In generally, R has a lot of difference facilirates and classes for dealing with them, but each tends to be a bit clunky. The lubridate package from Hadley Wickham really simplifies many of those complexities.
R has a number of date objects with each having its own unique characteristics. The lubridate package makes it easier to work with dates and times by providing functions to identify and parse date-time data, extract and modify components of a date-time, and perform accurate math on date-times.
library(lubridate) ymd("2010-01-01") mdy("01-01-2010") ymd_h("2010-01-01 10") ymd_hm("2010-01-01 10:02") ymd_hms("2010-01-01 10:02:30")
C. Regression Analysis
Time series regression is an important part of time series analysis. There are many ways to incorporate time series data into regression analysis, but they generally involve example shifts, decay, and so forth to estimate the impact of certain regressors.
C1. dynlm / ardl
Distributed lag models (error correction models) are a core component of doing time series analysis. They are many instances where we want to regress an outcome variable at the current time against values of various regressors at current and previous times. dynlm and ardl (wrapper for dynlm) are solid for this type of analysis.
Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. So the model attempts to regress incidents or reported theft based on the weather from the previous day.
library(dynlm) mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] mydat[, weather := sample(c(20:90), dim(mydat), replace=TRUE)] mydat[, weather_lag := shift(weather, 1, type = 'lag')] mod = dynlm(N ~ L(weather), data = mydat2) summary(mod)
Another common task when working with distributed lag models involves using dynamic simulations to understand estimated outcomes in different scenarios. dynsim provides a coherent solution for simulation and visualization of those estimated values of the target variable.
Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. I’ve used the dynsim to product two dynamic simulations and plotted them.
library(dynsim) mydat3 = mydat[1:10000] mod = lm(N ~ weather_lag, data = mydat3) Scen1 <- data.frame(weather_lag = min(mydat2$weather_lag, na.rm=T)) Scen2 <- data.frame(weather_lag = max(mydat2$weather_lag, na.rm=T)) ScenComb <- list(Scen1, Scen2) Sim1 <- dynsim(obj = mod, ldv = 'weather_lag', scen = ScenComb, n = 20) dynsimGG(Sim1)
When most people talk about time series analysis, they are talking about forecasting. This is one area where R is loaded with great tools. From standard moving average models to complex gradient boost models, R has many tools designed specifically to forecast from time series data.
The forecast package is the most used package in R for time series forecasting. It contains functions for performing decomposition and forecasting with exponential smoothing, arima, moving average models, and so forth. For aggregated data that is fairly high dimensional, one of the techniques present in this package should provide an adaquete forecasting model given that the assumptions hold.
Here is a quick example of how to use the auto.arima function in R. In general, automatic forecasting tools should be used with caution, but it is a good place to explore time series data.
library(forecast) mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] fit = auto.arima(mydat[,.(N)]) pred = forecast(fit, 200) plot(pred)
The smooth package provides functions to perform even more variations of exponential smoothing, moving average models, and various seasonal arima techniques. The smooth and forecast package are usually more than adaquete for most forecasting problems that pertain to high dimensional data.
Here is a basic example that uses the automatic complex exponential smoothing function:
library(smooth) mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] fit = auto.ces(mydat[,N]) pred = forecast(fit, 200) plot(pred)
So for those of you getting introduced to the R programming language, these are a list extremely useful packages for time series analysis that you will to get some exposure.