(Don’t) Blame the weather

This project uses time series analysis and forecasting techniques to explore crime data during the COVID-19 pandemic, while also considering the weather.

Did COVID-19 prevent homicides?

Admittedly, that header is a bit sensational, but when I analyzed Boston’s crime data during Covid-19, the drop in reported offenses was quite astonishing. While verbal dispute offenses sky-rocketed, and more people set out to rob a bank, the overall number of crimes dropped by more than half. So indeed, the lock-down prevented harm by reducing the risk of catching the virus and also decreased the probability of getting hit by a car or bullet.

Furthermore, my most avid blog reader (and dad-in-law), pointed out that I should incorporate the weather into my analysis. So here I go again, analyzing more crime data. This time, I’ll conduct the study in R and use the opportunity to take a deep dive into time series analysis and forecasting. Rob Hyndman and George Athanasopoulos wrote an excellent introduction to the topic, that served as my study guide. The authors kindly make their text freely available online.

Here, I’ll discuss the results in a high-level overview, but I’ll point out technical pitfalls along the way. Please find the extensively commented code on GitHub. Also, don’t hesitate to drop me a line, if something is not clear or you’d like to suggest some improvements.

Data sources on Google BigQuery

When I searched for data sources for both crime and weather data, I stumbled over BigQuery’s phenomenal public data sets. I’ll make use of the Chicago crime data (bigquery-public-data.chicago_crime.crime) and weather data from NOAA, the National Oceanic and Atmospheric Administration, (bigquery-public-data.noaa_gsod.gsod*).

Chicago’s offense counts are strongly seasonal and decline over the years

First, I wanted to repeat some of the analysis for the Chicago data, that I conducted for Boston a while back. It was interesting to repeat the study in R and compare it to the previous analysis in Tableau.

The findings between the two metropoles Boston and Chicago are very similar. The monthly number of offenses is highly seasonal. The number of offenses overall declines over the years.

Monthly offense counts over the years. Each dot represents the total count of crimes of one month. Note the strong seasonality of the data (i.e., the “oscillation” over the years). The blue trendline (LOESS smoothing) emphasizes the decline of offense count over the years. In 2019, the police recorded approximately 40% fewer offenses compared to 2001. Note the deviation from the regular, seasonal pattern in 2020.

Some data engineering to retrieve Chicago’s weather data

Next, let’s pull in the weather data. The NOAA data set with daily weather information (GSOD… Global Surface Summary of the Day) provides one table per year (i.e., gsod1929 through gsod2020). The table “stations” provide information when and where weather data was collected. Interestingly, NOAA reports that there are “more than 9,000 stations.” However, the table lists 26,478 unique station names. Unfortunately, the identifiers, wban and usaf, do not elucidate the matter as they have been used inconsistently over the years. If you’d like some more metadata about the GSOD data, head on over to NOAA’s website.

So how should we select relevant stations? First, I narrowed it down to stations in Illinois that delivered data past 2001. Interestingly, most stations report data for only a few years, or even intermittently, which means that we have to piece data of multiple stations together in order to cover the full period for which we have crime data (2001 to present).

Furthermore, I noticed that selecting “Chicago” on the station name would omit some relevant stations or include some that are outside the area of interest. For example, we would exclude the station “Northerly Island,” which sits on an artificial island in Lake Michigan. That station is highly relevant, while a station such as “Chicago/Waukegan” is quite removed from the area in which crime data for Chicago are collected.

Eventually, I resorted to a geographical selection, and created a “quadrant of interest,” using the longitude and latitude information of the stations table. Here is the complete query to obtain weather data for Chicago:

#standardsql
SELECT
  -- Create a timestamp from the date components.
  timestamp(concat(year,'-',mo,'-',da)) as date,
  -- Replace numerical null values with actual nulls
  -- Round to appropriate decimal
  ROUND(AVG(IF (temp=9999.9, null, temp)),2) AS temperature,
  ROUND(AVG(IF (visib=999.9, null, visib)),2) AS visibility,
  ROUND(AVG(IF (wdsp='999.9', null, CAST(wdsp AS Float64))),2) 
      AS wind_speed,
  ROUND(AVG(IF (gust=999.9, null, gust)),2) AS wind_gust,
  ROUND(AVG(IF (prcp=99.99, null, prcp)),3) AS precipitation,
  AVG(IF (sndp=999.9, null, sndp)) AS snow_depth
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
   `stn` IN
 -- Subquery to select stations from stations table
  (SELECT `usaf`
    FROM bigquery-public-data.noaa_gsod.stations
 -- Specify Longitude and Latidude that spans the city of Chicago
    WHERE state='IL' AND (`lat` BETWEEN 41.6 AND 42) 
        AND (`lon` BETWEEN -88 AND -87)
 -- Exclude stations that stopped reporting weather data before 2001
      AND `end` > '20010000'
    ORDER BY `end` ASC) AND CAST(YEAR AS INT64) > 2000
 GROUP BY date;

That query is a mouth-full but pulls reasonably clean data into R. Last but not least, we need to average across all stations for each day to cover the entire period from 2001 to present.

May 2020 is the rainiest month

Let’s plot the temperature data in a seasonal plot to get an idea about the variance across years.

A seasonal subseries plot highlights the variance of the average monthly temperature across the years. Each panel represents the average daily temperature for a particular month (i.e., Jan 2001 through Jan 2020.) The blue horizontal line in each panel shows the average for that month across all years.

OK, maybe not terribly interesting to see that the temperature in Chicago follows a yearly, seasonal pattern. January is the coldest month and July the warmest. However, we notice a stronger variance in the average temperatures during the colder months, while August is just “hot” with a lot less variation. Let’s keep that in mind when we look at the crime data in conjunction with the weather data.

Let’s also quickly look at the seasonal plot for precipitation:

A seasonal subseries plot highlights the variance of the average monthly precipitation across the years. Each panel represents the average total precipitation for a particular month (i.e., Jan 2001 through Jan 2020.) The blue horizontal line in each panel shows the average for that month across all years.

That doesn’t look quite as clear cut as the temperature plot. The plot reveals a bimodal distribution with more rain in the early and late summer months. However, there is a high variance in almost all months. For instance, for September, we see an up to 8-fold difference across the years.

TO BE CONTINUED…

Leave a Reply

Your email address will not be published. Required fields are marked *