DATA ANALYSIS
Datasets with several timezones
How to deal with daylight saving time in Pandas
Questo articolo è disponibile anche in italiano, qui
Dates and times: your best friend and your worst enemy when it comes to handling a dataset.
In my everyday job, I often deal with timeseries analysis and historical data.
One critical aspect in the datetime format, especially if the timestep is hourly or lower: daylight saving time creates two discontinuities that can create errors in the data handling or interpretation if overlooked.
In this story, I’ll show you a practical example and the solution I adopted.
The data
Data was stored in a database; I was working on a subset stored in .csv for easy handling with Jupyter Notebook.
Here is an example:
datetime, variable1, variable2
2020–01–01, 00:00:00+01, 1, 2
2020–01–01, 00:15:00+01, 3, 5
2020–01–01, 00:30:00+01, 7, 9
This was the first time I had to handle data with the “+01” timezone bit in the timestamp, so I decided to just try and let Pandas handle it.
my_data = pd.read_csv(filepath,
index_col = ‘datetime’,
parse_dates=True)
my_data.index.dtypeOUTPUT: dtype(‘O’)
Damn! Pandas wasn’t able to convert the datetime column to datetime format! A quick online search solved the mystery: datetimes were recognized with two different timezones.
Let’s see the data around daylight saving time start:
datetime, variable1, variable2
2020–03–29 01:30:00+01, 65, 7
2020–03–29 01:45:00+01, 39, 11
2020–03–29 03:00:00+02, 24, 4
2020–03–29 03:00:15+02, 12, 9
So, Pandas is not able to understand that the notation simply indicates daylight saving time. What should we do now?
UTC, the machine-friendly timezone
Our goal now is to choose a datetime format easy to handle with Pandas.
The library itself has a solution: the datetime parser has a parameter that allows converting each string in UTC datetime, avoiding the daylight saving time ambiguities.
my_data[‘Datetime’] = pd.to_datetime(
Medium.index,
utc=True,
infer_datetime_format=True)
EUREKA! Now the datetimes are in datetime64 format!
Last step: put the column as the index and I’m ready to roll!
my_data = my_data.set_index(‘datetime’)
Or so I believed…
Back to the future
Working with the data was now really impractical: my datetime filters were written in GMT+1 timezone and I wanted all my data visualization expressed in my timezone!
Luckily, Pandas got me covered again: converting the parsed dates into my timezone was easy as writing one line of code.
my_data.index = my_data.index.tz_convert(‘Europe/Rome’)
That’s all Folks!
This experience taught me how to streamline the datetime parsing with parsing in just three steps:
- First try: parse datetimes directly with pd.read_csv(); if this fails:
- Parse the dates converting them in UTC format, and then
- Convert dates to the desired timezone.
Final catch: we can do point 2 and 3 in the same instruction!
my_data.index = my_data.set_index(pd.to_datetime(
my_data.index,
utc=True,
infer_datetime_format=True).tz_convert(‘Europe/Rome’)