Discussion:
[pystatsmodels] SARIMAX unable to detect weekly time intervals?
k***@gmail.com
2018-10-02 20:24:50 UTC
Permalink
I am working with statsmodel version 0.8.0 and python 3.6. I have the
following pandas dataframe, df, with two columns: 'date' and 'count'. The
'date' column is has a datetime dtype and and 'count' has an integer dtype.
There is an observation/ (row) corresponding to each Monday between
2009-12-28 and 2018-09-24, and these Monday dates are the contents of the
'date' column:

df =

date count
2009-12-28 2
2010-01-04 19
2010-01-11 18
2010-01-18 8
2010-01-25 18
2010-02-01 23
.
2018-09-17 15
2018-09-24 7

I am able to successfully utilize the statsmodels.tsa.statespace.SARIMAX
class to produce past predictions of 'count' - using the .get_prediction()
method - and future predictions of 'count' - using the .get_forecast()
method - when the 'date' column contains the start date of each month
between 2010-01-01 and 2018-09-01; in this case, the day is always set to
'01'. The same code that is successful in this case fails, however, if the
'date' column contains the *last* day of each month and the day is variable
('30', '31', '28', or '29').

According to the documentation, the failure of SARIMAX to work when I used
the last day of each month in the 'date' column is somewhat expected since
the date - when converted to an index for use in SARIMAX - must be in
regular time intervals. It makes sense that since some months have more
days than others, the computer would fail to see that the unit of time is 1
month in that case.

However, in the weekly case the observations are all exactly seven days
apart so I expected that the algorithm to be able to self-detect the unit
of time to be 1 week/7 days. Is there any way for me to get the SARIMAX
object to train and predict on a time unit of 1 week?

Many thanks,
Kathryn
Chad Fulton
2018-10-02 23:20:45 UTC
Permalink
Post by k***@gmail.com
I am working with statsmodel version 0.8.0 and python 3.6. I have the
following pandas dataframe, df, with two columns: 'date' and 'count'. The
'date' column is has a datetime dtype and and 'count' has an integer dtype.
There is an observation/ (row) corresponding to each Monday between
2009-12-28 and 2018-09-24, and these Monday dates are the contents of the
df =
date count
2009-12-28 2
2010-01-04 19
2010-01-11 18
2010-01-18 8
2010-01-25 18
2010-02-01 23
.
2018-09-17 15
2018-09-24 7
I am able to successfully utilize the statsmodels.tsa.statespace.SARIMAX
class to produce past predictions of 'count' - using the .get_prediction()
method - and future predictions of 'count' - using the .get_forecast()
method - when the 'date' column contains the start date of each month
between 2010-01-01 and 2018-09-01; in this case, the day is always set to
'01'. The same code that is successful in this case fails, however, if the
'date' column contains the *last* day of each month and the day is variable
('30', '31', '28', or '29').
According to the documentation, the failure of SARIMAX to work when I used
the last day of each month in the 'date' column is somewhat expected since
the date - when converted to an index for use in SARIMAX - must be in
regular time intervals. It makes sense that since some months have more
days than others, the computer would fail to see that the unit of time is 1
month in that case.
However, in the weekly case the observations are all exactly seven days
apart so I expected that the algorithm to be able to self-detect the unit
of time to be 1 week/7 days. Is there any way for me to get the SARIMAX
object to train and predict on a time unit of 1 week?
Many thanks,
Kathryn
If possible, it would be great if you could provide some example code.
There are a lot of things that might be happening.

The key point is that for forecasting with dates to work, the index of your
data must have an associated frequency. From what you posted, it appears
that your index may just have timestamp objects but not have a frequency.
You might see what happens if you force the data index to have frequency
'W-MON'.

Finally, we overhauled date/time handling in v0.9, so it may be the case
that your problem would be fixed by upgrading (although I still think you'd
need to set a frequency in your data).

For example, the following code works for me (actually on the latest code
in Gitlab, rather than v0.9) - you might try it out to see what happens:

---------------

import numpy as np
import pandas as pd
import statsmodels.api as sm

rs = np.random.RandomState(seed=1234)

nobs = 104
index = pd.PeriodIndex(start='2000', periods=nobs, freq='W-MON')
endog = pd.Series(np.random.normal(size=nobs), index=index)

mod = sm.tsa.SARIMAX(endog)
res = mod.fit()

print(res.forecast('2002'))
print(res.forecast('2002-01'))
print(res.forecast('2002-01-01'))
print(res.forecast('2002-01-31'))

---------------

this produces as output:

2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64

2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64

2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64

2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
2002-01-08/2002-01-14 -0.008351
2002-01-15/2002-01-21 -0.001625
2002-01-22/2002-01-28 -0.000316
2002-01-29/2002-02-04 -0.000062
Freq: W-MON, dtype: float64



Best,
Chad
Kathryn Bryant
2018-10-03 13:19:16 UTC
Permalink
Hi Chad,

Thank you so much for your speedy reply. I will implement your suggestion
of using the freq='W-MON' argument and, if I still have a problem, circle
back with code snippets. (I apologize for not attaching any - I was so
concerned with dumping too much unnecessary detail into my question that
then I avoided putting any code at all. There's clearly a happy medium to
which I should aspire.)

Many thanks,
Kathryn
Post by Chad Fulton
Post by k***@gmail.com
I am working with statsmodel version 0.8.0 and python 3.6. I have the
following pandas dataframe, df, with two columns: 'date' and 'count'. The
'date' column is has a datetime dtype and and 'count' has an integer dtype.
There is an observation/ (row) corresponding to each Monday between
2009-12-28 and 2018-09-24, and these Monday dates are the contents of the
df =
date count
2009-12-28 2
2010-01-04 19
2010-01-11 18
2010-01-18 8
2010-01-25 18
2010-02-01 23
.
2018-09-17 15
2018-09-24 7
I am able to successfully utilize the statsmodels.tsa.statespace.SARIMAX
class to produce past predictions of 'count' - using the .get_prediction()
method - and future predictions of 'count' - using the .get_forecast()
method - when the 'date' column contains the start date of each month
between 2010-01-01 and 2018-09-01; in this case, the day is always set to
'01'. The same code that is successful in this case fails, however, if the
'date' column contains the *last* day of each month and the day is variable
('30', '31', '28', or '29').
According to the documentation, the failure of SARIMAX to work when I
used the last day of each month in the 'date' column is somewhat expected
since the date - when converted to an index for use in SARIMAX - must be in
regular time intervals. It makes sense that since some months have more
days than others, the computer would fail to see that the unit of time is 1
month in that case.
However, in the weekly case the observations are all exactly seven days
apart so I expected that the algorithm to be able to self-detect the unit
of time to be 1 week/7 days. Is there any way for me to get the SARIMAX
object to train and predict on a time unit of 1 week?
Many thanks,
Kathryn
If possible, it would be great if you could provide some example code.
There are a lot of things that might be happening.
The key point is that for forecasting with dates to work, the index of
your data must have an associated frequency. From what you posted, it
appears that your index may just have timestamp objects but not have a
frequency. You might see what happens if you force the data index to have
frequency 'W-MON'.
Finally, we overhauled date/time handling in v0.9, so it may be the case
that your problem would be fixed by upgrading (although I still think you'd
need to set a frequency in your data).
For example, the following code works for me (actually on the latest code
---------------
import numpy as np
import pandas as pd
import statsmodels.api as sm
rs = np.random.RandomState(seed=1234)
nobs = 104
index = pd.PeriodIndex(start='2000', periods=nobs, freq='W-MON')
endog = pd.Series(np.random.normal(size=nobs), index=index)
mod = sm.tsa.SARIMAX(endog)
res = mod.fit()
print(res.forecast('2002'))
print(res.forecast('2002-01'))
print(res.forecast('2002-01-01'))
print(res.forecast('2002-01-31'))
---------------
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
2002-01-08/2002-01-14 -0.008351
2002-01-15/2002-01-21 -0.001625
2002-01-22/2002-01-28 -0.000316
2002-01-29/2002-02-04 -0.000062
Freq: W-MON, dtype: float64
Best,
Chad
Kathryn Bryant
2018-10-03 17:01:48 UTC
Permalink
Hi Chad,

Using a pandas PeriodIndex in my series and setting freq='W-MON' worked
perfectly. Thank you again.

Sincerely,
Kathryn
Post by Kathryn Bryant
Hi Chad,
Thank you so much for your speedy reply. I will implement your suggestion
of using the freq='W-MON' argument and, if I still have a problem, circle
back with code snippets. (I apologize for not attaching any - I was so
concerned with dumping too much unnecessary detail into my question that
then I avoided putting any code at all. There's clearly a happy medium to
which I should aspire.)
Many thanks,
Kathryn
Post by Chad Fulton
Post by k***@gmail.com
I am working with statsmodel version 0.8.0 and python 3.6. I have the
following pandas dataframe, df, with two columns: 'date' and 'count'. The
'date' column is has a datetime dtype and and 'count' has an integer dtype.
There is an observation/ (row) corresponding to each Monday between
2009-12-28 and 2018-09-24, and these Monday dates are the contents of the
df =
date count
2009-12-28 2
2010-01-04 19
2010-01-11 18
2010-01-18 8
2010-01-25 18
2010-02-01 23
.
2018-09-17 15
2018-09-24 7
I am able to successfully utilize the statsmodels.tsa.statespace.SARIMAX
class to produce past predictions of 'count' - using the .get_prediction()
method - and future predictions of 'count' - using the .get_forecast()
method - when the 'date' column contains the start date of each month
between 2010-01-01 and 2018-09-01; in this case, the day is always set to
'01'. The same code that is successful in this case fails, however, if the
'date' column contains the *last* day of each month and the day is variable
('30', '31', '28', or '29').
According to the documentation, the failure of SARIMAX to work when I
used the last day of each month in the 'date' column is somewhat expected
since the date - when converted to an index for use in SARIMAX - must be in
regular time intervals. It makes sense that since some months have more
days than others, the computer would fail to see that the unit of time is 1
month in that case.
However, in the weekly case the observations are all exactly seven days
apart so I expected that the algorithm to be able to self-detect the unit
of time to be 1 week/7 days. Is there any way for me to get the SARIMAX
object to train and predict on a time unit of 1 week?
Many thanks,
Kathryn
If possible, it would be great if you could provide some example code.
There are a lot of things that might be happening.
The key point is that for forecasting with dates to work, the index of
your data must have an associated frequency. From what you posted, it
appears that your index may just have timestamp objects but not have a
frequency. You might see what happens if you force the data index to have
frequency 'W-MON'.
Finally, we overhauled date/time handling in v0.9, so it may be the case
that your problem would be fixed by upgrading (although I still think you'd
need to set a frequency in your data).
For example, the following code works for me (actually on the latest code
---------------
import numpy as np
import pandas as pd
import statsmodels.api as sm
rs = np.random.RandomState(seed=1234)
nobs = 104
index = pd.PeriodIndex(start='2000', periods=nobs, freq='W-MON')
endog = pd.Series(np.random.normal(size=nobs), index=index)
mod = sm.tsa.SARIMAX(endog)
res = mod.fit()
print(res.forecast('2002'))
print(res.forecast('2002-01'))
print(res.forecast('2002-01-01'))
print(res.forecast('2002-01-31'))
---------------
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
Freq: W-MON, dtype: float64
2001-12-25/2001-12-31 -0.220530
2002-01-01/2002-01-07 -0.042914
2002-01-08/2002-01-14 -0.008351
2002-01-15/2002-01-21 -0.001625
2002-01-22/2002-01-28 -0.000316
2002-01-29/2002-02-04 -0.000062
Freq: W-MON, dtype: float64
Best,
Chad
Chad Fulton
2018-10-03 22:07:11 UTC
Permalink
Post by Kathryn Bryant
Hi Chad,
Using a pandas PeriodIndex in my series and setting freq='W-MON' worked
perfectly. Thank you again.
Sincerely,
Kathryn
Post by Kathryn Bryant
Hi Chad,
Thank you so much for your speedy reply. I will implement your suggestion
of using the freq='W-MON' argument and, if I still have a problem, circle
back with code snippets. (I apologize for not attaching any - I was so
concerned with dumping too much unnecessary detail into my question that
then I avoided putting any code at all. There's clearly a happy medium to
which I should aspire.)
Many thanks,
Kathryn
Glad to hear it!

Chad

Loading...