Adfuller use up all the RAM resource?

Discussion:

Louis

2014-06-27 04:06:40 UTC

I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.

My problem is the program eat away all my RAM. So I couldn't do anything
else while it is running. I couldn't even kill the process as the system is
not responding to anything (i.e. ALT_CTRL_DEL). From my preliminary
calculation the calculation needs at least 2 days to complete. My question
is there any way to prevent the program from using all the resource so I
could still do other stuff while it is running in the background?

My data is only about 200MB.

the code is simple:

import pandas as pdimport statsmodels.tsa.stattools as ts

zn = pd.read_csv(path)
result = ts.adfuller(zn)

Much appreciate your help.

j***@public.gmane.org

2014-06-27 04:42:33 UTC

Permalink

Post by Louis
I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.
My problem is the program eat away all my RAM. So I couldn't do anything
else while it is running. I couldn't even kill the process as the system is
not responding to anything (i.e. ALT_CTRL_DEL). From my preliminary
calculation the calculation needs at least 2 days to complete. My question
is there any way to prevent the program from using all the resource so I
could still do other stuff while it is running in the background?
My data is only about 200MB.
import pandas as pdimport statsmodels.tsa.stattools as ts
zn = pd.read_csv(path)
result = ts.adfuller(zn)
Much appreciate your help.

I guess you need to set a smaller maxlag, and turn off autolag

The OLS to get the results uses a design matrix of roughly (nobs, maxlag)

Post by Louis

nobs=5299788;int(np.ceil(12. * np.power(nobs / 100., 1 / 4.)))

183

Post by Louis

5299788 * _

969861204

But also, autolag saves the regression results instances temporarily, all
at the same time.

The entire results are not stored and returned by default.

Would you please open an issue.
https://github.com/statsmodels/statsmodels/issues

We can reduce the number of temporary arrays and results that are kept in
memory.
However, I don't know if there is any way around the (nobs, maxlag) design
matrix for the auxiliary regression.
I'm sure there is, but it might not be obvious to implement without a low
memory OLS.

Josef

Nathaniel Smith

2014-06-27 08:09:44 UTC

Permalink

Two quick thoughts:
- you can make a lag matrix in basically no memory by using stride_tricks.
(This requires the time series be stored in contiguous memory, but copying
the time series is cheap compared to forming an explicit lag matrix.) I'm
not sure whether such a matrix can be passed to linalg routines without
copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and could
again if it's useful.

-n