Discussion:
Adfuller use up all the RAM resource?
Louis
2014-06-27 04:06:40 UTC
Permalink
I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.

My problem is the program eat away all my RAM. So I couldn't do anything
else while it is running. I couldn't even kill the process as the system is
not responding to anything (i.e. ALT_CTRL_DEL). From my preliminary
calculation the calculation needs at least 2 days to complete. My question
is there any way to prevent the program from using all the resource so I
could still do other stuff while it is running in the background?

My data is only about 200MB.

the code is simple:

import pandas as pdimport statsmodels.tsa.stattools as ts

zn = pd.read_csv(path)
result = ts.adfuller(zn)

Much appreciate your help.
j***@public.gmane.org
2014-06-27 04:42:33 UTC
Permalink
Post by Louis
I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.
My problem is the program eat away all my RAM. So I couldn't do anything
else while it is running. I couldn't even kill the process as the system is
not responding to anything (i.e. ALT_CTRL_DEL). From my preliminary
calculation the calculation needs at least 2 days to complete. My question
is there any way to prevent the program from using all the resource so I
could still do other stuff while it is running in the background?
My data is only about 200MB.
import pandas as pdimport statsmodels.tsa.stattools as ts
zn = pd.read_csv(path)
result = ts.adfuller(zn)
Much appreciate your help.
I guess you need to set a smaller maxlag, and turn off autolag

The OLS to get the results uses a design matrix of roughly (nobs, maxlag)
Post by Louis
nobs=5299788;int(np.ceil(12. * np.power(nobs / 100., 1 / 4.)))
183
Post by Louis
5299788 * _
969861204

But also, autolag saves the regression results instances temporarily, all
at the same time.

The entire results are not stored and returned by default.


Would you please open an issue.
https://github.com/statsmodels/statsmodels/issues

We can reduce the number of temporary arrays and results that are kept in
memory.
However, I don't know if there is any way around the (nobs, maxlag) design
matrix for the auxiliary regression.
I'm sure there is, but it might not be obvious to implement without a low
memory OLS.

Josef
Nathaniel Smith
2014-06-27 08:09:44 UTC
Permalink
Two quick thoughts:
- you can make a lag matrix in basically no memory by using stride_tricks.
(This requires the time series be stored in contiguous memory, but copying
the time series is cheap compared to forming an explicit lag matrix.) I'm
not sure whether such a matrix can be passed to linalg routines without
copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and could
again if it's useful.

-n
Post by j***@public.gmane.org
Post by Louis
I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.
Post by j***@public.gmane.org
Post by Louis
My problem is the program eat away all my RAM. So I couldn't do anything
else while it is running. I couldn't even kill the process as the system is
not responding to anything (i.e. ALT_CTRL_DEL). From my preliminary
calculation the calculation needs at least 2 days to complete. My question
is there any way to prevent the program from using all the resource so I
could still do other stuff while it is running in the background?
Post by j***@public.gmane.org
Post by Louis
My data is only about 200MB.
import pandas as pd
import statsmodels.tsa.stattools as ts
zn = pd.read_csv(path)
result = ts.adfuller(zn)
Much appreciate your help.
I guess you need to set a smaller maxlag, and turn off autolag
The OLS to get the results uses a design matrix of roughly (nobs, maxlag)
Post by Louis
nobs=5299788;int(np.ceil(12. * np.power(nobs / 100., 1 / 4.)))
183
Post by Louis
5299788 * _
969861204
But also, autolag saves the regression results instances temporarily, all
at the same time.
Post by j***@public.gmane.org
The entire results are not stored and returned by default.
Would you please open an issue.
https://github.com/statsmodels/statsmodels/issues
Post by j***@public.gmane.org
We can reduce the number of temporary arrays and results that are kept in
memory.
Post by j***@public.gmane.org
However, I don't know if there is any way around the (nobs, maxlag)
design matrix for the auxiliary regression.
Post by j***@public.gmane.org
I'm sure there is, but it might not be obvious to implement without a low
memory OLS.
Post by j***@public.gmane.org
Josef
j***@public.gmane.org
2014-06-27 11:48:56 UTC
Permalink
Post by Nathaniel Smith
- you can make a lag matrix in basically no memory by using stride_tricks.
(This requires the time series be stored in contiguous memory, but copying
the time series is cheap compared to forming an explicit lag matrix.) I'm
not sure whether such a matrix can be passed to linalg routines without
copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and could
again if it's useful.
adfuller is a "text book" implementation where I had at most a few thousand
observations as fpr application in macroeconomics and finance in mind.

I think we can build the moment matrices in a similar way as for
Yule-Walker. And the autolag regression could be replace by a single QR or
using a sweep algorithm on the moment matrix.
(pacf has 3 calculation methods plus versions of it, adfuller has only 1)

Using OLS has the big advantage that we have all required results, aic,
tvalues, ... immediately available.

The main work for incremental OLS is to write a Results class that
calculates all the interesting things based on just the moment matrix.

PRs welcome.


Josef
Post by Nathaniel Smith
-n
Post by j***@public.gmane.org
Post by Louis
I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.
Post by j***@public.gmane.org
Post by Louis
My problem is the program eat away all my RAM. So I couldn't do
anything else while it is running. I couldn't even kill the process as the
system is not responding to anything (i.e. ALT_CTRL_DEL). From my
preliminary calculation the calculation needs at least 2 days to complete.
My question is there any way to prevent the program from using all the
resource so I could still do other stuff while it is running in the
background?
Post by j***@public.gmane.org
Post by Louis
My data is only about 200MB.
import pandas as pd
import statsmodels.tsa.stattools as ts
zn = pd.read_csv(path)
result = ts.adfuller(zn)
Much appreciate your help.
I guess you need to set a smaller maxlag, and turn off autolag
The OLS to get the results uses a design matrix of roughly (nobs,
maxlag)
Post by j***@public.gmane.org
Post by Louis
nobs=5299788;int(np.ceil(12. * np.power(nobs / 100., 1 / 4.)))
183
Post by Louis
5299788 * _
969861204
But also, autolag saves the regression results instances temporarily,
all at the same time.
Post by j***@public.gmane.org
The entire results are not stored and returned by default.
Would you please open an issue.
https://github.com/statsmodels/statsmodels/issues
Post by j***@public.gmane.org
We can reduce the number of temporary arrays and results that are kept
in memory.
Post by j***@public.gmane.org
However, I don't know if there is any way around the (nobs, maxlag)
design matrix for the auxiliary regression.
Post by j***@public.gmane.org
I'm sure there is, but it might not be obvious to implement without a
low memory OLS.
Post by j***@public.gmane.org
Josef
Thomas Johnson
2014-07-08 19:10:49 UTC
Permalink
Would this approach also speed up adfuller significantly? If the speed-up
is significant I'd be willing to put a bounty on this feature
Post by j***@public.gmane.org
Post by Nathaniel Smith
- you can make a lag matrix in basically no memory by using
stride_tricks. (This requires the time series be stored in contiguous
memory, but copying the time series is cheap compared to forming an
explicit lag matrix.) I'm not sure whether such a matrix can be passed to
linalg routines without copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and could
again if it's useful.
adfuller is a "text book" implementation where I had at most a few
thousand observations as fpr application in macroeconomics and finance in
mind.
I think we can build the moment matrices in a similar way as for
Yule-Walker. And the autolag regression could be replace by a single QR or
using a sweep algorithm on the moment matrix.
(pacf has 3 calculation methods plus versions of it, adfuller has only 1)
Using OLS has the big advantage that we have all required results, aic,
tvalues, ... immediately available.
The main work for incremental OLS is to write a Results class that
calculates all the interesting things based on just the moment matrix.
PRs welcome.
Josef
Post by Nathaniel Smith
-n
Post by j***@public.gmane.org
Post by Louis
I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.
Post by j***@public.gmane.org
Post by Louis
My problem is the program eat away all my RAM. So I couldn't do
anything else while it is running. I couldn't even kill the process as the
system is not responding to anything (i.e. ALT_CTRL_DEL). From my
preliminary calculation the calculation needs at least 2 days to complete.
My question is there any way to prevent the program from using all the
resource so I could still do other stuff while it is running in the
background?
Post by j***@public.gmane.org
Post by Louis
My data is only about 200MB.
import pandas as pd
import statsmodels.tsa.stattools as ts
zn = pd.read_csv(path)
result = ts.adfuller(zn)
Much appreciate your help.
I guess you need to set a smaller maxlag, and turn off autolag
The OLS to get the results uses a design matrix of roughly (nobs,
maxlag)
Post by j***@public.gmane.org
Post by Louis
nobs=5299788;int(np.ceil(12. * np.power(nobs / 100., 1 / 4.)))
183
Post by Louis
5299788 * _
969861204
But also, autolag saves the regression results instances temporarily,
all at the same time.
Post by j***@public.gmane.org
The entire results are not stored and returned by default.
Would you please open an issue.
https://github.com/statsmodels/statsmodels/issues
Post by j***@public.gmane.org
We can reduce the number of temporary arrays and results that are kept
in memory.
Post by j***@public.gmane.org
However, I don't know if there is any way around the (nobs, maxlag)
design matrix for the auxiliary regression.
Post by j***@public.gmane.org
I'm sure there is, but it might not be obvious to implement without a
low memory OLS.
Post by j***@public.gmane.org
Josef
j***@public.gmane.org
2014-07-08 20:31:22 UTC
Permalink
Post by Thomas Johnson
Would this approach also speed up adfuller significantly? If the speed-up
is significant I'd be willing to put a bounty on this feature
Some rough estimates

Without autolag search, the speedup will not be huge. Part of a speedup can
come from optimizing OLS for a specific shape of the design matrix, exog.
For example, if exog has many rows and few columns, then solving the normal
equations will be faster than working with pinv/svd as we do in OLS by
default, but it will be more susceptible to numerical noise for ill
conditioned exog.

For the autolag search which requires to solve OLS for all lags up to
maxlag, the cost could be reduced from maxlag OLS regresssions to maybe the
equivalent of 2 or 3 regression plus a small cost for each lag to calculate
AIC or similar.

Plus whatever savings come from more efficient memory usage.

So any savings will depend a lot on the number of observations and on
maxlag.

Josef
Post by Thomas Johnson
Post by j***@public.gmane.org
Post by Nathaniel Smith
- you can make a lag matrix in basically no memory by using
stride_tricks. (This requires the time series be stored in contiguous
memory, but copying the time series is cheap compared to forming an
explicit lag matrix.) I'm not sure whether such a matrix can be passed to
linalg routines without copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and could
again if it's useful.
adfuller is a "text book" implementation where I had at most a few
thousand observations as fpr application in macroeconomics and finance in
mind.
I think we can build the moment matrices in a similar way as for
Yule-Walker. And the autolag regression could be replace by a single QR or
using a sweep algorithm on the moment matrix.
(pacf has 3 calculation methods plus versions of it, adfuller has only 1)
Using OLS has the big advantage that we have all required results, aic,
tvalues, ... immediately available.
The main work for incremental OLS is to write a Results class that
calculates all the interesting things based on just the moment matrix.
PRs welcome.
Josef
Post by Nathaniel Smith
-n
Post by j***@public.gmane.org
Post by Louis
I am doing some analysis on some relatively large data set. To be
more specific, I am running a adjusted dickey-fuller test on a pandas
timeseries object. The length of the time series amounts to 5299788. I have
8GB of RAM on board with about 5.5GB available when IDLE.
Post by j***@public.gmane.org
Post by Louis
My problem is the program eat away all my RAM. So I couldn't do
anything else while it is running. I couldn't even kill the process as the
system is not responding to anything (i.e. ALT_CTRL_DEL). From my
preliminary calculation the calculation needs at least 2 days to complete.
My question is there any way to prevent the program from using all the
resource so I could still do other stuff while it is running in the
background?
Post by j***@public.gmane.org
Post by Louis
My data is only about 200MB.
import pandas as pd
import statsmodels.tsa.stattools as ts
zn = pd.read_csv(path)
result = ts.adfuller(zn)
Much appreciate your help.
I guess you need to set a smaller maxlag, and turn off autolag
The OLS to get the results uses a design matrix of roughly (nobs,
maxlag)
Post by j***@public.gmane.org
Post by Louis
nobs=5299788;int(np.ceil(12. * np.power(nobs / 100., 1 / 4.)))
183
Post by Louis
5299788 * _
969861204
But also, autolag saves the regression results instances temporarily,
all at the same time.
Post by j***@public.gmane.org
The entire results are not stored and returned by default.
Would you please open an issue. https://github.com/
statsmodels/statsmodels/issues
Post by j***@public.gmane.org
We can reduce the number of temporary arrays and results that are kept
in memory.
Post by j***@public.gmane.org
However, I don't know if there is any way around the (nobs, maxlag)
design matrix for the auxiliary regression.
Post by j***@public.gmane.org
I'm sure there is, but it might not be obvious to implement without a
low memory OLS.
Post by j***@public.gmane.org
Josef
Thomas Johnson
2014-07-08 20:38:33 UTC
Permalink
I have ~43000 observations and I'm using default autolag, which looks like
it works out to 12*(43000/100)^(1/4) ~= 55
However I don't use autolag, so it sounds like maybe there won't be
significant speedup if I understand you correctly?

I'm also going to try to use numba to see if that helps
Post by j***@public.gmane.org
Post by Thomas Johnson
Would this approach also speed up adfuller significantly? If the speed-up
is significant I'd be willing to put a bounty on this feature
Some rough estimates
Without autolag search, the speedup will not be huge. Part of a speedup
can come from optimizing OLS for a specific shape of the design matrix,
exog. For example, if exog has many rows and few columns, then solving the
normal equations will be faster than working with pinv/svd as we do in OLS
by default, but it will be more susceptible to numerical noise for ill
conditioned exog.
For the autolag search which requires to solve OLS for all lags up to
maxlag, the cost could be reduced from maxlag OLS regresssions to maybe the
equivalent of 2 or 3 regression plus a small cost for each lag to calculate
AIC or similar.
Plus whatever savings come from more efficient memory usage.
So any savings will depend a lot on the number of observations and on
maxlag.
Josef
Post by Thomas Johnson
Post by j***@public.gmane.org
Post by Nathaniel Smith
- you can make a lag matrix in basically no memory by using
stride_tricks. (This requires the time series be stored in contiguous
memory, but copying the time series is cheap compared to forming an
explicit lag matrix.) I'm not sure whether such a matrix can be passed to
linalg routines without copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and
could again if it's useful.
adfuller is a "text book" implementation where I had at most a few
thousand observations as fpr application in macroeconomics and finance in
mind.
I think we can build the moment matrices in a similar way as for
Yule-Walker. And the autolag regression could be replace by a single QR or
using a sweep algorithm on the moment matrix.
(pacf has 3 calculation methods plus versions of it, adfuller has only 1)
Using OLS has the big advantage that we have all required results, aic,
tvalues, ... immediately available.
The main work for incremental OLS is to write a Results class that
calculates all the interesting things based on just the moment matrix.
PRs welcome.
Josef
Post by Nathaniel Smith
-n
Post by j***@public.gmane.org
Post by Louis
I am doing some analysis on some relatively large data set. To be
more specific, I am running a adjusted dickey-fuller test on a pandas
timeseries object. The length of the time series amounts to 5299788. I have
8GB of RAM on board with about 5.5GB available when IDLE.
Post by j***@public.gmane.org
Post by Louis
My problem is the program eat away all my RAM. So I couldn't do
anything else while it is running. I couldn't even kill the process as the
system is not responding to anything (i.e. ALT_CTRL_DEL). From my
preliminary calculation the calculation needs at least 2 days to complete.
My question is there any way to prevent the program from using all the
resource so I could still do other stuff while it is running in the
background?
Post by j***@public.gmane.org
Post by Louis
My data is only about 200MB.
import pandas as pd
import statsmodels.tsa.stattools as ts
zn = pd.read_csv(path)
result = ts.adfuller(zn)
Much appreciate your help.
I guess you need to set a smaller maxlag, and turn off autolag
The OLS to get the results uses a design matrix of roughly (nobs,
maxlag)
Post by j***@public.gmane.org
Post by Louis
nobs=5299788;int(np.ceil(12. * np.power(nobs / 100., 1 / 4.)))
183
Post by Louis
5299788 * _
969861204
But also, autolag saves the regression results instances temporarily,
all at the same time.
Post by j***@public.gmane.org
The entire results are not stored and returned by default.
Would you please open an issue. https://github.com/
statsmodels/statsmodels/issues
Post by j***@public.gmane.org
We can reduce the number of temporary arrays and results that are
kept in memory.
Post by j***@public.gmane.org
However, I don't know if there is any way around the (nobs, maxlag)
design matrix for the auxiliary regression.
Post by j***@public.gmane.org
I'm sure there is, but it might not be obvious to implement without a
low memory OLS.
Post by j***@public.gmane.org
Josef
Chad Fulton
2014-06-27 16:05:15 UTC
Permalink
Post by Nathaniel Smith
- you can make a lag matrix in basically no memory by using stride_tricks.
(This requires the time series be stored in contiguous memory, but copying
the time series is cheap compared to forming an explicit lag matrix.) I'm
not sure whether such a matrix can be passed to linalg routines without
copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and could
again if it's useful.
If I'm not wrong, I believe the actual recursions for RLS can be performed
via the Kalman filter as well, if need be.
j***@public.gmane.org
2014-06-27 16:15:03 UTC
Permalink
Post by Nathaniel Smith
- you can make a lag matrix in basically no memory by using
stride_tricks. (This requires the time series be stored in contiguous
memory, but copying the time series is cheap compared to forming an
explicit lag matrix.) I'm not sure whether such a matrix can be passed to
linalg routines without copying though :-/
- Incremental OLS is pretty easy, I've posted code a few times and could
again if it's useful.
If I'm not wrong, I believe the actual recursions for RLS can be
performed via the Kalman filter as well, if need be.
Since it's univariate, we could just use acf calculated via fft or similar.
The estimation is AR and OLS and doesn't need the recursive filter like
(V)ARMA.

Maybe there are some other tests where Kalman Filter would help.

Josef

Josef
Sturla Molden
2014-06-27 21:18:45 UTC
Permalink
Post by Louis
I am doing some analysis on some relatively large data set. To be more
specific, I am running a adjusted dickey-fuller test on a pandas timeseries
object. The length of the time series amounts to 5299788. I have 8GB of RAM
on board with about 5.5GB available when IDLE.
If you have a working program, the least expensive solution is almost
always to use a larger computer. There is tendency to ignore developer time
(salary) in the budget. Even if you are a scientist or individual developer
and will do the coding yourself, you will loose valuable time you could
have spent on other projects. Most universities have HPC facilites that can
be used when deskop computers are too small. If you are in a company they
will probably have enough money to buy you a computer with 50 GB of RAM if
it is needed. And if the investment is too expensive, you can buy a quota
on Google or Amazon's cloud computing sevices. If you just need bigger
hardware for s short period of time, it will not be very expensive.

Sturla
Loading...