Discussion:
[pystatsmodels] Reusing OLS object for results with different endog
Andrey Portnoy
2018-10-26 00:20:54 UTC
Permalink
Hi all,

Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.

The obvious approach is to create a new OLS object for each case and call fit().

fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.

One way to handle this is to directly swap wendog and rerun fit():

model = smf.ols(formula, data)
old_results = model.fit()

model.wendog = new_endog
new_results = model.fit()

But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?

Thank you,
Andrey Portnoy.
j***@gmail.com
2018-10-26 00:58:35 UTC
Permalink
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is
fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the
pseudoinverse), skips fitting if they are found, and jumps straight to
computing the betas. So clearly, refitting is not necessary if the goal is
to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather
than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718

A fit_new_endog would essentially do what you have. The problem is that is
dangerous for general use because the results instance might still need to
access the endog of the model until enough results attributes, mainly
resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in
nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent
way need to recompute wexog and pinv.

Replacing endog/wendog is pretty safe in a loop where we take the required
results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still
don't have.

One alternative would be a proper VectorizedOLS model class which is less
general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is
currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771

Josef
Post by Andrey Portnoy
Thank you,
Andrey Portnoy.
Andrey Portnoy
2018-10-26 16:54:52 UTC
Permalink
Thank you for the thorough reply!

In terms of design, do you think it would make sense for results objects to have their own references to endog, instead of going through self.model?
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718 <https://github.com/statsmodels/statsmodels/issues/718>
A fit_new_endog would essentially do what you have. The problem is that is dangerous for general use because the results instance might still need to access the endog of the model until enough results attributes, mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>
Josef
Thank you,
Andrey Portnoy.
j***@gmail.com
2018-10-26 19:27:05 UTC
Permalink
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.

Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.

The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
another 2015 issue:
https://github.com/statsmodels/statsmodels/issues/2203 partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.

Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718
A fit_new_endog would essentially do what you have. The problem is that is dangerous for general use because the results instance might still need to access the endog of the model until enough results attributes, mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771
Josef
Post by Andrey Portnoy
Thank you,
Andrey Portnoy.
Andrey Portnoy
2018-11-13 18:22:34 UTC
Permalink
Thank you!

Just linking a related issue: https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>.
Post by j***@gmail.com
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718
A fit_new_endog would essentially do what you have. The problem is that is dangerous for general use because the results instance might still need to access the endog of the model until enough results attributes, mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771
Josef
Post by Andrey Portnoy
Thank you,
Andrey Portnoy.
j***@gmail.com
2018-11-14 03:20:03 UTC
Permalink
Post by Andrey Portnoy
Thank you!
https://github.com/statsmodels/statsmodels/issues/4771.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in
https://github.com/statsmodels/statsmodels/pull/5382
It has the main results including vectorized t_test, but inherits from the
linear model classes and that includes additional methods that won't work
(yet).

It still needs work on API and inheritance and needs additional methods,
but I wanted to see how difficult it will be to get the core statistics
part to work.

Josef
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects
to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Hi all,
Is there an official way of creating new results objects when exog is
fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the
pseudoinverse), skips fitting if they are found, and jumps straight to
computing the betas. So clearly, refitting is not necessary if the goal is
to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather
than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718
A fit_new_endog would essentially do what you have. The problem is that is
dangerous for general use because the results instance might still need to
access the endog of the model until enough results attributes, mainly
resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in
nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data
dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required
results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less
general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is
currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771
Josef
Thank you,
Andrey Portnoy.
Andrey Portnoy
2018-11-14 04:09:23 UTC
Permalink
Would be happy to work on this.
Post by Andrey Portnoy
Thank you!
Just linking a related issue: https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in https://github.com/statsmodels/statsmodels/pull/5382 <https://github.com/statsmodels/statsmodels/pull/5382>
It has the main results including vectorized t_test, but inherits from the linear model classes and that includes additional methods that won't work (yet).
It still needs work on API and inheritance and needs additional methods, but I wanted to see how difficult it will be to get the core statistics part to work.
Josef
Post by j***@gmail.com
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 <https://github.com/statsmodels/statsmodels/issues/2203> partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718 <https://github.com/statsmodels/statsmodels/issues/718>
A fit_new_endog would essentially do what you have. The problem is that is dangerous for general use because the results instance might still need to access the endog of the model until enough results attributes, mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>
Josef
Post by Andrey Portnoy
Thank you,
Andrey Portnoy.
j***@gmail.com
2018-11-14 04:38:40 UTC
Permalink
Post by Andrey Portnoy
Would be happy to work on this.
That would be great.
The main thing is to check what you need for your use cases and to check
which other results attributes work correctly or don't work.
This is relatively easy given that we have the looped OLS as comparison.
(I'm not sure yet whether it's better to inherit from RegressionResults
which includes many methods or to subclass LikelihoodModelResults and copy
all relevant regression results methods.)

from_formula and predict work most likely with at most minor adjustments,
other wald and F tests will most likely require work, i.e. copy and adjusts
as I did for the t_test.
I'm not sure what to do about the summary method because there are a lot of
vectorized statistics for the top-table, and if you run a few hundred
regression at the same time, then a full summary table might be a bit too
much to read.

There might be additional model specific helper methods depending on the
usecase, like multiple testing p-value correction for the t_test. Those
could be directly added to the results class, or plots instead of long
summary tables to get a quick overview of the results.

Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Thank you!
https://github.com/statsmodels/statsmodels/issues/4771.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in
https://github.com/statsmodels/statsmodels/pull/5382
It has the main results including vectorized t_test, but inherits from the
linear model classes and that includes additional methods that won't work
(yet).
It still needs work on API and inheritance and needs additional methods,
but I wanted to see how difficult it will be to get the core statistics
part to work.
Josef
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects
to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Hi all,
Is there an official way of creating new results objects when exog is
fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the
pseudoinverse), skips fitting if they are found, and jumps straight to
computing the betas. So clearly, refitting is not necessary if the goal is
to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather
than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718
A fit_new_endog would essentially do what you have. The problem is that
is dangerous for general use because the results instance might still need
to access the endog of the model until enough results attributes, mainly
resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in
nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data
dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the
required results attributes and then let the results instance go out of
scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less
general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is
currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771
Josef
Thank you,
Andrey Portnoy.
Andrey Portnoy
2018-11-14 05:37:44 UTC
Permalink
My use cases are mostly extraction and aggregation of betas and influence and fit diagnostics. The setting involves running many regressions with different endog’s each time, but almost always repeating exog’s. The shape of the data can be thought of as being a 2D grid where each cell holds a dataset of the same shape.

So far the solution has been to wrap a pandas dataframe of results objects in a big grid object that passes on attribute access to the cells. So a call to `grid.rsquared`, say, results in thousands of calls to the individual results objects, and produces the output of these calls as a dataframe of expected shape.

This is general enough and flexible, but wildly inefficient, both computationally and in terms of memory usage (we are running out of memory on a machine with half a terabyte of RAM).

So I’m most interested in working on replicating those attributes of RegressionResults and OLSResults that a) can be vectorized, and b) benefit from the exog being repeated.

Regarding subclassing: my initial take is that many attributes might not be amenable to vectorization, and so would require a for loop. Maybe that should be left to the user? To subclass RegressionResults is to enter a binding contract to implement all attributes and calls in vectorized form, right? Maybe subclassing LikelihoodModelResults and cherry picking from RegressionResults would be an easier contract to honor.

Summary tables as in OLSResults don’t make much sense to me as soon as we leave the single endog territory.

Plots and multiple test p-value corrections sound like great ideas for examination of results.

Andrey.
Post by Andrey Portnoy
Would be happy to work on this.
That would be great.
The main thing is to check what you need for your use cases and to check which other results attributes work correctly or don't work.
This is relatively easy given that we have the looped OLS as comparison. (I'm not sure yet whether it's better to inherit from RegressionResults which includes many methods or to subclass LikelihoodModelResults and copy all relevant regression results methods.)
from_formula and predict work most likely with at most minor adjustments, other wald and F tests will most likely require work, i.e. copy and adjusts as I did for the t_test.
I'm not sure what to do about the summary method because there are a lot of vectorized statistics for the top-table, and if you run a few hundred regression at the same time, then a full summary table might be a bit too much to read.
There might be additional model specific helper methods depending on the usecase, like multiple testing p-value correction for the t_test. Those could be directly added to the results class, or plots instead of long summary tables to get a quick overview of the results.
Josef
Post by Andrey Portnoy
Thank you!
Just linking a related issue: https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in https://github.com/statsmodels/statsmodels/pull/5382 <https://github.com/statsmodels/statsmodels/pull/5382>
It has the main results including vectorized t_test, but inherits from the linear model classes and that includes additional methods that won't work (yet).
It still needs work on API and inheritance and needs additional methods, but I wanted to see how difficult it will be to get the core statistics part to work.
Josef
Post by j***@gmail.com
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 <https://github.com/statsmodels/statsmodels/issues/2203> partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718 <https://github.com/statsmodels/statsmodels/issues/718>
A fit_new_endog would essentially do what you have. The problem is that is dangerous for general use because the results instance might still need to access the endog of the model until enough results attributes, mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>
Josef
Post by Andrey Portnoy
Thank you,
Andrey Portnoy.
j***@gmail.com
2018-11-14 14:08:33 UTC
Permalink
Post by Andrey Portnoy
My use cases are mostly extraction and aggregation of betas and influence
and fit diagnostics. The setting involves running many regressions with
different endog’s each time, but almost always repeating exog’s. The shape
of the data can be thought of as being a 2D grid where each cell holds a
dataset of the same shape.
influence diagnostics would need to be vectorized separately. The hat
matrix in OLS is independent of endog, so there will be considerable
computational savings compared to looping over OLS. My guess is that
outlier diagnostics can be vectorized with a 2-dim resid.
Post by Andrey Portnoy
So far the solution has been to wrap a pandas dataframe of results objects
in a big grid object that passes on attribute access to the cells. So a
call to `grid.rsquared`, say, results in thousands of calls to the
individual results objects, and produces the output of these calls as a
dataframe of expected shape.
This is general enough and flexible, but wildly inefficient, both
computationally and in terms of memory usage (we are running out of memory
on a machine with half a terabyte of RAM).
The problem is that there is a large trade-off in this case between lazy
computing on demand which we favor in our models and memory consumption if
many models are estimated.
In general it is better just to keep the relevant results statistic in one
loop and let the models be immediately garbage collected. Another option is
`remove_data` which we added for cases when we don't need to the attached
datasets/nobs arrays any more.
Post by Andrey Portnoy
So I’m most interested in working on replicating those attributes of
RegressionResults and OLSResults that a) can be vectorized, and b) benefit
from the exog being repeated.
Regarding subclassing: my initial take is that many attributes might not
be amenable to vectorization, and so would require a for loop. Maybe that
should be left to the user? To subclass RegressionResults is to enter a
binding contract to implement all attributes and calls in vectorized form,
right? Maybe subclassing LikelihoodModelResults and cherry picking from
RegressionResults would be an easier contract to honor.
For subclassing we will see later how "bad" it is. Most attributes can be
vectorized but not some options or methods.Also I started to add some
vectorization to the existing RegressionResults class, but those should not
slow down the standard univariate endog OLS case. But those where mainly
adding axis argument to reduce operations.
Post by Andrey Portnoy
To subclass RegressionResults is to enter a binding contract to implement
all attributes and calls in vectorized form, right?

If there are only a few methods or attributes that are not supported, then
those could be replaced by NotImplementedErrors or nans.
(quantile_regression.QuantRegResults is doing this because it subclasses
wrongly RegressionResults, which we realized only long after the
implementation. QuantReg should have been an M-estimator like robust.RLM
not a least squares subclass.)

I also think we should not add options like cov_type that require sandwich
computation to OLSVectorized. Those are endog specific and if users want to
have those also then savings compared to a full loop will be much smaller
relative to the total time.

Some things could be vectorized by switching to 3-dim arrays, but that
would increase memory consumption by a large amount and would only work for
small datasets. nobs x k_exog x k_endog or even nobs x k_exog x k_exog x
k_endog
(unless we put those in a numba/cython loop)

Josef
Post by Andrey Portnoy
Summary tables as in OLSResults don’t make much sense to me as soon as we
leave the single endog territory.
Plots and multiple test p-value corrections sound like great ideas for
examination of results.
Andrey.
Post by Andrey Portnoy
Would be happy to work on this.
That would be great.
The main thing is to check what you need for your use cases and to check
which other results attributes work correctly or don't work.
This is relatively easy given that we have the looped OLS as comparison.
(I'm not sure yet whether it's better to inherit from RegressionResults
which includes many methods or to subclass LikelihoodModelResults and copy
all relevant regression results methods.)
from_formula and predict work most likely with at most minor adjustments,
other wald and F tests will most likely require work, i.e. copy and adjusts
as I did for the t_test.
I'm not sure what to do about the summary method because there are a lot
of vectorized statistics for the top-table, and if you run a few hundred
regression at the same time, then a full summary table might be a bit too
much to read.
There might be additional model specific helper methods depending on the
usecase, like multiple testing p-value correction for the t_test. Those
could be directly added to the results class, or plots instead of long
summary tables to get a quick overview of the results.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Thank you!
https://github.com/statsmodels/statsmodels/issues/4771.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in
https://github.com/statsmodels/statsmodels/pull/5382
It has the main results including vectorized t_test, but inherits from
the linear model classes and that includes additional methods that won't
work (yet).
It still needs work on API and inheritance and needs additional methods,
but I wanted to see how difficult it will be to get the core statistics
part to work.
Josef
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects
to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Hi all,
Is there an official way of creating new results objects when exog is
fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the
pseudoinverse), skips fitting if they are found, and jumps straight to
computing the betas. So clearly, refitting is not necessary if the goal is
to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather
than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718
A fit_new_endog would essentially do what you have. The problem is that
is dangerous for general use because the results instance might still need
to access the endog of the model until enough results attributes, mainly
resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in
nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data
dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the
required results attributes and then let the results instance go out of
scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is
less general than a MultivariateOLS that we use for MANOVA,
(MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771
Josef
Thank you,
Andrey Portnoy.
j***@gmail.com
2018-11-14 15:30:37 UTC
Permalink
Post by j***@gmail.com
Post by Andrey Portnoy
My use cases are mostly extraction and aggregation of betas and influence
and fit diagnostics. The setting involves running many regressions with
different endog’s each time, but almost always repeating exog’s. The shape
of the data can be thought of as being a 2D grid where each cell holds a
dataset of the same shape.
influence diagnostics would need to be vectorized separately. The hat
matrix in OLS is independent of endog, so there will be considerable
computational savings compared to looping over OLS. My guess is that
outlier diagnostics can be vectorized with a 2-dim resid.
Post by Andrey Portnoy
So far the solution has been to wrap a pandas dataframe of results
objects in a big grid object that passes on attribute access to the cells.
So a call to `grid.rsquared`, say, results in thousands of calls to the
individual results objects, and produces the output of these calls as a
dataframe of expected shape.
This is general enough and flexible, but wildly inefficient, both
computationally and in terms of memory usage (we are running out of memory
on a machine with half a terabyte of RAM).
The problem is that there is a large trade-off in this case between lazy
computing on demand which we favor in our models and memory consumption if
many models are estimated.
In general it is better just to keep the relevant results statistic in one
loop and let the models be immediately garbage collected. Another option is
`remove_data` which we added for cases when we don't need to the attached
datasets/nobs arrays any more.
Post by Andrey Portnoy
So I’m most interested in working on replicating those attributes of
RegressionResults and OLSResults that a) can be vectorized, and b) benefit
from the exog being repeated.
Regarding subclassing: my initial take is that many attributes might not
be amenable to vectorization, and so would require a for loop. Maybe that
should be left to the user? To subclass RegressionResults is to enter a
binding contract to implement all attributes and calls in vectorized form,
right? Maybe subclassing LikelihoodModelResults and cherry picking from
RegressionResults would be an easier contract to honor.
For subclassing we will see later how "bad" it is. Most attributes can be
vectorized but not some options or methods.Also I started to add some
vectorization to the existing RegressionResults class, but those should not
slow down the standard univariate endog OLS case. But those where mainly
adding axis argument to reduce operations.
Post by Andrey Portnoy
To subclass RegressionResults is to enter a binding contract to
implement all attributes and calls in vectorized form, right?
If there are only a few methods or attributes that are not supported, then
those could be replaced by NotImplementedErrors or nans.
(quantile_regression.QuantRegResults is doing this because it subclasses
wrongly RegressionResults, which we realized only long after the
implementation. QuantReg should have been an M-estimator like robust.RLM
not a least squares subclass.)
I also think we should not add options like cov_type that require sandwich
computation to OLSVectorized. Those are endog specific and if users want to
have those also then savings compared to a full loop will be much smaller
relative to the total time.
Maybe there would still be some incentive to make this computationally
efficient. If each endog is a time series, then there might be serial
correlation and the usual nonrobust standard errors are likely not
appropriate. If the noise variance depend on some explanatory exog
variables, then heteroscedasticity robustness would be needed.

Or maybe not, even if we can compute the sandwiches, we would have
different cov_params across endog that don't just differ in scale,.and we
wouldn't be able to avoid storing and keeping track of the full `k_endog x
k_exog x k_exog` cov_params array. (and would loose a major advantage of
OLSVectorized compared to a full MultivariateOLS)

Josef
Post by j***@gmail.com
Some things could be vectorized by switching to 3-dim arrays, but that
would increase memory consumption by a large amount and would only work for
small datasets. nobs x k_exog x k_endog or even nobs x k_exog x k_exog x
k_endog
(unless we put those in a numba/cython loop)
Josef
Post by Andrey Portnoy
Summary tables as in OLSResults don’t make much sense to me as soon as we
leave the single endog territory.
Plots and multiple test p-value corrections sound like great ideas for
examination of results.
Andrey.
Post by Andrey Portnoy
Would be happy to work on this.
That would be great.
The main thing is to check what you need for your use cases and to check
which other results attributes work correctly or don't work.
This is relatively easy given that we have the looped OLS as comparison.
(I'm not sure yet whether it's better to inherit from RegressionResults
which includes many methods or to subclass LikelihoodModelResults and copy
all relevant regression results methods.)
from_formula and predict work most likely with at most minor adjustments,
other wald and F tests will most likely require work, i.e. copy and adjusts
as I did for the t_test.
I'm not sure what to do about the summary method because there are a lot
of vectorized statistics for the top-table, and if you run a few hundred
regression at the same time, then a full summary table might be a bit too
much to read.
There might be additional model specific helper methods depending on the
usecase, like multiple testing p-value correction for the t_test. Those
could be directly added to the results class, or plots instead of long
summary tables to get a quick overview of the results.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Thank you!
https://github.com/statsmodels/statsmodels/issues/4771.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in
https://github.com/statsmodels/statsmodels/pull/5382
It has the main results including vectorized t_test, but inherits from
the linear model classes and that includes additional methods that won't
work (yet).
It still needs work on API and inheritance and needs additional methods,
but I wanted to see how difficult it will be to get the core statistics
part to work.
Josef
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results
objects to have their own references to endog, instead of going through
self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Hi all,
Is there an official way of creating new results objects when exog is
fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the
pseudoinverse), skips fitting if they are found, and jumps straight to
computing the betas. So clearly, refitting is not necessary if the goal is
to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather
than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718
A fit_new_endog would essentially do what you have. The problem is that
is dangerous for general use because the results instance might still need
to access the endog of the model until enough results attributes, mainly
resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in
nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data
dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the
required results attributes and then let the results instance go out of
scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is
less general than a MultivariateOLS that we use for MANOVA,
(MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771
Josef
Thank you,
Andrey Portnoy.
Andrey Portnoy
2018-11-14 17:50:42 UTC
Permalink
Logistically how do you think I should contribute? Should I make my pull requests against vectorized_ols in josef-pkt/statsmodels?
Post by Andrey Portnoy
My use cases are mostly extraction and aggregation of betas and influence and fit diagnostics. The setting involves running many regressions with different endog’s each time, but almost always repeating exog’s. The shape of the data can be thought of as being a 2D grid where each cell holds a dataset of the same shape.
influence diagnostics would need to be vectorized separately. The hat matrix in OLS is independent of endog, so there will be considerable computational savings compared to looping over OLS. My guess is that outlier diagnostics can be vectorized with a 2-dim resid.
So far the solution has been to wrap a pandas dataframe of results objects in a big grid object that passes on attribute access to the cells. So a call to `grid.rsquared`, say, results in thousands of calls to the individual results objects, and produces the output of these calls as a dataframe of expected shape.
This is general enough and flexible, but wildly inefficient, both computationally and in terms of memory usage (we are running out of memory on a machine with half a terabyte of RAM).
The problem is that there is a large trade-off in this case between lazy computing on demand which we favor in our models and memory consumption if many models are estimated.
In general it is better just to keep the relevant results statistic in one loop and let the models be immediately garbage collected. Another option is `remove_data` which we added for cases when we don't need to the attached datasets/nobs arrays any more.
So I’m most interested in working on replicating those attributes of RegressionResults and OLSResults that a) can be vectorized, and b) benefit from the exog being repeated.
Regarding subclassing: my initial take is that many attributes might not be amenable to vectorization, and so would require a for loop. Maybe that should be left to the user? To subclass RegressionResults is to enter a binding contract to implement all attributes and calls in vectorized form, right? Maybe subclassing LikelihoodModelResults and cherry picking from RegressionResults would be an easier contract to honor.
For subclassing we will see later how "bad" it is. Most attributes can be vectorized but not some options or methods.Also I started to add some vectorization to the existing RegressionResults class, but those should not slow down the standard univariate endog OLS case. But those where mainly adding axis argument to reduce operations.
Post by Andrey Portnoy
To subclass RegressionResults is to enter a binding contract to implement all attributes and calls in vectorized form, right?
If there are only a few methods or attributes that are not supported, then those could be replaced by NotImplementedErrors or nans.
(quantile_regression.QuantRegResults is doing this because it subclasses wrongly RegressionResults, which we realized only long after the implementation. QuantReg should have been an M-estimator like robust.RLM not a least squares subclass.)
I also think we should not add options like cov_type that require sandwich computation to OLSVectorized. Those are endog specific and if users want to have those also then savings compared to a full loop will be much smaller relative to the total time.
Maybe there would still be some incentive to make this computationally efficient. If each endog is a time series, then there might be serial correlation and the usual nonrobust standard errors are likely not appropriate. If the noise variance depend on some explanatory exog variables, then heteroscedasticity robustness would be needed.
Or maybe not, even if we can compute the sandwiches, we would have different cov_params across endog that don't just differ in scale,.and we wouldn't be able to avoid storing and keeping track of the full `k_endog x k_exog x k_exog` cov_params array. (and would loose a major advantage of OLSVectorized compared to a full MultivariateOLS)
Josef
Some things could be vectorized by switching to 3-dim arrays, but that would increase memory consumption by a large amount and would only work for small datasets. nobs x k_exog x k_endog or even nobs x k_exog x k_exog x k_endog
(unless we put those in a numba/cython loop)
Josef
Summary tables as in OLSResults don’t make much sense to me as soon as we leave the single endog territory.
Plots and multiple test p-value corrections sound like great ideas for examination of results.
Andrey.
Post by Andrey Portnoy
Would be happy to work on this.
That would be great.
The main thing is to check what you need for your use cases and to check which other results attributes work correctly or don't work.
This is relatively easy given that we have the looped OLS as comparison. (I'm not sure yet whether it's better to inherit from RegressionResults which includes many methods or to subclass LikelihoodModelResults and copy all relevant regression results methods.)
from_formula and predict work most likely with at most minor adjustments, other wald and F tests will most likely require work, i.e. copy and adjusts as I did for the t_test.
I'm not sure what to do about the summary method because there are a lot of vectorized statistics for the top-table, and if you run a few hundred regression at the same time, then a full summary table might be a bit too much to read.
There might be additional model specific helper methods depending on the usecase, like multiple testing p-value correction for the t_test. Those could be directly added to the results class, or plots instead of long summary tables to get a quick overview of the results.
Josef
Post by Andrey Portnoy
Thank you!
Just linking a related issue: https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in https://github.com/statsmodels/statsmodels/pull/5382 <https://github.com/statsmodels/statsmodels/pull/5382>
It has the main results including vectorized t_test, but inherits from the linear model classes and that includes additional methods that won't work (yet).
It still needs work on API and inheritance and needs additional methods, but I wanted to see how difficult it will be to get the core statistics part to work.
Josef
Post by j***@gmail.com
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 <https://github.com/statsmodels/statsmodels/issues/2203> partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718 <https://github.com/statsmodels/statsmodels/issues/718>
A fit_new_endog would essentially do what you have. The problem is that is dangerous for general use because the results instance might still need to access the endog of the model until enough results attributes, mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>
Josef
Post by Andrey Portnoy
Thank you,
Andrey Portnoy.
j***@gmail.com
2018-11-14 18:12:23 UTC
Permalink
Post by Andrey Portnoy
Logistically how do you think I should contribute? Should I make my pull
requests against vectorized_ols in josef-pkt/statsmodels?
Yes to the second, that would be easiest for now.
I would like to check a few more things, and maybe look at your first
feedback for it. But then I think you can take over completely and open a
PR against master, and I will do mainly review and merging.

If we need more wald test conversions, then it might be faster if I do
them, given that I'm very familiar with those parts of statsmodels.

Josef
Post by Andrey Portnoy
Post by j***@gmail.com
Post by Andrey Portnoy
My use cases are mostly extraction and aggregation of betas and
influence and fit diagnostics. The setting involves running many
regressions with different endog’s each time, but almost always repeating
exog’s. The shape of the data can be thought of as being a 2D grid where
each cell holds a dataset of the same shape.
influence diagnostics would need to be vectorized separately. The hat
matrix in OLS is independent of endog, so there will be considerable
computational savings compared to looping over OLS. My guess is that
outlier diagnostics can be vectorized with a 2-dim resid.
Post by Andrey Portnoy
So far the solution has been to wrap a pandas dataframe of results
objects in a big grid object that passes on attribute access to the cells.
So a call to `grid.rsquared`, say, results in thousands of calls to the
individual results objects, and produces the output of these calls as a
dataframe of expected shape.
This is general enough and flexible, but wildly inefficient, both
computationally and in terms of memory usage (we are running out of memory
on a machine with half a terabyte of RAM).
The problem is that there is a large trade-off in this case between lazy
computing on demand which we favor in our models and memory consumption if
many models are estimated.
In general it is better just to keep the relevant results statistic in
one loop and let the models be immediately garbage collected. Another
option is `remove_data` which we added for cases when we don't need to the
attached datasets/nobs arrays any more.
Post by Andrey Portnoy
So I’m most interested in working on replicating those attributes of
RegressionResults and OLSResults that a) can be vectorized, and b) benefit
from the exog being repeated.
Regarding subclassing: my initial take is that many attributes might not
be amenable to vectorization, and so would require a for loop. Maybe that
should be left to the user? To subclass RegressionResults is to enter a
binding contract to implement all attributes and calls in vectorized form,
right? Maybe subclassing LikelihoodModelResults and cherry picking from
RegressionResults would be an easier contract to honor.
For subclassing we will see later how "bad" it is. Most attributes can be
vectorized but not some options or methods.Also I started to add some
vectorization to the existing RegressionResults class, but those should not
slow down the standard univariate endog OLS case. But those where mainly
adding axis argument to reduce operations.
Post by Andrey Portnoy
To subclass RegressionResults is to enter a binding contract to
implement all attributes and calls in vectorized form, right?
If there are only a few methods or attributes that are not supported,
then those could be replaced by NotImplementedErrors or nans.
(quantile_regression.QuantRegResults is doing this because it subclasses
wrongly RegressionResults, which we realized only long after the
implementation. QuantReg should have been an M-estimator like robust.RLM
not a least squares subclass.)
I also think we should not add options like cov_type that require
sandwich computation to OLSVectorized. Those are endog specific and if
users want to have those also then savings compared to a full loop will be
much smaller relative to the total time.
Maybe there would still be some incentive to make this computationally
efficient. If each endog is a time series, then there might be serial
correlation and the usual nonrobust standard errors are likely not
appropriate. If the noise variance depend on some explanatory exog
variables, then heteroscedasticity robustness would be needed.
Or maybe not, even if we can compute the sandwiches, we would have
different cov_params across endog that don't just differ in scale,.and we
wouldn't be able to avoid storing and keeping track of the full `k_endog x
k_exog x k_exog` cov_params array. (and would loose a major advantage of
OLSVectorized compared to a full MultivariateOLS)
Josef
Post by j***@gmail.com
Some things could be vectorized by switching to 3-dim arrays, but that
would increase memory consumption by a large amount and would only work for
small datasets. nobs x k_exog x k_endog or even nobs x k_exog x k_exog x
k_endog
(unless we put those in a numba/cython loop)
Josef
Post by Andrey Portnoy
Summary tables as in OLSResults don’t make much sense to me as soon as
we leave the single endog territory.
Plots and multiple test p-value corrections sound like great ideas for
examination of results.
Andrey.
Post by Andrey Portnoy
Would be happy to work on this.
That would be great.
The main thing is to check what you need for your use cases and to check
which other results attributes work correctly or don't work.
This is relatively easy given that we have the looped OLS as comparison.
(I'm not sure yet whether it's better to inherit from RegressionResults
which includes many methods or to subclass LikelihoodModelResults and copy
all relevant regression results methods.)
from_formula and predict work most likely with at most minor
adjustments, other wald and F tests will most likely require work, i.e.
copy and adjusts as I did for the t_test.
I'm not sure what to do about the summary method because there are a lot
of vectorized statistics for the top-table, and if you run a few hundred
regression at the same time, then a full summary table might be a bit too
much to read.
There might be additional model specific helper methods depending on the
usecase, like multiple testing p-value correction for the t_test. Those
could be directly added to the results class, or plots instead of long
summary tables to get a quick overview of the results.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Thank you!
https://github.com/statsmodels/statsmodels/issues/4771.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in
https://github.com/statsmodels/statsmodels/pull/5382
It has the main results including vectorized t_test, but inherits from
the linear model classes and that includes additional methods that won't
work (yet).
It still needs work on API and inheritance and needs additional
methods, but I wanted to see how difficult it will be to get the core
statistics part to work.
Josef
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results
objects to have their own references to endog, instead of going through
self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Hi all,
Is there an official way of creating new results objects when exog is
fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the
pseudoinverse), skips fitting if they are found, and jumps straight to
computing the betas. So clearly, refitting is not necessary if the goal is
to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object,
rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718
A fit_new_endog would essentially do what you have. The problem is
that is dangerous for general use because the results instance might still
need to access the endog of the model until enough results attributes,
mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help
in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data
dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the
required results attributes and then let the results instance go out of
scope.
One main application would be residual bootstrap which, however, we
still don't have.
One alternative would be a proper VectorizedOLS model class which is
less general than a MultivariateOLS that we use for MANOVA,
(MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771
Josef
Thank you,
Andrey Portnoy.
Andrey Portnoy
2018-11-14 18:14:01 UTC
Permalink
Understood, thank you.
Post by Andrey Portnoy
Logistically how do you think I should contribute? Should I make my pull requests against vectorized_ols in josef-pkt/statsmodels?
Yes to the second, that would be easiest for now.
I would like to check a few more things, and maybe look at your first feedback for it. But then I think you can take over completely and open a PR against master, and I will do mainly review and merging.
If we need more wald test conversions, then it might be faster if I do them, given that I'm very familiar with those parts of statsmodels.
Josef
Post by Andrey Portnoy
My use cases are mostly extraction and aggregation of betas and influence and fit diagnostics. The setting involves running many regressions with different endog’s each time, but almost always repeating exog’s. The shape of the data can be thought of as being a 2D grid where each cell holds a dataset of the same shape.
influence diagnostics would need to be vectorized separately. The hat matrix in OLS is independent of endog, so there will be considerable computational savings compared to looping over OLS. My guess is that outlier diagnostics can be vectorized with a 2-dim resid.
So far the solution has been to wrap a pandas dataframe of results objects in a big grid object that passes on attribute access to the cells. So a call to `grid.rsquared`, say, results in thousands of calls to the individual results objects, and produces the output of these calls as a dataframe of expected shape.
This is general enough and flexible, but wildly inefficient, both computationally and in terms of memory usage (we are running out of memory on a machine with half a terabyte of RAM).
The problem is that there is a large trade-off in this case between lazy computing on demand which we favor in our models and memory consumption if many models are estimated.
In general it is better just to keep the relevant results statistic in one loop and let the models be immediately garbage collected. Another option is `remove_data` which we added for cases when we don't need to the attached datasets/nobs arrays any more.
So I’m most interested in working on replicating those attributes of RegressionResults and OLSResults that a) can be vectorized, and b) benefit from the exog being repeated.
Regarding subclassing: my initial take is that many attributes might not be amenable to vectorization, and so would require a for loop. Maybe that should be left to the user? To subclass RegressionResults is to enter a binding contract to implement all attributes and calls in vectorized form, right? Maybe subclassing LikelihoodModelResults and cherry picking from RegressionResults would be an easier contract to honor.
For subclassing we will see later how "bad" it is. Most attributes can be vectorized but not some options or methods.Also I started to add some vectorization to the existing RegressionResults class, but those should not slow down the standard univariate endog OLS case. But those where mainly adding axis argument to reduce operations.
Post by Andrey Portnoy
To subclass RegressionResults is to enter a binding contract to implement all attributes and calls in vectorized form, right?
If there are only a few methods or attributes that are not supported, then those could be replaced by NotImplementedErrors or nans.
(quantile_regression.QuantRegResults is doing this because it subclasses wrongly RegressionResults, which we realized only long after the implementation. QuantReg should have been an M-estimator like robust.RLM not a least squares subclass.)
I also think we should not add options like cov_type that require sandwich computation to OLSVectorized. Those are endog specific and if users want to have those also then savings compared to a full loop will be much smaller relative to the total time.
Maybe there would still be some incentive to make this computationally efficient. If each endog is a time series, then there might be serial correlation and the usual nonrobust standard errors are likely not appropriate. If the noise variance depend on some explanatory exog variables, then heteroscedasticity robustness would be needed.
Or maybe not, even if we can compute the sandwiches, we would have different cov_params across endog that don't just differ in scale,.and we wouldn't be able to avoid storing and keeping track of the full `k_endog x k_exog x k_exog` cov_params array. (and would loose a major advantage of OLSVectorized compared to a full MultivariateOLS)
Josef
Some things could be vectorized by switching to 3-dim arrays, but that would increase memory consumption by a large amount and would only work for small datasets. nobs x k_exog x k_endog or even nobs x k_exog x k_exog x k_endog
(unless we put those in a numba/cython loop)
Josef
Summary tables as in OLSResults don’t make much sense to me as soon as we leave the single endog territory.
Plots and multiple test p-value corrections sound like great ideas for examination of results.
Andrey.
Post by Andrey Portnoy
Would be happy to work on this.
That would be great.
The main thing is to check what you need for your use cases and to check which other results attributes work correctly or don't work.
This is relatively easy given that we have the looped OLS as comparison. (I'm not sure yet whether it's better to inherit from RegressionResults which includes many methods or to subclass LikelihoodModelResults and copy all relevant regression results methods.)
from_formula and predict work most likely with at most minor adjustments, other wald and F tests will most likely require work, i.e. copy and adjusts as I did for the t_test.
I'm not sure what to do about the summary method because there are a lot of vectorized statistics for the top-table, and if you run a few hundred regression at the same time, then a full summary table might be a bit too much to read.
There might be additional model specific helper methods depending on the usecase, like multiple testing p-value correction for the t_test. Those could be directly added to the results class, or plots instead of long summary tables to get a quick overview of the results.
Josef
Post by Andrey Portnoy
Thank you!
Just linking a related issue: https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>.
I needed a break from watching home repairs and penalized splines.
So, there is now an initial version in https://github.com/statsmodels/statsmodels/pull/5382 <https://github.com/statsmodels/statsmodels/pull/5382>
It has the main results including vectorized t_test, but inherits from the linear model classes and that includes additional methods that won't work (yet).
It still needs work on API and inheritance and needs additional methods, but I wanted to see how difficult it will be to get the core statistics part to work.
Josef
Post by j***@gmail.com
Post by Andrey Portnoy
Thank you for the thorough reply!
In terms of design, do you think it would make sense for results objects to have their own references to endog, instead of going through self.model?
It would be possible, but I think it is not worth the extra code
complexity. There are not many use cases outside of the current
special case. I cannot come up with any other case.
Initially I thought we would have to make a copy of endog, but
references would work well, i.e. keep a hold onto the model.endog, but
if that is replaced by assignment, then we would still have a
reference to the original which wouldn't garbage collected.
For the new endog usecase in OLS, I think the cached residual should
be enough, but that would have to be verified whether it's true in the
current implementation.
The main design problem is that I want the core models to become more
consistent in behavior with each other, so that it becomes easier to
write generic/general extensions for them. There are inherent
statistical differences across models that we still have to work
around, but I would like to avoid "unnecessary" complications. For
example, OLS should be a standard model with all the extras. But
VectorizedOLS can be optimized for computational efficiency but
doesn't get all the additional goodies.
https://github.com/statsmodels/statsmodels/issues/2203 <https://github.com/statsmodels/statsmodels/issues/2203> partially
opened as a counterpoint to becoming more consistent in the models and
writing meta classes or mixins to add generic extensions to the
models.
Development of those special models is slow with a few exceptions like
a simplified WLS used in GLM and RLM or some special case helper
functions, because the main interest of contributors and maintainers
was in other areas.
Josef
Post by Andrey Portnoy
Post by Andrey Portnoy
Hi all,
Is there an official way of creating new results objects when exog is fixed and only endog changes? I’m using unweighted OLS.
The obvious approach is to create a new OLS object for each case and call fit().
fit(), however, checks for presence of exog-related objects (like the pseudoinverse), skips fitting if they are found, and jumps straight to computing the betas. So clearly, refitting is not necessary if the goal is to produce a results object with only the endog swapped out.
model = smf.ols(formula, data)
old_results = model.fit()
model.wendog = new_endog
new_results = model.fit()
But is there an official way of reusing the existing OLS object, rather than setting model.wendog directly?
No, there is still no official way to do this.
I had added the reuse of the attached pinv for use cases like this.
However, something like fit_new_endog never was implemented.
https://github.com/statsmodels/statsmodels/issues/718 <https://github.com/statsmodels/statsmodels/issues/718>
A fit_new_endog would essentially do what you have. The problem is that is dangerous for general use because the results instance might still need to access the endog of the model until enough results attributes, mainly resid, I guess, are cached.
For other models it would be even more fragile, and it wouldn't help in nonlinear models like GLM anyway.
Even GLS/WLS classes that need to whiten endog and exog in a data dependent way need to recompute wexog and pinv.
Replacing endog/wendog is pretty safe in a loop where we take the required results attributes and then let the results instance go out of scope.
One main application would be residual bootstrap which, however, we still don't have.
One alternative would be a proper VectorizedOLS model class which is less general than a MultivariateOLS that we use for MANOVA, (MultivariateOLS is currently not a full model class).
https://github.com/statsmodels/statsmodels/issues/4771 <https://github.com/statsmodels/statsmodels/issues/4771>
Josef
Post by Andrey Portnoy
Thank you,
Andrey Portnoy.
Loading...