[pystatsmodels] Accessing predicted endog

Discussion:

d***@cimenviro.com

2018-11-14 22:10:59 UTC

Hi

I'm an R user moving to python so I've been learning the statsmodel api.
I'm having difficulty accessing the endog data after doing a prediction

i.e.

ols = smf.ols("np.log(demand) ~ temperature", data=train_data).fit()

y_train_pred = ols.predict(train_data)
y_train = ols.model.endog
print('train error (rmse): \n', np.round(np.sqrt(ols.mse_resid), 2))

y_test_pred = ols.predict(test_data)
y_test = ols.model.endog
print('test error (rmse): \n',
np.round(np.sqrt(smt.eval_measures.mse(y_test, y_test_pred)), 2))

In the last block, y_test is incorrect as ols.model contains the orginal
fitted values. I've specifically transformed endog in the example to show
why it's important to access the predicted endog rather than use
test_data.demand, but it's also important if ols.predict() drops any rows
i.e. due to NA's

In R for example predict() returns a predict object containing the endog /
exog and fitted values using the original data and predict(newdata) updates
endog / exog and fitted. Is there some way of doing this with statsmodel?

Regards
David

j***@gmail.com

2018-11-15 00:16:58 UTC

Permalink

Post by d***@cimenviro.com
Hi
I'm an R user moving to python so I've been learning the statsmodel api.
I'm having difficulty accessing the endog data after doing a prediction
i.e.
ols = smf.ols("np.log(demand) ~ temperature", data=train_data).fit()
y_train_pred = ols.predict(train_data)
y_train = ols.model.endog
print('train error (rmse): \n', np.round(np.sqrt(ols.mse_resid), 2))
y_test_pred = ols.predict(test_data)
y_test = ols.model.endog
print('test error (rmse): \n',
np.round(np.sqrt(smt.eval_measures.mse(y_test, y_test_pred)), 2))
In the last block, y_test is incorrect as ols.model contains the orginal
fitted values. I've specifically transformed endog in the example to show
why it's important to access the predicted endog rather than use
test_data.demand, but it's also important if ols.predict() drops any rows
i.e. due to NA's
In R for example predict() returns a predict object containing the endog /
exog and fitted values using the original data and predict(newdata) updates
endog / exog and fitted. Is there some way of doing this with statsmodel?

If I understand correctly, and I never looked how R handles your last
question:

statsmodels currently has no general support for endog transformations,
except in a few models where the model itself does the transformation.
The models receive the processed endog and exog from patsy when the formula
interface is used. We reuse patsy for the exog transformation in predict,
but predict does not know anything about what the corresponding observed
endog would be, given that new data is not in the training data.

Also statsmodels never updates the training endog and exog to include new
observations. (Exceptions to this might be in some time series models that
allow updating and extending the dataset.)

Given that there is no specific support for endog transformation, the
easiest that is usually done, is that the user creates a new variable e.g.
log_demand, or log_price, and use that as a regular variable, and transform
train and test data before splitting or use the same transformation.

I don't know if it would even be possible to support this at the moment.
Transformations are handled by patsy and I never looked what information is
available for endog. But AFAIK patsy just `eval`uates the LHS expression
and might not know what the function does. But there might be a way to
evaluate the same function on new data.
patsy is increasing the formula information that is available about the
design matrix, and we slowly add more support to those, but there is
nothing about endog transformations yet.

Issues about endog transformation have shown up several times, but it is
not clear what and how we should support those. There are some starts in
time series models that need to be linear. For non-time series models there
is the alternative of using a GLM link function which would avoid the endog
transformation and model instead E(y|x) = exp(x b) as in Poisson.

( a bit related aside: Most of the time it's easier to figure out what can
be supported in general by starting in specific models. For example log
transformation of endog shows up in parametric survival models (accelerated
failure time model). In that case there is an option to either look at the
gaussian model for logY or at the log-normal model for Y. That's only in
experimental stage.)

Josef

Josef

Post by d***@cimenviro.com
Regards
David

d***@cimenviro.com

2018-11-15 02:48:59 UTC

Permalink

Thanks Josef

I'll follow your suggestion on explicitly transforming the endog - although
it's nice to be able to do it implicitly

I just want to confirm one thing, this is how I've been using sklearn +
patsy. Can you confirm this is how statsmodel operates - I believe the
build_design_matrices() is important as cr() is stateful
<https://patsy.readthedocs.io/en/latest/stateful-transforms.html>?

Regards
David

y, X = dmatrices("np.log(demand) ~ cr(temperature, df=3)", data)
lm = LinearRegression().fit(X, y)
lm.predict(X)

...

X = build_design_matrices([X.design_info], newdata)
lm.predict(X)

Note: if newdata is a test / dev set (i.e. contains a demand column) you
can do the following to transform newdata to X and y

y, X = build_design_matrices([y.design_info. X.design_info], newdata)
lm.predict(X)

j***@gmail.com

2018-11-15 05:10:08 UTC

Permalink

Post by d***@cimenviro.com
Thanks Josef
I'll follow your suggestion on explicitly transforming the endog -
although it's nice to be able to do it implicitly
I just want to confirm one thing, this is how I've been using sklearn +
patsy. Can you confirm this is how statsmodel operates - I believe the
build_design_matrices() is important as cr() is stateful
<https://patsy.readthedocs.io/en/latest/stateful-transforms.html>?

Yes, all splines are stateful transformations, and we store the design_info
in the model.data instance so we can reuse it in predict.
If users use the formula interface to the model, then patsy handles
stateful transforms, and statsmodels just delegates in predict.

One consequence is that we only have the information to "interpret" the
design matrix, or endog transformation in this case, that patsy provides to
us in the `design_info`.
This is now pretty good for categorical factors, and we reuse it also in
some Wald tests. There is some information about the splines in design_info
but it is not used in the GAM PR.
But there is no way for statsmodels to know that the endog transformation
is log, and that we could use our log-link support for it.

A related problem is that the code in patsy is focused on constructing the
design matrix, which is not enough for us for supporting extras. If we need
to provide statistical support for example for a log transformed variable,
then we would need additional methods similar to the ones that are
available in our link classes, specifically inverse transforms and
derivatives.
Another example, GAM includes wrappers of or copies code from patsy's
splines because we need to add derivatives and similar methods to support
penalization.

In R, formulas have a much longer history and much more tightly integrated
with the models, while our direct interface to models is DataFrames or just
numpy arrays, or anything `array_like`. In the array case we just get
numbers and have no information at all about whether there is a structure
behind that data.
e.g https://github.com/statsmodels/statsmodels/issues/5342
and one more example:
It would be very helpful to provide some automatic support for polynomials
that are in the design matrix, i.e. multiple columns that are a
transformation of some underlying explanatory variables. Right now there is
no way for the models to get that information in a general way. And it will
take work for designing and implementing it before we can add those
features.
(GAM requires that penalized splines are provided as separate argument and
not as part of the design matrix, so we know exactly which splines to
support.)

Post by d***@cimenviro.com
Regards
David
y, X = dmatrices("np.log(demand) ~ cr(temperature, df=3)", data)
lm = LinearRegression().fit(X, y)
lm.predict(X)
...
X = build_design_matrices([X.design_info], newdata)
lm.predict(X)
Note: if newdata is a test / dev set (i.e. contains a demand column) you
can do the following to transform newdata to X and y
y, X = build_design_matrices([y.design_info. X.design_info], newdata)
lm.predict(X)

we use dmatrix directly
exog = dmatrix(design_info, exog, return_type="dataframe")
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/model.py#L997
and I used it the same way in a GAM example that doesn't have formula
support yet
line In 35 in
https://gist.github.com/josef-pkt/453de603b019143e831fbdd4dfb6aa30

AFAIK, we use still the code that was initially added by Skipper and
Nathaniel, and I just keep copying the same pattern.

I didn't know about using it for endog also, because it is not used in the
statsmodels code.

Josef