[pystatsmodels] Problems with MLR in statsmodel

Discussion:

Darryl Campbell

2018-12-05 13:47:21 UTC

Hi,

I have been running multiple linear regression analysis using statsmodel,
and I am trying to build a formula from the results so that I can use it to
make predictions in another programming language.

Before I do that, I have been trying to get it to work in Excel. However,
using the same test data as the MLR analysis has used I am unable to make a
formula that makes the same (or even remotely similar) predictions as the
model in statsmodel does.
<about:invalid#zClosurez>
[image: Untitled.png]

The regression formula I have been using to predict the results is as
follows:

0.0112+-0.1085*X1+0.9035*X2+-0.0567*X3+0.0588*X4+0.0531*X5+0.1489*X6+-0.1652*X7

Can anybody help me with what I am doing wrong please?

And is there a way that I can get python to outuput the formula that it
uses to make the predictions?

Any help would be very much appreciated.

As a sidenote - I am rather inexperienced with python in general and the
pasted code below was a template given to me through a course I bought on
Udemy.

-----

For reference, here is the code I am using too:

# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing dataset
dataset = pd.read_csv('Test_02.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 14].values
# encoding categorical data
"""from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
X[:, 4] = labelencoder_X.fit_transform(X[:, 4])
onehotencoder = OneHotEncoder(categorical_features = [4])
X = onehotencoder.fit_transform(X).toarray()"""
# avoid dummy variable trap
"""X = X[:, 1:]"""
# splitting test data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
ï¿Œ
# feature scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_Y = StandardScaler()
Y_train = sc_Y.fit_transform(Y_train)
Y_test = sc_Y.transform(Y_test)"""
# Fitting multiple linear regression to training data
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
# Predicting the results
Y_pred = regressor.predict(X_test)
# Plotting the results into a scatter plot
plt.scatter(Y_test,Y_pred, color = 'red')
plt.plot(Y_test, regressor.predict(X_test), color = 'blue')
plt.title('Correlation to Open Price')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.show()
# Backward elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((4996,1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()
X_opt = X[:, [0, 1, 2, 5, 6, 7, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()

j***@gmail.com

2018-12-05 15:01:02 UTC

Permalink

On Wed, Dec 5, 2018 at 9:41 AM Darryl Campbell <

Post by Darryl Campbell
Hi,
I have been running multiple linear regression analysis using statsmodel,
and I am trying to build a formula from the results so that I can use it to
make predictions in another programming language.
Before I do that, I have been trying to get it to work in Excel. However,
using the same test data as the MLR analysis has used I am unable to make a
formula that makes the same (or even remotely similar) predictions as the
model in statsmodel does.
[image: Untitled.png]
The regression formula I have been using to predict the results is as
0.0112+-0.1085*X1+0.9035*X2+-0.0567*X3+0.0588*X4+0.0531*X5+0.1489*X6+-0.1652*X7

This looks correct.

For OLS or for linear models the prediction is just `exog dot
results.params` which is what you have after unvectorizing.

Post by Darryl Campbell
Can anybody help me with what I am doing wrong please?

My guess is that you might have a mistake in creating the new X.
You should check first that your transformation pipeline recreates the
transformations used in the the training sample.
E.g. what I often do:
Use the first, say 5, observations, recreate the X array for them including
all transformations.
Then, compare the prediction for this with results.fittedvalues[:5] or
results.fittedvalues.iloc[:5].

aside: "stateful transforms"
When the data is transformed before estimating a model, then this
transformation might depend on properties and statistics from the original
data.
The transformation for new explanatory variables has to be based on the
same statistics, otherwise we don't use the same kind of data as was used
for the estimation.
Statsmodels does this automatically with the help of patsy when formulas
are used. But the user is responsible for handling transformation when
formulas are not used, because then the models only know about the
transformed data and nothing about any preprocessing.

Below you are using scikit-learn transformation but I don't know how it
handles stateful transforms, e.g. use the same mean and standard deviation
in standardizing the data.

If you do the prediction in another package like Excel or another
programming language, then you have to make sure that you include the code
for the appropriate transformation pipeline.

I hope that helps. If this is not the source of your problem, then you
might need to make an replicable example including how you do the
prediction.

Post by Darryl Campbell
And is there a way that I can get python to outuput the formula that it
uses to make the predictions?

except for `x dot beta` like above there is no formula for predict.

Josef

Post by Darryl Campbell
Any help would be very much appreciated.
As a sidenote - I am rather inexperienced with python in general and the
pasted code below was a template given to me through a course I bought on
Udemy.
-----
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing dataset
dataset = pd.read_csv('Test_02.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 14].values
# encoding categorical data
"""from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
X[:, 4] = labelencoder_X.fit_transform(X[:, 4])
onehotencoder = OneHotEncoder(categorical_features = [4])
X = onehotencoder.fit_transform(X).toarray()"""
# avoid dummy variable trap
"""X = X[:, 1:]"""
# splitting test data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
ï¿Œ
# feature scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_Y = StandardScaler()
Y_train = sc_Y.fit_transform(Y_train)
Y_test = sc_Y.transform(Y_test)"""
# Fitting multiple linear regression to training data
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
# Predicting the results
Y_pred = regressor.predict(X_test)
# Plotting the results into a scatter plot
plt.scatter(Y_test,Y_pred, color = 'red')
plt.plot(Y_test, regressor.predict(X_test), color = 'blue')
plt.title('Correlation to Open Price')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.show()
# Backward elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((4996,1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()
X_opt = X[:, [0, 1, 2, 5, 6, 7, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()

j***@gmail.com

2018-12-05 15:09:18 UTC

Permalink

Post by j***@gmail.com
On Wed, Dec 5, 2018 at 9:41 AM Darryl Campbell <

This looks correct.
For OLS or for linear models the prediction is just `exog dot
results.params` which is what you have after unvectorizing.

Post by Darryl Campbell
Can anybody help me with what I am doing wrong please?

My guess is that you might have a mistake in creating the new X.
You should check first that your transformation pipeline recreates the
transformations used in the the training sample.
Use the first, say 5, observations, recreate the X array for them
including all transformations.
Then, compare the prediction for this with results.fittedvalues[:5] or
results.fittedvalues.iloc[:5].
aside: "stateful transforms"
When the data is transformed before estimating a model, then this
transformation might depend on properties and statistics from the original
data.
The transformation for new explanatory variables has to be based on the
same statistics, otherwise we don't use the same kind of data as was used
for the estimation.
Statsmodels does this automatically with the help of patsy when formulas
are used. But the user is responsible for handling transformation when
formulas are not used, because then the models only know about the
transformed data and nothing about any preprocessing.
Below you are using scikit-learn transformation but I don't know how it
handles stateful transforms, e.g. use the same mean and standard deviation
in standardizing the data.
If you do the prediction in another package like Excel or another
programming language, then you have to make sure that you include the code
for the appropriate transformation pipeline.

to mention an alternative way:
If the transformations are linear like in a standard scaler, then it is
possible to transform the parameters to take account of this instead of
transforming the new data.
But again, there is nothing that statsmodels can do for this case because
the transformation and data preprocessing is outside of statsmodels, and
the models don't know anything about that.
(For transformation that are performed inside of statsmodels, our models
can compensate for it. I added this for example to fit_constrained where we
internally use a linearly transformed models but the user gets the
parameterization including predict in terms of the original explanatory
variables.)

Josef

Post by j***@gmail.com
I hope that helps. If this is not the source of your problem, then you
might need to make an replicable example including how you do the
prediction.

Post by Darryl Campbell
And is there a way that I can get python to outuput the formula that it
uses to make the predictions?

except for `x dot beta` like above there is no formula for predict.
Josef

Darryl Campbell

2018-12-05 16:05:17 UTC

Permalink

Thanks ever so much for your help.

I think you hit the nail on the head with your response!

I had been using the original values of Xn without taking into account how
they might have been transformed.

Now I'm totally confused, but I think I can figure things out with a little
work. I hope :)

David Waterworth

2018-12-05 23:12:20 UTC

Permalink

As an aside, sklearn deals with stateful transforms by by using a 2 step
process implementing using functions fit() and transform(). For example
StandardScalar.fit() is called on the train data to calculate and store the
mean and std, and StandardScalar.transform() actually scales the data. The
function StandardScalar.fit_transform(X) is simple a shortcut for
StandardScalar.fit(X).transform(X).

Also note, as far as I'm aware you don't normally scale y. If you start
using sklearn pipelines it's not actually supported. Also don't scale your
categorical variables, it has the effect of removing the intercept

On Thu, 6 Dec 2018 at 03:05, Darryl Campbell <

Post by Darryl Campbell
Thanks ever so much for your help.
I think you hit the nail on the head with your response!
I had been using the original values of Xn without taking into account how
they might have been transformed.
Now I'm totally confused, but I think I can figure things out with a
little work. I hope :)