Darryl Campbell
2018-12-05 13:47:21 UTC
Hi,
I have been running multiple linear regression analysis using statsmodel,
and I am trying to build a formula from the results so that I can use it to
make predictions in another programming language.
Before I do that, I have been trying to get it to work in Excel. However,
using the same test data as the MLR analysis has used I am unable to make a
formula that makes the same (or even remotely similar) predictions as the
model in statsmodel does.
<about:invalid#zClosurez>
[image: Untitled.png]
The regression formula I have been using to predict the results is as
follows:
0.0112+-0.1085*X1+0.9035*X2+-0.0567*X3+0.0588*X4+0.0531*X5+0.1489*X6+-0.1652*X7
Can anybody help me with what I am doing wrong please?
And is there a way that I can get python to outuput the formula that it
uses to make the predictions?
Any help would be very much appreciated.
As a sidenote - I am rather inexperienced with python in general and the
pasted code below was a template given to me through a course I bought on
Udemy.
-----
For reference, here is the code I am using too:
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing dataset
dataset = pd.read_csv('Test_02.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 14].values
# encoding categorical data
"""from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
X[:, 4] = labelencoder_X.fit_transform(X[:, 4])
onehotencoder = OneHotEncoder(categorical_features = [4])
X = onehotencoder.fit_transform(X).toarray()"""
# avoid dummy variable trap
"""X = X[:, 1:]"""
# splitting test data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
ᅩ
# feature scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_Y = StandardScaler()
Y_train = sc_Y.fit_transform(Y_train)
Y_test = sc_Y.transform(Y_test)"""
# Fitting multiple linear regression to training data
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
# Predicting the results
Y_pred = regressor.predict(X_test)
# Plotting the results into a scatter plot
plt.scatter(Y_test,Y_pred, color = 'red')
plt.plot(Y_test, regressor.predict(X_test), color = 'blue')
plt.title('Correlation to Open Price')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.show()
# Backward elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((4996,1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()
X_opt = X[:, [0, 1, 2, 5, 6, 7, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()
I have been running multiple linear regression analysis using statsmodel,
and I am trying to build a formula from the results so that I can use it to
make predictions in another programming language.
Before I do that, I have been trying to get it to work in Excel. However,
using the same test data as the MLR analysis has used I am unable to make a
formula that makes the same (or even remotely similar) predictions as the
model in statsmodel does.
<about:invalid#zClosurez>
[image: Untitled.png]
The regression formula I have been using to predict the results is as
follows:
0.0112+-0.1085*X1+0.9035*X2+-0.0567*X3+0.0588*X4+0.0531*X5+0.1489*X6+-0.1652*X7
Can anybody help me with what I am doing wrong please?
And is there a way that I can get python to outuput the formula that it
uses to make the predictions?
Any help would be very much appreciated.
As a sidenote - I am rather inexperienced with python in general and the
pasted code below was a template given to me through a course I bought on
Udemy.
-----
For reference, here is the code I am using too:
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing dataset
dataset = pd.read_csv('Test_02.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 14].values
# encoding categorical data
"""from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
X[:, 4] = labelencoder_X.fit_transform(X[:, 4])
onehotencoder = OneHotEncoder(categorical_features = [4])
X = onehotencoder.fit_transform(X).toarray()"""
# avoid dummy variable trap
"""X = X[:, 1:]"""
# splitting test data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
ᅩ
# feature scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_Y = StandardScaler()
Y_train = sc_Y.fit_transform(Y_train)
Y_test = sc_Y.transform(Y_test)"""
# Fitting multiple linear regression to training data
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
# Predicting the results
Y_pred = regressor.predict(X_test)
# Plotting the results into a scatter plot
plt.scatter(Y_test,Y_pred, color = 'red')
plt.plot(Y_test, regressor.predict(X_test), color = 'blue')
plt.title('Correlation to Open Price')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.show()
# Backward elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((4996,1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()
X_opt = X[:, [0, 1, 2, 5, 6, 7, 12, 13]]
regressor_ols = sm.OLS(endog = Y, exog = X_opt).fit()
regressor_ols.summary()