Discussion:
[pystatsmodels] quantreg - operands could not be broadcast together with shapes (3,) (2,)
HPa
2018-11-15 16:58:59 UTC
Permalink
import pandas
import statsmodels.formula.api as smf

df = pandas.DataFrame([
{'Y': 1, 'A' : 2, 'B' : 3},
{'Y': 4, 'A' : 5, 'B' : 6}])

mod = smf.quantreg('Y ~ A + B', data=df)
res = mod.fit(q=0.5) # The error message comes from here

Gives me error message:
ValueError: operands could not be broadcast together with shapes (3,) (2,)

If I remove A or B, then I dont get the error.
Same error in my PC, and in Azure Notebooks.

I cannot see what is the problem, anybody can help?
j***@gmail.com
2018-11-16 06:36:43 UTC
Permalink
Post by HPa
import pandas
import statsmodels.formula.api as smf
df = pandas.DataFrame([
{'Y': 1, 'A' : 2, 'B' : 3},
{'Y': 4, 'A' : 5, 'B' : 6}])
mod = smf.quantreg('Y ~ A + B', data=df)
res = mod.fit(q=0.5) # The error message comes from here
ValueError: operands could not be broadcast together with shapes (3,) (2,)
If I remove A or B, then I dont get the error.
Same error in my PC, and in Azure Notebooks.
I cannot see what is the problem, anybody can help?
Post by HPa
res = mod.fit(q=0.5)
File "m:\...\statsmodels\regression\quantile_regression.py", line 179, in
fit
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
diff = np.max(np.abs(beta - beta0))
ValueError: operands could not be broadcast together with shapes (3,) (2,)

This is https://github.com/statsmodels/statsmodels/issues/2597
Your design matrix exog has fewer rows than columns and, therefore, does
not have full column rank.

QuantReg needs a full rank, non-singular design matrix, which should be the
case even after the bug with that exception is fixed.

Josef
HPa
2018-11-16 08:41:52 UTC
Permalink
Thanks Josef,

Actually, in my original dataset where the problem appears, I have ~6000
lines and four features.
I guess I simplified the problem too much.

Looking the discussion in thread issues/2597, I think my problem may be
multicollinearity

BR Hannu P
j***@gmail.com
2018-11-16 08:48:10 UTC
Permalink
Post by HPa
Thanks Josef,
Actually, in my original dataset where the problem appears, I have ~6000
lines and four features.
I guess I simplified the problem too much.
Looking the discussion in thread issues/2597, I think my problem may be
multicollinearity
It is perfect collinearity (up to numerical threshold in
np.linalg.matrix_rank), so I guess one of your
columns is redundant and needs to be dropped..
statsmodels doesn't drop collinear columns automatically.

Josef
Post by HPa
BR Hannu P
HPa
2018-11-16 09:19:47 UTC
Permalink
My mistake, problem solved. Thanks Josef.
Below brief description, may be it helps somebody else to avoid the same
problem

///

I have feature V which has values [118..125].
I am expecting it has quadratic behaviour but I did not think it thoroughly
but just added a feature V2 = V ** 2

As V values are a small range far from origo, then the V**2 values are
almost linear, and thus corr(V,V2) is ~1.0
=> multicollinearity

The error message could be more descriptive like discussed in issues/2597

BR Hannu P
j***@gmail.com
2018-11-16 12:56:48 UTC
Permalink
Post by HPa
My mistake, problem solved. Thanks Josef.
Below brief description, may be it helps somebody else to avoid the same
problem
///
I have feature V which has values [118..125].
I am expecting it has quadratic behaviour but I did not think it
thoroughly but just added a feature V2 = V ** 2
As V values are a small range far from origo, then the V**2 values are
almost linear, and thus corr(V,V2) is ~1.0
=> multicollinearity
Interesting case: When we only have two points in the support, then they
are collinear with any transformation (up to floating point noise).

I ran into something similar before when squaring dummy variables without
realizing it, which however didn't cause incorrect results because I used
matrix_rank and pinv which still behave correctly in the singular design
case (for that usecase)
https://github.com/statsmodels/statsmodels/issues/1061

There was also a recent discussion in scipy about correlation coefficient
and p-value if we only have two points
https://github.com/scipy/scipy/issues/7730#issuecomment-433062911

Josef
Post by HPa
The error message could be more descriptive like discussed in issues/2597
BR Hannu P
Loading...