[#6303] systemfit() fails with singularities

2016-03-13 22:29
Submitted by:
Liviu Andronic (landroni)
Assigned to:
Arne Henningsen (arne)
Operating System:
systemfit() fails with singularities

Detailed description
When using systemfit() with "2SLS" on my data I encountered an issue that seems to be related to the presence of singularities. When controlling for both individual and time fixed effects (using dummies), some groups don't have sufficient data for estimation so they can end up as NA. (This can be troublesome enough to make it complicated to detect and address beforehand.)

Here's a reproducible example (modified from Wooldridge) that showcases the issue:
mroz <- read.dta("";)

# generate vars for linear dependency
x.lin.dep <- sample(0:1, 753, replace = TRUE)
mroz[ , "fact1"] <- x.lin.dep
mroz[ , "fact2"] <- !x.lin.dep

# restrict to non-missing wage observations
oursample <- subset(mroz, !

# check linear dependency with lm
x <- lm(log(wage)~educ+exper+I(exper^2)+fact1+fact2, data=oursample)
## [..]
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## [..]
## fact1 0.0398240 0.0646545 0.616 0.53826
## fact2TRUE NA NA NA NA

# check linear dependency with systemfit
aut.2SLS.lindep <- systemfit(log(wage)~educ+exper+I(exper^2)+fact1+fact2, "2SLS",
~motheduc+fatheduc+exper+I(exper^2)+fact1+fact2 , data=oursample)
## t test of coefficients:
## Estimate Std. Error t value Pr(>|t|)
## eq1_(Intercept) 0.74338378 NA NA NA
## eq1_educ 0.06087732 0.03154511 1.9298 0.054295 .
## eq1_exper 0.04407168 0.01345892 3.2745 0.001146 **
## eq1_I(exper^2) -0.00089489 0.00040248 -2.2235 0.026714 *
## eq1_fact1 -0.66290571 NA NA NA
## eq1_fact2TRUE -0.71093750 NA NA NA
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It seems to me that systemfit() gets confused as it outputs 6 coefficients, when one of them should be missing since it represents a linear dependency. Sometimes I also get a warning message from coeftest():
"Warning message:
In sqrt(diag(se)) : NaNs produced"


Message  ↓
Date: 2016-03-20 15:21
Sender: Liviu Andronic

Thanks for looking into this!

As you mention, it seems like the matrix package works very hard around rounding issues and ends up believing in near-singularity when it deals in fact with exact singularity (when accounting for rounding quirks). I agree that it's not a good idea to silently remove a regressor in case of exact singularity, but perhaps doing so with a warning should prove more useful in many cases.

As mentioned before, lm() silently drops regressors (but clearly identifies undefined coefficients in coef() and summary() output). The plm() fun will also silently drop regressors, although my local version issues a warning and I'll probably re-enable it in SVN plm, too:

# also 2SLS IV
# my local version issues a warning when dropping coefficients because of singularities
oursample.p <- pdata.frame(oursample, index="city")
| motheduc+fatheduc+exper+I(exper^2)+fact1+fact2
, model="pooling", data=oursample.p)
## Model Formula: log(wage) ~ educ + exper + I(exper^2) + fact1 + fact2 | motheduc +
## fatheduc + exper + I(exper^2) + fact1 + fact2
## <environment: 0xd198288>
## Coefficients:
## (Intercept) educ exper I(exper^2) fact1
## 0.03244628 0.06087732 0.04407168 -0.00089489 0.04803179
## Warning message:
## In mylm(y, X, W) :
## Coefficient(s) 'fact2TRUE' could not be estimated and is (are) dropped.

The next version of AER, kindly fixed by Achim, will also work around singularity issues in ivreg() in the same style as lm():

| motheduc+fatheduc+exper+I(exper^2)+fact1+fact2 , data=oursample)
## Call:
## ivreg(formula = log(wage) ~ educ + exper + I(exper^2) + fact1 + fact2 | motheduc + fatheduc + exper + I(exper^2) + fact1 + fact2, data = oursample)
## Coefficients:
## (Intercept) educ exper I(exper^2) fact1 fact2TRUE
## 0.0324463 0.0608773 0.0440717 -0.0008949 0.0480318 NA

IMO the most useful way to address this would be to mirror lm(): add a `singular.ok` argument which when FALSE outputs an error in case of exact singularity, and when TRUE drops a coefficient while issuing a warning. This way the user has a clear idea on why the estimation has failed, or where a potential issue lies when the estimation proceeds. In my experience perfect collinearity issues can prove very confusing to end-users (and indeed, very hard to track down unless you're very experienced), and the lm() way to dealing with this---coupled with a warning---seems to me like the optimal approach.

Date: 2016-03-20 14:15
Sender: Arne Henningsen

Thanks a lot for reporting this problem and for providing a reproducible example!

The huge or even NA standard errors of the intercept and the regressors fact1 and fact2 indicate that there is very high multicollinearity between these three regressors. The matrix of the regressors should be exactly singular but due to rounding errors it seems to be not exactly singular so that it can be inverted and the estimates of the coefficients can be obtained.

When estimating the model by OLS, the estimation terminates with an error message that (correctly) claims that the matrix of the regressors is (nearly) singular:

systemfit(log(wage)~educ+exper+I(exper^2)+fact1+fact2, "OLS", data=oursample)
Error in LU.dgC(a) : cs_lu(A) failed: near-singular A (or out of memory)

When instructing systemfit() not to use the "matrix" package to estimate the model by 2SLS, the estimation terminates with an error message that (correctly) claims that the matrix of the regressors is (exactly) singular:

systemfit(log(wage)~educ+exper+I(exper^2)+fact1+fact2, "2SLS",~motheduc+fatheduc+exper+I(exper^2)+fact1+fact2 , data=oursample, useMatrix = FALSE )
Error in solve.default(crossprod(zMatEq[[i]]), crossprod(zMatEq[[i]], :
Lapack routine dgesv: system is exactly singular: U[7,7] = 0

I don't think that it is a good idea to silently remove some of the regressors in case of too high multicollinearity, particularly in case of a system estimation. Therefore, I suggest that systemfit() returns an informative error message if there is "too high" multicollinearity, where the question remains how to define "too high" multicollinearity. Any thoughts?

Attached Files:


Field Old Value Date By
assigned_tonone2016-03-20 14:16arne
HardwareNone2016-03-20 14:16arne
Operating SystemNone2016-03-20 14:16arne
SeverityNone2016-03-20 14:16arne
Thanks to:
Vienna University of Economics and Business Powered By FusionForge