Help File for REHGDFE.ADO
Title
reghdfe —

Linear regression absorbing multiple levels of fixed effects 
Syntax
reghdfe
depvar [indepvars] [if] [in] [weight] ,
absorb(absvars)
[options]
Options  Description  
Model [+]  
*  absorb(absvars)

categorical variables that identify the fixed effects to be absorbed 
absorb( ..., savefe)

save all fixed effect estimates with the __hdfe* prefix  
noabsorb

only absorb the constant; alternative to regress that supports for multiwayclustering 

residuals(newvar)

save residuals; predict, d requires this option  
summarize(stats) 
equivalent to the postestimation command estat summarize, but more flexible, faster, and saves results on e(summarize)  
SE/Robust [+]  
+  vce ( vcetype [, opt])

vcetype may be unadjusted (default), robust or cluster fvvarlist (allowing two and multiway clustering) 
Diagnostic [+]  
verbose(#) 
amount of debugging information to show (0=None, 1=Some, 2=More, 3=Parsing/convergence details, 4=Every iteration)  
timeit 
show elapsed times by stage of computation  
Optimization [+]  
+  tolerance(#)

criterion for convergence (default=1e8) 
maxiterations(#) 
maximum number of iterations (default=10,000); if set to missing (. ) it will run for as long as it takes. 

poolsize(#) 
apply the within algorithm in groups of # variables (default 10). a large poolsize is usually faster but uses more memory  
acceleration(str) 
acceleration method; options are conjugate_gradient (cg), steep_descent (sd), aitken (a), lsmr (with diagonal preconditioner), and none (no)  
transform(str) 
transform operation that defines the type of alternating projection; options are Kaczmarz (kac), Cimmino (cim), Symmetric Kaczmarz (sym). This is ignored with LSMR acceleration  
prune 
prune vertices of degree1; acts as a preconditioner that is useful if the underlying network is very sparse  
cond 
compute the finite condition number; will only run successfully with few fixed effects (because it computes the eigenvalues of the graph Laplacian) 
Speedup Tricks [+]  
cache(save [,opt])

absorb all variables without regressing (destructive; combine it with preserve/restore)  
suboption keep(varlist) adds additional untransformed variables to the resulting dataset 

cache(use) 
run regressions on cached data; vce() must be the same as with cache(save) . 

cache(clear) 
delete Mata objects to clear up memory; no more regressions can be run after this  
nosample 
will not create e(sample), saving some space and speed  
DegreesofFreedom Adjustments [+]  
dofadjustments(list) 
allows selecting the desired adjustments for degrees of freedom; rarely used  
groupvar(newvar)

unique identifier for the first mobility group  
Reporting [+]  
version 
reports the version number and date of reghdfe, and the list of required packages. standalone option  
level(#) 
set confidence level; default is level(95)


display_options  control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factorvariable labeling.  
particularly useful are the noomitted and noempty options to hide regressors omitted due to collinearity 

Undocumented  
keepsingletons 
do not drop singleton groups  
old 
will call the latest 3.x version of reghdfe instead (see the old help file)  
rre(varname) 
where varname is the residual of a proven prev. regression of y against only the FEs  
check 
compile lreghdfe.mlib if it does not exist or if it needs to be updated; use reghdfe,compile to force an update 

update 
update reghdfe and dependencies from the respective Github repositories; use reghdfe,reload to do so from c:\git\*


* either absorb(absvars) or noabsorb is required.  
+ indicates a recommended or important option.  
the regression variables may contain timeseries operators and factor variables; the dependent variable cannot be of the form i.turn, but 42.turn is allowed  
fweight s, aweight s and pweight s are allowed; see weight. 
Absvar Syntax
absvar  Description  
i. varname

categorical variable to be absorbed (the i. prefix is tacit) 

i. var1#i. var2

absorb the interactions of multiple categorical variables  
i. var1# c. var2

absorb heterogeneous slopes, where var2 has a different slope coef. depending on the category of var1  
var1## c. var2

equivalent to "i. var1 i. var1# c. var2", but much faster 

var1##c.( var2 var3)

multiple heterogeneous slopes are allowed together. Alternative syntax: var1##(c. var2 c. var3)


v1# v2# v3##c.( v4 v5)

factor operators can be combined  
To save the estimates specific absvars, write newvar=absvar.  
Please be aware that in most cases these estimates are neither consistent nor econometrically identified.  
Using categorical interactions (e.g. x# z) is faster than running egen group(...) beforehand.  
Singleton obs. are dropped iteratively until no more singletons are found (see ancilliary article for details).  
Slopeonly absvars ("state#c.time") have poor numerical stability and slow convergence. If you need those, either i) increase tolerance or ii) use slopeandintercept absvars ("state##c.time"), even if the intercept is redundant. For instance if absvar is "i.zipcode i.state##c.time" then i.state is redundant given i.zipcode, but convergence will still be much faster. 
Description
reghdfe
is a generalization of areg (and xtreg,fe, xtivreg,fe) for multiple levels of fixed effects (including heterogeneous slopes), alternative estimators (2sls, gmm2s, liml), and additional robust standard errors (multiway clustering, HAC standard errors, etc).
Additional features include:
 A novel and robust algorithm to efficiently absorb the fixed effects (extending the work of Guimaraes and Portugal, 2010).
 Coded in Mata, which in most scenarios makes it even faster than areg and xtreg for a single fixed effect (see benchmarks on the Github page).
 Can save the point estimates of the fixed effects (caveat emptor: the fixed effects may not be identified, see the references).
 Calculates the degreesoffreedom lost due to the fixed effects (note: beyond two levels of fixed effects, this is still an open problem, but we provide a conservative approximation).
 Iteratively removes singleton groups by default, to avoid biasing the standard errors (see ancillary document).
Options
Model and Miscellanea
absorb(absvars)
list of categorical variables (or interactions) representing the fixed effects to be absorbed. this is equivalent to including an indicator/dummy variable for each category of each absvar. absorb()
is required.
To save a fixed effect, prefix the absvar with "newvar=
". For instance, the option absorb(firm_id worker_id year_coefs=year_id)
will include firm, worker and year fixed effects, but will only save the estimates for the year fixed effects (in the new variable year_coefs).
If you want to predict afterwards but don't care about setting the names of each fixed effect, use the savefe
suboption. This will delete all variables named __hdfe*__ and create new ones as required. Example: reghdfe price weight, absorb(turn trunk, savefe)
residuals(newvar)
will save the regression residuals in a new variable.
residuals
(without parenthesis) saves the residuals in the variable _reghdfe_resid.
This option does not require additional computations, and is required for subsequent calls to predict, d
.
summarize(stats)
will report and save a table of summary of statistics of the regression variables (including the instruments, if applicable), using the same sample as the regression.
summarize
(without parenthesis) saves the default set of statistics: mean min max.
The complete list of accepted statistics is available in the tabstat help. The most useful are count range sd median p##.
The summary table is saved in e(summarize)
To save the summary table silently (without showing it after the regression table), use the quietly
suboption. You can use it by itself (summarize(,quietly)
) or with custom statistics (summarize(mean, quietly)
).
suboptions(...)
options that will be passed directly to the regression command (either regress, ivreg2, or ivregress)
SE/Robust
vce(vcetype, subopt)
specifies the type of standard error reported. Note that all the advanced estimators rely on asymptotic theory, and will likely have poor performance with small samples (but again if you are using reghdfe, that is probably not your case)
unadjusted
/ols
estimates conventional standard errors, valid even in small samples under the assumptions of homoscedasticity and no correlation between observations
robust
estimates heteroscedasticityconsistent standard errors (Huber/White/sandwich estimators), but still assuming independence between observations
Warning: in a FE panel regression, using robust
will lead to inconsistent standard errors if for every fixed effect, the other dimension is fixed. For instance, in an standard panel with individual and time fixed effects, we require both the number of individuals and time periods to grow asymptotically. If that is not the case, an alternative may be to use clustered errors, which as discussed below will still have their own asymptotic requirements. For a discussion, see Stock and Watson, "Heteroskedasticityrobust standard errors for fixedeffects paneldata regression," Econometrica 76 (2008): 155174
cluster
clustervars estimates consistent standard errors even when the observations are correlated within groups.
Multiwayclustering is allowed. Thus, you can indicate as many clustervars as desired (e.g. allowing for intragroup correlation across individuals, time, country, etc).
Each clustervar permits interactions of the type var1#
var2 (this is faster than using egen group()
for a oneoff regression).
Warning: The number of clusters, for all of the cluster variables, must go off to infinity. A frequent rule of thumb is that each cluster variable must have at least 50 different categories (the number of categories for each clustervar appears on the header of the regression table).
The following suboptions require either the ivreg2 or the avar package from SSC. For a careful explanation, see the ivreg2 help file, from which the comments below borrow.
unadjusted
,
bw(#)
(or just ,
bw(#)
) estimates autocorrelationconsistent standard errors (NeweyWest).
robust
,
bw(#)
estimates autocorrelationandheteroscedasticity consistent standard errors (HAC).
cluster
clustervars,
bw(#)
estimates standard errors consistent to common autocorrelated disturbances (DriscollKraay). At most two cluster variables can be used in this case.
,
kiefer
estimates standard errors consistent under arbitrary intragroup autocorrelation (but not heteroskedasticity) (Kiefer).
kernel(str)
is allowed in all the cases that allow bw(#)
The default kernel is bar (Bartlett). Valid kernels are Bartlett (bar); Truncated (tru); Parzen (par); TukeyHanning (thann); TukeyHamming (thamm); Daniell (dan); Tent (ten); and QuadraticSpectral (qua or qs).
Advanced suboptions:
,
suite(default,mwc,avar)
overrides the package chosen by reghdfe to estimate the VCE. default uses the default Stata computation (allows unadjusted, robust, and at most one cluster variable). mwc allows multiwayclustering (any number of cluster variables), but without the bw and kernel suboptions. avar uses the avar package from SSC. Is the same package used by ivreg2, and allows the bw, kernel, dkraay and kiefer suboptions. This is useful almost exclusively for debugging.
,
twicerobust
will compute robust standard errors not only on the first but on the second step of the gmm2s estimation. Requires ivsuite(ivregress)
, but will not give the exact same results as ivregress.
Explanation: When running instrumentalvariable regressions with the ivregress
package, robust standard errors, and a gmm2s estimator, reghdfe will translate vce(robust)
into wmatrix(robust)
vce(unadjusted)
. This maintains compatibility with ivreg2
and other packages, but may unadvisable as described in ivregress (technical note). Specifying this option will instead use wmatrix(robust)
vce(robust)
.
However, computing the secondstep vce matrix requires computing updated estimates (including updated fixed effects). Since reghdfe currently does not allow this, the resulting standard errors will not be exactly the same as with ivregress. This issue is similar to applying the CUE estimator, described further below.
Note: The above comments are also appliable to clustered standard error.
IV/2SLS/GMM
estimator
(
2sls
gmm2s
liml
cue
)
estimator used in the instrumentalvariable estimation
2sls
(twostage least squares, default), gmm2s
(twostage efficient GMM), liml
(limitedinformation maximum likelihood), and cue
("continuouslyupdated" GMM) are allowed.
Warning: cue
will not give the same results as ivreg2. See the discussion in Baum, Christopher F., Mark E. Schaffer, and Steven Stillman. "Enhanced routines for instrumental variables/GMM estimation and testing." Stata Journal 7.4 (2007): 465506 (page 484). Note that even if this is not exactly cue
, it may still be a desirable/useful alternative to standard cue, as explained in the article.
stages(list)
adds and saves up to four auxiliary regressions useful when running instrumentalvariable regressions:
first
all firststage regressions
ols
ols regression (between dependent variable and endogenous variables; useful as a benchmark)
reduced
reducedform regression (ols regression with included and excluded instruments as regressors)
acid
an "acid" regression that includes both instruments and endogenous variables as regressors; in this setup, excluded instruments should not be significant.
You can pass suboptions not just to the iv command but to all stage regressions with a comma after the list of stages. Example:
reghdfe price (weight=length), absorb(turn) subopt(nocollin) stages(first, eform(exp(beta)) )
By default all stages are saved (see estimates dir). The suboption ,nosave
will prevent that. However, future replay
s will only replay the iv regression.
ffirst
compute and report first stage statistics (details); requires the ivreg2 package.
These statistics will be saved on the e(first) matrix. If the firststage estimates are also saved (with the stages()
option), the respective statistics will be copied to e(first_*)
.
ivsuite(subcmd)
allows the IV/2SLS regression to be run either using ivregress
or ivreg2
.
ivreg2
is the default, but needs to be installed for that option to work.
Diagnostic
verbose(#)
orders the command to print debugging information.
Possible values are 0 (none), 1 (some information), 2 (even more), 3 (adds dots for each iteration, and reportes parsing details), 4 (adds details for every iteration step)
For debugging, the most useful value is 3. For simple status reports, set verbose to 1.
timeit
shows the elapsed time at different steps of the estimation. Most time is usually spent on three steps: map_precompute(), map_solve() and the regression step.
DegreesofFreedom Adjustments
dofadjustments(doflist)
selects how the degreesoffreedom, as well as e(df_a), are adjusted due to the absorbed fixed effects.
Without any adjustment, we would assume that the degreesoffreedom used by the fixed effects is equal to the count of all the fixed effects (e.g. number of individuals + number of years in a typical panel). However, in complex setups (e.g. fixed effects by individual, firm, job position, and year), there may be a huge number of fixed effects collinear with each other, so we want to adjust for that.
Note: changing the default option is rarely needed, except in benchmarks, and to obtain a marginal speedup by excluding the pairwise
option.
all
is the default and almost always the best alternative. It is equivalent to dof(pairwise clusters continuous)
none
assumes no collinearity across the fixed effects (i.e. no redundant fixed effects). This is overtly conservative, although it is the faster method by virtue of not doing anything.
firstpair
will exactly identify the number of collinear fixed effects across the first two sets of fixed effects (i.e. the first absvar and the second absvar). The algorithm used for this is described in Abowd et al (1999), and relies on results from graph theory (finding the number of connected subgraphs in a bipartite graph). It will not do anything for the third and subsequent sets of fixed effects.
For more than two sets of fixed effects, there are no known results that provide exact degreesoffreedom as in the case above. One solution is to ignore subsequent fixed effects (and thus oversestimate e(df_a) and understimate the degreesoffreedom). Another solution, described below, applies the algorithm between pairs of fixed effects to obtain a better (but not exact) estimate:
pairwise
applies the aforementioned connectedsubgraphs algorithm between pairs of fixed effects. For instance, if there are four sets of FEs, the first dimension will usually have no redundant coefficients (i.e. e(M1)==1), since we are running the model without a constant. For the second FE, the number of connected subgraphs with respect to the first FE will provide an exact estimate of the degreesoffreedom lost, e(M2).
For the third FE, we do not know exactly. However, we can compute the number of connected subgraphs between the first and third G(1,3), and second and third G(2,3) fixed effects, and choose the higher of those as the closest estimate for e(M3). For the fourth FE, we compute G(1,4), G(2,4) and G(3,4) and again choose the highest for e(M4).
Finally, we compute e(df_a) = e(K1)  e(M1) + e(K2)  e(M2) + e(K3)  e(M3) + e(K4)  e(M4); where e(K#) is the number of levels or dimensions for the #th fixed effect (e.g. number of individuals or years). Note that e(M3) and e(M4) are only conservative estimates and thus we will usually be overestimating the standard errors. However, given the sizes of the datasets typically used with reghdfe, the difference should be small.
Since the gain from pairwise
is usually minuscule for large datasets, and the computation is expensive, it may be a good practice to exclude this option for speedups.
clusters
will check if a fixed effect is nested within a clustervar. In that case, it will set e(K#)==e(M#) and no degreesoffreedom will be lost due to this fixed effect. The rationale is that we are already assuming that the number of effective observations is the number of cluster levels. This is the same adjustment that xtreg, fe
does, but areg
does not use it.
continuous
Fixed effects with continuous interactions (i.e. individual slopes, instead of individual intercepts) are dealt with differently. In an i.categorical#c.continuous interaction, we will do one check: we count the number of categories where c.continuous is always zero. In an i.categorical##c.continuous interaction, we do the above check but replace zero for any particular constant. In the case where continuous is constant for a level of categorical, we know it is collinear with the intercept, so we adjust for it.
Additional methods, such as bootstrap
are also possible but not yet implemented. Some preliminary simulations done by the author showed a very poor convergence of this method.
groupvar(newvar)
name of the new variable that will contain the first mobility group. Requires pairwise
, firstpair
, or the default all
.
Speeding Up Estimation
reghdfe
varlist [if] [in],
absorb(absvars)
save(cache)
[options]
This will transform varlist, absorbing the fixed effects indicated by absvars. It is useful when running a series of alternative specifications with common variables, as the variables will only be transformed once instead of every time a regression is run.
It replaces the current dataset, so it is a good idea to precede it with a preserve command
To keep additional (untransformed) variables in the new dataset, use the keep(varlist)
suboption.
cache(use)
is used when running reghdfe after a save(cache) operation. Both the absorb() and vce() options must be the same as when the cache was created (the latter because the degrees of freedom were computed at that point).
cache(clear)
will delete the Mata objects created by reghdfe and kept in memory after the save(cache) operation. These objects may consume a lot of memory, so it is a good idea to clean up the cache. Additionally, if you previously specified preserve, it may be a good time to restore.
Example:
sysuse auto
preserve
* Save the cache
reghdfe price weight length, a(turn rep) vce(turn) cache(save, keep(foreign))
* Run regressions
reghdfe price weight, a(turn rep) cache(use)
reghdfe price length, a(turn rep) cache(use)
* Clean up
reghdfe, cache(clear)
restore
fast
avoids saving e(sample) into the regression. Since saving the variable only involves copying a Mata vector, the speedup is currently quite small. Future versions of reghdfe may change this as features are added.
Note that fast
will be disabled when adding variables to the dataset (i.e. when saving residuals, fixed effects, or mobility groups), and is incompatible with most postestimation commands.
If you wish to use fast
while reporting estat summarize
, see the summarize
option.
Optimization
tolerance(#)
specifies the tolerance criterion for convergence; default is tolerance(1e8)
Note that for tolerances beyond 1e14, the limits of the double precision are reached and the results will most likely not converge.
At the other end, is not tight enough, the regression may not identify perfectly collinear regressors. However, those cases can be easily spotted due to their extremely high standard errors.
Warning: when absorbing heterogeneous slopes without the accompanying heterogeneous intercepts, convergence is quite poor and a tight tolerance is strongly suggested (i.e. higher than the default). In other words, an absvar of var1##c.var2 converges easily, but an absvar of var1#c.var2 will converge slowly and may require a tighter tolerance.
maxiterations(#)
specifies the maximum number of iterations; the default is maxiterations(10000)
; set it to missing (.
) to run forever until convergence.
poolsize(#)
Number of variables that are pooled together into a matrix that will then be transformed. The default is to pool variables in groups of 5. Larger groups are faster with more than one processor, but may cause outofmemory errors. In that case, set poolsize to 1.
Advanced options:
acceleration(str)
allows for different acceleration techniques, from the simplest case of no acceleration (none
), to steep descent (steep_descent
or sd
), Aitken (aitken
), and finally Conjugate Gradient (conjugate_gradient
or cg
).
Note: Each acceleration is just a plugin Mata function, so a larger number of acceleration techniques are available, albeit undocumented (and slower).
transform(str)
allows for different "alternating projection" transforms. The classical transform is Kaczmarz (kaczmarz
), and more stable alternatives are Cimmino (cimmino
) and Symmetric Kaczmarz (symmetric_kaczmarz
)
Note: Each transform is just a plugin Mata function, so a larger number of acceleration techniques are available, albeit undocumented (and slower).
Note: The default acceleration is Conjugate Gradient and the default transform is Symmetric Kaczmarz. Be wary that different accelerations often work better with certain transforms. For instance, do not use conjugate gradient with plain Kaczmarz, as it will not converge.
precondition
(currently disabled)
Reporting
level(#)
sets confidence level; default is level(95)
display_options: noomitted
, vsquish
, noemptycells
, baselevels
, allbaselevels
, nofvlabel
, fvwrap(#)
, fvwrapon(style)
, cformat(%fmt)
, pformat(%fmt)
, sformat(%fmt)
, and nolstretch
; see [R] estimation options.
Postestimation Syntax
Onlyestat summarize
, predict
and test
are currently supported and tested.
estat summarize
Summarizes depvar and the variables described in _b (i.e. not the excluded instruments)
predict
newvar [if] [in] [,
statistic]
May require you to previously save the fixed effects (except for option xb
).
To see how, see the details of the absorb option
Equation: y = xb + d_absorbvars + e
statistic  Description  
Main  
xb

xb fitted values; the default  
xbd

xb + d_absorbvars  
d

d_absorbvars  
residuals

residual  
score

score; equivalent to residuals


stdp

standard error of the prediction (of the xb component)  
although predict type newvar is allowed, the resulting variable will always be of type double. 
test
Performs significance test on the parameters, see the stata help
suest
Do not use suest
. It will run, but the results will be incorrect. See workaround below
If you want to perform tests that are usually run with suest
, such as nonnested models, tests using alternative specifications of the variables, or tests on different groups, you can replicate it manually, as described here.
Possible Pitfalls and Common Mistakes
 (note: as of version 2.1, the constant is no longer reported) Ignore the constant; it doesn't tell you much. If you want to use descriptive stats, that's what the
summarize()
andestat summ
commands are for. Even better, usenoconstant
to drop it (although it's not really dropped as it never existed on the first place!)  Think twice before saving the fixed effects. They are probably inconsistent / not identified and you will likely be using them wrong.
 (note: as of version 3.0 singletons are dropped by default) It's good practice to drop singletons.
dropsingleton
is your friend.  If you use
vce(robust)
, be sure that your other dimension is not "fixed" but grows with N, or your SEs will be wrong.  If you use
vce(cluster ...)
, check that your number of clusters is high enough (50+ is a rule of thumb). If not, you are making the SEs even worse!  The panel variables (absvars) should probably be nested within the clusters (clustervars) due to the withinpanel correlation induced by the FEs. (this is not the case for *all* the absvars, only those that are treated as growing as N grows)
 If you run analytic or probability weights, you are responsible for ensuring that the weights stay constant within each unit of a fixed effect (e.g. individual), or that it is correct to allow varyingweights for that case.
 Be aware that adding several HDFEs is not a panacea. The first limitation is that it only uses within variation (more than acceptable if you have a large enough dataset). The second and subtler limitation occurs if the fixed effects are themselves outcomes of the variable of interest (as crazy as it sounds). For instance, imagine a regression where we study the effect of past corporate fraud on future firm performance. We add firm, CEO and time fixedeffects (standard practice). This introduces a serious flaw: whenever a fraud event is discovered, i) future firm performance will suffer, and ii) a CEO turnover will likely occur. Moreover, after fraud events, the new CEOs are usually specialized in dealing with the aftershocks of such events (and are usually accountants or lawyers). The fixed effects of these CEOs will also tend to be quite low, as they tend to manage firms with very risky outcomes. Therefore, the regressor (fraud) affects the fixed effect (identity of the incoming CEO). Adding particularly low CEO fixed effects will then overstate the performance of the firm, and thus understate the negative effects of fraud on future firm performance.
Missing Features
(If you are interested in discussing these or others, feel free to contact me)
Code, medium term:
 Complete GT preconditioning (v4)
 Improve algorithm that recovers the fixed effects (v5)
 Improve statistics and tests related to the fixed effects (v5)
 Implement a bootstrap option in DoF estimation (v5)
Code, long term:
 The interaction with cont vars (i.a#c.b) may suffer from numerical accuracy issues, as we are dividing by a sum of squares
 Calculate exact DoF adjustment for 3+ HDFEs (note: not a problem with cluster VCE when one FE is nested within the cluster)
 More postestimation commands (lincom? margins?)
Theory:
 Add a more thorough discussion on the possible identification issues
 Find out a way to use reghdfe iteratively with CUE (right now only OLS/2SLS/GMM2S/LIML give the exact same results)
 Not sure if I should add an Ftest for the absvars in the vce(robust) and vce(cluster) cases. Discussion on e.g. areg (methods and formulas) and textbooks suggests not; on the other hand, there may be alternatives: A HeteroskedasticityRobust FTest Statistic for Individual Effects
Examples
Setup
sysuse auto
Simple case  one fixed effect
reghdfe price weight length, absorb(rep78)
As above, but also compute clustered standard errors
reghdfe price weight length, absorb(rep78) vce(cluster rep78)
Two and three sets of fixed effects
webuse nlswork
reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode year)
reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode year occ)
Advanced examples
Save the FEs as variables
reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(FE1=idcode FE2=year)
Report nested Ftests
reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode year) nested
Do AvgE instead of absorb() for one FE
reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode year) avge(occ)
reghdfe ln_w grade age ttl_exp tenure not_smsa south , absorb(idcode year) avge(AvgByOCC=occ)
Check that FE coefs are close to 1.0
reghdfe ln_w grade age ttl_exp tenure not_smsa , absorb(idcode year) check
Save first mobility group
reghdfe ln_w grade age ttl_exp tenure not_smsa , absorb(idcode occ) group(mobility_occ)
Factor interactions in the independent variables
reghdfe ln_w i.grade#i.age ttl_exp tenure not_smsa , absorb(idcode occ)
Interactions in the absorbed variables (notice that only the # symbol is allowed)
reghdfe ln_w grade age ttl_exp tenure not_smsa , absorb(idcode#occ)
Interactions in both the absorbed and AvgE variables (again, only the # symbol is allowed)
reghdfe ln_w grade age ttl_exp not_smsa , absorb(idcode#occ) avge(tenure#occ)
IV regression
sysuse auto
reghdfe price weight (length=head), absorb(rep78)
reghdfe price weight (length=head), absorb(rep78) first
reghdfe price weight (length=head), absorb(rep78) ivsuite(ivregress)
Factorial interactions
reghdfe price weight (length=head), absorb(rep78)
reghdfe price weight length, absorb(rep78 turn##c.price)
Stored results
reghdfe
stores the following in e()
:
Note: it also keeps most e() results placed by the regression subcommands (ivreg2, ivregress)
Scalars  
e(N) 
number of observations  
e(N_hdfe) 
number of absorbed fixedeffects  
e(tss) 
total sum of squares  
e(rss) 
residual sum of squares  
e(r2) 
Rsquared  
e(r2_a) 
adjusted Rsquared  
e(r2_within) 
Within Rsquared  
e(r2_a_within) 
Adjusted Within Rsquared  
e(df_a) 
degrees of freedom lost due to the fixed effects  
e(rmse) 
root mean squared error  
e(ll) 
loglikelihood  
e(ll_0) 
loglikelihood of fixedeffectonly regression  
e(F) 
F statistic  
e(F_absorb) 
F statistic for absorbed effect note: currently disabled  
e(rank) 
rank of e(V)


e(N_clustervars) 
number of cluster variables  
e(clust #)

number of clusters for the #th cluster variable  
e(N_clust) 
number of clusters; minimum of e(clust#)  
e(K #)

Number of categories of the #th absorbed FE  
e(M #)

Number of redundant categories of the #th absorbed FE  
e(mobility) 
Sum of all e(M#)


e(df_m) 
model degrees of freedom  
e(df_r) 
residual degrees of freedom 
Macros  
e(cmd) 
reghdfe 

e(subcmd) 
either regress , ivreg2 or ivregress


e(model) 
ols , iv , gmm2s , liml or cue


e(cmdline) 
command as typed  
e(dofmethod) 
dofmethod employed in the regression  
e(depvar) 
name of dependent variable  
e(indepvars) 
names of independent variables  
e(endogvars) 
names of endogenous righthandside variables  
e(instruments) 
names of excluded instruments  
e(absvars) 
name of the absorbed variables or interactions  
e(title) 
title in estimation output  
e(clustvar) 
name of cluster variable  
e(clustvar #)

name of the #th cluster variable  
e(vce) 
vcetype specified in vce()


e(vcetype) 
title used to label Std. Err.  
e(stage) 
stage within an IVregression; only if stages() was used  
e(properties) 
b V 
Matrices  
e(b) 
coefficient vector  
e(V) 
variancecovariance matrix of the estimators 
Functions  
e(sample) 
marks estimation sample 
Author
Sergio Correia
Fuqua School of Business, Duke University
Email: sergio.correia@duke.edu
User Guide
A copy of this help file, as well as a more indepth user guide is in development and will be available at "http://scorreia.com/reghdfe".
Latest Updates
reghdfe
is updated frequently, and upgrades or minor bug fixes may not be immediately available in SSC. To check or contribute to the latest version of reghdfe, explore the Github repository. Bugs or missing features can be discussed through email or at the Github issue tracker.
To see your current version and installed dependencies, type reghdfe, version
Acknowledgements
This package wouldn't have existed without the invaluable feedback and contributions of Paulo Guimaraes, Amine Ouazad, Mark Schaffer and Kit Baum. Also invaluable are the great bugspotting abilities of many users.
In addition, reghdfe is build upon important contributions from the Stata community:
reg2hdfe, from Paulo Guimaraes, and a2reg from Amine Ouazad, were the inspiration and building blocks on which reghdfe was built.
ivreg2, by Christopher F Baum, Mark E Schaffer and Steven Stillman, is the package used by default for instrumentalvariable regression.
avar by Christopher F Baum and Mark E Schaffer, is the package used for estimating the HACrobust standard errors of ols regressions.
tuples by Joseph Lunchman and Nicholas Cox, is used when computing standard errors with multiway clustering (two or more clustering variables).
References
The algorithm underlying reghdfe is a generalization of the works by:
Paulo Guimaraes and Pedro Portugal. "A Simple Feasible Alternative Procedure to Estimate Models with HighDimensional Fixed Effects". Stata Journal, 10(4), 628649, 2010. [link]
Simen Gaure. "OLS with Multiple High Dimensional Category Dummies". Memorandum 14/2010, Oslo University, Department of Economics, 2010. [link]
It addresses many of the limitation of previous works, such as possible lack of convergence, arbitrary slow convergence times, and being limited to only two or three sets of fixed effects (for the first paper). The paper explaining the specifics of the algorithm is a workinprogress and available upon request.
If you use this program in your research, please cite either the REPEC entry or the aforementioned papers.
Additional References
For details on the Aitken acceleration technique employed, please see "method 3" as described by:
Macleod, Allan J. "Acceleration of vector sequences by multidimensional Delta2 methods." Communications in Applied Numerical Methods 2.4 (1986): 385392.
For the rationale behind interacting fixed effects with continuous variables, see:
Duflo, Esther. "The medium run effects of educational expansion: Evidence from a large school construction program in Indonesia." Journal of Development Economics 74.1 (2004): 163197. [link]
Also see:
Abowd, J. M., R. H. Creecy, and F. Kramarz 2002. Computing person and firm effects using linked longitudinal employeremployee data. Census Bureau Technical Paper TP200206.
Cameron, A. Colin & Gelbach, Jonah B. & Miller, Douglas L., 2011. "Robust Inference With Multiway Clustering," Journal of Business & Economic Statistics, American Statistical Association, vol. 29(2), pages 238249.
Gormley, T. & Matsa, D. 2014. "Common errors: How to (and not to) control for unobserved heterogeneity." The Review of Financial Studies, vol. 27(2), pages 617661.
Mittag, N. 2012. "New methods to estimate models with large sets of fixed effects with an application to matched employeremployee data from Germany." FDZMethodenreport 02/2012.