Get (easily) introduced to Stata

I do not want to make multiple posts on this topic. So, this post will be updated little by little (if I learn something new). Here’s the list of essential commands for creating variables, making basic (and advanced) statistical analyses.

Updates and help

help whatsnew

help xtmixed

This is a special feature of Stata, and a useful one. When we type the above syntax, we have the list of all recent changes, corrections, and new stuff added in Stata. The help function displays a new window for help in syntax with the command we have selected (e.g., regress, factor, sem).

Generate variables

Here’s a list of basic commands

gen my_new_var = original_var

gen age2 = age^2
gen age3 = age^3

gen age_gender = age*gender

gen sampling_weight = wtssall*oversamp

gen log_income = log(income)

replace age = . if age>=70 & real_income<=0

keep if age<70

drop age

gen black_white = 0 if race==2
replace black_white = 1 if race==1

gen gender = .
replace gender = 1 if sex=="male"
replace gender = 2 if sex=="female"

encode gender, generate(gender2)

recode age (min/17=.) (18/26=1) (27/35=2) (36/44=3) (45/55=4) (56/69=5) (70/max=.), generate(agegroup)
tabulate agegroup, gen(agedummy)

recode female (0=1) (1=0), gen(male)

egen zIQtest = std(IQtest)

The commands gen and generate are equivalent. I use gen because it’s shorter. The first line shows how to copy a variable. Then, how to make interaction variable or squared/cubed terms, and how to set missing values (“.”) under specified condition(s). Then, how to drop/remove all observations with age 70+ using the function “keep”, or how to drop a variable using “drop”. And then, how generate new variable. In case of string variables, we need to convert them into numeric variables by replacing, e.g., “male” by 1. Alternatively, we can use the encode function to convert string to numeric. The std function allows us to transform a variable into z-score (i.e., standardizing it). The command recode+generate is an easy way to create categorical variables whereas tabulate+gen is straightforward for generating a dummy variable. We can also use recode function to reverse coding a dummy variable 0;1 (female) into 1;0 (male).

The signs = and == are different. The first is used to generate variables, and the second is used to assign a value to a variable. We can make and set multiple assignment and conditions with & (“and”). If we need to specify “or” we will use |.

Descriptive statistics

Now, here’s a list of useful descriptive statistics for variables IQ and education.

tabulate race sex if !missing(age) & !missing(IQ) & !missing(education)

by race cohort, sort : summarize IQ education [aweight = sampling_weight] if age>18 & age<70

The first line gives the sample size by categories of race and gender, if there is no missing values of IQ, education, and age. The second line displays means, standard deviations, sample size and weighted sample size (optional) within a subset of data (certain age range conditional on gender) by subgroups of both race and cohort, i.e., each categories of these variables, independently and in combination.


scatter IQtest education

twoway (scatter IQtest education) if countryBirth==1 & age<70, ytitle(IQ score) xtitle(educ level) by(, title(scatter of IQ-educ relationship) subtitle(by gender group)) by(sex)

graph twoway (scatter IQtest education if race==0, msymbol(Oh)) (scatter IQtest education if race==1, msymbol(O)), legend(label(1 black) label(2 white))

twoway (scatter IQtest education, msymbol(O) jitter(0)) (lfit IQtest education if race==0) (lfit IQtest education if race==1), ytitle("scatter plot of IQ against education by race") legend(order(2 "black" 3 "white"))

graph matrix IQtest education income age if age <70 [pweight = sampling_weight], half maxis(ylabel(none) xlabel(none)) by(race)

The first line is the easy way to make scatterplot. The second line generates a scatterplot for males and females, separately but in a single picture, and with titles of X and Y. The third line is a scatterplot with data points by racial group, msymbol is used for setting the color and size of data points. The fourth line specifies the fitted line of the scatterplot with “lfit”. The fifth line generates a correlation matrix in the form of scatterplots. The “half” statement requests only the lower triangle of the matrix to be displayed.

Data distribution

Next, we look at the different ways to examine the normal distribution of the variables. We can do it using stem and leaf distribution, P-P and Q-Q plot, box plots, histogram, skewness and kurtosis, Shapiro-Wilk test by subgroups.

stem education if race==1

stem education if race==2

symplot IQtest if bw1==0

symplot IQtest if bw1==1

quantile IQtest, rlopts(clpattern(dash))

qnorm IQtest, grid ylabel(, angle(horizontal) axis(1)) ylabel(, angle(horizontal) axis(2))

graph box education IQtest, over(sex) over(race)

graph box education IQtest [pweight = sampling_weight] if age<=60 & born==1, over(sex) over(race) ytitle("level of IQ or education") title("Education and IQ by race and by sex") subtitle("(500 participants)" " ") note("Source: your data")

histogram IQtest, kdensity normal

kdensity IQtest, normal

twoway histogram IQtest, discrete by(race)

twoway histogram income, freq by(race)

sktest income

by race, sort : swilk IQtest income

Cronbach’s alpha

alpha IQ1-IQ10

alpha IQ1-IQ10, casewise generate(new_var) item std

ssc install kr20

kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum)
replace wordsum = . if wordsum<0
replace wordsum = . if wordsum>10

replace worda = . if worda<0
replace worda = 0 if worda==9
replace wordb = . if wordb<0
replace wordb = 0 if wordb==9
replace wordc = . if wordc<0
replace wordc = 0 if wordc==9
replace wordd = . if wordd<0
replace wordd = 0 if wordd==9
replace worde = . if worde<0
replace worde = 0 if worde==9
replace wordf = . if wordf<0
replace wordf = 0 if wordf==9
replace wordg = . if wordg<0
replace wordg = 0 if wordg==9
replace wordh = . if wordh<0
replace wordh = 0 if wordh==9
replace wordi = . if wordi<0
replace wordi = 0 if wordi==9
replace wordj = . if wordj<0
replace wordj = 0 if wordj==9

. alpha worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum), item

Test scale = mean(unstandardized items)

                             item-test     item-rest       interitem
Item         |  Obs  Sign   correlation   correlation     covariance      alpha
worda        | 23817   +       0.4347        0.2727        .0344404      0.7072
wordb        | 23817   +       0.4814        0.3710         .034422      0.6951
wordc        | 23817   +       0.5228        0.3587        .0321716      0.6940
wordd        | 23817   +       0.4543        0.3496        .0351039      0.6985
worde        | 23817   +       0.6071        0.4495         .029806      0.6776
wordf        | 23817   +       0.6025        0.4543        .0302186      0.6773
wordg        | 23817   +       0.5509        0.3682        .0310597      0.6935
wordh        | 23817   +       0.5788        0.4086        .0304111      0.6854
wordi        | 23817   +       0.4806        0.3059        .0331632      0.7032
wordj        | 23817   +       0.5835        0.4263        .0305587      0.6821
Test scale   |                                             .0321355      0.7137

. kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum)

Kuder-Richarson coefficient of reliability (KR-20)

Number of items in the scale = 10
Number of complete observations = 23817

                     Item       Item     Item-rest
  Item   |  Obs   difficulty  variance  correlation
 worda   | 23817    0.8240     0.1450     0.2727
 wordb   | 23817    0.9153     0.0776     0.3710
 wordc   | 23817    0.2199     0.1716     0.3587
 wordd   | 23817    0.9280     0.0668     0.3496
 worde   | 23817    0.7380     0.1934     0.4495
 wordf   | 23817    0.7792     0.1721     0.4543
 wordg   | 23817    0.3230     0.2187     0.3682
 wordh   | 23817    0.2895     0.2057     0.4086
 wordi   | 23817    0.7685     0.1779     0.3059
 wordj   | 23817    0.2384     0.1815     0.4262
   Test  |          0.6024                0.3765

KR20 coefficient is 0.7138

Here is an analysis of (internal consistency) reliability. The Cronbach’s alpha can be written as a function of the number of test items and the average inter-correlation among the items. We can use 2 variables or more. The “casewise” function will delete cases with missing values. The function “item” shows the effect of removing a variable from the scale. The std function allows us to standardize the items in the scale to mean 0 and variance 1 (we will get the average interitem correlation instead of the average interitem covariance). The function “generate” serves to save the generated scale. The last column gives the alpha reliability for the test scale. The alpha of each variable gives the reliability of the test scale when the corresponding variable is removed. However, for dichotomous variable such as IQ tests’ items, we preferably use the Kuder-Richardson reliability (KR-20) which needs to be installed in Stata first. Above, we see that the internal reliability of the Wordsum vocabulary test is 0.71.


A good advice. Never use correlate. Use pwcorr instead. While pwcorr uses pairwise deletion, correlate uses listwise deletion. So this is pretty useless if you have more than 2 variables.

correlate age race gender IQtest income education

by race, sort : correlate age race gender IQtest income education [aweight = sampling_weight] if age>=18, means covariance

by sex, sort : pwcorr IQtest income education age [aweight = sampling_weight] if age<70, obs sig

by sex, sort : spearman IQtest income education age if age<70, stats(rho obs p) star(0.05) bonferroni pw

by sex, sort : ktau IQtest income education, stats(taua taub p) pw

by sex, sort : pcorr IQtest degree income age race [aweight = sampling_weight] if age<70

This is how to generate correlation matrix. We can use sampling weight with aweight. In some analysis, it aweight is not available, sometimes it is pweight which is not available. It is pweight that is considered as the sampling weights (observations weighted according to their different probabilities of being in a sample). I have read that the difference between aweight and pweight is that they generate the same point estimates but different variance (Standard Error) estimates. The calculation of aweight is given here. If we need to display the mean, standard deviation and min/max values, we use “means”. And if we need covariances, we type “covariance”. We can make the correlation for a subset of the data, e.g., people aged 18+. Furthermore, the command “by [categorical_variable], sort :” allows us to repeat the analysis for each of the categories in the variable after “by”. The pwcorr function is a pairwise Pearson correlation, and the obs and sig give the sample size and p-values. Spearman correlation does not allow sampling weight, star allows to specify the level of significance for correlation marked with a star, pw requires pairwise analysis, and bonferroni specifies the Bonferroni’s correction of p-values. ktau is for kendall’s rank correlation. The pcorr function is partial correlation, where the first variable (IQtest) is correlated with degree (all other vars held constant) and then with income (all other vars held constant) and then with age (all other vars held constant) and then with race (all other vars held constant).

Analysis of variance

by race, sort : anova IQtest cohort sex cohort#sex [aweight = sampling_weight] if age<70

Multiple regression

Now, I show how to make regression. Below is an OLS regression with dummy variables of cohort and its interaction with race (this allows us to look at the trend in black-white gap in IQ by cohort categories, although one dummy variable must be dropped in order to avoid perfect multicollinearity, and it will serve as reference value from which all other variables are expressed, and the choice of the reference dummy will affect the intercept but not the coefficient of the independent variables). The first variable is always the dependent variables. Unless we specify “beta” the program will not give us the standardized coefficients. As in all other analyses, we can use a subset of data and apply sampling weight. We can also select which kind of standard errors (robust, bootstrap, jackknife) we need with “vce”.

regress IQtest race gender age education cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 raceCohort2 raceCohort3 raceCohort4 raceCohort5 raceCohort6 if age>=18 & age<=50 [pweight = sampling_weight], vce(robust) beta

predict predicted_values if e(sample), xb
predict residualized_values if e(sample), res
predict standardized_residuals if e(sample), rstandard
graph twoway (lfitci residualized_values predicted_values) (scatter residualized_values predicted_values)
pnorm residualized_values
qnorm residualized_values
histogram residualized_values, kdensity normal
kdensity residualized_values, normal

The “regress” command must be immediately followed by “predict” command if we want to examine the normality of residuals (an assumption generally but wrongly omitted by researchers when doing regressions). In most softwares, it seems that either or both the predicted values and/or residuals are calculated even for cases not analyzed in the regression model; so, the function e(sample) should be specified. xb will generate a variable of predicted/fitted values of the dependent variable whereas res or resid will generate the unstandardized residuals (Y minus Yhat). We use rstandard to request the standardized residuals, and rstudent to request the studentized (jackknifed) residuals. Correlating the residualized and fitted values allows us to examine the normality of the residuals (lfitci can be specified if we are interested in confidence intervals).

It is not necessary to specify vce(robust) when pweight is already activated because Stata automatically estimates the robust SE when pweight is specified in regression, because robust SE, according to the Stata manual, is the correct specification. This is true also for other types of regression (i.e., logit, tobit, xtmixed, …) that uses pweight. Estimation commands that do not have a vce(robust) option (there are a few) do not allow pweights.

test x5=x6=x7=x8=x9=0

After regression, it is possible to perform an F-test (test statistic for F-change) and compare nested models. Say, model 1 has four independent variables x1-x4, and model 2 adds variables x5-x9. Ideally, models 1 and 2 should have identical sample sizes. To know if these variables should be added to the initial model, F-test is performed. We are actually testing the null hypothesis that the regression coefficients x5 through x9 are all zero. Significant p-value indicates they should be added (but as always, p-value depends on sample size, so it is not very useful). It is better to look at the Cohen’s f² effect size.

regress read write math science
estimates store m2
generate samplem2 = e(sample)
regress read write if sample==1
estimates store m1
lrtest m1 m2

We can perform a likelihood ratio test between two nested models, as shown above. The requirement of LRT is that the models need to sample exactly the same cases. The function e(sample), which can be used each time we run a new model, is used for this purpose. The model m1 will then be run only for cases that appear in m2. However, if we have used vce(robust) for robust standard errors, we will get the following message : “LR test likely invalid for models with robust vce”.

Errors-in-variables (EIV) regression

eivreg IQtest education occupation income [aweight = sampling_weight] if race==1, reliab(education .95 occupation .95 income .80)

Here, we have an analysis of error-in-variables regression. The “reliab” (or “r” which is equivalent) function allows us to specify the level of assumed reliability (we can write 0.90 or .90) for each of the independent variables. Standardized coefficients are not available.

Tobit/Truncated regressions

. tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bw1C2 bw1C3 bw1C4 bw1C5 bw1C6 agedummy1 agedummy2 agedummy4 agedummy5 sex [pweight = weight], ll ul

Tobit regression                                  Number of obs   =      22156
                                                  F(  16,  22140) =     103.09
                                                  Prob > F        =     0.0000
Log pseudolikelihood = -47841.919                 Pseudo R2       =     0.0175

             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
         bw1 |   2.022845    .155358    13.02   0.000     1.718332    2.327358
cohortdummy2 |   .3783523   .1814003     2.09   0.037     .0227949    .7339097
cohortdummy3 |   1.033257   .1718304     6.01   0.000     .6964568    1.370056
cohortdummy4 |   .7742905   .1708343     4.53   0.000     .4394431    1.109138
cohortdummy5 |     1.0802   .1747871     6.18   0.000     .7376051    1.422795
cohortdummy6 |   1.308605   .1937606     6.75   0.000     .9288205     1.68839
       bw1C2 |  -.2276471   .1918261    -1.19   0.235    -.6036399    .1483457
       bw1C3 |  -.5661765   .1790605    -3.16   0.002    -.9171478   -.2152051
       bw1C4 |  -.6056165   .1763168    -3.43   0.001      -.95121    -.260023
       bw1C5 |  -1.008771   .1797605    -5.61   0.000    -1.361115   -.6564281
       bw1C6 |  -.9942551   .1999675    -4.97   0.000    -1.386206   -.6023045
   agedummy1 |  -.7602832   .0534722   -14.22   0.000    -.8650924   -.6554739
   agedummy2 |   -.247756   .0475324    -5.21   0.000    -.3409229    -.154589
   agedummy4 |   .0323851   .0517914     0.63   0.532    -.0691299       .1339
   agedummy5 |   .0456177   .0623807     0.73   0.465    -.0766528    .1678883
         sex |   .1725331   .0317426     5.44   0.000     .1103153    .2347509
       _cons |   4.008042   .1630947    24.57   0.000     3.688365    4.327719
      /sigma |   2.107777   .0135762                      2.081167    2.134388
  Obs. summary:        140  left-censored observations at wordsum=10
                     20698     uncensored observations
                      1318 right-censored observations at wordsum>=10

Tobit regression is displayed above. ll and ul stand for lower and upper level for value censoring. If the IQ has minimum of 0 and maximum of 10 answers correct, we can set ll(0) ul(10). Although ll ul automatically does the job, Cameron & Trivedi (2009, p. 523) always recommend specification of these values. There is no reason to do it by hand, unless we want to censor at a value less (more) than the maximum (minimum). Standardized beta not available because the tobit regression estimates the expected value of the latent (i.e., not observed) variable Y. In other words, the SD of the latent Y is not given.

truncreg IQtest race gender age education cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 raceCohort2 raceCohort3 raceCohort4 raceCohort5 raceCohort6 if age>=18 & age<=50 [pweight = sampling_weight], ll(80)

Above is the truncated regression for values truncated from below 80. If the (survey) data has sampled persons having at least 80 IQ points, there is a truncation from below, and we would have ll(80). Similarly, if the data selects only the persons with a maximum of 120 points, the data would appear to be truncated from above 120, then we set ul(120). For the same reason as tobit, truncreg does not provide standardized betas.

Logistic regression

logit being_unemployed education IQtest age if age>=25 & age<60 [pweight = sampling_weight], or

Above is a logistic regression. If we don’t request for the odds ratio (“or”), then only the b coefficient is displayed. Unlike other statistical packages such as SPSS, the logit command in Stata only recognizes a dependent variable that is coded 0 and 1. If coded otherwise (e.g., 1 and 2), we can get the message “outcome does not vary; remember: 0 = negative outcome, all other nonmissing values = positive outcome”.

Multilevel regression (or linear mixed effect model)

xtmixed IQtest race aged2-aged9 raceaged2-raceaged9 [pweight=sampling_weight] || cohort21: race, mle cov(un) var pweight(1) pwscale(size)

predict xtfitted if e(sample), fitted
predict xtfixed if e(sample), xb
predict xtres if e(sample), residuals
predict xtu* if e(sample), reffects
estimates store lme

graph twoway (lfitci xtres xtfitted) (scatter xtres xtfitted)
qnorm xtres
swilk xtres
histogram xtres, kdensity normal
kdensity xtres, normal
graph twoway (lfitci xtres xtu1) (scatter xtres xtu1)
qnorm xtu1
swilk xtu1
histogram xtu1, kdensity normal
kdensity xtu1, normal
graph twoway (lfitci xtres xtu2) (scatter xtres xtu2)
qnorm xtu2
swilk xtu2
histogram xtu2, kdensity normal
kdensity xtu2, normal

This is a multilevel regression, also called linear mixed effect model, where IQtest is the dependent var, the race dichotomy, age dummies and race*age (dummies) interaction specified as fixed effect variable in the fixed component to serve as covariate, with cohort (21-category) as random intercepts to allow differences in cohort across values/categories of race and age, and race as random slope to allow race difference to differ across cohorts. mle is for (full) maximum likelihood. reml is for restricted (or residualized) maximum likelihood and is a better estimator than mle if one wants to focus on variance components. cov(un) or cov(unstructured) allows the model to assume the presence of intercept-by-slope covariance. But we can use cov(independent), which allows a distinct variance for each random effect but covariances to be zero, if we assume random intercept and slopes to be uncorrelated. An alternative option is cov(identity) which assumes that all variances are equal and all covariances are zero. Sometimes, the model reaches convergence but Stata displays the message “standard-error calculation failed”. This means our model is incorrectly specified.

reffects function produces two random effects variables, u1 (slope) and u2 (intercepts). estimates store will save the results for later use, e.g., lrtest for comparing the significance between two models.

Sampling weight can be applied using multilevel regression. The problem is that the sampling weight for level-2 cluster variable usually does not exist in nearly all survey data. However, we can specify the second-level sampling weight to be just 1 (or create a variable “one” using gen and weight the level-2 variable using variable “one” which has a value of 1 for all cases), if we can safely assume that the categories of the level-2 cluster variable have no different probabilities of being selected into the sample

Factor analysis

by race, sort : pca education income occupation [aweight = sampling_weight] if age<70, components(3) means

by race, sort : factor education income occupation [aweight = sampling_weight] if age<70, factors(3) blanks(.1)

by race, sort : factor education income occupation [aweight = sampling_weight] if age<70, ml factors(3) blanks(.1)

Above is the syntax for factor analysis and principal component analysis. We can request the descriptive stats with “means” and specify the minimum loading to be displayed with “blanks” and the maximum number of factors to be retained with “factors” for FA and “components” for PCA. We can get maximum likelihood factor analysis by checking the “ml” option.

Good references.

Stata13 manual

Stata FAQ: How can I estimate effect size for xtmixed?
Stata FAQ: How can I identify cases used by an estimation command using e(sample)?
Why should I not do a likelihood-ratio test after an ML estimation (e.g., logit, probit) with clustering or pweights?

Cameron, A. C., & Trivedi, P. K. (2009). Microeconometrics Using Stata. Stata Press books.
Everitt, B. S., & Rabe-Hesketh, S. (2006). Handbook of Statistical Analyses Using Stata. CRC Press.
Hamilton, L. (2012). Statistics with STATA: Version 12. Cengage Learning.
West, B.T., Welch, K.B., & Galecki, A.T. (2014). Linear Mixed Models: A Practical Guide Using Statistical Software, Second Edition. Chapman Hall/CRC Press: Boca Raton, FL.

This entry was posted in Stats, Stats, Softwares, Syntax and tagged . Bookmark the permalink.