I do not want to make multiple posts on this topic. So, this post will be updated little by little (if I learn something new). Here’s the list of essential commands for creating variables, making basic (and advanced) statistical analyses.
Updates and help
help whatsnew help xtmixed
This is a special feature of Stata, and a useful one. When we type the above syntax, we have the list of all recent changes, corrections, and new stuff added in Stata. The help function displays a new window for help in syntax with the command we have selected (e.g., regress, factor, sem).
Here’s a list of basic commands
gen my_new_var = original_var gen age2 = age^2 gen age3 = age^3 gen age_gender = age*gender gen sampling_weight = wtssall*oversamp gen log_income = log(income) replace age = . if age>=70 & real_income<=0 keep if age<70 drop age gen black_white = 0 if race==2 replace black_white = 1 if race==1 gen gender = . replace gender = 1 if sex=="male" replace gender = 2 if sex=="female" encode gender, generate(gender2) recode age (min/17=.) (18/26=1) (27/35=2) (36/44=3) (45/55=4) (56/69=5) (70/max=.), generate(agegroup) tabulate agegroup, gen(agedummy) recode female (0=1) (1=0), gen(male) egen zIQtest = std(IQtest)
The commands gen and generate are equivalent. I use gen because it’s shorter. The first line shows how to copy a variable. Then, how to make interaction variable or squared/cubed terms, and how to set missing values (“.”) under specified condition(s). Then, how to drop/remove all observations with age 70+ using the function “keep”, or how to drop a variable using “drop”. And then, how generate new variable. In case of string variables, we need to convert them into numeric variables by replacing, e.g., “male” by 1. Alternatively, we can use the encode function to convert string to numeric. The std function allows us to transform a variable into z-score (i.e., standardizing it). The command recode+generate is an easy way to create categorical variables whereas tabulate+gen is straightforward for generating a dummy variable. We can also use recode function to reverse coding a dummy variable 0;1 (female) into 1;0 (male).
The signs = and == are different. The first is used to generate variables, and the second is used to assign a value to a variable. We can make and set multiple assignment and conditions with & (“and”). If we need to specify “or” we will use |.
Now, here’s a list of useful descriptive statistics for variables IQ and education.
tabulate race sex if !missing(age) & !missing(IQ) & !missing(education) by race cohort, sort : summarize IQ education [aweight = sampling_weight] if age>18 & age<70
The first line gives the sample size by categories of race and gender, if there is no missing values of IQ, education, and age. The second line displays means, standard deviations, sample size and weighted sample size (optional) within a subset of data (certain age range conditional on gender) by subgroups of both race and cohort, i.e., each categories of these variables, independently and in combination.
scatter IQtest education twoway (scatter IQtest education) if countryBirth==1 & age<70, ytitle(IQ score) xtitle(educ level) by(, title(scatter of IQ-educ relationship) subtitle(by gender group)) by(sex) graph twoway (scatter IQtest education if race==0, msymbol(Oh)) (scatter IQtest education if race==1, msymbol(O)), legend(label(1 black) label(2 white)) twoway (scatter IQtest education, msymbol(O) jitter(0)) (lfit IQtest education if race==0) (lfit IQtest education if race==1), ytitle("scatter plot of IQ against education by race") legend(order(2 "black" 3 "white")) graph matrix IQtest education income age if age <70 [pweight = sampling_weight], half maxis(ylabel(none) xlabel(none)) by(race)
The first line is the easy way to make scatterplot. The second line generates a scatterplot for males and females, separately but in a single picture, and with titles of X and Y. The third line is a scatterplot with data points by racial group, msymbol is used for setting the color and size of data points. The fourth line specifies the fitted line of the scatterplot with “lfit”. The fifth line generates a correlation matrix in the form of scatterplots. The “half” statement requests only the lower triangle of the matrix to be displayed.
Next, we look at the different ways to examine the normal distribution of the variables. We can do it using stem and leaf distribution, P-P and Q-Q plot, box plots, histogram, skewness and kurtosis, Shapiro-Wilk test by subgroups.
stem education if race==1 stem education if race==2 symplot IQtest if bw1==0 symplot IQtest if bw1==1 quantile IQtest, rlopts(clpattern(dash)) qnorm IQtest, grid ylabel(, angle(horizontal) axis(1)) ylabel(, angle(horizontal) axis(2)) graph box education IQtest, over(sex) over(race) graph box education IQtest [pweight = sampling_weight] if age<=60 & born==1, over(sex) over(race) ytitle("level of IQ or education") title("Education and IQ by race and by sex") subtitle("(500 participants)" " ") note("Source: your data") histogram IQtest, kdensity normal kdensity IQtest, normal twoway histogram IQtest, discrete by(race) twoway histogram income, freq by(race) sktest income by race, sort : swilk IQtest income
alpha IQ1-IQ10 alpha IQ1-IQ10, casewise generate(new_var) item std ssc install kr20 kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum)
replace wordsum = . if wordsum<0 replace wordsum = . if wordsum>10 replace worda = . if worda<0 replace worda = 0 if worda==9 replace wordb = . if wordb<0 replace wordb = 0 if wordb==9 replace wordc = . if wordc<0 replace wordc = 0 if wordc==9 replace wordd = . if wordd<0 replace wordd = 0 if wordd==9 replace worde = . if worde<0 replace worde = 0 if worde==9 replace wordf = . if wordf<0 replace wordf = 0 if wordf==9 replace wordg = . if wordg<0 replace wordg = 0 if wordg==9 replace wordh = . if wordh<0 replace wordh = 0 if wordh==9 replace wordi = . if wordi<0 replace wordi = 0 if wordi==9 replace wordj = . if wordj<0 replace wordj = 0 if wordj==9 . alpha worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum), item Test scale = mean(unstandardized items) average item-test item-rest interitem Item | Obs Sign correlation correlation covariance alpha -------------+----------------------------------------------------------------- worda | 23817 + 0.4347 0.2727 .0344404 0.7072 wordb | 23817 + 0.4814 0.3710 .034422 0.6951 wordc | 23817 + 0.5228 0.3587 .0321716 0.6940 wordd | 23817 + 0.4543 0.3496 .0351039 0.6985 worde | 23817 + 0.6071 0.4495 .029806 0.6776 wordf | 23817 + 0.6025 0.4543 .0302186 0.6773 wordg | 23817 + 0.5509 0.3682 .0310597 0.6935 wordh | 23817 + 0.5788 0.4086 .0304111 0.6854 wordi | 23817 + 0.4806 0.3059 .0331632 0.7032 wordj | 23817 + 0.5835 0.4263 .0305587 0.6821 -------------+----------------------------------------------------------------- Test scale | .0321355 0.7137 ------------------------------------------------------------------------------- . kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum) Kuder-Richarson coefficient of reliability (KR-20) Number of items in the scale = 10 Number of complete observations = 23817 Item Item Item-rest Item | Obs difficulty variance correlation ---------+------------------------------------------ worda | 23817 0.8240 0.1450 0.2727 wordb | 23817 0.9153 0.0776 0.3710 wordc | 23817 0.2199 0.1716 0.3587 wordd | 23817 0.9280 0.0668 0.3496 worde | 23817 0.7380 0.1934 0.4495 wordf | 23817 0.7792 0.1721 0.4543 wordg | 23817 0.3230 0.2187 0.3682 wordh | 23817 0.2895 0.2057 0.4086 wordi | 23817 0.7685 0.1779 0.3059 wordj | 23817 0.2384 0.1815 0.4262 ---------+------------------------------------------ Test | 0.6024 0.3765 KR20 coefficient is 0.7138
Here is an analysis of (internal consistency) reliability. The Cronbach’s alpha can be written as a function of the number of test items and the average inter-correlation among the items. We can use 2 variables or more. The “casewise” function will delete cases with missing values. The function “item” shows the effect of removing a variable from the scale. The std function allows us to standardize the items in the scale to mean 0 and variance 1 (we will get the average interitem correlation instead of the average interitem covariance). The function “generate” serves to save the generated scale. The last column gives the alpha reliability for the test scale. The alpha of each variable gives the reliability of the test scale when the corresponding variable is removed. However, for dichotomous variable such as IQ tests’ items, we preferably use the Kuder-Richardson reliability (KR-20) which needs to be installed in Stata first. Above, we see that the internal reliability of the Wordsum vocabulary test is 0.71.
A good advice. Never use correlate. Use pwcorr instead. While pwcorr uses pairwise deletion, correlate uses listwise deletion. So this is pretty useless if you have more than 2 variables.
correlate age race gender IQtest income education by race, sort : correlate age race gender IQtest income education [aweight = sampling_weight] if age>=18, means covariance by sex, sort : pwcorr IQtest income education age [aweight = sampling_weight] if age<70, obs sig by sex, sort : spearman IQtest income education age if age<70, stats(rho obs p) star(0.05) bonferroni pw by sex, sort : ktau IQtest income education, stats(taua taub p) pw by sex, sort : pcorr IQtest degree income age race [aweight = sampling_weight] if age<70
This is how to generate correlation matrix. We can use sampling weight with aweight. In some analysis, it aweight is not available, sometimes it is pweight which is not available. It is pweight that is considered as the sampling weights (observations weighted according to their different probabilities of being in a sample). I have read that the difference between aweight and pweight is that they generate the same point estimates but different variance (Standard Error) estimates. The calculation of aweight is given here. If we need to display the mean, standard deviation and min/max values, we use “means”. And if we need covariances, we type “covariance”. We can make the correlation for a subset of the data, e.g., people aged 18+. Furthermore, the command “by [categorical_variable], sort :” allows us to repeat the analysis for each of the categories in the variable after “by”. The pwcorr function is a pairwise Pearson correlation, and the obs and sig give the sample size and p-values. Spearman correlation does not allow sampling weight, star allows to specify the level of significance for correlation marked with a star, pw requires pairwise analysis, and bonferroni specifies the Bonferroni’s correction of p-values. ktau is for kendall’s rank correlation. The pcorr function is partial correlation, where the first variable (IQtest) is correlated with degree (all other vars held constant) and then with income (all other vars held constant) and then with age (all other vars held constant) and then with race (all other vars held constant).
Analysis of variance
by race, sort : anova IQtest cohort sex cohort#sex [aweight = sampling_weight] if age<70
Now, I show how to make regression. Below is an OLS regression with dummy variables of cohort and its interaction with race (this allows us to look at the trend in black-white gap in IQ by cohort categories, although one dummy variable must be dropped in order to avoid perfect multicollinearity, and it will serve as reference value from which all other variables are expressed, and the choice of the reference dummy will affect the intercept but not the coefficient of the independent variables). The first variable is always the dependent variables. Unless we specify “beta” the program will not give us the standardized coefficients. As in all other analyses, we can use a subset of data and apply sampling weight. We can also select which kind of standard errors (robust, bootstrap, jackknife) we need with “vce”.
regress IQtest race gender age education cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 raceCohort2 raceCohort3 raceCohort4 raceCohort5 raceCohort6 if age>=18 & age<=50 [pweight = sampling_weight], vce(robust) beta predict predicted_values if e(sample), xb predict residualized_values if e(sample), res predict standardized_residuals if e(sample), rstandard graph twoway (lfitci residualized_values predicted_values) (scatter residualized_values predicted_values) pnorm residualized_values qnorm residualized_values histogram residualized_values, kdensity normal kdensity residualized_values, normal
The “regress” command must be immediately followed by “predict” command if we want to examine the normality of residuals (an assumption generally but wrongly omitted by researchers when doing regressions). In most softwares, it seems that either or both the predicted values and/or residuals are calculated even for cases not analyzed in the regression model; so, the function e(sample) should be specified. xb will generate a variable of predicted/fitted values of the dependent variable whereas res or resid will generate the unstandardized residuals (Y minus Yhat). We use rstandard to request the standardized residuals, and rstudent to request the studentized (jackknifed) residuals. Correlating the residualized and fitted values allows us to examine the normality of the residuals (lfitci can be specified if we are interested in confidence intervals).
It is not necessary to specify vce(robust) when pweight is already activated because Stata automatically estimates the robust SE when pweight is specified in regression, because robust SE, according to the Stata manual, is the correct specification. This is true also for other types of regression (i.e., logit, tobit, xtmixed, …) that uses pweight. Estimation commands that do not have a vce(robust) option (there are a few) do not allow pweights.
After regression, it is possible to perform an F-test (test statistic for F-change) and compare nested models. Say, model 1 has four independent variables x1-x4, and model 2 adds variables x5-x9. Ideally, models 1 and 2 should have identical sample sizes. To know if these variables should be added to the initial model, F-test is performed. We are actually testing the null hypothesis that the regression coefficients x5 through x9 are all zero. Significant p-value indicates they should be added (but as always, p-value depends on sample size, so it is not very useful). It is better to look at the Cohen’s f² effect size.
regress read write math science estimates store m2 generate samplem2 = e(sample) regress read write if sample==1 estimates store m1 lrtest m1 m2
We can perform a likelihood ratio test between two nested models, as shown above. The requirement of LRT is that the models need to sample exactly the same cases. The function e(sample), which can be used each time we run a new model, is used for this purpose. The model m1 will then be run only for cases that appear in m2. However, if we have used vce(robust) for robust standard errors, we will get the following message : “LR test likely invalid for models with robust vce”.
Errors-in-variables (EIV) regression
eivreg IQtest education occupation income [aweight = sampling_weight] if race==1, reliab(education .95 occupation .95 income .80)
Here, we have an analysis of error-in-variables regression. The “reliab” (or “r” which is equivalent) function allows us to specify the level of assumed reliability (we can write 0.90 or .90) for each of the independent variables. Standardized coefficients are not available.
. tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bw1C2 bw1C3 bw1C4 bw1C5 bw1C6 agedummy1 agedummy2 agedummy4 agedummy5 sex [pweight = weight], ll ul Tobit regression Number of obs = 22156 F( 16, 22140) = 103.09 Prob > F = 0.0000 Log pseudolikelihood = -47841.919 Pseudo R2 = 0.0175 ------------------------------------------------------------------------------ | Robust wordsum | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- bw1 | 2.022845 .155358 13.02 0.000 1.718332 2.327358 cohortdummy2 | .3783523 .1814003 2.09 0.037 .0227949 .7339097 cohortdummy3 | 1.033257 .1718304 6.01 0.000 .6964568 1.370056 cohortdummy4 | .7742905 .1708343 4.53 0.000 .4394431 1.109138 cohortdummy5 | 1.0802 .1747871 6.18 0.000 .7376051 1.422795 cohortdummy6 | 1.308605 .1937606 6.75 0.000 .9288205 1.68839 bw1C2 | -.2276471 .1918261 -1.19 0.235 -.6036399 .1483457 bw1C3 | -.5661765 .1790605 -3.16 0.002 -.9171478 -.2152051 bw1C4 | -.6056165 .1763168 -3.43 0.001 -.95121 -.260023 bw1C5 | -1.008771 .1797605 -5.61 0.000 -1.361115 -.6564281 bw1C6 | -.9942551 .1999675 -4.97 0.000 -1.386206 -.6023045 agedummy1 | -.7602832 .0534722 -14.22 0.000 -.8650924 -.6554739 agedummy2 | -.247756 .0475324 -5.21 0.000 -.3409229 -.154589 agedummy4 | .0323851 .0517914 0.63 0.532 -.0691299 .1339 agedummy5 | .0456177 .0623807 0.73 0.465 -.0766528 .1678883 sex | .1725331 .0317426 5.44 0.000 .1103153 .2347509 _cons | 4.008042 .1630947 24.57 0.000 3.688365 4.327719 -------------+---------------------------------------------------------------- /sigma | 2.107777 .0135762 2.081167 2.134388 ------------------------------------------------------------------------------ Obs. summary: 140 left-censored observations at wordsum=10 20698 uncensored observations 1318 right-censored observations at wordsum>=10
Tobit regression is displayed above. ll and ul stand for lower and upper level for value censoring. If the IQ has minimum of 0 and maximum of 10 answers correct, we can set ll(0) ul(10). Although ll ul automatically does the job, Cameron & Trivedi (2009, p. 523) always recommend specification of these values. There is no reason to do it by hand, unless we want to censor at a value less (more) than the maximum (minimum). Standardized beta not available because the tobit regression estimates the expected value of the latent (i.e., not observed) variable Y. In other words, the SD of the latent Y is not given.
truncreg IQtest race gender age education cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 raceCohort2 raceCohort3 raceCohort4 raceCohort5 raceCohort6 if age>=18 & age<=50 [pweight = sampling_weight], ll(80)
Above is the truncated regression for values truncated from below 80. If the (survey) data has sampled persons having at least 80 IQ points, there is a truncation from below, and we would have ll(80). Similarly, if the data selects only the persons with a maximum of 120 points, the data would appear to be truncated from above 120, then we set ul(120). For the same reason as tobit, truncreg does not provide standardized betas.
logit being_unemployed education IQtest age if age>=25 & age<60 [pweight = sampling_weight], or
Above is a logistic regression. If we don’t request for the odds ratio (“or”), then only the b coefficient is displayed. Unlike other statistical packages such as SPSS, the logit command in Stata only recognizes a dependent variable that is coded 0 and 1. If coded otherwise (e.g., 1 and 2), we can get the message “outcome does not vary; remember: 0 = negative outcome, all other nonmissing values = positive outcome”.
Multilevel regression (or linear mixed effect model)
xtmixed IQtest race aged2-aged9 raceaged2-raceaged9 [pweight=sampling_weight] || cohort21: race, mle cov(un) var pweight(1) pwscale(size) predict xtfitted if e(sample), fitted predict xtfixed if e(sample), xb predict xtres if e(sample), residuals predict xtu* if e(sample), reffects estimates store lme graph twoway (lfitci xtres xtfitted) (scatter xtres xtfitted) qnorm xtres swilk xtres histogram xtres, kdensity normal kdensity xtres, normal graph twoway (lfitci xtres xtu1) (scatter xtres xtu1) qnorm xtu1 swilk xtu1 histogram xtu1, kdensity normal kdensity xtu1, normal graph twoway (lfitci xtres xtu2) (scatter xtres xtu2) qnorm xtu2 swilk xtu2 histogram xtu2, kdensity normal kdensity xtu2, normal
This is a multilevel regression, also called linear mixed effect model, where IQtest is the dependent var, the race dichotomy, age dummies and race*age (dummies) interaction specified as fixed effect variable in the fixed component to serve as covariate, with cohort (21-category) as random intercepts to allow differences in cohort across values/categories of race and age, and race as random slope to allow race difference to differ across cohorts. mle is for (full) maximum likelihood. reml is for restricted (or residualized) maximum likelihood and is a better estimator than mle if one wants to focus on variance components. cov(un) or cov(unstructured) allows the model to assume the presence of intercept-by-slope covariance. But we can use cov(independent), which allows a distinct variance for each random effect but covariances to be zero, if we assume random intercept and slopes to be uncorrelated. An alternative option is cov(identity) which assumes that all variances are equal and all covariances are zero. Sometimes, the model reaches convergence but Stata displays the message “standard-error calculation failed”. This means our model is incorrectly specified.
reffects function produces two random effects variables, u1 (slope) and u2 (intercepts). estimates store will save the results for later use, e.g., lrtest for comparing the significance between two models.
Sampling weight can be applied using multilevel regression. The problem is that the sampling weight for level-2 cluster variable usually does not exist in nearly all survey data. However, we can specify the second-level sampling weight to be just 1 (or create a variable “one” using gen and weight the level-2 variable using variable “one” which has a value of 1 for all cases), if we can safely assume that the categories of the level-2 cluster variable have no different probabilities of being selected into the sample
by race, sort : pca education income occupation [aweight = sampling_weight] if age<70, components(3) means by race, sort : factor education income occupation [aweight = sampling_weight] if age<70, factors(3) blanks(.1) by race, sort : factor education income occupation [aweight = sampling_weight] if age<70, ml factors(3) blanks(.1)
Above is the syntax for factor analysis and principal component analysis. We can request the descriptive stats with “means” and specify the minimum loading to be displayed with “blanks” and the maximum number of factors to be retained with “factors” for FA and “components” for PCA. We can get maximum likelihood factor analysis by checking the “ml” option.
Stata FAQ: How can I estimate effect size for xtmixed?
Stata FAQ: How can I identify cases used by an estimation command using e(sample)?
Why should I not do a likelihood-ratio test after an ML estimation (e.g., logit, probit) with clustering or pweights?
Cameron, A. C., & Trivedi, P. K. (2009). Microeconometrics Using Stata. Stata Press books.
Everitt, B. S., & Rabe-Hesketh, S. (2006). Handbook of Statistical Analyses Using Stata. CRC Press.
Hamilton, L. (2012). Statistics with STATA: Version 12. Cengage Learning.
West, B.T., Welch, K.B., & Galecki, A.T. (2014). Linear Mixed Models: A Practical Guide Using Statistical Software, Second Edition. Chapman Hall/CRC Press: Boca Raton, FL.