* MMA27P3MILOGIT.DO  January 2007 for Stata version 8.0
*                     based on logitmcar.do
clear
capture log close 
log using mma27p3milogit.txt, text replace

********** OVERVIEW OF MMA27P3MILOGIT.DO **********

* STATA Program by A. Colin Cameron and Pravin K. Trivedi (2005) for
* "Microeconometrics: Methods and Applications, Cambridge University Press 

* Chapter 27.8.2 pp. 937-939  Missing Data Imputation in a Logit Model

* This program creates the first three columns of Tables 27.5-27.6
* and it creates the data sets analyzed by SAS for multiple imputations
* To give the remaining columns of Tables 27.5-27.6

* There are four cases
*  1: 10% missing rho=0.64 for Table 27.5 and mma27logit1.asc  
*  2: 25% missing rho=0.64 for                mma27logit2.asc  
*  3: 10% missing rho=0.36 for                mma27logit3.asc  
*  4: 35% missing rho=0.36 for Table 27.6 and mma27logit4.asc  

* THIS PROGRAM DIFFERS FROM THE PROGRAM THAT CREATED THE TABLE GIVEN IN THE BOOK.
* IT USES A DIFFERENT SEED LEADING TO DIFFERENT DATA SETS

* The created data are then analyzed using MMA27P4MILOGIT.SAS 
* to construct the remaining columns of Tables 27.5-27.6

********** SETUP ********** 

set more off
version 8.0
set scheme s1mono  /* Graphics scheme */

********** SIMULATION OVERVIEW ********** 

* The data generating process is logit with
*   y = 1(ystar > 0)
*   ystar = constant + x1 + x2 + u, 
*   x1, x2 ~ bivariate normal with covariance matrix(1,rho\rho,1)
*   u ~ logistic with variance pi^2/3
*   N = 1000

* The missing data process is
*   10% (or 25%) of x1 are randomly missing
*   10% (or 25%) of x2 are randomly missing
* They are not necessary to be missing on the same observation.

* Note that estimated model will give
* estimated coefficients -1/sqrt(p1^2/3) equals -0.551 approx.

************ PROGRAM TO CREATE AND ANALYZE MISSING DATA ***********

* This program has four arguments
*   `1' is rho - correlation between x1 and x2
*   `2' is percentage nonmissing (so 100 - `2' is percentage missing)
*   `3' is the number for the data set created
*   `4' is the variance of u set so that R^2 = 0.25 in true OLS regression 

* The program 
*    creates a missing data set
*    estimates using listwise deletion and mean imputation
*    writes out data set for later multiple imputation by SAS 

capture program drop missing

program define missing

  /* (1) Create complete data set */
  di
  clear 
  set obs 1000                       /* set sample size*/
  matrix covvar = (1,`1' \ `1',1)    /* set covariance matrix for x1, x2*/
  matrix means = (0,0)               /* set mean for x1, x2*/
  drawnorm x1 x2, seed(123) cov(covvar) means(means)  /* draw x1, x2*/
  sum x1 x2                          /* check x1, x2 corectly drawn*/
  corr x1 x2
  gen u = sqrt(_pi^2/3)*logit(uniform())     /* draw logistic error u */
  sum u                                      /* check draws of u*/
  gen cons = 1
  gen ystar = x1 + x2 + u + cons      /* generate ystar */
  gen y = 0                           /* generate y*/
  replace y=1 if ystar<=0
  gen id = _n
  sort id
  save x1x2uy.dta, replace

  /* (2) Create data set with some observations missing */
  use x1x2uy.dta, clear       /* randomly set 100-`2' % of x1 missing*/
  keep x1
  gen id=_n
  sample `2'
  sort id
  rename x1 x1missing         /* rename resulting x1 as x1missing*/
  save x1.dta, replace
  use x1x2uy.dta, clear       /* randomly set 100-`2' % of x2 missing*/
  keep x2
  gen id=_n
  sample `2'
  sort id
  rename x2 x2missing         /*rename resulting x2 as x2missing*/
  save x2.dta, replace
  use x1x2uy, clear           /* merge x1missing and x2missing */
  sort id
  merge id using x1
  rename _merge merge1
  sort id
  merge id using x2

  /* (3) Create the first three columns of Tables 27.5-27.6 */

  /* OLS with no data missing */
  di _n "Column 1: OLS with no data missing"
  logit y x1 x2                  

  /* OLS with listwise deletion of missing data */
  di _n "Column 2: OLS with listwise deletion of missing data"
  logit y x1missing x2missing                

  /* OLS with mean imputation of missing data */
  /* Generate mean imputations of x1 and x2 */
  gen x1meanimpute=x1missing    
  gen x2meanimpute=x2missing
  sum x1missing
  replace x1meanimpute=r(mean) if x1meanimpute==.
  sum x2missing
  replace x2meanimpute=r(mean) if x2meanimpute==.
  di _n "Column 3: OLS with mean imputation of missing data"
  logit y x1meanimpute x2meanimpute 

  /* Save data for later SAS multiple imputation use */
  /* save x1x2missuy.dta, replace */
  outfile y x1missing x2missing using mma27logit`3'.asc, replace
  clear

end

************ RUN THE PROGRAM TO CREATE SEVERAL MISSING DATA SETS ***********

* This program has four arguments
*   `1' is rho - correlation between x1 and x2
*   `2' is percentage nonmissing (so 100 - `2' is percentage missing)
*   `3' is the number for the data set created
*       e.g. the first will be mma27lineardata1.asc 
*   `4' is the variance of u set so that R^2 = 0.25 in true OLS regression 

* Table 27.5
missing 0.64 90 1 10   /* Case 1: high correlation and low missing  */
 
* Not tabulated 
missing 0.64 75 2 10   /* Case 2: high correlation and high missing */
 
* Not tabulated
missing 0.36 90 3 10   /* Case 3: low correlation and low missing   */
 
* Table 27.6
missing 0.36 75 4 10   /* Case 4: low correlation and high missing  */

********** CLOSE OUTPUT **********
log close
clear
exit