------------------------------------------------------------------------------------------------------
       log:  c:\Imbook\bwebpage\Section2\mma09p1np.txt
  log type:  text
 opened on:  17 May 2005, 14:16:51

. 
. ********** OVERVIEW OF MMA09P1NP.DO **********
. 
. * STATA Program 
. * copyright C 2005 by A. Colin Cameron and Pravin K. Trivedi 
. * used for "Microeconometrics: Methods and Applications" 
. * by A. Colin Cameron and Pravin K. Trivedi (2005)
. * Cambridge University Press 
. 
. * Chapter 9.2 p.295-297
. * Nonparametric density estimation and nonparametric regression using actual data.
. 
. * (1) Histogram: Figure 9.1 in chapter 9.2.1 (ch9hist)
. * (2) Kernel density estimate as bandwidth varies: Figure 9.2 in chapter 9.2.1 (ch9kd1)
. * (3) Kernel density estimate as kernel varies: Figure 9.4 in chapter 9.3.4 (ch9kdensu1)
. * (4) Lowess regression: Figure 9.3 in chapter 9.4.3 (ch9ksm1)
. * (5) Extra: Nearest neighbours regression: using Lowess and using add-on knnreg
. * (6) Extra: Kernel regression: using add-on kernreg
. 
. * using data on earnings and education (see below)
. 
. * NOTE: This particular program uses version 8.2 rather than 8.0
. *       For kernel density Stata uses an alternative formulation of Epanechnikov
. *       To follow book and e.g. Hardle (1990) use epan2 rather than epan 
. *       epan = epan2 if epan bandwidth is epan2 bandwidth divided by sqrt(5) 
. *       where kernel epan2 is an update to Stata version 8.2
. 
. * To run this program you need file
. *    psidf3050.dat 
. * in your directory
. 
. * To do (5) and (6) you need Stata add-ons knnreg and kernreg
. * In Stata give command  search knnreg  and  search kernreg
. 
. * See also mma9p2npmore.do for more on nonparametric regression (Figures 9.5-9.7)
. 
. ********** SETUP
. 
. di "mma09p1np.do Cameron and Trivedi: Stata nonparametrics with wages and education"
mma09p1np.do Cameron and Trivedi: Stata nonparametrics with wages and education

. set more off

. version 8

. set scheme s1mono  /* Graphics scheme */

. 
. ********** DATA DESCRIPTION
. *
. * The original data are from the PSID Individual Level Final Release 1993 data
. * From www.isr.umich.edu/src/psid  then choose Data Center 
. * 4856 observations on 9 variables for Females 30 to 50 years 
. 
. * Fixed width data
. * intnum    1-4  V30001="1968 INTERVIEW NUMBER" 
. * persnum   5-7  V30002="PERSON NUMBER" 
. * age       8-9  V30809="AGE OF INDIVIDUAL                     93" 
. * educatn  10-11 V30820="G90 HIGHEST GRADE COMPLETED           93" 
. * earnings 12-17 V30821="TOTAL LABOR INCOME                    93" 
. * hours    18-21 V30823="1992 ANNUAL WORK HOURS                93" 
. * sex       22   V32000="SEX OF INDIVIDUAL" 
. * kids     23-24 V32022="# LIVE BIRTHS TO THIS INDIVIDUAL" 
. * [NOTE: DO NOT USE THE kids VARIABLE AS IT IS NUMBER OF BIRTHS 
. *        NOT NUMBER OF KIDS CURRENTLYU IN HOUSEHOLD]
. * married   25   V32049="LAST KNOWN MARITAL STATUS" 
. 
. ********** READ DATA ********** 
. 
. * Data are fixed format so use infix
. infix intnum 1-4 persnum 5-7 age 8-9 educatn 10-11 earnings 12-17  /* 
>   */  hours 18-21 sex 22 kids 23-24 married 25 using psidf3050.dat
(4856 observations read)

. summarize

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      intnum |      4856    4598.101    2761.971          4       9306
     persnum |      4856    59.21355    79.74856          1        205
         age |      4856    38.46293    5.595116         30         50
     educatn |      4855    16.37714     18.4495          0         99
    earnings |      4856    14244.51    15985.45          0     240000
-------------+--------------------------------------------------------
       hours |      4856    1235.335    947.1758          0       5160
         sex |      4856           2           0          2          2
        kids |      4856     4.48126    14.88786          0         99
     married |      4856    1.920717    1.504848          1          9

. 
. ********** MISSING VALUES, DATA TRANSFORMATIONS and SAMPLE SELECTION
. 
. * For Highest grade codes the missing codes are 98 DK and 99 NA and 0 inappropriate
. * Here treat these as missing
. replace educatn = . if (educatn==0 | educatn==98 | educatn==99)
(290 real changes made, 290 to missing)

. 
. * For marital status the codes are 
. *   1 married; 2 Never married; 3 Widowed; 4 Divorced, annulment;
. *   5 Separated; 8 NA / DK; 9 No histories 85-93
. * Recode 2-5 as not married and treat 8 and 9 as missing
. replace married = . if (married==8 | married==9) 
(52 real changes made, 52 to missing)

. replace married = 0 if married > 1
(1785 real changes made)

. 
. * For kids the missing codes are 98 DK/NA and 99 no birth history
. replace kids = . if (kids==98 | kids==99)
(118 real changes made, 118 to missing)

. * But do not use these data as it is number of births 
. * not number of kids currently in household
. * So I drop kids
. drop kids 

. 
. * Work with positive earnings only
. drop if earnings==0
(1204 observations deleted)

. * Topcode women with very high earnings
. replace earnings=100000 if earnings>100000 
(11 real changes made)

. * Create log hourly wage 
. gen hwage = earnings/hours

. gen lnhwage = ln(hwage)

. 
. * Work with age 36 and nonmissing education data
. keep if age == 36
(3468 observations deleted)

. drop if educatn == .
(7 observations deleted)

. summarize

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      intnum |       177    4699.853    2765.081         14       9240
     persnum |       177    59.53672    79.73001          1        188
         age |       177          36           0         36         36
     educatn |       177    12.58757    2.841347          3         17
    earnings |       177    17470.55    13513.56         87      70000
-------------+--------------------------------------------------------
       hours |       177    1506.401    698.4145          8       3160
         sex |       177           2           0          2          2
     married |       177    .7457627    .4366669          0          1
       hwage |       177    12.71631    16.58889   .6837607        175
     lnhwage |       177    2.198163    .8281614  -.3801473   5.164786

. 
. * Write data to a text (ascii) file so can use with programs other than Stata
. outfile intnum persnum age educatn earnings hours sex married hwage /*
>    */ lnhwage using mma09p1np.asc, replace

. 
. ********* ANALYSIS: (1)-(3) NONPARAMETRIC DENSITY ESTIMATES 
. 
. set scheme s1mono

. 
. * Here give bin width for histogram and kdensity  
. 
. * Calculate Silberman's plugin estimate of optimal bandwidth in (9.13)
. *   with delta given in Table 9.1 for Epanechnikov kernel
. quietly sum lnhwage, detail

. global sadj = min(r(sd),(r(p75)-r(p25))/1.349)

. di "sadj: " $sadj " iqr/1349: " (r(p75)-r(p25))/1.349 " stdev: " r(sd)
sadj: .65488184 iqr/1349: .65488184 stdev: .82816143

. global bwepan2 = 1.3643*1.7188*$sadj/(r(N)^0.2)

. di "Bandwidth: " $bwepan2 
Bandwidth: .54538542

. 
. * HISTOGRAM ONLY - Figure 9.1
. graph twoway (histogram lnhwage, bin(20) bcolor(*.2)), /*
>   */ scale (1.2) plotregion(style(none)) /*
>   */ title("Histogram for Log Wage") /*
>   */ xtitle("Log Hourly Wage", size(medlarge)) xscale(titlegap(*5)) /*
>   */ ytitle("Density", size(medlarge)) yscale(titlegap(*5)) /*
>   */ legend(pos(10) ring(0) col(1)) legend(size(small)) /*
>   */ legend( label(1 "Histogram") label(2 "Kernel")) 

. graph save ch9hist, replace
(file ch9hist.gph saved)

. graph export ch9hist.wmf, replace
(file c:\Imbook\bwebpage\Section2\ch9hist.wmf written in Windows Metafile format)

. 
. * COMBINED HISTOGRAM AND KERNEL DENSITY ESTIMATE
. graph twoway (histogram lnhwage, bin(20) bcolor(*.2)) /*
>   */ (kdensity lnhwage, width($bwepan2) epan2 clstyle(p1)), /* 
>   */ title("Histogram and Kernel Density for Log Wage") /*
>   */ caption("Note: Kernel is Epanechnikov with bandwidth 0.55") 

. 
. * KERNEL DENSITY ESTIMATE FOR 3 BANDWIDTHS - Figure 9.2
. global bwonehalf = 0.5*$bwepan2

. global btwotimes = 2*$bwepan2

. graph twoway (kdensity lnhwage, width($bwonehalf) epan2 clstyle(p2)) /*
>   */  (kdensity lnhwage, width($bwepan2) epan2 clstyle(p1)) /*
>   */  (kdensity lnhwage, width($btwotimes) epan2 clstyle(p3)), /*
>   */ scale (1.2) plotregion(style(none)) /*
>   */ title("Density Estimates as Bandwidth Varies") /*
>   */ xtitle("Log Hourly Wage", size(medlarge)) xscale(titlegap(*5)) /* 
>   */ ytitle("Kernel density estimates", size(medlarge)) yscale(titlegap(*5)) /* 
>   */ legend(pos(1) ring(0) col(1)) legend(size(small)) /*
>   */ legend( label(1 "One-half plug-in") label(2 "Plug-in") /*
>   */         label(3 "Two times plug-in"))  

. graph save ch9kd1, replace
(file ch9kd1.gph saved)

. graph export ch9kd1.wmf, replace
(file c:\Imbook\bwebpage\Section2\ch9kd1.wmf written in Windows Metafile format)

. 
. * KERNEL DENSITY ESTIMATE FOR 4 DIFFERENT KERNELS - Figure 9.4
. * Calculate Silberman's plugin optimal bandwidths using (9.13)
. * with delta given in Table 9.1 for the different kernels
. 
. * Use sadj calculated earlier for Epanecnnikov
. global bwgauss = 1.3643*0.7764*$sadj/(_N^0.2)

. global bwbiweight = 1.3643*2.0362*$sadj/(_N^0.2)

. global bwrectang = 0.5*1.3643*1.3510*$sadj/(_N^0.2)

. di "Usual Epanechnikov (epan2):      " $bwepan2 
Usual Epanechnikov (epan2):      .54538542

. di "Gaussian:                        " $bwgauss 
Gaussian:                        .24635632

. di "Quartic or biweight:             " $bwbiweight
Quartic or biweight:             .64609832

. di "Uniform or rectangular:          " $bwrectang
Uniform or rectangular:          .21434015

. graph twoway (kdensity lnhwage, width($bwepan2) epan2) /*
>   */ (kdensity lnhwage, width($bwgauss) gauss) /*
>   */ (kdensity lnhwage, width($bwbiweight) biweight) /* 
>   */ (kdensity lnhwage, width($bwrectang) rectangle), /*
>   */ scale (1.2) plotregion(style(none)) /*
>   */ title("Density Estimates as Kernel Varies") /*
>   */ xtitle("Log Hourly Wage", size(medlarge)) xscale(titlegap(*5)) /* 
>   */ ytitle("Kernel density estimates", size(medlarge)) yscale(titlegap(*5)) /* 
>   */ legend(pos(3) ring(0) col(1)) legend(size(small)) /*
>   */ legend( label(1 "Epanechnikov (h=0.545)") label(2 "Gaussian (h=0.246)") /*
>   */         label(3 "Quartic (h=0.646)") label(4 "Uniform (h=0.214)"))

. graph save ch9kdensu1, replace
(file ch9kdensu1.gph saved)

. graph export ch9kdensu1.wmf, replace
(file c:\Imbook\bwebpage\Section2\ch9kdensu1.wmf written in Windows Metafile format)

. 
. * SHOW THAT STATA EPANECHNIKOV = USUAL EPANECHNIKOV
. * Once divide usual Epanechnikov bandwidth by sqrt(5). 
. * (Pagan and Ullah (1999, p.28) have formulae.)
. global bwepan = $bwepan2/sqrt(5)

. graph twoway (kdensity lnhwage, width($bwepan2) epan2) /*
>    */  (kdensity lnhwage, width($bwepan) epan), /*
>    */  title("Epan = Epan2 if bandwidth adjusted") /*
>    */  legend( label(1 "Usual Epanechnikov") label(2 "Stata Epanechnikov"))   

. 
. 
. ********* ANALYSIS: (4) LOWESS NONPARAMETRIC REGRESSION ESTIMATES
. 
. * LOWESS WITH DEFAULT BANDWIDTH of 0.8 
. lowess lnhwage educatn

. 
. * LOWESS REGRESSION WITH BANDWIDTHS of 0.1, 0.4 and 0.8 - Figure 9.3
. graph twoway (scatter lnhwage educatn, msize(medsmall) msymbol(o)) /*
>   */ (lowess lnhwage educatn, bwidth(0.8) clstyle(p2)) /*
>   */ (lowess lnhwage educatn, bwidth(0.4) clstyle(p1)) /* 
>   */ (lowess lnhwage educatn, bwidth(0.1) clstyle(p3)), /*  
>   */ scale (1.2) plotregion(style(none)) /*
>   */ title("Nonparametric Regression as Bandwidth Varies") /*
>   */ xtitle("Years of Schooling", size(medlarge)) xscale(titlegap(*5)) /* 
>   */ ytitle("Log Hourly Wage", size(medlarge)) yscale(titlegap(*5)) /*
>   */ legend(pos(12) ring(0) col(2)) legend(size(small)) /*
>   */ legend( label(1 "Actual data") label(2 "Bandwidth h=0.8") /*
>   */         label(3 "Bandwidth h=0.4") label(4 "Bandwidth h=0.1"))

. graph save ch9ksm1, replace
(file ch9ksm1.gph saved)

. graph export ch9ksm1.wmf, replace
(file c:\Imbook\bwebpage\Section2\ch9ksm1.wmf written in Windows Metafile format)

. 
. ********* ANALYSIS: (5) EXTRA: K-NEAREST NEIGHBORS NONPARAMETRIC REGRESSION
. 
. * NEAREST NEIGHBOURS REGRESSION USING LOWESS
. * Use lowess with mean and noweight options to give running means = centered kNN
. global knnbwidth = 0.3

. di "knn via Lowess uses following % of sample: " $knnbwidth
knn via Lowess uses following % of sample: .3

. lowess lnhwage educatn, bwidth($knnbwidth) mean noweight

. 
. * LOWESS COMPARED TO NEAREST NEIGHBOURS
. graph twoway (lowess lnhwage educatn, bwidth(0.3) mean noweight) /*
>   */ (lowess lnhwage educatn, bwidth(0.3)), /*
>   */ title("Centered kNN versus Lowess") /*
>   */ legend( label(1 "Centered kNN") label(2 "Lowess 0.8"))

. 
. * NEAREST NEIGHBOURS REGRESSION USING KNNREG COMPARED TO USING LOWESS
. * knnreg is a Stata add-on (in Stata search knnreg to find and download)
. * Here we verify that same as lowess knn except knnreg drops endpoints
. global k = round($knnbwidth*_N)

. di "knnreg uses following number of neighbours: " $k
knnreg uses following number of neighbours: 53

. knnreg lnhwage educatn, k($k) gen(knnregpred) ylabel nograph

. lowess lnhwage educatn, bwidth($knnbwidth) gen(knnlowesspred) mean noweight nograph

. * Following shows that the same except knnreg drops endpoints and lowess does not
. sum knnlowesspred knnregpred

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
knnlowessp~d |       177    2.180308    .4522163   1.475512   2.954416
  knnregpred |       125    2.184309    .3412013   1.529874   2.802865

. corr knnlowesspred knnregpred
(obs=125)

             | knnlow~d knnreg~d
-------------+------------------
knnlowessp~d |   1.0000
  knnregpred |   1.0000   1.0000


. 
. ********* ANALYSIS: (6) EXTRA: KERNEL NONPARAMETRIC REGRESSION
. 
. * KERNEL REGRESSION
. * Kercode 1 = Uniform; 2 = Triangle; 3 = Epanechnikov; 4 = Quartic (Biweight);
. *         5 = Triweight; 6 = Gaussian; 7 = Cosinus
. * bwidth(#) defines width of the weight function window around each grid point.
. * npoint(#) specifies the number of equally spaced grid points over range of x.
. * Here bwidth(3) gives e.g. positive weight from x=4 to x=10 if current x0=7  
. kernreg lnhwage educatn, bwidth(3) kercode(3) npoint(100) ylabel gen(kernregpred1 xkernreg)

. graph twoway (lowess lnhwage educatn, bwidth(0.5) clstyle(p2)) /*
>   */ (line kernregpred xkernreg, clstyle(p1)), /*
>   */ title("Lowess versus kernel regression") /*
>   */ legend( label(1 "Lowess") label(2 "Kernreg"))

. 
. ********** CLOSE OUTPUT
. log close
       log:  c:\Imbook\bwebpage\Section2\mma09p1np.txt
  log type:  text
 closed on:  17 May 2005, 14:17:05
----------------------------------------------------------------------------------------------------
