------------------------------------------------------------------------------------------------------ log: c:\Imbook\bwebpage\Section2\mma09p1np.txt log type: text opened on: 17 May 2005, 14:16:51 . . ********** OVERVIEW OF MMA09P1NP.DO ********** . . * STATA Program . * copyright C 2005 by A. Colin Cameron and Pravin K. Trivedi . * used for "Microeconometrics: Methods and Applications" . * by A. Colin Cameron and Pravin K. Trivedi (2005) . * Cambridge University Press . . * Chapter 9.2 p.295-297 . * Nonparametric density estimation and nonparametric regression using actual data. . . * (1) Histogram: Figure 9.1 in chapter 9.2.1 (ch9hist) . * (2) Kernel density estimate as bandwidth varies: Figure 9.2 in chapter 9.2.1 (ch9kd1) . * (3) Kernel density estimate as kernel varies: Figure 9.4 in chapter 9.3.4 (ch9kdensu1) . * (4) Lowess regression: Figure 9.3 in chapter 9.4.3 (ch9ksm1) . * (5) Extra: Nearest neighbours regression: using Lowess and using add-on knnreg . * (6) Extra: Kernel regression: using add-on kernreg . . * using data on earnings and education (see below) . . * NOTE: This particular program uses version 8.2 rather than 8.0 . * For kernel density Stata uses an alternative formulation of Epanechnikov . * To follow book and e.g. Hardle (1990) use epan2 rather than epan . * epan = epan2 if epan bandwidth is epan2 bandwidth divided by sqrt(5) . * where kernel epan2 is an update to Stata version 8.2 . . * To run this program you need file . * psidf3050.dat . * in your directory . . * To do (5) and (6) you need Stata add-ons knnreg and kernreg . * In Stata give command search knnreg and search kernreg . . * See also mma9p2npmore.do for more on nonparametric regression (Figures 9.5-9.7) . . ********** SETUP . . di "mma09p1np.do Cameron and Trivedi: Stata nonparametrics with wages and education" mma09p1np.do Cameron and Trivedi: Stata nonparametrics with wages and education . set more off . version 8 . set scheme s1mono /* Graphics scheme */ . . ********** DATA DESCRIPTION . * . * The original data are from the PSID Individual Level Final Release 1993 data . * From www.isr.umich.edu/src/psid then choose Data Center . * 4856 observations on 9 variables for Females 30 to 50 years . . * Fixed width data . * intnum 1-4 V30001="1968 INTERVIEW NUMBER" . * persnum 5-7 V30002="PERSON NUMBER" . * age 8-9 V30809="AGE OF INDIVIDUAL 93" . * educatn 10-11 V30820="G90 HIGHEST GRADE COMPLETED 93" . * earnings 12-17 V30821="TOTAL LABOR INCOME 93" . * hours 18-21 V30823="1992 ANNUAL WORK HOURS 93" . * sex 22 V32000="SEX OF INDIVIDUAL" . * kids 23-24 V32022="# LIVE BIRTHS TO THIS INDIVIDUAL" . * [NOTE: DO NOT USE THE kids VARIABLE AS IT IS NUMBER OF BIRTHS . * NOT NUMBER OF KIDS CURRENTLYU IN HOUSEHOLD] . * married 25 V32049="LAST KNOWN MARITAL STATUS" . . ********** READ DATA ********** . . * Data are fixed format so use infix . infix intnum 1-4 persnum 5-7 age 8-9 educatn 10-11 earnings 12-17 /* > */ hours 18-21 sex 22 kids 23-24 married 25 using psidf3050.dat (4856 observations read) . summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- intnum | 4856 4598.101 2761.971 4 9306 persnum | 4856 59.21355 79.74856 1 205 age | 4856 38.46293 5.595116 30 50 educatn | 4855 16.37714 18.4495 0 99 earnings | 4856 14244.51 15985.45 0 240000 -------------+-------------------------------------------------------- hours | 4856 1235.335 947.1758 0 5160 sex | 4856 2 0 2 2 kids | 4856 4.48126 14.88786 0 99 married | 4856 1.920717 1.504848 1 9 . . ********** MISSING VALUES, DATA TRANSFORMATIONS and SAMPLE SELECTION . . * For Highest grade codes the missing codes are 98 DK and 99 NA and 0 inappropriate . * Here treat these as missing . replace educatn = . if (educatn==0 | educatn==98 | educatn==99) (290 real changes made, 290 to missing) . . * For marital status the codes are . * 1 married; 2 Never married; 3 Widowed; 4 Divorced, annulment; . * 5 Separated; 8 NA / DK; 9 No histories 85-93 . * Recode 2-5 as not married and treat 8 and 9 as missing . replace married = . if (married==8 | married==9) (52 real changes made, 52 to missing) . replace married = 0 if married > 1 (1785 real changes made) . . * For kids the missing codes are 98 DK/NA and 99 no birth history . replace kids = . if (kids==98 | kids==99) (118 real changes made, 118 to missing) . * But do not use these data as it is number of births . * not number of kids currently in household . * So I drop kids . drop kids . . * Work with positive earnings only . drop if earnings==0 (1204 observations deleted) . * Topcode women with very high earnings . replace earnings=100000 if earnings>100000 (11 real changes made) . * Create log hourly wage . gen hwage = earnings/hours . gen lnhwage = ln(hwage) . . * Work with age 36 and nonmissing education data . keep if age == 36 (3468 observations deleted) . drop if educatn == . (7 observations deleted) . summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- intnum | 177 4699.853 2765.081 14 9240 persnum | 177 59.53672 79.73001 1 188 age | 177 36 0 36 36 educatn | 177 12.58757 2.841347 3 17 earnings | 177 17470.55 13513.56 87 70000 -------------+-------------------------------------------------------- hours | 177 1506.401 698.4145 8 3160 sex | 177 2 0 2 2 married | 177 .7457627 .4366669 0 1 hwage | 177 12.71631 16.58889 .6837607 175 lnhwage | 177 2.198163 .8281614 -.3801473 5.164786 . . * Write data to a text (ascii) file so can use with programs other than Stata . outfile intnum persnum age educatn earnings hours sex married hwage /* > */ lnhwage using mma09p1np.asc, replace . . ********* ANALYSIS: (1)-(3) NONPARAMETRIC DENSITY ESTIMATES . . set scheme s1mono . . * Here give bin width for histogram and kdensity . . * Calculate Silberman's plugin estimate of optimal bandwidth in (9.13) . * with delta given in Table 9.1 for Epanechnikov kernel . quietly sum lnhwage, detail . global sadj = min(r(sd),(r(p75)-r(p25))/1.349) . di "sadj: " $sadj " iqr/1349: " (r(p75)-r(p25))/1.349 " stdev: " r(sd) sadj: .65488184 iqr/1349: .65488184 stdev: .82816143 . global bwepan2 = 1.3643*1.7188*$sadj/(r(N)^0.2) . di "Bandwidth: " $bwepan2 Bandwidth: .54538542 . . * HISTOGRAM ONLY - Figure 9.1 . graph twoway (histogram lnhwage, bin(20) bcolor(*.2)), /* > */ scale (1.2) plotregion(style(none)) /* > */ title("Histogram for Log Wage") /* > */ xtitle("Log Hourly Wage", size(medlarge)) xscale(titlegap(*5)) /* > */ ytitle("Density", size(medlarge)) yscale(titlegap(*5)) /* > */ legend(pos(10) ring(0) col(1)) legend(size(small)) /* > */ legend( label(1 "Histogram") label(2 "Kernel")) . graph save ch9hist, replace (file ch9hist.gph saved) . graph export ch9hist.wmf, replace (file c:\Imbook\bwebpage\Section2\ch9hist.wmf written in Windows Metafile format) . . * COMBINED HISTOGRAM AND KERNEL DENSITY ESTIMATE . graph twoway (histogram lnhwage, bin(20) bcolor(*.2)) /* > */ (kdensity lnhwage, width($bwepan2) epan2 clstyle(p1)), /* > */ title("Histogram and Kernel Density for Log Wage") /* > */ caption("Note: Kernel is Epanechnikov with bandwidth 0.55") . . * KERNEL DENSITY ESTIMATE FOR 3 BANDWIDTHS - Figure 9.2 . global bwonehalf = 0.5*$bwepan2 . global btwotimes = 2*$bwepan2 . graph twoway (kdensity lnhwage, width($bwonehalf) epan2 clstyle(p2)) /* > */ (kdensity lnhwage, width($bwepan2) epan2 clstyle(p1)) /* > */ (kdensity lnhwage, width($btwotimes) epan2 clstyle(p3)), /* > */ scale (1.2) plotregion(style(none)) /* > */ title("Density Estimates as Bandwidth Varies") /* > */ xtitle("Log Hourly Wage", size(medlarge)) xscale(titlegap(*5)) /* > */ ytitle("Kernel density estimates", size(medlarge)) yscale(titlegap(*5)) /* > */ legend(pos(1) ring(0) col(1)) legend(size(small)) /* > */ legend( label(1 "One-half plug-in") label(2 "Plug-in") /* > */ label(3 "Two times plug-in")) . graph save ch9kd1, replace (file ch9kd1.gph saved) . graph export ch9kd1.wmf, replace (file c:\Imbook\bwebpage\Section2\ch9kd1.wmf written in Windows Metafile format) . . * KERNEL DENSITY ESTIMATE FOR 4 DIFFERENT KERNELS - Figure 9.4 . * Calculate Silberman's plugin optimal bandwidths using (9.13) . * with delta given in Table 9.1 for the different kernels . . * Use sadj calculated earlier for Epanecnnikov . global bwgauss = 1.3643*0.7764*$sadj/(_N^0.2) . global bwbiweight = 1.3643*2.0362*$sadj/(_N^0.2) . global bwrectang = 0.5*1.3643*1.3510*$sadj/(_N^0.2) . di "Usual Epanechnikov (epan2): " $bwepan2 Usual Epanechnikov (epan2): .54538542 . di "Gaussian: " $bwgauss Gaussian: .24635632 . di "Quartic or biweight: " $bwbiweight Quartic or biweight: .64609832 . di "Uniform or rectangular: " $bwrectang Uniform or rectangular: .21434015 . graph twoway (kdensity lnhwage, width($bwepan2) epan2) /* > */ (kdensity lnhwage, width($bwgauss) gauss) /* > */ (kdensity lnhwage, width($bwbiweight) biweight) /* > */ (kdensity lnhwage, width($bwrectang) rectangle), /* > */ scale (1.2) plotregion(style(none)) /* > */ title("Density Estimates as Kernel Varies") /* > */ xtitle("Log Hourly Wage", size(medlarge)) xscale(titlegap(*5)) /* > */ ytitle("Kernel density estimates", size(medlarge)) yscale(titlegap(*5)) /* > */ legend(pos(3) ring(0) col(1)) legend(size(small)) /* > */ legend( label(1 "Epanechnikov (h=0.545)") label(2 "Gaussian (h=0.246)") /* > */ label(3 "Quartic (h=0.646)") label(4 "Uniform (h=0.214)")) . graph save ch9kdensu1, replace (file ch9kdensu1.gph saved) . graph export ch9kdensu1.wmf, replace (file c:\Imbook\bwebpage\Section2\ch9kdensu1.wmf written in Windows Metafile format) . . * SHOW THAT STATA EPANECHNIKOV = USUAL EPANECHNIKOV . * Once divide usual Epanechnikov bandwidth by sqrt(5). . * (Pagan and Ullah (1999, p.28) have formulae.) . global bwepan = $bwepan2/sqrt(5) . graph twoway (kdensity lnhwage, width($bwepan2) epan2) /* > */ (kdensity lnhwage, width($bwepan) epan), /* > */ title("Epan = Epan2 if bandwidth adjusted") /* > */ legend( label(1 "Usual Epanechnikov") label(2 "Stata Epanechnikov")) . . . ********* ANALYSIS: (4) LOWESS NONPARAMETRIC REGRESSION ESTIMATES . . * LOWESS WITH DEFAULT BANDWIDTH of 0.8 . lowess lnhwage educatn . . * LOWESS REGRESSION WITH BANDWIDTHS of 0.1, 0.4 and 0.8 - Figure 9.3 . graph twoway (scatter lnhwage educatn, msize(medsmall) msymbol(o)) /* > */ (lowess lnhwage educatn, bwidth(0.8) clstyle(p2)) /* > */ (lowess lnhwage educatn, bwidth(0.4) clstyle(p1)) /* > */ (lowess lnhwage educatn, bwidth(0.1) clstyle(p3)), /* > */ scale (1.2) plotregion(style(none)) /* > */ title("Nonparametric Regression as Bandwidth Varies") /* > */ xtitle("Years of Schooling", size(medlarge)) xscale(titlegap(*5)) /* > */ ytitle("Log Hourly Wage", size(medlarge)) yscale(titlegap(*5)) /* > */ legend(pos(12) ring(0) col(2)) legend(size(small)) /* > */ legend( label(1 "Actual data") label(2 "Bandwidth h=0.8") /* > */ label(3 "Bandwidth h=0.4") label(4 "Bandwidth h=0.1")) . graph save ch9ksm1, replace (file ch9ksm1.gph saved) . graph export ch9ksm1.wmf, replace (file c:\Imbook\bwebpage\Section2\ch9ksm1.wmf written in Windows Metafile format) . . ********* ANALYSIS: (5) EXTRA: K-NEAREST NEIGHBORS NONPARAMETRIC REGRESSION . . * NEAREST NEIGHBOURS REGRESSION USING LOWESS . * Use lowess with mean and noweight options to give running means = centered kNN . global knnbwidth = 0.3 . di "knn via Lowess uses following % of sample: " $knnbwidth knn via Lowess uses following % of sample: .3 . lowess lnhwage educatn, bwidth($knnbwidth) mean noweight . . * LOWESS COMPARED TO NEAREST NEIGHBOURS . graph twoway (lowess lnhwage educatn, bwidth(0.3) mean noweight) /* > */ (lowess lnhwage educatn, bwidth(0.3)), /* > */ title("Centered kNN versus Lowess") /* > */ legend( label(1 "Centered kNN") label(2 "Lowess 0.8")) . . * NEAREST NEIGHBOURS REGRESSION USING KNNREG COMPARED TO USING LOWESS . * knnreg is a Stata add-on (in Stata search knnreg to find and download) . * Here we verify that same as lowess knn except knnreg drops endpoints . global k = round($knnbwidth*_N) . di "knnreg uses following number of neighbours: " $k knnreg uses following number of neighbours: 53 . knnreg lnhwage educatn, k($k) gen(knnregpred) ylabel nograph . lowess lnhwage educatn, bwidth($knnbwidth) gen(knnlowesspred) mean noweight nograph . * Following shows that the same except knnreg drops endpoints and lowess does not . sum knnlowesspred knnregpred Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- knnlowessp~d | 177 2.180308 .4522163 1.475512 2.954416 knnregpred | 125 2.184309 .3412013 1.529874 2.802865 . corr knnlowesspred knnregpred (obs=125) | knnlow~d knnreg~d -------------+------------------ knnlowessp~d | 1.0000 knnregpred | 1.0000 1.0000 . . ********* ANALYSIS: (6) EXTRA: KERNEL NONPARAMETRIC REGRESSION . . * KERNEL REGRESSION . * Kercode 1 = Uniform; 2 = Triangle; 3 = Epanechnikov; 4 = Quartic (Biweight); . * 5 = Triweight; 6 = Gaussian; 7 = Cosinus . * bwidth(#) defines width of the weight function window around each grid point. . * npoint(#) specifies the number of equally spaced grid points over range of x. . * Here bwidth(3) gives e.g. positive weight from x=4 to x=10 if current x0=7 . kernreg lnhwage educatn, bwidth(3) kercode(3) npoint(100) ylabel gen(kernregpred1 xkernreg) . graph twoway (lowess lnhwage educatn, bwidth(0.5) clstyle(p2)) /* > */ (line kernregpred xkernreg, clstyle(p1)), /* > */ title("Lowess versus kernel regression") /* > */ legend( label(1 "Lowess") label(2 "Kernreg")) . . ********** CLOSE OUTPUT . log close log: c:\Imbook\bwebpage\Section2\mma09p1np.txt log type: text closed on: 17 May 2005, 14:17:05 ----------------------------------------------------------------------------------------------------