* MMA20P1COUNT.DO Revised November 2006 (to add Table 20.6)
* Based on March 2005 for Stata version 8.0
log using mma20p1count.txt, text replace
********* OVERVIEW OF MMA20P1COUNT.DO **********
* STATA Program
* copyright C 2005 by A. Colin Cameron and Pravin K. Trivedi
* used for "Microeconometrics: Methods and Applications"
* by A. Colin Cameron and Pravin K. Trivedi (2005)
* Cambridge University Press
* Chapter 20.3 pages 671-4 and 20.7 page 690
* Count data regression example
* It provides
* (1) Frequency distribution for count (Table 20.3)
* (2) Data summary (Table 20.4)
* (3) Poisson regression with various standard errors (Table 20.5)
* (4) Negative binomial regression with various standard errors (Table 20.5)
* (5) Predicted probabilities from models (Table 20.6) * Added Nov 2006
* To use this program you need health expenditure data in Stata data set
* randdata.dta
********** SETUP **********
set mem 20m
set more off
version 8.0
set scheme s1mono /* Used for graphs */
********** DATA DESCRIPTION **********
* Essentially same data as in P. Deb and P.K. Trivedi (2002)
* "The Structure of Demand for Medical Care: Latent Class versus
* Two-Part Models", Journal of Health Economics, 21, 601-625
* except that paper used different outcome (counts rather than $)
* Each observation is for an individual over a year.
* Individuals may appear in up to five years.
* All available sample is used except only fee for service plans included.
* In analysis here only year 2 is used so panel complications are avoided.
* Clustering of individuals within household is ignored here.
* Dependent variable is
* MED med Annual medical expenditures in constant dollars
* excluding dental and outpatient mental
* LNMED lnmeddol Ln(Medical expenditures) given meddol > 0
* Missing otherwise
* DMED binexp 1 if medical expenditures > 0
* Regressors are
* - Health insurance measures
* LC logc log(coinsrate+1) where coinsurance rate is 0 to 100
* IDP idp 1 if individual deductible plan
* LPI lpi 1og(annual participation incentive payment) or 0 if no payment
* FMDE fmde log(max(medical deductible expenditure)) if IDP=1 and MDE>1 or 0 otherwise.
* - Health status measures
* NDISEASE disea number of chronic diseases
* PHYSLIM physlm 1 if physical limitation
* HLTHG hlthg 1 if good health
* HLTHF hlthf 1 if good health
* HLTHP hlthp 1 if good health (omitted is excellent)
* - Socioeconomic characteristics
* LINC linc log of annual family income (in $)
* LFAM lfam log of family size
* EDUCDEC educdec years of schooling of decision maker
* AGE xage exact age
* BLACK black 1 if black
* FEMALE female 1 if female
* CHILD child 1 if child
* FEMCHILD fchild 1 if female child
* If panel data used then clustering is on
* zper person id
********** READ DATA, SELECT AND TRANSFORM **********
use randdata.dta, clear
sum
/* Describe and summarize the original data.
describe
summarize
* The orignal data are a panel.
* The following summarizes panel features for completeness
iis zper
tis year
xtdes
xtsum meddol lnmeddol binexp
*/
* Note that unlike chapter 16 we use all years, not just year 2
* educdec is missing for some observations
drop if educdec==.
* rename variables
rename mdvis MDU
rename meddol MED
rename binexp DMED
rename lnmeddol LNMED
rename linc LINC
rename lfam LFAM
rename educdec EDUCDEC
rename xage AGE
rename female FEMALE
rename child CHILD
rename fchild FEMCHILD
rename black BLACK
rename disea NDISEASE
rename physlm PHYSLIM
rename hlthg HLTHG
rename hlthf HLTHF
rename hlthp HLTHP
rename idp IDP
rename logc LC
rename lpi LPI
rename fmde FMDE
* Define the regressor list which in commands can refer to as $XLIST
global XLIST LC IDP LPI FMDE PHYSLIM NDISEASE HLTHG HLTHF HLTHP /*
*/ LINC LFAM EDUCDEC AGE FEMALE CHILD FEMCHILD BLACK
sum MDU $XLIST
* Write final data to a text (ascii) file so can use with programs other than Stata
outfile MDU LC IDP LPI FMDE PHYSLIM NDISEASE HLTHG HLTHF HLTHP /*
*/ LINC LFAM EDUCDEC AGE FEMALE CHILD FEMCHILD BLACK /*
*/ using mma20p1count.asc, replace
********** (1) FREQUENCIES OF COUNT (Table 20.3, page 672) **********
* Following ggives Table 20.3 (page 672) frequencies
tabulate MDU
* Histogram with kernel density estimate
hist MDU, discrete kdensity
********** (2) DATA SUMMARY (Table 20.4, page 672) **********
* Following gives variables in same order as Table 20.4 (page 672)
sum MDU LC IDP LPI FMDE LINC LFAM AGE FEMALE CHILD FEMCHILD BLACK /*
*/ EDUCDEC PHYSLIM NDISEASE HLTHG HLTHF HLTHP
*********** (3, 4) REGRESSION ANALYSIS **************
* Here just two estimators - Poisson and negative binomial
* but three ways to calculate standard errors
* (A) default ML
* (B) robust (to misspecification of heteroskedasticity)
* (C) cluster-robust needed here as data are actually panel (see chapter 21, 24)
*** Table 20.5 Poisson regression estimates
* Default standard errors assume variance = mean (ignoring overdispersion)
* This is first t-ratio in Table 20.5
poisson MDU $XLIST
estimates store poisml
* Should always control for possible overdispersion
* This is second t-ratio in Table 20.5
poisson MDU $XLIST, robust
estimates store poisrobust
* Should also control here for clustering (see chapter 24)
* as up to four years of data for each person.
* Table 20.5 did not report these results
poisson MDU $XLIST, cluster(zper)
estimates store poiscluster
*** Table 20.5 Negative binomial regression estimates
* Default standard errors assume variance = mean (ignoring overdispersion)
* This is first t-ratio in Table 20.5
nbreg MDU $XLIST
estimates store nbml
* Should always control for possible overdispersion
* This is second t-ratio in Table 20.5
nbreg MDU $XLIST, robust
estimates store nbrobust
* Should also control here for clustering (see chapter 24)
* as up to four years of data for each person.
* Table 20.5 did not report these results
nbreg MDU $XLIST, cluster(zper)
estimates store nbcluster
*** Table 20.6 Predicted probabilities
* This is coded for counts 0, 1, 2, ..., $CELLMAX or more
global CELLMAX 10 // User can change the value here
global MAXLESS1 = $CELLMAX-1 // Need in loops below
** First row of Table 20.6
* Obtain sample frequencies (up to 10 or more)
gen yactual = MDU
replace yactual = $CELLMAX if yactual > 10
tabulate yactual
** Second row of Table 20.6
* Obtain Possion predicted probabilities (up to 9 or more)
* Uses the Poisson recursion Pr[y=k] = Prob[y=k-1]*mu/k for k>=1
quietly poisson MDU $XLIST
predict pmu, n // gives exp(x'b)
gen p0 = exp(-pmu)
gen sump = p0
foreach k of numlist 1(1)$MAXLESS1 {
local j = `k'-1
gen p`k' = p`j'*pmu/`k'
replace sump = sump + p`k'
}
gen p$CELLMAX = 1 - sump
sum p0-p$CELLMAX
** Third row of Table 20.6
* Obtain Negative binomial predicted probabilities (up to 9 or more)
* Uses the Negative binomial recursion
* Pr[y=k] = Prob[y=k-1]*(a1+k-1)*(mu/(mu+a1))/k for K>=1
* where mu = exp(x'b) and a1 = alpha^-1
quietly nbreg MDU $XLIST
predict nmu, n // gives exp(x'b)
scalar a1 = 1/e(alpha) // gives the inverse of alpha
gen n0 = (a1/(a1+nmu))^a1
gen sumn = n0
foreach k of numlist 1(1)$MAXLESS1 {
local j = `k'-1
gen n`k' = n`j'*(a1+`j')*nmu/((a1+nmu)*`k')
replace sumn = sumn + n`k'
}
gen n$CELLMAX = 1 - sumn
* For unknown reason this differs a little from Table 20.6
sum n0-n$CELLMAX
** EXTRA: Chisquare goodness of fit test for Negative binomial
** Defined in (8.27) and CoOmputed by auxiliary regression (8.5)
* First form the differences in actual and predicted for each count
foreach k of numlist 0(1)$MAXLESS1 {
gen y`k' = MDU==`k'
}
gen y$CELLMAX = MDU > $MAXLESS1
foreach k of numlist 0(1)$CELLMAX {
gen dn`k' = y`k' - n`k'
}
sum dn0-dn$CELLMAX
* Then run the auxiliary regression which requires getting the scores.
* The following code is very specific to the regressors in this example.
quietly nbreg MDU $XLIST
predict u salpha, score
#delimit ;
gen s1 = u; gen s2 = u*LC; gen s3 = u*IDP; gen s4 = u*LPI;
gen s5 = u*FMDE; gen s6 = u*PHYSLIM; gen s7 = u*NDISEASE;
gen s8 = u*HLTHG; gen s9 = u*HLTHF; gen s10 = u*HLTHP;
gen s11 = u*LINC; gen s12 = u*LFAM; gen s13 = u*EDUCDEC; gen s14 = u*AGE;
gen s15 = u*FEMALE; gen s16 = u*CHILD; gen s17 = u*FEMCHILD; gen s18 = u*BLACK;
#delimit cr
gen one = 1
quietly regress one dn0-dn$MAXLESS1 s1-s18 salpha, noconstant
scalar CGF = e(N)*e(r2) // This is statistic in (8.27) computed using (8.5)
di "Chis-square goodness of fit statistic: " CGF " p-value: " chi2tail($MAXLESS1,CGF)
************ DISPLAY RESULTS FOR TABLE 20.5 (page 673) ************
* Note for brevity the coefficients for only some of the regressors
* are given in Table 20.5
* First columns of Table 20.5 (page 673) plus cluster-robust
estimates table poisml poisrobust poiscluster, t stats(N ll rank aic bic) b(%10.4f) t(%10.3f)
* Last columns of Table 20.5 (page 673) give bnbml. Also give others.
estimates table nbml nbrobust nbcluster, t stats(N ll rank aic bic) b(%10.4f) t(%10.3f)
* For Poisson correcting for overdispersion is most important.
* For negative binomial overdispersion is already incorporated.
* For both controlling for clustering (in this example with panel data)
* is also needed.
********** CLOSE OUTPUT
log close
clear
exit