Categorical variables and regression Categorical variable




1 categorical variables , regression

1.1 dummy coding
1.2 effects coding
1.3 contrast coding
1.4 nonsense coding
1.5 interactions

1.5.1 categorical categorical variable interactions
1.5.2 categorical continuous variable interactions







categorical variables , regression

categorical variables represent qualitative method of scoring data (i.e. represents categories or group membership). these can included independent variables in regression analysis or dependent variables in logistic regression or probit regression, must converted quantitative data in order able analyze data. 1 through use of coding systems. analyses conducted such g -1 (g being number of groups) coded. minimizes redundancy while still representing complete data set no additional information gained coding total g groups: example, when coding gender (where g = 2: male , female), if code females left on males. in general, group 1 not code group of least interest.


there 3 main coding systems typically used in analysis of categorical variables in regression: dummy coding, effects coding, , contrast coding. regression equation takes form of y = bx + a, b slope , gives weight empirically assigned explanator, x explanatory variable, , y-intercept, , these values take on different meanings based on coding system used. choice of coding system not affect f or r statistics. however, 1 chooses coding system based on comparison of interest since interpretation of b values vary.


dummy coding

dummy coding used when there control or comparison group in mind. 1 therefore analyzing data of 1 group in relation comparison group: represents mean of control group , b difference between mean of experimental group , mean of control group. suggested 3 criteria met specifying suitable control group: group should well-established group (e.g. should not “other” category), there should logical reason selecting group comparison (e.g. group anticipated score highest on dependent variable), , finally, group’s sample size should substantive , not small compared other groups.


in dummy coding, reference group assigned value of 0 each code variable, group of interest comparison reference group assigned value of 1 specified code variable, while other groups assigned 0 particular code variable.


the b values should interpreted such experimental group being compared against control group. therefore, yielding negative b value entail experimental group have scored less control group on dependent variable. illustrate this, suppose measuring optimism among several nationalities , have decided french people serve useful control. if comparing them against italians, , observe negative b value, suggest italians obtain lower optimism scores on average.


the following table example of dummy coding french control group , c1, c2, , c3 respectively being codes italian, german, , other (neither french nor italian nor german):



effects coding

in effects coding system, data analyzed through comparing 1 group other groups. unlike dummy coding, there no control group. rather, comparison being made @ mean of groups combined (a grand mean). therefore, 1 not looking data in relation group rather, 1 seeking data in relation grand mean.


effects coding can either weighted or unweighted. weighted effects coding calculating weighted grand mean, taking account sample size in each variable. appropriate in situations sample representative of population in question. unweighted effects coding appropriate in situations differences in sample size result of incidental factors. interpretation of b different each: in unweighted effects coding b difference between mean of experimental group , grand mean, whereas in weighted situation mean of experimental group minus weighted grand mean.


in effects coding, code group of interest 1, dummy coding. principal difference code −1 group least interested in. since continue use g - 1 coding scheme, in fact −1 coded group not produce data, hence fact least interested in group. code of 0 assigned other groups.


the b values should interpreted such experimental group being compared against mean of groups combined (or weighted grand mean in case of weighted effects coding). therefore, yielding negative b value entail coded group having scored less mean of groups on dependent variable. using our previous example of optimism scores among nationalities, if group of interest italians, observing negative b value suggest obtain lower optimism score.


the following table example of effects coding other group of least interest.



contrast coding

the contrast coding system allows researcher directly ask specific questions. rather having coding system dictate comparison being made (i.e., against control group in dummy coding, or against groups in effects coding) 1 can design unique comparison catering 1 s specific research question. tailored hypothesis based on previous theory and/or research. hypotheses proposed follows: first, there central hypothesis postulates large difference between 2 sets of groups; second hypothesis suggests within each set, differences among groups small. through priori focused hypotheses, contrast coding may yield increase in power of statistical test when compared less directed previous coding systems.


certain differences emerge when compare our priori coefficients between anova , regression. unlike when used in anova, @ researcher’s discretion whether choose coefficient values either orthogonal or non-orthogonal, in regression, essential coefficient values assigned in contrast coding orthogonal. furthermore, in regression, coefficient values must either in fractional or decimal form. cannot take on interval values.


the construction of contrast codes restricted 3 rules:



violating rule 2 produces accurate r , f values, indicating reach same conclusions whether or not there significant difference; however, can no longer interpret b values mean difference.


to illustrate construction of contrast codes consider following table. coefficients chosen illustrate our priori hypotheses: hypothesis 1: french , italian persons score higher on optimism germans (french = +0.33, italian = +0.33, german = −0.66). illustrated through assigning same coefficient french , italian categories , different 1 germans. signs assigned indicate direction of relationship (hence giving germans negative sign indicative of lower hypothesized optimism scores). hypothesis 2: french , italians expected differ on optimism scores (french = +0.50, italian = −0.50, german = 0). here, assigning 0 value germans demonstrates non-inclusion in analysis of hypothesis. again, signs assigned indicative of proposed relationship.



nonsense coding

nonsense coding occurs when 1 uses arbitrary values in place of designated “0”s “1”s , “-1”s seen in previous coding systems. although produces correct mean values variables, use of nonsense coding not recommended lead uninterpretable statistical results.


interactions

an interaction may arise when considering relationship among 3 or more variables, , describes situation in simultaneous influence of 2 variables on third not additive. interactions may arise categorical variables in 2 ways: either categorical categorical variable interactions, or categorical continuous variable interactions.


categorical categorical variable interactions

this type of interaction arises when have 2 categorical variables. in order probe type of interaction, 1 code using system addresses researcher s hypothesis appropriately. product of codes yields interaction. 1 may calculate b value , determine whether interaction significant.


categorical continuous variable interactions

simple slopes analysis common post hoc test used in regression similar simple effects analysis in anova, used analyze interactions. in test, examining simple slopes of 1 independent variable @ specific values of other independent variable. such test not limited use continuous variables, may employed when independent variable categorical. cannot choose values probe interaction in continuous variable case because of nominal nature of data (i.e., in continuous case, 1 analyze data @ high, moderate, , low levels assigning 1 standard deviation above mean, @ mean, , @ 1 standard deviation below mean respectively). in our categorical case use simple regression equation each group investigate simple slopes. common practice standardize or center variables make data more interpretable in simple slopes analysis; however, categorical variables should never standardized or centered. test can used coding systems.








Comments

Popular posts from this blog

Fuji List of motion picture film stocks

The Missionaries and the Congo Congo Free State propaganda war

Discography Tommy Denander