Definitions, Attenuation, and Moderators

[Page 740 continued]

The next set of criticisms we examine centers on the methods of our review. Dallam et al. (2001) claimed that we

	(a) compounded the lack of standardization of definitions of CSA in the primary studies by including certain studies they believed questionable and excluding others they believed appropriate;
	(b) failed to account for possible attenuation of effect sizes by our use of the Pearson r effect size, as opposed to Cohen's d; and
	(c) mishandled our analyses of moderators of the CSA-symptoms relationship.

We show these criticisms to be factually incorrect in most cases and highly debatable in the rest.

Operational Definition of CSA

Operational definitions of CSA have varied widely, a problem for all reviews of CSA research. Dallam et al. (2001) claimed that we "compounded" this problem by including studies "that did not even purport to examine the effects of CSA" (pp. 718-719), citing three instances:

	Landis (1956),
	Schultz and Jones (1983), and
	Sedney and Brooks ( 1984).

In fact, Landis consistently wrote of the "child" and the "offender," referring to interactions with a "sexual deviate," someone considerably older. He reported the median age of this experience to be 11.93 for girls and 14.92 for boys, and that for girls "41.9 percent were under the age of 11 when they had their deviate experience" (p. 95). About 90% of females and 80% of males were 18 or younger at the time of their experience. Finally, the Landis study is used as an example of an early CSA study by many other researchers (e.g., Finkelhor, 1979a; Fishman, 1991; Fromuth & Burkhart, 1989; Sarbo, 1985). Clearly, Landis did purport to study what is now called CSA, actually did study CSA, and is recognized for having done so by many other researchers.

As for Schultz and Jones (1983), Dallam et al. (2001) mentioned that they "looked at all types of 'sexual acts' before age 12" (Dallam et al., 2001, p. 719), but failed to add that respondents were asked "if their experience was with a person over the age of 16" (Schultz & Jones, 1983, p. 100) and that Schultz and Jones's discussion and conclusions were based entirely on CSA experiences, not peer sex.

Finally, Dallam et al. (2001) noted that Sedney and Brooks (1984) "examined all types of 'sexual experiences' during childhood" (p. 719), but Sedney and Brooks themselves explained that their "rather broad definition of 'child sexual abuse' was used because of the difficulty posed by a priori decisions about what type of sexual experiences are 'problems' " (p. 215).

Dallam et al. (2001) cited Neumann et al. (1996) as researchers who did not include this study because of its broad definition. However, Dallam et al. (2001) failed to note that a variety of other researchers whom they also cited elsewhere to buttress other arguments referred to Sedney and Brooks as an important early nonclinical study on CSA effects

(e.g. Fergusson, Horwood, & Lynskey, 1996; Garnefski, & Arends, 1998; Mullen, Martin, Anderson, Romans, & Herbison, 1993).

Last, whereas the Landis (1956) and Schultz and Jones (1983) studies were not included in our meta-analyses of CSA-adjustment relations, the Sedney and Brooks study was. Its effect size (r = .15) was higher than the mean and thus did not bias results in terms of underestimating the CSA-adjustment association. In short, Dallam et al.'s (2001) claim that these studies did not purport to examine CSA effects is incorrect; their inclusion did not bias our meta-analysis of CSA- symptom relations.

Dallam et al. (2001) next argued that some of our studies included sexual experiences that occurred after age 17, again citing three instances:

	Greenwald (1994),
	Landis (1956), and
	Sarbo (1985).

In Sarbo's "purified" sample (p. 45), analyses were done on CSA occurring before age 17; these are the data we used to compute effect sizes. From Greenwald's study, we computed effect sizes on the basis of CSA under age 16 (Greenwald, 1994). In Landis' s article, as we already noted, the vast majority of experiences with "deviates" occurred during childhood up to age 18 (89% of male and female cases combined). Thus, two of three instances cited by Dallam et al. (2001) are incorrect, and we argue the third is trivial (again, the Landis data were not part of the meta-analyses, and eliminating them from the analysis of reactions actually substantially increases the proportion of positive and neutral reactions).

It is notable that some of the CSA studies Dallam et al. (2001) cited elsewhere in their critique to bolster their own arguments also include cases up to age 18 or 19

(e.g..: , Erickson & Rapkin, 1991; Finkelhor et al., 1990; Garnefski & Arends, 1998; Kendall- Tackett et al., 1993).

Dallam et al. (2001) continued by calling our exclusion of Roland, Zelhart, and Dubes (1989) and Jackson, Calhoun, Amick, Maddever, and Habif ( 1990) - both of which consisted mainly or incest experiences - "quite baffling" (p. 719).

In a footnote right before this criticism, however, they cited Neumann et al. (1996) to argue that we should not have included the Sedney and Brooks (1984) study. However, they failed to note that Neumann et al. also excluded Roland et al. as an outlier, even among clinical studies, a point noted in our original article (Rind et al., 1998, p. 31 ). Our treatment of this issue was unbiased: We reported results with and without the outliers, finding that the unbiased effect size estimate was not affected by excluding the outliers, and our exclusion of these studies was based solely on statistical grounds.

Finally, Dallam et al. (2001) argued that we "erroneously, coded" (p. 719) Silliman (1993) in the wrong direction. Had we

[Page 741]

not "miscalculated" Silliman, they argued, then the two excluded studies just discussed would not have been outliers. Here is what Silliman (1993) wrote regarding the self-esteem measure in question:

It was hypothesized that the women who recalled sexual abuse during childhood would score significantly more externally on locus of control and score significantly lower on the self-esteem measure than the control group; however, contrary to expectation. the hypotheses were not confirmed (t₆₄ = .16 and -4.50, p > .05. respectively). (p. 1294)

Because it was predicted that abused participants would have lower self-esteem, but they did not. it follows that self-esteem was higher for the abused group, because the large t was in the opposite of the hypothesized direction, indicated by the non-significant p value. Thus, given the results as reported by Silliman (1993), we calculated the effect size appropriately.

However, Dallam et al. (2001) reported that they contacted Silliman and found that self-esteem was lower for the abused group, which means that t = -4.50 would have been statistically significant (p < .05, not p > .05). Dallam et al. (2001) then noted that the entire distribution of effect sizes "would have shifted slightly" (p. 719) and that the two incest studies would not have been outliers.

It would have been informative had they specified how slight the shift would have been. We computed the unbiased effect size estimate in three ways:

	(a) using the corrected value of r = .255 for Silliman and including all 54 samples;
	(b) using the incorrect value for Silliman of r = -.25 and including all 54 samples; and
	(c) using only 51 samples excluding Silliman (1993), Roland et al. (1989) and Jackson et al. (1990).

We obtained, respectively, r_us = .0969, .0948, and .0921. In short, the coding of Silliman and the exclusions have no meaningful impact on the effect size estimate.

Attenuation

Dallam et al. (2001) argued that our use of the Pearson r effect size, as opposed to Cohen's d, in studies with unequal proportions in the comparison groups "created a situation in which clinically large effects could be represented by what appear to be small r values" (p. 720), citing Cohen' s (1977) description of d = .80 as a large effect.

We addressed this criticism in a previous rebuttal (Rind, Tromovitch, & Bauserman, 2000a) to Dallam et al. ( 1999), in which we used R. Rosenthal's (1984) formula that takes into account population prevalences to convert our rs to ds for each sample. Assuming prevalences are 50-50 for CSA and control populations, we obtained the following mean ds for our 14 male and 33 female samples: ds = .22 and .25, respectively. Both of these ds were small, not large, according to Cohen' s (1977} guidelines, which suggested that d = .20 is small. These findings contradict the thrust of Dallam et al.'s (2001) argument about attenuation of effect sizes.

Base rate differences for men and women

On the basis of the argument that our effect sizes were attenuated owing to unequal sample size, Dallam et al. (2001) used Becker's (1986) formula to correct our effect size estimates for men (r_u = .07) and women (r_u = .10), finding them to be .10 and .11, respectively. They argued from this that our computations "created the appearance of gender-related differences in CSA adjustment, when effect sizes for men and women were actually equivalent" (p. 721 ), but we never concluded that the r_us of .07 and .10 for men and women, respectively, were significantly different statistically.

In fact, we reported that the contrast between these effect size estimates "was nonsignificant, z = 1.42, p > .10, two-tailed" (Rind et al., 1998, p. 33). Later, Dallam et al. (2001) provided this same quote to incorrectly argue a different point on moderator analysis involving gender (see below).

What we did report as significantly different was the contrast between male and female effect size estimates for the all-types-of-consent groups, where r_us = .04 and .11, respectively. If we follow Dallam et al. (2001) and apply Becker's correction formula to these values, they become r_c s = .06 and .12 for men and women, respectively. The contrast is still statistically significant (z = 2.68, p < .01. two-tailed), contrary to Dallam et al.'s (2001) claim.

Dallam et al. (2001) further argued that, although we wrote that effect sizes would increase at most by .03 for a 50-50 split (Rind et al., 1998), the effect sizes in some cases increased much more. They made this point despite quoting us two sentences earlier as having written that "an r = .07 based on a 14-86 split [for male samples] increases at most by .03 (to r = .10) in a 50-50 split (Rind, Tromovitch, & Bauserman, 2000a, p. 29)" (Dallam et al., 2001, p. 721), which, makes it clear that we were referring to a maximum .03 increase in the overall effect size, not individual effect sizes.

For the sake of argument in the discussion above, we corrected our rs on the basis of samples with naturally occurring unequal base rates in the population. It is important to note, however, that many statistical experts believe that correction in this type of situation is inappropriate, including Becker (1986) himself. Although Dallam et al. (2001) cited Becker and used his formula, they failed to note his qualification of valid use of this formula. He noted that it

"is appropriate to correct for unequal sample size when populations represented by the samples can be assumed to be equally numerous" (Becker, 1986, p. 5),

as in randomized or block experiments. But he noted further that when populations are not equal in size, "the inequality should be reflected in estimating degree of relationship" (Becker, 1986, p. 6). As examples, he gave people with schizophrenia versus controls and people with double- recessive genes versus all carriers of a given gene, among others. Regarding these cases, he commented that

"a correction would be inappropriate. Indeed, it can be argued that for such populations, when sample site is constrained to be equal, the resulting r should be corrected in the other direction" (Becker, 1986, p. 6).

Hunter and Schmidt (1990), who provided the same formula that Becker did, similarly noted that it's intended use was for situations in which equal sample sizes could or should have been obtained (e.g., as in experiments) but were not. In natural settings with populations inherently unequal in size, however, they argued that correlations should reflect this difference. To illustrate, they noted that the association between race (White vs. Black) and achievement test performance among U.S. students is .45. This value is based on transforming a racial performance difference of 1 SD to Pearson's r, and assumes equal numbers of Whites and Blacks. Noting, however, that Blacks compose only 13% of the total, they argued that "this value must be adjusted to reflect that fact" (p. 276; italics added); they deemed the reverse-adjusted correlation of .32 (i.e., the estimated point-biserial correlation) to be the appropriate value. R. Rosenthal (1984) also opined that, when comparing naturally occurring groups with different base rates, the

[Page 742]

appropriate correlation is the one reflecting this difference (i.e., the uncorrected correlation). He added that correction for unequal sample sizes is appropriate if one is trying to estimate what one might find in a future study drawn from a population of equal numbers (i.e., p = q = .5), but is inappropriate if the goal is to estimate results from a future study drawn from the same population with unequal numbers - that is, to obtain a real world estimate (R. Rosenthal, personal communication, April 12, 2001).

This statistical opinion supports our non-correction for base-rate differences. Nevertheless, we acknowledge the intuitive appeal for arguing for correction - that effect sizes should be base-rate independent (which Pearson's r is not, but Cohen's d is), especially if one is comparing effect sizes across groups. But as we showed, even from this perspective, group differences between the genders that we identified in our 1998 review remain different upon re-analysis, contrary to Dallam et al.'s (2001) contention, which undermines their reason for raising this issue to begin with.

In short, our handling of Pearson's r in the face of base-rate differences was methodologically proper and produced no important bias, if any at all. Dallam et al. (2001), on the other hand, exhibited bias in their criticisms, selectively ignoring key clarifying quotes by us but citing them elsewhere in their critique to argue different points, and ignoring or overlooking a key caveat by Becker (1986) regarding appropriate use of his correction formula.

Dichotomization

Most studies on CSA, including many of the ones we reviewed (Rind et al., 1998), have classified participants dichotomously as either having experienced or not having experienced CSA. Dallam et al. (200 I) argued that CSA as a variable has an underlying continuum and that dichotomizing it attenuates the CSA-adjustment association. They argued that a biserial correlation or a tetrachoric correlation is then appropriate, depending on whether only CSA is dichotomized or both CSA and the dependent measure are.

Biserial and tetrachoric correlations are not product-moment correlations; rather, they are intended to estimate them, but can fail rather badly (Nunnally, 1978; R. Rosenthal, personal communication, April 12, 2001; Sheskin, 1997).

R. Rosenthal (personal communication, April 12, 2001) gave an example in which X values are 0, 0, 1, 1, 2, 2, 3, and 3, and Y valuesare 0, 1, 0, 1, 2, 3, 2, and 3. Median splits on each produce r = 1.00, and dichotomizing on X only produces r = .89. Both of these rs based on dichotomization, however, are larger, not smaller, than the non-dichotomized r = .80 based on continuous data.

R. Rosenthal added that his example, in which dichotomization increases rather than attenuates the correlation, is not far-fetched and can easily occur when there is a third variable problem, in which X and Yare uncorrelated in levels A and B of a third variable, but A and B differ in their mean X and mean Y scores. Then, combining A and B can yield a large positive or negative overall correlation. We argue that this is just the kind of situation that could obtain in CSA studies.

In sum, R. Rosenthal continued, although dichotomizing continuous data can decrease the magnitude of Pearson' s r under some conditions, it can increase it in others; without having the continuous data, one cannot be sure which situation obtains. This argues against use of the biserial and tetrachoric correlations.

Furthermore, both corrections assume underlying continuous and normal distributions (Glass & Hopkins, 1996; Nunnally & Bemstein, 1994; Sheskin, 1997). For CSA, evidence suggests that these assumptions do not hold in the college population. Continuous CSA is assumed to correspond to severity, where noncontact events are the least severe, touching is a level up, oral sex is even mere severe, and intercourse is the most severe. It is clear that, in the case of adult-adult sex, these levels do not form a continuum of "severity" (i.e., seriousness or negativity). Reactions and levels of "severity" cannot be assumed to correspond, unless the sex is forced. In the case of CSA, numerous college studies have found no relation between "severity" and reactions. Finkelhor (1979a), on the basis of his college sample, observed

People ... still use the standard of intercourse to judge the seriousness of a child's sexual experience. In other words, they presume that experiences involving intercourse are the most traumatic. ... However, our data show the opposite; that is, the seriousness of sexual activity as it is usually understood does not seem related to greater trauma in children. ... It suggests that the actual sexual activity involved is less important than its context. (p. 103)

West and Woodhouse (1993) presented interviews of 24 male students, 7 of whom had oral sex or intercourse with adults when they were minors. Six experienced these encounters positively, whereas only 1 reacted negatively. Two had heterosexual encounters, 4 had homosexual encounters, and 1 had both. Negative or neutral reactions among the 24 students occurred almost exclusively in less severe cases involving noncontact sexual approaches or fondling.

Condy, Templer, Brown, and Veaco (1987) examined sexual relations between boys under age 16 and females at least age 16 and 5 years older than the boys. These relations were predominantly of a "severe" nature: Among the college participants, 68% involved intercourse. Relatively few men reacted negatively or felt they had been harmed. Most reported that they consented or even initiated the encounters. Negative feelings and self-reported effects were associated with perceived lack of consent and incest.

In Fromuth and Burkhart's (1987) male college sample, about 30% of the CSA involved oral sex and another 25%, intercourse. Despite this large degree of "severity," only 15% reported negative effects, whereas 39% and 46% reported positive or neutral effects, respectively. Among teenagers aged 13 to 16, only 3% reported negative effects, with 60 % reporting positive effects.

In a recent study based on a sample of gay and bisexual males who were mostly college students (Rind, 2001), no relation was found between level of "severity" and reaction for boys aged 12 to 17 involved in contact sex with adult males; what mattered was context, particularly level of willingness. To repeat Finkelhor ( 1979a), it is the context that is important.

Sheskin ( 1997) noted that the accuracy of the biserial correlation is

"highly dependent on the assumption of normality, and it should not be employed unless there is empirical evidence to indicate that the distribution underlying the dichotomous variable is normal" (p. 583)

- he provided the same caveat for the tetrachoric correlation.

Similarly, Nunnally (1978) noted that both "these correlations very much depend on a strict assumption of the normality of the continuous variables" (p. 137, italics added) and noted further that when the assumption of normality is not met, estimates can be off by more than 20 points of correlation.

Evidence indicates that CSA is not normally distributed, even if one assumes continuity. For example, consider the case of a male sample in which 14% had and 86% did not have CSA. If we assume a continuum of severity, where not having CSA puts one at the low end, then the low end is the mode - a very dominating one - and the positively skewed distribution deviates markedly from normality.

Sheskin added that

[Page 743]

if there is reason to believe the normality assumption for the dichotomous variable has been violated, most sources recommend computing the point-biserial, rather than biserial, correlation, because the latter may be a spuriously inflated estimate of the underlying population correlation.

In summary, the assumption that CSA in the college population is a continuum of "severity" according to increasing physical intimacy, irrespective of context, is not empirically supported; the assumption of normality even if continuous is also not supported. These facts add to the issue of unreliability of the biserial and tetrachoric correlations to argue against their use in this population.

Interpretation of effect sizes

Both sets of critics raised concerns about our discussion of effect sizes, Dallam et al. (2001) claimed that we interpreted r² as the measure of effect size, rather than r, by quoting us as writing "according to Cohen's (1988) guidelines; in terms of variance accounted for, CSA accounted for less than 1% of the adjustment variance." What we actually wrote was

The resulting unbiased effect size estimate ... was r_u = .09. ... This difference in adjustment between SA and control students was small, however, according to Cohen's (1988) guidelines; in terms of variance accounted for, CSA accounted for less than 1% of the adjustment variance. (Rind et al., 1998, p. 31)

The partial quote taken out of context by Dallam et al. (2001) gives the impression that we cited Cohen to conclude that a 1% variance is a small effect size. Clearly, our citation of Cohen refers to our reporting of the unbiased effect size, r_u = .09.

More important, both Ondersma et al. (2001) and Dallam et al. (2001) argued that small effect sizes can have huge personal and social costs. Dallam et al. (2001) cited R. Rosenthal and Rubin's (1982) binomial effect size display (BESD) as one means of indicating this.

In regard to our study, the BESD would categorize 100 students as having had CSA and another 100 as not having had CSA; additionally it would categorize 100 students as being worse adjusted and another 100 as being better adjusted, producing four cells in a 2 X 2 matrix.

Categorization into CSA x Adjustment combinations would be based on Pearson's r, the association between CSA and adjustment (i.e., the effect size). Specifically,

	for CSA-worse adjusted (Cell A) and
	for no CSA-better adjusted (Cell D),

the cell frequencies would be 100 (.500 + r / 2);

	for CSA-better adjusted (Cell B) and
	for no CSA-worse adjusted (Cell C),

the frequencies would be 100(.500 - r /2).

With the overall effect size in our meta-analysis of .09, the four values above would be as follows:

	(A) 54.5,
	(B) 45.5,
	(C) 45.5, and
	(D) 54.5.

From this, one might conclude that exposure to CSA for every 100 persons produces a decrease in good adjustment from 54.5 to 45.5; in other words, 9 persons per 100 exposed to CSA are now more poorly adjusted. This interpretation strongly suggests that the small effect size of .09, according to Cohen's (1988) guidelines, is nevertheless an important effect. As we discuss next, however, this interpretation would be misleading.

This interpretation is based on the assumption that the increase in more poorly adjusted persons is due to CSA. Both Ondersma et al. (2001) and Dallam et al. (2001) implied that it is. Ondersma et al. (2001), by analogy, cite the small effect size (r = .03) in the well-known aspirin versus heart attack experiment to argue that "even miniscule effects can have huge personal and societal costs when one extrapolates to a societal level" (p. 709, italics added).

Dallam et al. (2001) cited Ondersma et al.'s (1999) meta-analysis of 14 studies on smoking and lung cancer, which showed an effect size of .17 to argue that the relationship we found between CSA and symptoms was roughly comparable to "the effect of cigarette smoking on lung cancer in the general population" (Dallam et al., 2001, p. 729, italics added).

The effects for aspirin are true effects - we know this because this study used the experimental design. Similarly, the effects of cigarette smoking are now confirmed as true effects through an enormous amount of research that has identified at least 63 distinct cancer-causing agents in cigarettes. Although less than 15% of regular smokers develop lung cancer (thus the low effect size), smoking is directly responsible for 87% of lung cancer cases (American Lung Association, 2000).

Such a dramatic relation has no empirical parallel in CSA research, vitiating the CSA-maladjustment /smoking-lung cancer analogy. The comparability with CSA is also much less clear than Dallam et al. (2001) asserted, because some, much, or all of the difference between CSA-adjustment combinations reflects confounding rather than actual effects.

To highlight this point, consider the BESD for family environment (FE; better vs. worse) and adjustment (better vs. worse). In our meta-analysis, we found an unbiased effect size estimate of .29.

Thus, in the case of poor family environment, for every 35.5 persons better adjusted, there would be 64.5 worse adjusted, which represents an increase of 29 persons per 100, a substantially greater number compared with CSA. Given that our meta-analysis showed that CSA was confounded with FE, and statistical control often eliminated significant CSA-adjustment relations, the BESD results regarding CSA should not be interpreted causally or compared with similar effect sizes that are known to be causal.

Additionally, being put into the worse versus better adjusted categories in the CSA example reflects very minor differences between subjects on either side of the median for adjustment and near to it. By contrast, in the heart attack and lung cancer cases, differences in categorization are always hugely important. This further weakens our critics' analogies.

Moderators

We examined several variables as moderators of the CSA-adjustment relationship. Dallam et al. (2001) questioned our examination of these variables, including contact versus noncontact experiences, gender, and willingness or consent.

Contact

Dallam et al. (2001) cited a number of studies that separated contact and noncontact sex to argue that it is "remarkably consistent" (p. 722) that adjustment is poorer for contact groups and to argue that it is very questionable whether our findings apply to more serious forms of CSA. Ondersma et al. (2001) also questioned the use of studies that combine contact and noncont1ict CSA.

As discussed above, however, much research indicates that it is the context that really seems to matter, not contact versus noncontact or level of contact. Context has much to do with the degree of force versus willingness, or at least absence of coercion. According to Finkelhor (1979a),

Unlike force, sexual activity and duration both are ambiguous in their implications. A longer relationship and one involving intercourse

[Page 744]

indicate greater intensity. Intensity may be more harmful, but it could also be an indicator in some cases of a positive, or at least. an ambivalent, bond. In contrast, presence of force would almost always signal something negative about the relationship. It is a concise symptom of a whole negative context - the reluctance of the child, the pressure exerted by the partner, the difference in power and control. The primary recollection of the child is of the coercion. That there was sex involved is perhaps less important than the fact that there was aggression. (pp. 104-105)

In our college studies, meta-analysis of the 11 studies that confined CSA to contact experiences yields a small unbiased effect size estimate

(r_u = .10, 95% CI = .06 - .15), Chi²(10, N = 1,776) = 7.64. p > .05.

This estimate is itself consistent (i,e., the effect sizes were homogeneous) and is not different from the overall effect size estimate (r_u = .09) based on a majority of studies including noncontact cases.

In the Laumann et al. (1994) national study, which included only contact cases of CSA, effect sizes were small for both male (r = .07) and female (r = .05) participants. Laumann et al. created a severity index based on level of contact (kissing or genital touching vs. oral sex or intercourse); the index failed to moderate current adjustment. These results show that the evidence is not as consistent as our critics claim.

Moreover, as critics have argued (e.g., Best, 1997; Dineen, 1998; Jenkins, 1998; Samoff, 2001), CSA researchers have extensively used broad definitions of CSA that include noncontact experiences. For example, Boney-McCoy and Finkelhor (1995), commenting on the finding that girls in their study with noncontact CSA were symptomatic, stated,

This finding highlights the noxious correlates of even interrupted forms of predatory sexual behavior. At present, these results warrant advising clinicians and others who work with young people to be aware of the potentially harmful consequences of what may appear to be "minor" sexual victimization experiences. (p. 733)

It seems inconsistent for CSA researchers following a victimological model to uncritically combine contact and noncontact experiences in their own research, then criticize us for including studies that combine both contact and noncontact CSA in a meta-analytic review.

Gender

Dallam et al. (2001) claimed that our finding that gender moderated adjustment was not reflected in our moderator analysis. To back this claim, they noted that we reported that the contrast between the male and female effect size estimates (.07 and .10, respectively) was non-significant.

First, this citation is selective, as they ignored it earlier when they argued that we used these same effect sizes to create the appearance of gender differences, when we clearly did not.

Second, their current claim is simply incorrect. Our regression analysis, the first step in our moderator analysis, showed that gender and the Consent X Gender interaction both moderated the effect sizes across samples. On the basis of this significant interaction, we computed main effects and interaction contrasts, which yielded no main effects (i.e., no difference between .07 and .10) but did yield a significant interaction involving gender and level of consent.

Consent

Dallam et al. (2001) repeatedly encased consent and willing in quotation marks and questioned our use of this construct more vigorously in their other critiques (e.g., Dallam et al., 1999; Spiegel, 2000a, 2000b, in press). Their position is that consent is not possible because CSA is immoral, and therefore scientific use of consent is invalid; Spiegel (2000a) called such use a "moral outrage" (p. 66). They argued that we were wrong to assume that studies that asked students about any sexual experiences with older persons, as opposed to unwanted experiences only, contained much in the way of willing sex.

We focus now on males, because it was with males that willingness moderated effect sizes in our study.

In studies involving male participants who were asked about all types of sexual experiences instead of just unwanted experiences, many have reported encounters that they themselves define as willing.

In Condy et al.'s (1987) study of boys involved with women, where two thirds of the cases involved intercourse. only 14% of the sex acts fell into the "female forced male" category, whereas 49% fell into the "male wanted, female agreed" category and 67% fell into the "female wanted, male agreed" category.

A. Nelson and Oliver (1998), in their college sample, found that 75% of sexual experiences between boys younger than 16 and adults 18 or older and at least 4 years older than the boys were "consensual" (their term).

College studies that have included narratives along with quantitative data provide additional clear evidence for willingness (e.g., Fishman, 1991; Rind, 2001; West & Woodhouse, 1993). In West and Woodhouse's interviews, as discussed previously, the more "severe" the sex was, the more willing the boy tended to be and the more positively the sex tended to be experienced.

In a gay and bisexual male sample of mostly college students, 26 participants out of 129 had age-discrepant sexual relations between ages 12 and 17 with men at least age 18 and at least 5 years older than themselves (Rind. 2001). Twenty-three percent initiated their sexual contacts. while another 69% mutually consented. Positive reactions were strongly related with higher levels of consent (r = .43 ).

Finally, in a recent large-scale, non-clinical study, Coxell. King, Mezey, and Gordon (1999) examined a sample of 2.474 men aged 18 to 94 in Great Britain recruited from general medical practices. Participants were asked about CSA occurring before 16 with someone at least 5 years older. In the entire sample, 7.7% of participants had "consensual" CSA - the authors' term - whereas 5.3% had non-consenting CSA. Thus, 59% of CSA was consenting. When asked whether they ever had a psychological problem of at least 2 weeks duration, non-consenters had statistically significantly more problems than controls (r = .10) but consenters did not (r = .02).

Sandfort (1992) examined a Dutch sample of 283 young adults aged 18 to 23, consisting of both students and working people. CSA was restricted to contact sex before age 16 with someone at least 5 years older. Most men who had experienced CSA were "consenting" (71%) - Sandfort's term. Consenting participants were as well adjusted as controls.

In sum, empirical evidence shows the existence of a substantial degree of consenting CSA among males, supporting our inference that studies asking male respondents about all experiences are likely to contain a nontrivial proportion of willing cases. It is important to recognize that willingness is not the same as informed consent, a point to which we return later.

Dallam et al. (2001) went on to calculate effect sizes for the "objective" measures for the male studies. For example, in the Fishman (1991) study (as indicated in their footnote 9), they only used one scale (sexual adjustment) to come up with their value (r = .07) instead of all scales for which results were reported, as we did, to come up with our value (r = -.04). This selective

[Page 745]

approach lends itself to data picking and researcher bias, risks well known to meta-analysts.

Dallam et al. (2001) next contended that we were incorrect to conclude that there was a statistically significant difference in CSA-symptom relations for males in the two different levels of consent categories. They based this argument on correcting the effect sizes and then re-meta-analyzing.

Correction for base rates is unnecessary and even inappropriate, as we argued previously in some detail. For argument's sake, however, we consider their analysis and present our own.

Table 5

	(a) repeats their meta- analysis shown in their Table 7, which was based on their corrected values that they presented in their Table 6;
	(b) presents our own meta-analysis of these same corrected values; and
	(c) presents our meta-analysis of corrected values from our original effect sizes.

As can be seen, our meta-analysis of their corrected values differs from their presentation. In particular, ours shows that the 95% confidence intervals for all types of CSA and for unwanted CSA for males do not overlap, implying that these groups differ. Additionally, our meta-analysis, based on our corrections of our original effect sizes, shows that the 95% confidence intervals do not overlap. This further reinforces the validity of our original results.

Although non-overlapping 95% confidence intervals imply that groups differ, overlapping confidence intervals do not imply that they are statistically the same, contrary to Dallam et al.'s (2001) remark that we "disregarded" (p. 724) overlapping intervals in forming our conclusions.

Consider Glass and Hopkins' (1996) illustration:

	for Group 1, r = .83, n = 30 pairs, 95% Cl = .67-.92;
	for Group 2, r = .93, n = 83 pairs, 95% Cl = .89-.95.

Although the confidence intervals overlap, the correlations are clearly different, as revealed by testing the significance of the difference between independent correlations (z = 2.11, p < .05, two-tailed). What matters is the 95% confidence interval of the difference between independent correlations, not the confidence intervals of each independent correlation.

Far from disregarding the overlapping intervals, we appropriately performed contrasts (analogous to the Glass and Hopkins example just given) demonstrating significant differences. In fact, it was Dallam et al. (2001) who disregarded the relevant statistics (our contrasts) for evaluating differences.

Finally, Dallam et al. (2001) disagreed with our conclusion that "adjustment was associated with level of consent for men, but not for women" (Rind et al., 1998, p. 34), attempting to support their position with the irrelevant point that their corrected values for men and women in the all-levels-of-consent groups were nearly the same. Clearly, adjustment was associated with level of consent for men, indicated by the contrasts shown in Table 5 (Zs = 3.63 and 3.45, ps < .001, two-tailed, for our meta-analyses of their and our corrected values, respectively).

To sum up, there is ample justification for our handling of contact sex, gender, and consent or willingness as moderator variables. The arguments raised against them by Dallam et al. (2001) are questionable or simply incorrect both in terms of empirical evidence and statistical practice.