Clinical Samples and External Validity Bias
The Logic of Using College Samples
The Relevance of the College Findings
Missing those most affected
Exclusion of relevant outcomes
Comparison to national samples
[Page 735 continued]
Dallam et al. (2001) questioned the external validity of our review based on college samples in several ways (the methodological complaints of Ondersma et al., 2001, essentially overlap Dallam et al.'s and will also be addressed). They claimed that
These claims range from being highly debatable at best to simply wrong. In rebutting them, we present additional analyses to support the wider relevance of the college data, reinforcing the importance of our original findings.
Clinical Samples and External Validity Bias
External validity bias is a major weakness of the clinical perspective. The research of Kinsey, Pomeroy, and Martin (1948) was groundbreaking, not because its results are definitive (they are not), but because it set a new standard for approaching sexual behavior scientifically. Kinsey et al. argued that previous sex research, especially that conducted by psychiatrists and psychoanalysts, was severely limited in generalizability because it focused on clinical case studies - a problem exacerbated by clinicians , seeming lack of awareness of this limitation. They sampled a large and diverse segment of the general population to reduce this bias.
Ford and Beach's ( 1951) work, a second milestone in sex research, went even further, arguing that even the most comprehensive survey of Americans would not adequately describe human sexuality because of the profound influence of culture. They reviewed data on numerous other cultures to seek patterns, which they elucidated through comparisons with cross-species data. The works of these researchers contradicted much of the prevailing conventional wisdom regarding human sexulity and showed that a comprehensive, scientific understanding of human sexuality is weakest when based primarily on clinical case studies, stronger with broad and diverse sampling within a society, and strongest with broad and diverse sampling of other cultures and species.
External validity bias is most prominent when the sexual behavior of interest is taboo. Then, authoritative opinion is typically deferred to the clinician, whose main goal is to treat, prevent, and cure rather than understand (Greenberg, 1988). Data from non- clinical samples or other cultures or species are then either not sought, labeled irrelevant, or ignored, This bias predominated when female sexual desire, masturbation, and homosexuality were viewed as pathological (Szasz, 1990).
External validity bias has been prominent in research on CSA, a highly taboo form of sex. Victimologists concerned with CSA have frequently exhibited this bias in viewing clinical and legal research as informative well beyond these populations, while paying relatively little attention to nonclinical research and rejecting cross-cultural and cross-species perspectives on CSA (e.g., Olafson, Corwin, & Summit, 1993; Ondersma et al., 2001). But the need to move beyond clinical samples is essential, as Finkelhor (1984) noted
The Logic of Using College Samples
In our literature reviews of CSA, we explicitly dealt with the problem of external validity bias by focusing on nonclinical samples (Bauserman & Rind, 1997; Rind & Tromovitch, 1997; Rind et al., 1998). Previous reviews typically drew general conclusions about CSA while restricting themselves mostly to clinical and legal samples.
Before conducting our contested review of college samples, we examined studies based on national probability samples, far more representative of the general population than clinical samples (Rind & Tromovitch, 1997). Because the national samples generally lacked data relevant to issues such as confounding and causality, we extended that review by analyzing college samples, which are much richer in such data (Rind et al., 1998). However, we did not present these samples as representative of the general population. Rather, we argued that the college data were relevant to the general population - an empirically derived conclusion based on various important similarities in prevalence, severity, and correlates - rather than representative of it. When we generalized our results, we did so to the college population.
At the outset of our Psychological Bulletin (1998) review, we stated that our goal was to examine whether CSA typically causes pervasive harm of an intense nature that is equivalent for both sexes in the population of persons with CSA experiences. We argued that this view has been promoted by not only the media, as Dallam et al. (2001) noted we claimed in their current critique, but also by mental health professionals.
To illustrate the latter, consider the Sidran Foundation's (1994) characterization of CSA - this foundation focuses on multiple personality disorder and re- covered memory and provides related literature to therapists and patients. In its brochure, the foundation equates a child's "sexual activity with an adult" (p. 2) with serious threats to one's life, rape, military combat, natural or accidental disasters, and torture in terms of traumatic impact. One of the Dallam et al. coauthors separately has compared CSA to head injuries from a car accident (Spiegel, 2000b, in press).
If these dramatic analogies provided by mental health professionals are valid, then it should follow that in any population sampled - drug addicts, psychiatric patients, or college students - persons who have experienced CSA should show strong evidence of the assumed properties of CSA (even if some populations show stronger evidence than others). If we do not find such evidence in even one of these populations, then the broad and unqualified claims about the properties of CSA are contradicted. Thus, the representativeness of college samples is in fact irrelevant to the stated goals and conclusions of our study. As we stated in summing up our findings in our original article,
The Relevance of the College Findings
Even though Dallam et al.'s (2001) argument that college samples are biased is not relevant to our basic goals and conclusions, it is nevertheless important to examine the claims that we missed those most affected, underestimated CSA effects by excluding relevant outcomes, and overstated the similarity between national and college samples. Each of these claims is questionable or incorrect.
Missing those most affected
The claim that college samples understate correlates of CSA because they miss those most affected is essentially a claim that college samples produce lower (i.e., biased) estimates of CSA-symptom relations than in the general population. We demonstrated in our original article a strong similarity between college and national samples in prevalence, severity , and correlates (Rind et al., 1998). Although Dallam et al. (2001) disputed this similarity, we later support our original conclusions.
For now, we consider a second set of samples, those of junior high and high school students cited by Dallam et al (2001 ). If junior and senior high school students who experience CSA tend riot to make it to college because of the CSA, then we should expect that the magnitude of CSA-symptom associations would be substantially higher in these precollege samples because they would include more of those most affected. If this is not the case, then it is less likely that college samples introduce serious bias in terms of effect size estimates. We calculated effect sizes from these samples and meta-analyzed them.
Table 1 presents effect sizes for the samples from the 14 high school and junior high school studies cited by Dallam et al. (2001), computed separately for emotional and behavioral problems. Results are not provided for the Chandy, Blum, and Resnick (1996) study done on Minnesota students because it did not compare CSA students with controls; instead, we included two proxy studies, also done in Minnesota, that did contrast CSA and control students (Hernandez, Lodico, & DiClemente, 1993; Lodico, Gmber, & DiClemente, 1996). When a study reported separate statistics for the two sexes, that study was broken down into two samples. The Kendall- Tackett, Williams, and Finkelhor (1993) study, which included a meta-analysis of a few of its samples, did not provide data on sample sizes, so our meta-analyses were conducted with and without this study. Table 2 presents results of the meta-analyses.
For both emotional and behavioral problems. the overall unbiased effect size estimates, .13 and .11, respectively, were quite similar to that in our meta-analysis on college students (ru = .09). The unweighted mean correlations for emotional and behavioral problems without the Kendall-Tackett et al. (1993) study were only slightly larger (rs = .14). These unweighted values rose to .17 when including Kendall- Tackett et al. The Kendall- Tackett et al. results, however, are anomalous: The mean emotional effect size
(r = .57) was 2.86 SDs above the mean of the other effect sizes for emotional problems: the mean behavioral effect size, .63, was even more deviant (z = 3.77). The Kendall- Tackett et al. study was based on sexual abuse treatment samples and is thus nor comparable to the other samples, which were nonclinical and nonlegal. Treating this study as an outlier, justified by its sampling and results, we are left with effect sizes nearly the same on average as in the college population. It is important to note that all of these studies examined unwanted CSA only; thus, they may overestimate correlates of sociolegally defined CSA, which includes willing sex involving age discrepancies.
Thus, Dallam et al.'s (2001) argument that college samples are biased because those with a history of CSA more often do not make it to college is inconsistent with the very research they cite for this claim.
In suggesting that CSA debilitates academic performance, Dallam et al. (2001) cited a number of studies (many included in Table 1) that ignored or only weakly controlled for confounding variables. In our review we cited a study (Eckenrode, Laird, & Doris, 1993) that strongly controlled for confounding variables by categorizing a representative community sample into groups on the bases of CSA, physical abuse, neglect, and combinations of these (Rind et al., 1998). In this study CSA was not associated with academic problems, but physical abuse and neglect were.
Another Dallam et al. (2001) citation was a study based on a nationally representative sample of Swedish 17-year-olds, which examined not only students but also dropouts (Edgardh & Ormstad, 2000). Consistent with Dallam et al.'s (2001) argument, the study-level effect size for the association between CSA and problem areas was larger for female dropouts (r = .22) than students (r = .13). However, in contradiction to their argument that CSA causes students to drop out, Edgardh and Ormstad reported that dropping out was confounded with being in foster care (r = .21 for females). These results highlight the weakness in causal assertions based on correlational data.
Exclusion of relevant outcomes
Both Dallam et al. (2001) and Ondersma et al. (2001) argued that we either excluded relevant outcomes or defined harm too narrowly. Regarding Ondersma et al.'s claim of narrowness, we do not believe that the 18 separate categories of mental health measures included in our review were inadequate to examine harm. Indeed, CSA has been depicted as a "special destroyer of adult mental health," as Seligman ( 1994, p. 232) noted, and victimologists often provide long lists of symptoms labeled as effects of CSA that have included every category of symptoms examined in our review (e.g., depression, dissociation, eating disorders, and sexual maladjustment).
Dallam et al. (2001) claimed that we underestimated some adverse effects of CSA (i.e., posttraumatic stress disorder [PTSD] and behavioral problems) by using college samples. In fact, we examined all psychological correlates that appeared in at least two studies. Nevertheless, we believe that examining every imaginable correlate of CSA is not as informative as examining the magnitude of the relationship between CSA and psychological adjustment. Creating long lists of correlates capitalizes on statistical dependency between outcome measures and inflates apparent impact, unless specific measures are differentially correlated with CSA. In our review they were quite consistently related, which indicates that one measure (e.g., a general measure of adjustment) can act as a proxy for unmeasured correlates. The key issue is the magnitude of the association. In comparison with national and high school samples, the college samples do not underestimate magnitude.
Regarding PTSD, general measures of adjustment are adequate proxies for PTSD because of similar effect sizes. Neumann, Houskamp, Pollock, and Briere's (1996) meta-analysis yielded general indices of adjustment (d = .46 or r = .19), based on 11 nonclinical and clinical samples, that were comparable to PTSD (d = .52 or r = .22), based on 4 clinical samples. A national sample of women reporting child rape, cited by Ondersma et al. (2001), had a PTSD r = .12 (Saunders. Kilpatrick, Hanson, Resnick, & Walker, 1999). One of our college samples did assess PTSD (Brubaker, 1994); the effect size (r = .10) was comparable to all other measures that we meta-analyzed. In short, available evidence shows that the CSA-PTSD relationship is comparable in magnitude to CSA's relationship with many other adjustment measures.
Dallam et al. (2001) also stated that we underestimated effects involving behavior problems. Table 2 shows this is incorrect in terms of magnitude of association; emotional correlates are adequate proxies. Moreover, high school student studies do not support the statement that CSA "has a particularly negative impact on the behavior of adolescent males" (p. 717), as seen in the meta- analytic results for boys and girls in Table 2. The specific research they cited, rather than contradicting our earlier results, is consistent with those results.
Comparison to national samples
Dallam et al. (2001) made the serious charge that we often either misreported or "presented the [abuse severity] data in a misleading manner" (p. 717) in comparing college and national samples. They claimed that CSA in our college samples was categorized by highest level of abuse severity (where severity increases from noncontact sex such as exhibitionism to sexual touching to intercourse), whereas CSA in our national samples was categorized by simple frequency counts, rendering comparisons between the college and national samples invalid. However, only some college samples were categorized by highest level of severity, as is clearly indicated in our comment:
When possible, we obtained and averaged simple frequency counts to assess the degree of each type, rather than the extent of the most severe type. This contradicts Dallam et al.'s (2001) assertion that our numbers are not comparable, as well as their claim that we misreported the data from López, Carpintero, Hernandez, and Fuertes (1995), where exhibitionism was 33% of cases based on non-mutually exclusive categories.
Dallam et al. (2001, Table 1) presented their own table of the prevalence of CSA types, based on classification by highest level of severity. However, they left out the most reliable data available, those of Laumann, Gagnon, Michael, and Michaels (1994 ), which are based on a face-to-face interviewing format, an important methodological strength. Instead, they cited as more relevant to the general population Finkelhor, Hotaling, Lewis, and Smith (1990), a telephone interview study that found that 62% of men and 49% of women reported actual or attempted intercourse. On the basis of this single result, which doubled and quadrupled the rates we found in the college samples, they claimed our samples vastly
underestimated severity. However, this study is an outlier compared to all other national studies, probably because of the ambiguous screening question assessing intercourse. Even Finkelhor et al. (1990) stated that
The screening question on intercourse was
This question is clearly ambiguous. The main query is about any experience one now considers sexual abuse, and phrasing such as "any kind of" and "anything like that" is ambiguous and does not exclude non-intercourse CSA. This is clearly not a valid measure of intercourse.
In Table 3 we present percentages from five national samples, including the Edgardh and Ormstad (2000) study cited by Dallam et al. (2001). Finkelhor et al. (1990) is clearly an outlier, with percentages 2 to 12 times larger than the other percentages for men and 3 to 10 times larger for women. Extent of intercourse for men is higher in the college than in the national samples, with or without Finkelhor et al. For women, it is equal in the college and national samples without Finkelhor et al. but half as much including Finkelhor et al.
In conclusion, Dallam et al. (2001) misinterpreted our analysis of severity, failed to note a serious validity issue regarding the Finkelhor et al. results, and focused on a dubious estimate coming from this single study. CSA was not less severe in the college compared to national samples, assuming the common belief that intercourse is the most severe type.
Dallam et al. (2001) also disputed our comparison of effect size estimates in the college and national samples, listing three national samples in their Table 2. Although they described Finkelhor et al. (1990) as "more relevant to the U.S. general population" (p. 717) when using it for prevalence of severity, they did not include this study in their effect size analysis.
Additionally, they did not include the other U.S. national sample (Bigler, 1992) that we included in our meta-analysis of national samples (see Rind & Tromovitch, 1997). In that meta-analysis, we stipulated as an inclusion criterion that a national sample had to report data "separately for male and female respondents" (p. 241), given that one goal of that study was to analyze results separately by gender.
In the López et al. (1995) study, the only data on mental health problems reported separately by gender were presented in López et al.'s Table 7. Dallam et al. (2001) computed an effect size (shown in their Table 2) based not on López et al.'s Table 7 but rather on López et al.'s Table 8, which did not separate the sexes and was predominated by behavioral rather than psychological correlates - the mean effect size without base-rate correction for these measures was .13.
Meta-analyses of the male and female effect sizes resulted in the same effect size estimates for males (ru = .07) and females (ru = .10) that we reported previously (Rind & Tromovitch, 1997; Rind et al., 1998). We did not correct for base rates as Dallam et al. (2001) did, for reasons to be discussed later.
For purposes of debate, however, we examined Dallam et al.'s (2001) Table 2 effect sizes with base-rate corrections. They argued that there was "little support for the claim that [our] findings should be considered generalizable to the population as a whole" (p. 718).
We meta-analyzed the two effect sizes from the college data
as well as the five national effect sizes they provided
and then contrasted these two effect size estimates.
The contrast was statistically significant (z = 2.74, p < .01, two-tailed), but the effect size was very small (r = .02). We believe this exceedingly small association between sample type (college vs. national) and adjustment, based on selected national samples rather than all those available to Dallam et al. (2001), does not justify dismissing the college data as irrelevant to the population as a whole.
In sum, Dal1am et al. (2001) did not demonstrate that the college data are biased in terms of underestimating severity of CSA or CSA-adjustment relations. In their effort to demonstrate this, they selectively used particular results while ignoring others - the kind of confirmation bias that quantitative (i.e., meta-analytic) reviews are designed to counter, by taking into account all the data that meet pre-specified criteria rather than just the data that bolster one's argument.
It is important to add that the effect size estimates
from the high school, college, and national samples are all quite similar and small in magnitude - a result that bolsters, rather than disconfirms, our original conclusions.