Language, Dialect, and Register: Sociolinguistics and the ...

Text-only Preview

Language, Dialect, and Register:
Sociolinguistics and the Estimation of
Measurement Error in the Testing of
English Language Learners
University of Colorado at Boulder
This article examines the intersection of psychometrics and sociolinguists in the testing
of English language learners (ELLs); it discusses language, dialect, and register as
sources of measurement error. Research findings show that the dialect of the language
in which students are tested (e.g., local or standard English) is as important as
language as a facet that influences score dependability in ELL testing. The devel-
opment, localization, review, and sampling of items are examined as aspects of the
process of test construction critical to properly attaining linguistic alignment: the
correspondence between the features of the dialect and the register used in a test, and
the features of the language to which ELLs are exposed in both formal and instruc-
tional contexts.
Though well recognized, the impact of language on the validity of tests is
yet to be properly addressed. Since testing typically depends on the use of
language, to a large extent an achievement test is a test of language pro-
ficiency (American Educational Research Association/American Psycholog-
ical Association/National Council on Measurement in Education, 1999).
Language remains the prime construct-irrelevant factor in testing—a factor
that an instrument does not intend to measure yet affects test scores (see
Messick, 1989).
Although language is always an issue in testing, it becomes a much more
serious problem when students are not proficient in the language in which
they are tested. Efforts in the field of testing accommodations for English
language learners (ELLs) have rendered results that speak to the difficulty
of addressing this challenge. The effectiveness of the linguistic simplification
of items is limited by factors such as the ELL students’ language back-
grounds (e.g., Abedi & Lord, 2001; Abedi, Lord, Hofstetter, & Baker, 2000).
Moreover, language interacts with mathematics achievement in tests in
Teachers College Record Volume 108, Number 11, November 2006, pp. 2354–2379
Copyright r by Teachers College, Columbia University

Language, Dialect, and Register
ways that are different for ELL students and their non-ELL counterparts
(Abedi, 2002).
The issue of language as a construct-irrelevant factor in ELL testing is
aggravated by inappropriate or inconsistent testing practices and policies.
Information on the linguistic proficiency of ELLs is usually fragmented or
inaccurate (De Avila, 1990), and the criteria and instruments used to classify
students as ELLs are not the same across states (Abedi, 2004; Aguirre-
˜oz & Baker, 1997). Even attempts to characterize the linguistic pro-
ficiency of ELLs based on the kind of bilingual programs in which they are
enrolled (or whether they are in any bilingual program at all) may be flawed
because these programs vary considerably in type and fidelity of imple-
mentation (Brisk, 1998; Gandara, 1997; Krashen, 1996), and their success is
shaped by a multitude of contextual factors (Cummins, 1999).
Several authors (e.g., LaCelle-Peterson & Rivera, 1994; O. Lee, 1999,
2002, 2005; Lee & Fradd, 1998; Solano-Flores & Nelson-Barber, 2001;
Solano-Flores & Trumbull, 2003) have asserted that existing approaches to
dealing with diversity are limited because they lack adequate support from
current theories of language and culture. This gap between disciplines is
well illustrated by results from a recent review of surveys of ELL testing
practices (Ferrara, Macmillan, & Nathan, 2004). This study revealed that
among the accommodations reported for ELLs are actions of dubious rele-
vance to language—such as providing enhanced lighting conditions—bor-
rowed from the set of accommodations created for students with disabilities
(see Abedi, Hofstetter, & Lord, 2004). Although these irrelevant accom-
modations are well intended and may contribute to enhancing testing con-
ditions for any student, they do not target characteristics that are critical to
the condition of being an ELL, and they ultimately lead to obtaining invalid
measures of academic performance for ELLs.
Although linguists have seriously questioned current ELL testing prac-
tices (e.g., Cummins, 2000; Hakuta & Beatty, 2000; Hakuta & McLaughlin,
1996; Valde
´s & Figueroa, 1994), this criticism has not brought with it al-
ternative approaches. Unfortunately, this dearth of alternative approaches
becomes more serious in the context of the No Child Left Behind Act
(2001), which mandates that ELLs be tested in English after a year of living
in the United States or of being enrolled in a program for ELLs. Unfor-
tunately, ELLs will continue to be tested for accountability purposes in spite
of both the flaws of the new accountability system (see Abedi, 2004) and the
body of evidence from the field of linguistics that shows that individuals
need more time to acquire a second language before they can be assumed to
be fully proficient in that language (Hakuta, Goto Butler, & Witt, 2000).
This article addresses the need for research in the field of language from
which new and improved methods for the testing of ELLs can be derived
(see August & Hakuta, 1997). It addresses the fact that tests, as cultural

Teachers College Record
artifacts, cannot be culture free (Cole, 1999) and that constructs measured
by tests cannot be thought of as universal and are inevitably affected by
linguistic factors (see Greenfield, 1997). It establishes the intersection of two
disciplines: (1) sociolinguistics, which is concerned with the sociocultural
and psychological aspects of language, including those involved in the ac-
quisition and use of a second language (see Preston, 1999) and (2) psycho-
metrics, which in the context of education is concerned with the design and
administration of tests and the interpretation of test scores with the intent of
measuring knowledge and skills.
This article is organized in two parts. In the first part, I discuss the link
between two key concepts in sociolinguistics: dialect and register; and two
key concepts in psychometrics: sampling and measurement error. These
concepts are critical to the development of new, alternative psychometric
approaches that address the tremendous heterogeneity that is typical of
populations of ELLs.
In the second part, I discuss the notion of linguistic alignment: the cor-
respondence between the dialect and the register used in a test and the
characteristics of the language to which ELLs are exposed. I then discuss
ways in which linguistic alignment can be addressed in different areas of the
testing process.
Current approaches to testing ELLs are mainly based on classifications of
students according to broad linguistic groups, such as students whose first
language is English, or students whose first language is Spanish. This view is
reflected in the designs used traditionally in ELL research. These designs
focus on test score differences between populations of ELLs and main-
stream non-ELLs, or on test score differences between subgroups within a
given population defined by some kind of treatment. For example, in the
field of testing accommodations for ELLs, an ‘‘ideal study . . . is a 2 Â 2
experimental design with both English language learners and native speak-
ers of English being randomly assigned to both accommodated and non-
accommodated conditions’’ (Shepard, Taylor, & Betebenner, 1998, p. 11).
In some cases, the classifications used in these studies may be inaccurate
because of the wide variety of types of ELLs or bilingual students (see
˜oz & Baker, 1997; Casanova & Arias, 1993; Council of Chief
State School Officers, 1992). In addition, these classifications do not always
refer to the students’ academic language proficiencies in either English or
their native language (see Cummins, 2000).
An additional level of analysis can be used that comprises two additional
closely related components: dialect and register (Figure 1). Level refers to

Language, Dialect, and Register
Figure 1. Levels of Analysis in the Testing of ELLs
the fact that dialect and register are considered to be subordinate categories
of a language in the sense that there may be many dialects of the same
language and many registers within the same language (see Wardhaugh,
Whereas dialect refers to a variation of a language that is characteristic of
the users of that language, register refers to a variation of a language that is
determined by use—a situation or context. Dialects are different ways of
saying the same thing; they reflect social structure (e.g., class, gender, and
origin). Registers are ways of saying different things; they reflect social
processes (e.g., division of labor, specialty, contexts, content areas, and spe-
cific activities; Halliday, 1978). Dialects are associated with the linguistic and
cultural characteristics of the students who belong to the same broad lin-
guistic group; registers are associated with the characteristics of the lan-
guage (especially academic language) used in tests.
This section discusses how the sociolinguistic perspective and the psy-
chometric perspective can be used in combination to examine language,
dialect, and register as sources of measurement error.
The idea of linking psychometrics to sociolinguistics originated in a project
whose main goal was to assemble a sample of responses given by ELL
students, whose first languages were Spanish, Chinese, and Haitian Creole,
to the same set of open-ended science and mathematics items administered
in English and in their native language (Solano-Flores, Lara, Sexton, &
Navarrete, 2001). Our intent was to show, side by side, the students’ re-
sponses to each item in both languages.
In selecting the response samples, we observed that the quality of the
students’ responses was inconsistent across both items and languages. A
given student might perform better in his first language than in English for
some items but better in English than in his first language for other items. If
we wanted to determine whether these students should be tested in English
or in their first language, comparing the mean scores they obtained in each
language would not render valuable information because the score differ-
ences would cancel each other. What we needed was an estimation of the

Teachers College Record
amount of score variation due to this interaction between student, item, and
To accomplish our goal, we used generalizability (G) theory (Brennan,
1992, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson &
Webb, 1991). G theory is a psychometric theory that deals with measure-
ment error. It distinguishes between two kinds of sources of score variation.
One is student (s), the object of measurement; the other comprises facets—
sources of measurement error. The facets in our study were item (i), rater
(r), and language (l). G theory allowed us to estimate score variance due to
(1) the main effect of student (s); (2) the main effect of the facets (i, r, l); and
(3) the interaction effect of all sources of score variation (si, sr, sl, ir, il, rl, sir,
sil, srl, and srl,e; the e in ‘‘srl,e’’ denotes the measurement error that cannot
be accounted for and that is due to unknown sources).
Our analyses revealed that the sil interaction produced the largest score
variation. The performance of ELLs was considerably inconsistent both
across items and across languages. These results indicated that, in addition
to their knowledge of the content area assessed, ELLs had different sets of
strengths and weaknesses in English and in their native language, and in
addition to their intrinsic cognitive demands, test items posed different sets
of linguistic challenges.
A series of studies performed with larger samples of students confirmed
those results (Solano-Flores & Li, 2006). In this new series of studies, we
examined the r2 and j coefficients, which respectively express the extent to
which student achievement measures can be generalized depending on
whether they are intended to rank individuals or to index their absolute
level of performance (Shavelson & Webb, 1991). Based on these coeffi-
cients, we were able to determine the number of items that would be
needed to obtain dependable scores by testing ELLs in English and in their
native language.
We observed that the number of items needed to obtain dependable
scores varied within the same broad linguistic group. We also observed that
testing ELLs in their native languages does not necessarily produce more
dependable scores than testing them in English. For example, in order to
obtain more dependable measures of academic achievement, some groups
of native Haitian Creole speakers might need to be tested in English, while
others might need to be tested in Haitian Creole. A similar pattern was
observed among native speakers of Spanish.
The considerable score variation due to the interaction of student, item,
and language is consistent with two well-known facts about bilinguals. First,
ELLs have different patterns of language dominance that result from dif-
ferent kinds of language development in English and their native languages
(see Stevens, Butler, & Castellon-Wellington, 2000). Second, ELLs’ linguistic
proficiencies vary tremendously across language modes (i.e., writing,

Language, Dialect, and Register
Table 1. Differences Between Approaches to ELL Testing Based on Item Response
Theory (IRT) and Approaches Based on Generalizability (G) Theory
Item Response Theory
Generalizability Theory
Scaling, score differences between Measurement error, score
linguistic groups
variation due to language
Differential item functioning
Test score dependability
Between groups
Within groups
Reference groups Non-ELLs or ELLs who do not
No reference groups
receive testing accommodations
Level of analyses Item
Characteristics of Clear-cut differences assumed
No clear-cut differences
linguistic groups
reading, listening, speaking) and contexts (e.g., at home, at school, with
friends, with relatives); they are shaped by schooling (e.g., bilingual or full
immersion programs) and the way in which language instruction is imple-
mented (e.g., by emphasizing reading or writing in one language or the
other; Bialystok, 2001; Genesee, 1994; Valde´s & Figueroa, 1994).
An approach based on viewing language as a source of measurement
error addresses the fact that bilingual individuals do not typically replicate
their capacities across languages (Bialystok, 1991; Heubert & Hauser,
1999). This approach differs substantially from other approaches to ELL
testing. For example, approaches based on item response theory (IRT)
allow interpretation of mean score differences between linguistic groups
(e.g., Camilli & Shepard, 1994; Ercikan, 1998; van de Vijver & Tanzer,
1998); they examine bias due to language based on differential item func-
tioning (DIF): the extent to which individuals from different populations
(e.g., ELLs and non-ELLs ) have different probabilities of responding cor-
rectly to an item despite having comparable levels of performance on the
underlying measured attribute (Hambleton, Swaminathan, & Rogers,
1991). In contrast, an approach based on G theory does not necessarily
compare groups (Table 1).
A detailed discussion of the characteristics of G theory and IRT cannot be
provided in this article for reasons of space. However, it should be said that,
rather than being alternate approaches to the same measurement prob-
lems, the two theories serve different sets of purposes and can be used
complementarily. An approach to ELL testing that integrates the two the-
ories may not take place soon because efforts to address some methodo-
logical and theoretical issues to link them are still in progress (e.g., Briggs &
Wilson, 2004). In addition, although it has been used to examine error
variance due to facets such as rater, task, occasion, and context in the testing
of second-language proficiency (Bachman, Lynch, & Mason, 1995; Bolus,

Teachers College Record
Hinofotis, & Bailey, 1982; Brown & Bailey, 1984; Y. W. Lee, 2005; Molloy &
Shimura, 2005; Stansfield & Kenyon, 1992), G theory has not been used
before to examine language as a source of measurement error.
A dialect is defined by linguists as a variety of a language that is distin-
guished from other varieties of the same language by its pronunciation,
grammar, vocabulary, discourse conventions, and other linguistic features.
Dialects are rule-governed systems, with systematic deviations from other
dialects of the same language (Crystal, 1997). Research on the linguistic
characteristics of several non-standard-English dialects has found that these
dialects are ‘‘as complex and as regularly patterned as other varieties of
English, which are considered more standard’’ (Farr & Ball, 1999, p. 206).
Thus, although the term dialect is frequently used to refer to the language
used by people from a particular geographic or social group or to mean a
substandard variety of a language, in fact everyone speaks dialects (Preston,
1993). Standard English is one among many English dialects (Wardhaugh,
Different dialects may originate from contact with other languages or
from the fact that certain features of a language shared by its speakers
evolve among some communities but are kept among others (Wolfram,
Adger, & Christian, 1999). Thus, ELLs from the same broad linguistic
group but from different geographical areas within the United States can be
thought of as speakers of different dialects of their own language and
speakers of different English dialects.
In thinking about dialect and testing, it must be kept in mind that dis-
tinguishing between two dialects of a given language or defining a language
or a dialect may be problematic and even a matter of opinion. For example,
the differences between Mandarin and Cantonese are more profound than
the differences between Danish and Norwegian, yet Mandarin and Can-
tonese are considered dialects of Chinese, whereas Danish and Norwegian
are treated as different languages (see Haugen, 1966). However, although
dialects may be difficult to characterize, what is relevant to our discussion is
the notion that dialect can be an important source of measurement error.
We adopt a pragmatic approach in our research. Rather than trying to
characterize the dialect of each community, we assume that communities of
users of the same language may have dialect differences important enough
to affect student performance on tests.
For the purpose of this discussion, the term community is used here in an
ample manner to refer to a group of users of the same dialect. This group
can be, for example, a school community or a group of individuals who
speak the same language and live in the same neighborhood. It is not

Language, Dialect, and Register
Dialect A
Dialect B
Figure 2. Commonalities and Differences Between Dialects
necessarily used as synonym of speech community, a concept that is somehow
controversial because it assumes constancy across social contexts in the way
in which individuals use a speech style (see Hymes, 1974; Wardhaugh,
Although the dialects spoken by different communities are mutually in-
telligible (Rickford & Rickford, 1995)—they tend to differ in phonetics and
phonology but not in semantics (Halliday, 1978)—in the absence of oppor-
tunities for clarification, body language, and certain physical clues, tests
limit the possibilities for understanding test items. Moreover, because young
ELLs are developing both their first and second languages or because their
own native language is poorly developed, their performance in tests can be
especially sensitive to dialect variations, regardless of the language in which
they are tested. Subtle but important differences in elements such as word
usage, word frequency, syntax, and the use of certain idiomatic expressions
may limit the capacity of standard dialect versions of tests to properly assess
ELL students in either English or their native language (Solano-Flores,
Trumbull, & Kwon, 2003).
Figure 2 illustrates the commonalities and differences that may exist
between dialects of the same language. For the sake of simplicity, this ex-
ample assumes that there are only two dialects, A and B, of the same lan-
guage. The circles represent the sets of linguistic features (e.g., grammar,
vocabulary, word use frequency, and discourse conventions) that define
Dialect A and Dialect B. The intersection of A and B (A \ B) represents all
the features that the two dialects have in common; the areas outside the
intersection represent all the features that the dialects do not have in com-
mon and that might pose a challenge for communication.
In the field of testing, it is common to say that a test is written in the
standard form of a language (as in Standard English) to imply that it can be
understood by all the users of that language. In our example, this is true
only if all the linguistic features of the test are in A \ B. However, the reality
might be otherwise. Dialects are associated with various social groups or
classes (Coulmas, 2005); standard is actually used to refer to the mainstream
or most socially acceptable dialect in a society (Wardhaugh, 2002). If Dialect
A is the dialect used by the mainstream segment of a society, then a test

Teachers College Record
written in the standard dialect reflects all the linguistic features of Dialect A but
only the linguistic features of Dialect B that are in A \ B. As a consequence,
Dialect B users are more likely than Dialect A users to face linguistic chal-
lenges that are not related to the construct that the test is intended to
Solano-Flores and Li (2006) have performed a series of G studies in
which both dialect and language have been examined as sources of meas-
urement error. These studies have provided empirical evidence that dialect
can be as important as language in the testing of ELLs.
The participants in these studies were fourth- and fifth-grade ELL stu-
dents whose first language was Haitian Creole. They were assigned to two
treatment groups. In Group 1, students were tested across languages with
the same set of National Assessment of Educational Progress (NAEP) math-
ematics items (drawn from NAEP 1996, 2000) in both Standard English (the
original version of the test) and the standard dialect version of their Haitian
Creole, created by professional translators. In Group 2, students from two
communities of Haitian Creole speakers were tested with the same set of
items in two dialect versions of Haitian Creole, standard and local—the
dialect of Haitian Creole used in their communities.
To create the local-dialect versions of the test, a series of test translation
sessions were facilitated with a team of teachers from each community.
These teachers were asked to translate the items into a version of Haitian
Creole that reflected the language used in their community and that could
be understood by their own students.
G theory analyses revealed a considerable score variation due to the
interaction of student, item, and dialect for Group 2. Moreover, the mag-
nitude of score variation due to this interaction was as large as the mag-
nitude of score variation due to the interaction of student, item, and
language. These results indicate that ELL students do not necessarily per-
form better if they are tested in a standard version of their native language
than if they are tested in Standard English. In addition, the results indicate
that the dialect of the language in which they are tested (whether it is
English or the first language) is a powerful influence that shapes student
performance. Whether tested in English or in their native language, ELLs
are tested in some dialect of that language. Regardless of what language is
used to test ELLs, dialect can be crucial to obtaining valid measures of their
academic achievement.
Linguists distinguish between the linguistic skills used by ELLs in informal
conversation and the linguistic skills inherent to learning content (Cum-
mins, 1996; Hamayan & Perlman, 1990). Although there has been debate

Language, Dialect, and Register
around the nature of this distinction (see, for example, Cummins, 2000;
Edelsky, 1990; MacSwan & Rolstad, 2003; Rivera, 1984), there is agreement
that school and testing pose more linguistic demands to ELLs than the
demands posed by using a second language in informal settings.
The concept of register as a variety of a language is particularly useful in
conceptualizing the challenges that ELLs face in testing:
A register [italics added] can be defined as the configuration of semantic
resources that the member of a culture typically associates with a situ-
ation type. It is the meaning potential that is accessible in a given social
context. Both the situation and the register associated with it can be
described to varying degrees of specificity; but the existence of reg-
isters is a fact of everyday experience—speakers have no difficulty in
recognizing the semantic options and combinations of options that are
‘‘at risk’’ under particular environmental conditions. Since these op-
tions are realized in the form of grammar and vocabulary, the register
is recognizable as a particular selection of words and structures. But it
is defined in terms of meanings; it is not an aggregate of conventional
forms of expression superposed on some underlying content by ‘‘so-
cial factors’’ of one kind or another. It is the selection of meanings that
constitutes the variety to which a text belongs. (Halliday, 1978, p. 111)
Thus, performing well on a standardized test requires ELLs to know more
than the content area assessed by the test or the language in which it is
administered. It also requires from them the use of the register of that
discipline and the register of tests. This combined register is defined by the
activity in which individuals are engaged at a given time (e.g., taking a test).
Among other things, this register differs from other registers on features
such as semantics (e.g., root has different meanings in colloquial language
and in mathematics); word frequency (e.g., ion is mostly restricted to the
content of science); idiomatic expressions (e.g., the phrase None of the above
is used almost exclusively in multiple-choice tests); notation (e.g., A divided
by B is represented as A/B); conventions (e.g., uppercase letters are used to
denote variables); syntactical structures (e.g., the structure of multiple-
choice items in which an incomplete sentence [the stem] is followed by
several phrases [the options]); and ways of building arguments (e.g., Let A be
an integer number).
No wonder that, although ELLs can become reasonably fluent in con-
versational English in a relatively short time, it takes much longer for them
to become fluent in academic English (Cummins, 1984, 2003; Guerrero,
1997; Hakuta et al. 2000). In addition, it takes much longer for them to
deal successfully with the fact that test items tend to contain dense text and
scant contextual information, use colloquial terms with unusual meanings