Three factors in language variation

Text-only Preview

This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:

Author's personal copy
Available online at
Lingua 120 (2010) 1160–1177
Three factors in language variation
Charles Yang
Department of Linguistics & Computer Science, University of Pennsylvania, United States
Received 30 May 2008; received in revised form 28 August 2008; accepted 7 April 2009
Available online 29 May 2009
Universal Grammar and statistical generalization from linguistic data have almost always been invoked as mutually exclusive
means of explaining child language acquisition. This papers show that such segregation is both conceptually unnecessary and
empirically flawed. We demonstrate the utility of general learning mechanisms in the acquisition of the core grammatical system
through frequency effects in parameter setting, and develop an optimization-based model of productivity with applications to
morphology and syntax in the periphery. These findings in child language support the approach to the evolution of language that
seeks connections between language and other cognitive systems, in particular the consequence of general principles of efficient
# 2009 Elsevier B.V. All rights reserved.
Keywords: Parameter setting; Core vs. periphery; Statistical learning; Morphological processing; Productivity; Dative alternation
1. Introduction
How much should we ask of Universal Grammar? Not too little, for there must be a place for our unique ability to
acquire a language along with its intricacies and curiosities. But asking for too much won’t do either. A theory of
Universal Grammar is a statement of human biology, and one needs to be mindful of the limited structural modification
that would have been plausible under the extremely brief history of Homo sapiens evolution.
In a recent article, Chomsky (2005: 6) outlines three factors that determine the properties of the human language
Genetic endowment, ‘‘which interprets part of the environment as linguistic experience . . . and which
determines the general course of the development of the language faculty’’
Experience, ‘‘which leads to variation, within a fairly narrow range, as in the case of other subsystems of
the human capacity and the organism generally’’.
Principles not specific to the faculty of language: ‘‘(a) principles of data analysis that might be used
in language acquisition and other domains; (b) principles of structural architecture and developmental
constraints. . . including principles of efficient computation’’
These factors have been frequently invoked to account for linguistic variation—in almost always mutually
exclusive ways, perhaps for the understandable reason that innate things needn’t be learned and vice versa. Rather than
E-mail address: [email protected]
0024-3841/$ – see front matter # 2009 Elsevier B.V. All rights reserved.

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
dwelling on these efforts – see Yang (2004) for an assessment – we approach the problem of variation from the angle of
acquisition by developing a framework in which all three factors are given a fair billing. The study of child language
points to two fairly distinct types of language variation, which appear to invoke two distinct mechanisms of language
One kind of variation derives from the innate and invariant system of Universal Grammar (1a). Such a space of
variation constitutes the initial state of linguistic knowledge, which traditionally has been considered the ‘‘core’’
linguistic system (Chomsky, 1981). The child’s task is one of selection from a narrow range of options (e.g., parameter
values, constraint rankings) that are realized in her linguistic environment. A prominent line of evidence for the genetic
endowment of language comes from the fixed range of linguistic options, some of which are not present in the input data,
but which the child nevertheless spontaneously accesses and gradually eliminates during the course of acquisition.
Quite a different type of variation consists of language specific generalizations which are derived from the linguistic
environment, i.e., experience (1b). This type of variation can be identified with the periphery of the language faculty
(Chomsky, 1981:8): ‘‘marked elements and constructions’’, including ‘‘borrowing, historical residues, inventions’’
and other idiosyncrasies. The child’s task, as we shall see, is one of evaluation: decision making processes that
determine the scope of inductive generalizations based on the input yet still ‘‘within a fairly narrow range’’. We further
suggest that the instantiation of language variation by the child learner follows at least certain principles not specific to
the faculty of language (1c). The mechanism which selects amongst alternatives in the core parameter system in (1b) is
probabilistic in nature and apparently operates in other cognitive and perceptual systems, and had indeed first been
proposed in the study of animal learning and behavior. The acquisition of the periphery system in (1b) reflects general
principles of efficient computation which manipulate linguistic structures so as to optimize the time course of online
processing, very much in the spirit of the evaluation measure in the earlier studies of generative grammar (Chomsky,
1965; Chomsky and Halle, 1968). Both types of learning mechanisms show sensitivity to certain statistical properties
of the linguistic data that have been largely ignored in works that ask too much of Universal Grammar but would be
difficult to capture under approaches that rely solely on experience.
We take up these matters in turn.
2. Variation and selection
2.1. Return of the parameter
Saddled with the dual goals of descriptive and explanatory adequacy, the theory of grammar is primed to offer
solutions to the problem of language variation and acquisition in a single package. This vision is clearly illustrated by
the notion of syntactic parameters (Chomsky, 1981). Parameters unify regularities from (distant) aspects of the
grammar both within and across languages, thereby acting as a data compression device that reduces the space of
grammatical hypotheses during learning. The conception of parameters as triggers, and parameter setting as flipping
switches offers a most direct solution to language acquisition.
There was a time when parameters featured in child language as prominently as in comparative studies. Nina
Hyams’ (1986) ground breaking work was the first major effort to directly apply the parameter theory of variation to
the problem of acquisition. In recent years, however, parameters have been relegated to the background. The retreat is
predictable when broad claims are made that children and adults share the identical grammatical system (Pinker, 1984)
or that linguistic parameters are set very early (Wexler, 1998). Even if we accepted these broad assertions, a
responsible account of acquisition would still require the articulation of a learning process: a child born in Beijing will
acquire a different grammatical system or parameter setting from a child born in New York City, and it would be nice to
know how that happens. Unfortunately, influential models of parameter setting (e.g., Gibson and Wexler, 1994, but see
Sakas and Fodor, 2001) have failed to deliver formal results (Berwick and Niyogi, 1996),1 and it has been difficult to
bridge the empirical gap between child language and specific parameter settings in the UG space (Bloom, 1993;
Valian, 1991; Wang et al., 1992; Yang, 2002). The explanation of child language, which does differ from adult
language, falls upon either performance limitations or discontinuities in the grammatical system, both of which
presumably mature with age and general cognitive development—no thanks to parameters.
1 Baker (2002) and Snyder (2007) both sketched out properties of the parameter space that would make learning more efficient but no specific
learning model has been given.

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
To return, parameters must provide remedy for both the formal and the empirical problem in child language. The
hope, on our view, lies in paying attention to the factors of experience (1b) and the process of learning (1c), which have
not been addressed with sufficient clarity in the generative approach to acquisition. The variational learning model
(Yang, 2002) is an attempt to provide quantitative connections between the linguistic data and the child’s grammatical
development through the use of parameters. To capture the gradualness of syntactic acquisition, we introduce a
probabilistic component to parameter learning, which is schematically illustrated as follows2;
For an input sentence s, the child
a. with probability Pi selects a grammar Gi,
b. analyzes s with Gi
c.  if successful, reward Gi by increasing Pi
 otherwise punish Gi by decreasing Pi
Learning the target grammar involves the process of selection which eliminates grammatical hypotheses not
attested in the linguistic environment; indeed, the variational model was inspired by the dynamics of Natural Selection
in biological systems (Lewontin, 1983). It is obvious that non-target grammars, which all have non-zero probabilities
of failing in the target grammar environment, will eventually be driven to extinction. The probabilistic nature of
learning allows for alternative grammars – more precisely, parameter values – to co-exist, while the target grammar
gradually rises to dominance over time.3 The reality of co-existing grammars has been discussed elsewhere (Yang,
2002, 2006; Legate and Yang, 2007; Roeper, 2000 and subsequent work) but that line of evidence clearly rests on
establishing the fact that parameter setting is not too early, at least not in all cases; if the child is already on target, the
appeal to non-target parameter values as an explanatory device would be vacuous. This, then, requires us to develop a
framework in which the time course of parameter setting can be quantitatively assessed.
It can be observed that the rise of a grammar under (2) is a function of the probability with which it succeeds under a
sample of the input, as well as that of failure by its competitors. The dynamics of learning can be formalized much like
the dynamics of selection in the evolutionary system. Specifically, we can quantify the ‘‘fitness’’ of a grammar from
the UG Grammar pool as a probability of its failure in a specific linguistic environment:
The penalty probability of grammar Gi in a linguistic environment E is4
ci ¼ PrðGi sjs2EÞ
In the idealized case, the target grammar has zero probability of failing but all other grammars have positive penalty
probabilities. Given a sufficient sample of the linguistic data, we can estimate the penalty probabilities of the
grammars in competition. Note that such tasks are carried out by the scientist, rather than the learner; these estimates
are used to quantify the development of parameter setting but do not require the tabulation of statistical information by
the child. The present case is very much like the measure of fitness of fruit flies by, say, estimating the probability of
them producing viable offspring in a laboratory: the fly does not count anything.5 Consider two grammars, or two
alternative values of a parameter: target G1 and the competitor G2, with c1 ¼ 0 and c2 > 0. At any time, p1 þ p2 ¼ 1.
When G1 is selected, p1 of course will always increase. But when G2 is selected, p2 may increase if the incoming data
is ambiguous between G1 and G2 but it must decrease – with p1 increasing – when unambiguously G1 data is
presented, an event that occurs with the probability of c2. That is, the rise of the target grammar, as measured by p1
2 The formal details of the learning model can be found in Yang (2002). We will use the terms of ‘‘grammars’’ and ‘‘parameters’’ interchangeably
to denote the space of possible grammars under UG. For analytic results of learnability in a parametric space, see Straus (2008).
3 The appeal to non-target and linguistically possible options to explain child language can be traced back to Jakobson (1941/1968) and more
recently (Roeper, 2000; Crain and Pietroski, 2002; Rizzi, 2004, etc.) though these approaches do not provide an explicit role for either linguistic data
or mechanisms of learning.
4 We write s 2E to indicate that s is an utterance in the environment E, and G!s to mean that G can successfully analyze s. Formally, the success
of G !s can be defined in any suitable way, possibly even including extra-grammatical factors; a narrow definition that we have been using is
simply parsability.
5 In this sense, the use of the probabilistic information here is distinct from statistical learning models such as Saffran et al. (1996), where
linguistic hypotheses themselves are derived from the statistical properties of the input data by the learner. See Yang (2004) for an empirical
evaluation of that approach.

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
going to 1, is correlated with the penalty probabilities of its competitor, i.e., c2, which in turn determines the time
course of parameter setting. We turn to these predictions presently.
2.2. Frequency and parameter setting
As a starting point, consider the acquisition of verb raising to tense in French and similar languages. First, what is
the crucial linguistic evidence that drives the French child to the ½þŠ of the verb raising parameter? Word order
evidence such as (4a), where the position of the finite verb is ambiguous, is compatible with both the ½þŠ and ½ÀŠ value
of the parameter, and thus has no effect on grammar selection. Only data of the type in (4b) can unambiguously drive
the learner toward the ½þŠ value.
The raising of finite verbs in French and similar languages is a very early acquisition. Pierce (1992) reports that in child
French as early as 1;8, virtually all verbs preceding pas are finite while virtually all verbs following pas are non-finite.
Bear in mind that children in Pierce’s study are still at the two word stage of syntactic development, which is the earliest
stage in which verb raising could be observed from naturalistic production. And this early acquisition is due to the
accumulative effects of utterances such as (4b), which amount to an estimated 7% of child directed French sentences.6
Thus we obtain an empirical benchmark for early parameter setting, that 7% of unambiguous input data is sufficient.7
If all parameters are manifested at least as frequently as 7% of the input, then parameter setting would indeed be
early as widely believed. Fortunately that is not case, for otherwise we would not be able to observe parameter setting
in action. Let us consider two major cases of syntactic learning: the Verb Second parameter in languages such as
German and Dutch and the obligatory use of grammatical subjects in English. Both have been claimed – incorrectly, as
we shall see – to be very early acquisitions on a par with the raising of finite verbs in French (Wexler, 1998).
In an influential paper, Poeppel and Wexler (1993) make the claim that syntactic parameter setting takes place very
early, a claim which partially represents an agreement between the competence-based and the performance-based
approach (Pinker, 1984; Valian, 1991; Bloom, 1990; Gerken, 1991; cf. Hyams and Wexler, 1993) to grammar
acquisition: both sides now consider the child’s grammatical system to be adult like. Poeppel & Wexler’s study is
based on the acquisition of the V2 parameter. They find that in child German, finite verbs overwhelmingly appear in
the second (and not final) position while non-finite verbs overwhelmingly appear in the final (and not second) position.
But this does not warrant their conclusion that the V2 parameter has been set. A finite verb in the second position
does not mean it has moved to the ‘‘V2’’ position, particularly if the pre-verbal position is filled with a subject, as the
examples from Poeppel and Wexler (1993:3–4) illustrate below:
The structural position of the verb here deserves additional consideration. It is entirely possible that the verb has
gone only as far as T, and the subject would be situated in the Spec of T, and the clausal structure of the raised verb (5)
is not like German but like French. The evidence for V2 can be established only when the verb is unambiguously high
(e.g., higher than T) and the preverbal position is filled.
6 Unless noted through the citation of other references, the frequencies of specific linguistic input from child directed data are obtained from the
CHILDES database (MacWhinney, 1995). The details can be found in Yang (2002), which was the first generative study of language acquisition that
ever used input frequencies in the explanation of child language.
7 We claim 7% to be sufficient but it may not be necessary; an even lower amount may be adequate if the raising of finite verbs is established
before the two word stage, which could be confirmed by comprehension studies such as the preferential looking procedure (Golinkoff et al., 1987).

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
Table 1
Longitudinal V1 and V2 patterns. All sentences are finite, and the subjects are post-verbal.
V1 sentences
All sentences
To evaluate the setting of the V2 parameter, we must examine finite matrix sentences where the subject is
post-verbal. In child German acquisition, as shown in the quantitative study of Stromwold and Zimmerman (1999),
the subject is consistently placed out of the VP shell and is thus no lower than the specifier position of TP. If so, then a
finite verb preceding the subject will presumably by in C, or at least in some node higher than T. Now, if the preverbal,
and thus sentence-initial, position is consistently filled, then we are entitled to claim the early setting of the V2
parameter—or however this property comes to be analyzed. But Poeppel and Wexler’s claim is not supported: the
preverbal position is filled at child language, as shown in Table 1 which is based on Haegeman’s (1995: Tables 5 and 6)
longitudinal study of a Dutch child’s declarative sentences:
We can see that in the earliest stages, there are close to 50% of V1 utterances, in co-existence with V2 patterns, the
latter of which gradually increase in frequency.8 The claim of early V2 setting is therefore not supported.9 As argued
by Lightfoot (1999), Yang (2002) on independent grounds, the necessary evidence for V2 comes from utterances with
the pre-verbal position occupied by the object; such data only comes at the frequency of 1% in child-directed speech,
which results in a relatively late acquisition at the 36–38th month (Clahsen, 1986). Now we have established an
empirical benchmark for the relatively late setting of a parameter.
The quantitative aspects of parameter setting – specifically, the early and late benchmarks – can be further illustrated
by differential development of a single parameter across languages. This leads us to the phenomenon of subject drop by
English-learning children, one of the most researched topics in the entire history of language acquisition. Prior to 3;0
(Valian, 1991), children learning English leave out a significant number of subjects, and also a small but not insignificant
number of objects. However, children learning pro-drop grammars such as Italian and topic-drop grammars such as
Chinese are much closer to adult usage frequency from early on. For instance, Valian (1991) reports that Italian children
between 2;0–3;0 omit subjects about 70% of the time, which is also the rate of pro-drop by Italian adults reported by Bates
(1976) among others. In Wang et al.’s (1992) comparative study of Chinese and American children, they find that 2 year
old American children drop subjects at a frequency of just under 30%,10 which is significantly lower than Chinese
children of the same age group—and obviously significantly higher than English speaking adults. By contrast, the
difference in subject usage frequency between Chinese children and Chinese adults is not statistically significant.
If the claim of early parameter setting is to be maintained, and that certainly would be fine for Italian and Chinese
children, the disparity between adult and child English must be accounted for by non-parametric factors, presumably
by either competence or performance deficiencies. Without pursuing the empirical issues of these alternatives, both
approaches amount to postulating significant cognitive differences, linguistic or otherwise, between the learners
8 The data from 3;1 is probably a sampling oddity: all other months are represented by 5 to 10 recording sessions, but 3;1 had only one.
9 Poeppel and Wexler’s work does show, however, that finite verbs raise to a high position (out of the VP shell), and non-finite verbs stay in the
base position, and that the child grammar has an elaborate system of functional projections, and thus replicating Pierce’s (1992) findings in French
acquisition reviewed earlier. Furthermore, we have no quarrel with their more general claim that the child has access to the full grammatical
apparatus including functional projections. Indeed, even the V1 patterns displayed in Table 1 demonstrate that the structure of the CP is available to
the learner.
10 26.8% to be precise. The literature contains some discrepancies in the rate of subject omission. The criteria used by Wang et al. (1992) seem most
appropriate as they excluded subject omissions that would have been acceptable for adult English speakers. Following a similar counting procedure
but working with different data sets, Phillips (1995) produced similar estimates of subject drop by children.

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
acquiring different languages: that is, English learning children are more susceptible to competence and/or
performance limitations. This of course cannot be ruled out a priori, but requires independent justification.
The variational learning model provides a different view, in which a single model of learning provides direct
accounts for the cross-linguistic findings of language development. The time course of parameter setting needn’t be
uniform: Italian and Chinese children do set the parameter correctly early on but English children take longer. And the
reason for such differences is due to the amount of data necessary for the setting of the parameter, which could differ
across languages. The following data can unambiguously differentiate the grammars for the Chinese, Italian and
English learning children, and their frequencies in the child-directed speech are given as well.
[+topic drop, À pro drop] (Chinese): Null objects (11.6%; Wang et al., 1992: Appendix B)
[À topic drop, +pro drop] (Italian): Null subjects in object wh-questions (10%)
[À topic drop, À pro drop] (English): expletive subjects (1.2%)
The reasoning behind (6) is as follows. For the Chinese type grammar, subject omission is a subcase of the more
general process of topic drop, which includes object omission as well. Neither the Italian nor the Chinese type grammar
allows that, and hence object omission is the unambiguous indication of the [+] value of the topic drop parameter.11
However, topic drop is not without restrictions; such restrictions turn out to differentiate the Chinese type grammar
from the Italian type.12 There is a revealing asymmetry in the use of null subjects in topic drop languages (Yang, 2002)
which has not received much theoretical consideration. When topicalization takes place in Chinese, subject drop is
possible only if the topic does not interfere with the linking between the null subject and the established discourse
topic. In other words, subject drop is possible when an adjunct is topicalized (7a), but not when an argument is
topicalized (7b). Suppose the old discourse topic is ‘‘John’’, denoted by e as the intended missing subject, whereas the
new topic is in italics, having moved from its base position indicated by t.
The Itaian pro-drop grammar does not have such restrictions. Following Chomsky (1977) and much subsequent work
on topicalization and Wh-movement, the counterpart of (7b) can be identified with an object wh-question. In Italian,
subject omission is unrestricted as its licensing condition is through agreement and thus has nothing to do with the
discourse and information structure. Here again e stands for the omitted subject whereas t marks the trace of movement.
11 The actual amount of data may be higher: even subject drop would be evidence for the Chinese type grammar since the lack of agreement is
actually inconsistent with the pro-drop grammar.
12 As is well known, the pro drop grammar licenses subject omission via verbal morphology. But ‘‘rich’’ morphology, however it is defined, appears
to be a necessary though not sufficient condition, as indicated by the case of Icelandic, a language with ‘‘rich’’ morphology yet obligatory subject.
Thus, the mastery of verbal morphology, which Italian and Spanish learning children typically excel at from early on (Guasti, 2002), is not sufficient
for the positive setting of the pro-drop parameter.

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
Upon encountering data such as (8), the Chinese type grammar, if selected by the Italian learning child, would fail
and decrease its probability, whose net effect is to increase the probability of the [+] value of the pro drop parameter.
Note that for both the Chinese and Italian learning child, the amount of unambiguous data, namely (6a) and (6b),
occur at frequencies great than 7%, the benchmark established from the raising of finite verbs in French. We thus
account for the early acquisition of subjects in Italian and Chinese children in Valian (1991), Wang et al. (1992)’s
The English learning child takes longer as the Chinese type grammar lingers on.13 The rise of the obligatory subject
is facilitated by expletive subjects, but such data appear in low frequency: about 1.2% of child-directed English
utterances. Using the 1% benchmark established on the relatively late acquisition of the V2 parameter (3;0–3;2;
Clahsen, 1986), we expect English children to move out of the subject drop stage roughly around the same time, which
is indeed the case (Valian, 1991). And it is not a coincidence that the subject drop stage ends approximately at the same
time as the successful learning of the V2 parameter.
Finally, the variational model puts the parameter back into explanations of child language: learner’s deviation from
the target form is directly explained through parameters, which are also points of variation across languages. The
child’s language is a statistical ensemble of target and non-target grammars: deviation from the target, then, may bear
trademarks of possible human grammars used continents or millennia away, with which the child cannot have direct
contact. In the case of null subjects, we can analyze the English learning child as probabilistically using the English
type grammar, under which the grammatical subject is always used, in alternation with the Chinese type grammar,
under which subject drop is possible if facilitated by discourse. Thus, the distributional patterns of English child null
subjects ought to mirror those of Chinese adult null subjects as shown in (7). Indeed, the asymmetry of subject drop
under argument/adjunct topicalization for adult Chinese speakers is almost categorically replicated in child English, as
summarized below from Adam’s subject drop stage (file 1–20; Brown, 1973):
95% (114/120) of Wh-questions with null subjects are adjunct (how, where) questions
(e.g., ‘‘Where e go?’’, ‘‘Why e working?)
97.2% (209/215) of object questions (who, what) contain subjects (e.g., ‘‘What e doing?’’)
Taken together, we have uncovered significant frequency effects in parameter setting. The fact that frequency plays
some role in language learning ought to be a truism; language learning is impressively rapid but it does take time. Yet
the admission of frequency effects, which can only come about through the admission that experience and learning
matter, does not dismiss the importance of the first factor of Universal Grammar (cf. Tomasello, 2003). Quite to the
contrary, frequency effects in parameter setting presented here actually strengthen the argument for Universal
Grammar; frequency effects are effects about specific linguistic structures. The cases of null subjects and verb second
are illustrative because the input data is highly consistent with the target form yet children’s errors persist for an
extended period of time. To account for such input-output disparities, then, would require the learner to process
linguistic data in ways that are quite distant from surface level descriptions. If the space of possible grammars were
something like a phrase structure grammar with rules such as ‘‘S !NP VP’’ and ‘‘S! VP’’ where a þ b ¼ 1, it is
difficult to see how the phenomenon of subject drop is possible with the vast majority of English sentences containing
grammatical subjects. However, if the learner were to approach ‘‘S !½þtopic dropŠ’’ and ‘‘S!½Àtopic dropŠ’’, as
proposed in the parameter theory, we can capture the empirical findings of children’s null subjects—and null objects as
well. Parameters have developmental correlates, but they would only turn up when both the input and the learning
process are taken seriously.
3. Variation and evaluation
Selection among a universal set of options is by no means the only mode of language acquisition, and it would be
folly to attribute all variation in child language to the genetic component of Universal Grammar.
First, and most simply, the size of the search space and the resulting learning time to convergence can increase
exponentially with the number of parameters; this may undermine the original conception of parameters as a solution
13 The Italian type grammar can be swiftly dismissed by the impoverished English morphology, since sufficient agreement is a necessary condition
for pro-drop; see Yang (2002: Chapter 4) for additional discussion.

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
for the problem of explanatory adequacy. One phenomenon, one parameter is not recommended practice for syntax
and would not be a wise move for language acquisition either.
Second, it seems highly unlikely that all possibilities of language variation are innately specified; certainly, the
acquisition of particular languages does not always exhibit patterns of competition and selection. Variation in the
sound system is most obvious. While the development of early speech perception shows characteristics of selectionist
learning of phonetic and phonological primitives (Werker and Tees, 1983; Kuhl et al., 1992, see Yang, 2006 for
overview), the specific content of morphology and phonology at any point can be highly unpredictable, partly due to
ebbs and flows of language change over time; see Bromberger and Halle (1989). Innate principles or constraints
packaged in UG notwithstanding, even the most enthusiastic nativist would hesitate to suggest that the English specific
rule for past tense formation (‘‘-d’’) is one of the options, along with, say, the e´ suffix as in the case of French, waiting
to be selected by the child learner. Indeed, the past tense acquisition appears to have an Eureka moment: when the child
suddenly comes to the productive use of the ‘‘add -d’’ rule, over-regularization of irregular verbs (e.g., hold-holded)
starts to take place (Marcus et al., 1992; Pinker, 1999).
Close examination of syntactic acquisition reveals that the child is not only drifting smoothly in the land of
parameters (section 2) but also taking an occasional great leap forward. A clear example comes from the acquisition of
dative constructions. Quantitative analysis of children’s speech (Gropen et al., 1989; Snyder and Stromswold, 1997)
has shown that not all constructions are learned alike, or at the same time. For 11 of the 12 children in Snyder &
Stromsworld’s study, the acquisition of double object construction (‘‘I give John a book’’) proceeds that of
prepositional to-construction (‘‘I give a book to John’’) by an average of just over 4 months. Prior to that point, children
simply reproduce instances of datives present in adult speech. When three-year-olds productively apply these
constructions to novel lexical items as in pilked the cup to Petey into I pilked Petey the cup (Conwell and Demuth,
2007), they must have learned the alternation on the more mundane pairings of give, lend, send, and others.
The acquisition of past tense and datives are obviously different: the specific form of the ‘‘add -d’’ rule is learned
directly from data whereas the candidate verbs for dative constructions are probably provided by innate and universal
syntactic and semantic constraints (Pesetsky, 1995; Harley, 2002; Hale and Keyser, 2002; Rappaport Hovav and Levin,
2008). But the logical problems faced by the learner are the same. Upon seeing a sample that exemplifies a
construction or a rule, which may contain exceptions (e.g., irregulars), the learner has to decide whether the observed
regularity is a true generalization that extends beyond experience, or one of lexical exceptions that must be stored in
memory. For the cases at hand, the answer is positive, but the same decision making process ought to return a negative
answer for the rule ‘‘
’’ when presented with bring, buy, catch, seek and think. In other words, the
decision making involves the recognition of the productivity of the language particular processes.
In the rest of this section, we will extend a mathematical model (Yang, 2005) that specifies the conditions under
which a rule becomes productive. Even though the empirical motivation for that model is based on morphological
learning and processing, there is suggestive evidence that it can extend to the study of syntax as well, as we set out to
explain why the acquisition of double objects precedes that of prepositional (to) dative construction.
3.1. Optimization and productivity
Consider how an English child might learn the past tense rules in her morphology. Suppose she knows only two
words ring-rang and sing-sang; at this point she might be tempted to conjecture a rule ‘‘
’’ .14 The child has
every reason to believe this rule to be productive because, at this particular moment, it is completely consistent with the
learning data. However, as her vocabulary grows, ‘‘
’’ will run into more and more exceptions (e.g, bring-
brought, sting-stung, swing-swung’, etc.). Now the learner may decide enough is enough, and the rule will be demoted
to the non-productive status: sing and sang would be memorized as instances of lexical exceptions, which is how
English irregular verbs are treated in morphophonology (Halle and Mohanan, 1985; Pinker, 1999). By contrast, the
exceptions to the ‘‘add -d’’ rule – about 150 irregular verbs, depending how you count – are apparently not enough to
derail its productivity, which is backed up by thousands of regular verbs in English language.
14 One way to do so is to make conservative generalizations over a set of structural descriptions that share the same process of structural change.
This is commonly found in inductive learning models in Artificial Intelligence (Mitchell, 1982; Sussman and Yip, 1997), and its first linguistic
application is in Chomsky (1955). How the learner comes to such generalizations is an important issue but one which is of no significant interest to
us for present purposes; our key task is to determine the productivity of such generalizations.

Author's personal copy
C. Yang / Lingua 120 (2010) 1160–1177
The question is: How much is enough? How many exceptions can a productive rule tolerate? What kind of batting
average15 would the learner hold out for productive rules? Before we lay out our approach to the problem, let us note
that the child learner is superbly adept at recognizing productive processes. For instance, it is well known that English
learners over-regularize in past tense acquisition (Marcus et al., 1992; Pinker, 1999; Yang, 2002), in up to 10% of all
past tense uses. It is perhaps less well known that children do not over-irregularize: errors such as bring-brang are
exceedingly rare, only constituting in 0.2% of the past tense production data (Xu and Pinker, 1995). In Berko’s classic
Wug study (1958), four year olds reliably supply regular past tense for novel verbs but only one of 86 children
extended gling and bing to the irregular form of glang and bang, despite maximum similarity to the sing-sang and
ring-rang irregulars. Children generalize productive rules but do not generalize lexical rules.
There is a line of research in morphology that gathers corpus statistics to correlate with behavioral tests for
productivity (Aronoff, 1976; Baayen and Lieber, 1991; Hay, 2003 though see Schu¨tze, 2005a,b for a criticism of some
of the methodologies) but these results are, at best, a statistical summary of the empirical data that ‘‘accord nicely with
[linguists’] intuitive estimate of productivity’’ (Baayen and Lieber, 1991). Even if a criterion for productivity
established from these data summarizations turns out to be true – e.g., a productive rule can tolerate not more than 25%
of exceptions – we still need an explanation why that magic number is 25%, rather than 18% or 32%.
Our approach is a throwback to the notion of an evaluation measure, which dates back to the foundations of
generative grammar (Chomsky, 1955; Chomsky and Halle, 1968, in particular p. 172). It provides an evaluation metric,
and hence a decision procedure, that the learner can deploy to determine whether a linguistic generalization is
warranted or not. It is useful to recall that the evaluation measure ‘‘is not given a priori . . . Rather, a proposal
concerning such a measure is an empirical hypothesis about the nature of language’’ (Chomsky, 1965:37). A model of
productivity, therefore, requires independent motivation. And this is an area where Chomsky’s third factor – in
particular, ‘‘principles of efficient computation’’ – may play an important role in the organization of the language
faculty. Claims of efficient computation require an independently motivated metric of complexity. To this end, we
conjecture that the mental representation of morphological knowledge is driven by the time complexity of online
processing: productivity is the result of maintaining an optimal balance between lexical and productive rules.
Though direct storage of derived forms has been suggested (Baayen, 2003),16 the combinatorial explosion of
morphologically complex languages (Hankamer, 1989; Niemi et al., 1994; cf. Chan, 2008) necessitates a stage-based
architecture of processing that produces morphologically complex forms by rule-like processes (Caramazza, 1997;
Levelt et al., 1999). At the minimum, the stem must be retrieved from the lexicon and then combined with appropriate
rules/morphemes to generate the derived form. Both processes appear to be geared toward real-time efficiency, where a
telling source of evidence comes from frequency effects. One of the earliest and most robust findings in lexical
processing is that high frequency words are recognized and produced faster than low frequency words in both visual
and auditory tasks (Forster and Chambers, 1973; Balota and Chumbley, 1984). Within the component of
morphological computation, it is well established that the processing of exceptions (e.g., irregulars) is strongly
correlated with their frequency (see Pinker and Ullman, 2002 for reviews). Such findings have been considered
problematic for discrete representations in generative morphology – see Seidenberg and Gonnerman (2000), Hay and
Baayen (2006) – but that would be far too hasty; see Yang (2008) for general discussion of probabilistic matters in
generative grammar. When understood in terms of modern computer science algorithms, formal models of linguistic
competence can be directly translate into a performance model (cf. Miller and Chomsky, 1963; Berwick and
Weinberg, 1984); they not only provide accommodation for behavioral results such as frequency effects but also lead
to an evaluation measure for productivity.
Generative theories traditionally hold that the organization of morphology is governed by the Elsewhere Condition
(Kiparsky, 1973; Halle, 1990), which requires the application of the most specific rule/form when multiple candidates
are possible. This provides a way for representing exceptions together with rule-following items. Algorithmically, the
Elsewhere Condition may be implemented as a serial search procedure17:
15 For the uninitiated, this is American Baseball talk, referring to the percentage of a batter managing a hit. Batters with sufficiently high batter
averages get rich and famous, and the rest languish in the Minors.
16 As noted by Pinker (1999), there has been a lack of reports of storage effects in auditory presentations of regularly inflected items; it is thus
possible that Baayen’s results may be an artifact of orthographic familiarity.
17 This is again a return to an earlier approach in lexical processing, the serial search model of Forster (1976, 1992). The advantage of this model
lies in its ready availability for analytic methods, and its empirical coverage is at least as good as other approaches (Murray and Forster, 2004).