A Test of Coding Procedures for Lexical Data with
Tupı́-Guaranı́ and Chapacuran Languages
Natalia Chousou-Polydouri,∗ Joshua Birchall,† Sérgio Meira,† Zachary O’Hagan,‡ and Lev Michael‡
∗ Laboratoire Dynamique du Langage
† Museu Paraense Emı́lio Goeldi
‡ University of California, Berkeley
Abstract—Recent phylogenetic studies in historical linguistics the same time, there has been little work to evaluate if and how
have focused on lexical data. However, the way that such data different coding methods affect resulting classifications, with
are coded into characters for phylogenetic analysis has been two exceptions: a parsimony-based empirical test on Indo-
approached in different ways, without investigating how coding
methods may affect the results. In this paper, we compare European by Rexová and colleagues [10] and an analytical
three different coding methods for lexical data (multistate investigation of Pagel and Meade based on a maximum
meaning-based characters, binary root-meaning characters, and likelihood framework [11]. While Rexová and colleagues find
binary cognate characters) in a Bayesian framework, using topological differences when using different coding methods,
data from the Tupı́-Guaranı́ and Chapacuran language families Pagel and Meade predict no impact on topology, although
as case studies. We show that, contrary to prior expectations,
different coding methods can have a significant impact on the differences in branch lengths and support values are expected.
topology of the resulting trees. In this paper, we briefly describe and discuss three major
lexical coding methods and we compare their results in a
Keywords—Bayesian phylogenetic inference, cognate coding, Bayesian Inference framework, using data from the Tupı́-
historical linguistics, South American indigenous languages Guaranı́ and Chapacuran language families as case studies.
I. I NTRODUCTION II. DATA
South America, long considered the ethnographically and We test the different coding methods on lexical datasets
linguistically “least known continent” [1], has in recent for two South American language families: a Tupı́-Guaranı́
decades experienced a surge of descriptive and documentary dataset of 33 languages for a 547-meaning wordlist [7], and
linguistic research [2], [3]. The classification of the languages a Chapacuran dataset of 11 languages for a 126-meaning
of this region, and especially those of Amazonia, has, in wordlist [8]. Each dataset includes data for every language
contrast, advanced little in the last 50 years [4], [5]. However, for which adequate lexical data is available.
the increasing availability of lexical data on South American
languages, as well as recent successes in applying computa- III. M ETHODS
tional phylogenetic techniques to data of this type, offers us the
opportunity to push forward our understanding of genealogical A. Coding procedures
relationships in the region with new datasets and tools [6]–[8]. We compare three coding procedures based on different
While it is accepted that lexical data from natural languages types of characters: 1) multistate meaning-based characters;
carry phylogenetic signal, the study of lexical evolution per se 2) binary root-meaning characters; and 3) binary cognate
has largely been neglected by historical linguistics (with the characters. The two first coding methods are based on a
exception of lexicostatistics), as the evolution of other domains comparative lexical dataset collected using a wordlist, while
of language, such as phonology and morphology, are consid- the third necessitates the broader collection of lexical data
ered more informative for subgrouping and less susceptible including close synonyms.
to borrowing. In contrast, computational phylogenetic studies A typical comparative lexical dataset based on a wordlist
in recent years have focused primarily on lexical evolution, yields inherently multistate characters. Each meaning of the
due to the ease with which relatively short wordlists can be wordlist is a character. All languages that exhibit cognate
analyzed with a variety of established phylogenetic methods. forms for a given meaning are given the same character state
A critical aspect of these methods, and a way in which value. In other words, each character is equivalent to the
they differ, is the manner in which phylogenetic characters question “For meaning X, what root (or roots) express X?”
are generated from lexical data. The differing nature of these and the coding method essentially tracks lexical replacement.
characters ultimately reflects different understandings of the We refer to this scheme as ‘multistate meaning-based’ coding.
phylogenetic notion of homology [9] in the context of lexical Surprisingly, this coding method has been very rarely used
evolution. However, there has been little discussion of the im- [10]. Among its advantages is the ease of data collection and
plications of different coding methods and what the underlying its applicability in instances of little available lexical data. One
assumptions of each are regarding how the lexicon evolves. At potential problem of multistate meaning-based coding is that it