Nature paper on the genetics of Aboriginal Australia

Somewhat belatedly, here is a link to new work of mine and colleagues’ on gene-language coevolution in Pama-Nyungan, the peopling of Sahul, and migration and admixture in the Pleistocene. It was recently published in NatureThere’s a lot in this paper, a Genomic History indeed. There has been some media attention, particularly Michael Erard’s piece on Pama-Nyungan phylogenetics and how important computational work has been to recent advances in Australian language history. There’s also a summary piece in The Conversation, particularly about the genetic side of the paper.

Video of conference talk on Grammar Boot Camps

I run a grammar boot camp every year, where a small group of students write a grammar of a language in a month. Last year it was Ngalia, and this year (starting in a few weeks) it’ll be Cundalee Wangka and Kuwarra. I also ran a year-long grammar group to pilot the idea in 2013, using materials from Tjupan. All four languages are varieties of the Wati subgroup of Pama-Nyungan and all the books are based on fieldwork conducted by Sue Hanson.

At the recent Wanala Conference run by the Goldfields Language Centre, Anaí Navarro, Matthew Tyler and I did a video presentation about the boot camp, its aims, methods, and results. Here’s a link to the video: Warning that it’s 190mb and 22 minutes long.

Tasmanian language data

The CHIRILA database contains materials from the Aboriginal languages of Tasmania. The excel spreadsheets contain all the records from Plomley’s (1976) Tasmanian language data, and additional spreadsheets contain explanatory data about the speakers represented in the text, the regions where data were recorded, and who the recorders were. This is the data used in Bowern (2012).

A word of warning is warranted here. This is not easy data to use; there’s a steep learning curve both for understanding the original transcription conventions, Plomley’s groupings, and the abbreviations.


Tasmanian Data 600kb 20 downloads

The following files are the complete extant records for the languages of Tasmania,...

Introducing CHIRILA

I am very pleased to announce that the first phase of CHIRILA (Contemporary and Historical Resources for the Indigenous Languages of Australia) has been released. This represents approximately 180,000 words from 155 different Australian languages. It is a subset of the full database (of approx 780,000 items); eventually I hope to be able to release most of the data. Currently, the first phase is that for which we have explicit permission, or which is already in the public domain.

The material is hosted at; please see the web site for more information about the contents of the database, how to download data, what formats are available, and the like. We do not provide a web interface to the data; you download it and use excel or a database program to read the files.

We hope the data will be useful to researchers, community members, and others with an interest in Australia’s Indigenous language heritage. also includes access to the preprint of a paper describing the database (both the online and full versions).

Explorations in Pama-Nyungan Phylogenetics

I recently gave one of the plenary talks at a workshop on phylogenetic algorithms at the Lorentz Center in Leiden (Netherlands). In the talk I gave an overview of a number of recent results from my research program, including the creation of a Pama-Nyungan phylogeny and some of the research results that come from that.

The slides are available from, from this link.

One of the results that is worth highlighting is the distribution of innovative languages within subgroups. A standard theory argues that languages innovate in the center of their ranges. The innovations diffuse across the language area over times, and therefore areas around the periphery tend to show more archaisms than those in the center. This distribution should also apply to language subgroups, assuming that language split occurs through the gradual accretion of isoglosses so that dialects split into separate languages.

If this is true, subgroup areas should show the same distributions, if not in absolute terms, but in large measure. That is, more innovative languages should lie towards the center of subgroups, and more conservative ones should lie around that edges.

It turns out that it is straightforward to plot the most innovative languages in each subgroup, according to how much basic vocabulary they have replaced. In the Chirila database, there are basic vocabulary lists coded by cognacy. To get a sense of how innovative a language is, we can simply sum, for each word in the language, the number of languages that share that cognate and divide it by the total number of language-cognate items. That gives us a sense of the extent to which languages participate in the most archaic vocabulary in the famiy. Plotting the most innovative language in each subgroup gives us the following map.

As you can see, the most innovative languages are not, in most cases, in the center of the subgroups, but rather on the peripheries.

What can explain the discrepency? It’s probably the result of migratory expansions. That is, the languages that are the most innovative are the ones as the ‘ends’ of their subgroup phylogenetic expansions. That is, the most innovative languages are the ones that have undergone the most branching; another way of thinking about this is that more innovation happens on lineages with more branching events. This echoes a result from other work by Atkinson, Pagel, and colleagues, who also found that lineage splitting speeds up change.

One might think that this result reflects language contact; that is, that languages on the periphery might be in contact with more different languages, which leads to an increase in unidentifiable vocabulary. But these languages are not the only ones which are in contact with languages from other subgroups. In fact, if we map the most conservative languages in each subgroup, they are also often to be found around the periphery.

It may still be the case that the center-periphery model still holds in areas where languages have stopped expanding, and that Pama-Nyungan subgroups were (on the whole) not formed by diversification in situ.

It’s also interesting to plot the most and least conservative subgroups:

This is a bit more dodgy. For example, I strongly suspect that Thura-Yura’s place in this list is inflated by Wirangu having (as loans) a number of items that are otherwise found only in Western Pama-Nyungan languages, and by Wirangu overall showing some Pama-Nyungan retentions that are otherwise replaced in the rest of Thura-Yura. The broad trend, however, is that the further east, the less conservative. The correlation between longitude and retention is -0.49. The correlation doesn’t hold for latitude (0.05) or number of languages in the subgroup (-0.02).

Filed under: Chirila, Historical, Pama-Nyungan
Source: Anggarrgoon

Latest Paper: Quantifying uncertainty in the phylogenetics of Australian numeral systems

Earlier this month, the Yale Pama-Nyungan Lab’s Dr. Claire Bowern and Kevin Zhou published a paper titled “Quantifying uncertainty in the phylogenetics of Australian numeral systems” in the journal Proceedings of the Royal Society B. You can read the paper here.

Using Bayesian phylogenetic methods, Dr. Bowern and Zhou study and analyze the numeral systems of Pama-Nyungan languages in order to reconstruct how those systems may have looked thousands of years ago. What they discover is that the finite numeral systems of Pama-Nyungan languages change over time, losing and gaining numbers as they go. According to the authors, this demonstrates a potential for adaptability and flexibility in languages commonly stereotyped as simple, limited, and incapable of expressing new concepts. They also find that there is tremendous variation over time between the behavior of numeral systems limited at the number five and those with higher limits.

Here is the paper’s abstract:

Researchers have long been interested in the evolution of culture and the ways in which change in cultural systems can be reconstructed and tracked. Within the realm of language, these questions are increasingly investigated with Bayesian phylogenetic methods. However, such work in cultural phylogenetics could be improved by more explicit quantification of reconstruction and transition probabilities. We apply such methods to numerals in the languages of Australia. As a large phylogeny with almost universal ‘low-limit’ systems, Australian languages are ideal for investigating numeral change over time. We reconstruct the most likely extent of the system at the root and use that information to explore the ways numerals evolve. We show that these systems do not increment serially, but most commonly vary their upper limits between 3 and 5. While there is evidence for rapid system elaboration beyond the lower limits, languages lose numerals as well as gain them. We investigate the ways larger numerals build on smaller bases, and show that there is a general tendency to both gain and replace 4 by combining 2 + 2 (rather than inventing a new unanalysable word ‘four’). We develop a series of methods for quantifying and visualizing the results.

Language by source materials

For the curious, here is a map of the languages in the full database, color-coded by number of items. As you can see, there’s considerable variation, but there are also a good number of languages with substantial holdings.


Counts of sources in Australian lexical database, as at August 19, 2015

Filed under: language documentation, Lexicology/lexicography, Pama-Nyungan
Source: Anggarrgoon

Phase one database sources

I have a list of sources that will be released in Phase I of the Australian Lexical Database.

This represents about 170,000 lexical items and about 80 sources other than the Curr (1886) wordlists, which comprise the bulk of the collection.

We have been conservative in what is released because not everyone we have contacted about data has replied, and because we are still in the process of finding contact information for all the relevant stakeholders.

Of the people we contacted, only 5 sources were ‘closed’ (that is, unable to be distributed). The vast majority of researchers and communities gave permission for their languages to be represented in the database, which was really gratifying.

Phase I sources mapped against languages in comparative lexical database

Phase I sources mapped against languages in comparative lexical database


Phase I sources

Filed under: Bardi
Source: Anggarrgoon