Nature paper on the genetics of Aboriginal Australia

Somewhat belatedly, here is a link to new work of mine and colleagues’ on gene-language coevolution in Pama-Nyungan, the peopling of Sahul, and migration and admixture in the Pleistocene. It was recently published in NatureThere’s a lot in this paper, a Genomic History indeed. There has been some media attention, particularly Michael Erard’s piece on Pama-Nyungan phylogenetics and how important computational work has been to recent advances in Australian language history. There’s also a summary piece in The Conversation, particularly about the genetic side of the paper.

Explorations in Pama-Nyungan Phylogenetics

I recently gave one of the plenary talks at a workshop on phylogenetic algorithms at the Lorentz Center in Leiden (Netherlands). In the talk I gave an overview of a number of recent results from my research program, including the creation of a Pama-Nyungan phylogeny and some of the research results that come from that.

The slides are available from, from this link.

One of the results that is worth highlighting is the distribution of innovative languages within subgroups. A standard theory argues that languages innovate in the center of their ranges. The innovations diffuse across the language area over times, and therefore areas around the periphery tend to show more archaisms than those in the center. This distribution should also apply to language subgroups, assuming that language split occurs through the gradual accretion of isoglosses so that dialects split into separate languages.

If this is true, subgroup areas should show the same distributions, if not in absolute terms, but in large measure. That is, more innovative languages should lie towards the center of subgroups, and more conservative ones should lie around that edges.

It turns out that it is straightforward to plot the most innovative languages in each subgroup, according to how much basic vocabulary they have replaced. In the Chirila database, there are basic vocabulary lists coded by cognacy. To get a sense of how innovative a language is, we can simply sum, for each word in the language, the number of languages that share that cognate and divide it by the total number of language-cognate items. That gives us a sense of the extent to which languages participate in the most archaic vocabulary in the famiy. Plotting the most innovative language in each subgroup gives us the following map.

As you can see, the most innovative languages are not, in most cases, in the center of the subgroups, but rather on the peripheries.

What can explain the discrepency? It’s probably the result of migratory expansions. That is, the languages that are the most innovative are the ones as the ‘ends’ of their subgroup phylogenetic expansions. That is, the most innovative languages are the ones that have undergone the most branching; another way of thinking about this is that more innovation happens on lineages with more branching events. This echoes a result from other work by Atkinson, Pagel, and colleagues, who also found that lineage splitting speeds up change.

One might think that this result reflects language contact; that is, that languages on the periphery might be in contact with more different languages, which leads to an increase in unidentifiable vocabulary. But these languages are not the only ones which are in contact with languages from other subgroups. In fact, if we map the most conservative languages in each subgroup, they are also often to be found around the periphery.

It may still be the case that the center-periphery model still holds in areas where languages have stopped expanding, and that Pama-Nyungan subgroups were (on the whole) not formed by diversification in situ.

It’s also interesting to plot the most and least conservative subgroups:

This is a bit more dodgy. For example, I strongly suspect that Thura-Yura’s place in this list is inflated by Wirangu having (as loans) a number of items that are otherwise found only in Western Pama-Nyungan languages, and by Wirangu overall showing some Pama-Nyungan retentions that are otherwise replaced in the rest of Thura-Yura. The broad trend, however, is that the further east, the less conservative. The correlation between longitude and retention is -0.49. The correlation doesn’t hold for latitude (0.05) or number of languages in the subgroup (-0.02).

Latest Paper: Quantifying uncertainty in the phylogenetics of Australian numeral systems

Earlier this month, the Yale Pama-Nyungan Lab’s Dr. Claire Bowern and Kevin Zhou published a paper titled “Quantifying uncertainty in the phylogenetics of Australian numeral systems” in the journal Proceedings of the Royal Society B. You can read the paper here.

Using Bayesian phylogenetic methods, Dr. Bowern and Zhou study and analyze the numeral systems of Pama-Nyungan languages in order to reconstruct how those systems may have looked thousands of years ago. What they discover is that the finite numeral systems of Pama-Nyungan languages change over time, losing and gaining numbers as they go. According to the authors, this demonstrates a potential for adaptability and flexibility in languages commonly stereotyped as simple, limited, and incapable of expressing new concepts. They also find that there is tremendous variation over time between the behavior of numeral systems limited at the number five and those with higher limits.

Here is the paper’s abstract:

Researchers have long been interested in the evolution of culture and the ways in which change in cultural systems can be reconstructed and tracked. Within the realm of language, these questions are increasingly investigated with Bayesian phylogenetic methods. However, such work in cultural phylogenetics could be improved by more explicit quantification of reconstruction and transition probabilities. We apply such methods to numerals in the languages of Australia. As a large phylogeny with almost universal ‘low-limit’ systems, Australian languages are ideal for investigating numeral change over time. We reconstruct the most likely extent of the system at the root and use that information to explore the ways numerals evolve. We show that these systems do not increment serially, but most commonly vary their upper limits between 3 and 5. While there is evidence for rapid system elaboration beyond the lower limits, languages lose numerals as well as gain them. We investigate the ways larger numerals build on smaller bases, and show that there is a general tendency to both gain and replace 4 by combining 2 + 2 (rather than inventing a new unanalysable word ‘four’). We develop a series of methods for quantifying and visualizing the results.

Language by source materials

For the curious, here is a map of the languages in the full database, color-coded by number of items. As you can see, there’s considerable variation, but there are also a good number of languages with substantial holdings.


Counts of sources in Australian lexical database, as at August 19, 2015

Phase one database sources

I have a list of sources that will be released in Phase I of the Australian Lexical Database.

This represents about 170,000 lexical items and about 80 sources other than the Curr (1886) wordlists, which comprise the bulk of the collection.

We have been conservative in what is released because not everyone we have contacted about data has replied, and because we are still in the process of finding contact information for all the relevant stakeholders.

Of the people we contacted, only 5 sources were ‘closed’ (that is, unable to be distributed). The vast majority of researchers and communities gave permission for their languages to be represented in the database, which was really gratifying.

Phase I sources mapped against languages in comparative lexical database

Phase I sources

Documenting Endangered Languages outreach videos

A new set of videos have been released which provide information on how to apply for a grant to do language documentation. The series is focused on the requirements for the National Science Foundation’s DEL program, but there is much information that would be useful to anyone applying for funding for their language projects. The videos are aimed at community members as much as (if not more than) academic linguists.

I have two of the video segments: components of an application, and 6 things that tank a grant proposal. The first segment is DEL-specific; we walk through the sections of an application. The second one, however, is very general, and applies to just about all grant applications.

In brief, the six things are

  1. A project outside the agency’s mandate (e.g. DEL funds linguistic work on endangered languages)
  2. Project doesn’t meet the agency requirements (e.g. they ask for X, Y, and Z in the application, but if that’s not provided, it’ll be rejected;
  3. Unrealistic aims, budget, time frame.
  4. Too vague
  5. Too specific, too narrow for the scope of the budget or time, ie not good value for money
  6. Inconsistency in the proposal.

You can watch the video here for further information.

New Publication of Learner’s Guides

I have released two learner’s guides on One is for Yan-nhaŋu, the other for Bardi. They were written several years ago (first version for Yan-nhaŋu was 2006, and 2010 for Bardi) but I have been unable to find a more ‘traditional’ publisher for them. They have both been circulated in the relevant communities in both electronic and paper form. Perhaps ironically, this circulation was one of the reasons that I haven’t been able to find a publisher; the publishers I contacted assumed that I had already saturated the market for the books and that there would be no demand.

The uploaded versions of these Guides are based on the most recent updates; 2010 for Yan-nhaŋu, when I used the guide in a class on Aboriginal languages at Yale, and 2011 for Bardi, when I was last in the field. My negotiations with community members about these guides included permission to publish. Here are the direct links:

Please note the pricing structure: you don’t have to pay for them to download them, but you can. You name your own price. I have suggested $14.99 for each book. The proceeds from these books will go to support the Endangered Language Fund. The ELF supported two trips to work on Bardi (in 2003 and 2011). The royalties are 90% minus 50c, so of a $14.99 book price, $12.99 goes to the ELF.

The Bardi learner’s guide was originally a class project, at Rice in 2006. It was subsequently heavily edited (several times) and expanded, most recently by my former student Laura Kling, who did her senior thesis on Bardi. The Yan-nhaŋu guide was originally written after 5 weeks fieldwork at Milingimbi, but was expanded after subsequent trips. I have a big debt to Prof. Jane Simpson in these guides. Both guides used the Warumungu Learner’s Guide as a template (the Yan-nhaŋu guide more closely than the Bardi one) and it made it much easier to write a fairly detailed guide in the short space of time available.

The books use leanpub as the host site. I have been quite impressed with how easy it was to use them. They mostly have technical computing books but it would be nice to see more language-related materials up there. Their pricing structure seems a bit more friendly than Amazon’s (though they don’t have print on demand). is another self-publishing site that has been recommended to me.

Pama-Nyungan language locations

As noted in a previous post, I’ve started to put some of the results of my Pama-Nyungan prehistory grant on my lab web site, at One of the recent updates is a language map. The data are not new; this map was released in about 2011 (though with updates since). It is released through a wordpress plugin on the site, which allows easy embedding of maps into sites. I highly recommend it for its ease of use, except for the fact that it doesn’t seem  to render in Chrome on a Mac (at least, not on my mac).

Comments on language locations, names, etc, on the map are very welcome. Please use the comment form on the map’s page.

Languages coded for phylogenetics

I am starting a series of posts on map data from the Pama-Nyungan project. To begin, here is a map showing the languages for which I have coded wordlists suitable for phylogenetic analysis. Note that for some reason, viewing the page in Chrome on OSX results in a jquery error. It can be viewed in Firefox, or on Chrome on a PC.

The points are coded by how much data are available. The least well attested languages have white points; the middling ones are marked by plain red, while the languages with the most complete datasets have red markers with a square inside.

Some of the points appear to be at sea. This is an irritating result of how google earth fails to account for zoom correctly; the points are close to the close but not actually under water.


Semantic maps in Pama-Nyungan

One of the advantages of a large lexical database is the ability to test large-scale ideas about language behaviors. As a quick experiment this afternoon, I extracted all the colexification patterns from the database. These are all the words that are glossed by multiple distinct words within the same language.* 20 minutes to download the file, and about the same to manipulate it with the igraph package in R to produce some cluster visualisations.

Fruchterman-Reingold layout, colexification patterns appearing more than 50 times in Australian languages. (c) Claire Bowern, 2015

Fruchterman-Reingold layout, colexification patterns appearing more than 50 times in Australian languages. (c) Claire Bowern, 2015

*Of course, there are going to be issues with this, particularly in the lack of colexification evidence for some languages. The data are only as complete or as good as the dictionaries that went into the database in the first place.

