Optional Sidebar Info

Any information can be placed in the sidebar to help your website visitors navigate your site.

To make a boxed heading like the one above, simply apply the H3 tag.

To make a box like this, assign the "sidebarlt" class.

You can do anything with a sidebar box. Insert images, ads or other web content.

Here's a text link.

To make a box like this, assign the "sidebardk" class.

You can do anything with a sidebar box. Insert images, ads or other web content.

Here's a text link.

Ver en castellano

Sociolinguistic Corpus of Spoken Chilean Spanish

The Sociolinguistic Corpus of Spoken Chilean Spanish, known by its Spanish acronym COSCACH, is a massive electronic database of the speech of Chileans created with cutting-edge technology and solid sociolinguistic methods. Currently, it contains audio and video recordings of 887 speakers performing a series of elicitation tasks, ranging from reciting minimal pairs to a long, unstructured and maximally-informal conversation (typically 35-50 minutes long).

The ultimate goal of the Coscach is to make it possible to do large-scale empirical research on a wide range of linguistic issues, while at the same time allowing these phenomena studied to be analyzed in terms of a series of social variables -- geographical location, socioeconomic level, sex, age, ethnicity (Mapuche / Hispano-Chilean), native language(s) (Spanish only / Spanish and Mapudungun) and rural/urban provenance. To this end, I've also developed a set of tools for working with massive speech corpora, including MaSCoT and Perkins.


The COSCACH corpus at a glance

As of 18 December 2017, the COSCACH consists of:

  • Audio and video recordings of 887 speakers, roughly half male and half female.
  • Orthographic and phonemic transcriptions of 575 of these recordings. Transcriptions are done with Praat, and are segmented and aligned at the utterance level.
  • Approximately 3.7 million words of conversations (word count only includes informants' speech).
  • Some 1.7 million words of reading activities (minimal pairs, individual sentences, narrative texts, etc.)
  • About 700 GB of 24 bit / 48 kHz audio recordings in broadcast WAV format.
  • About 11 TB of videos in MTS format.

The COSCACH speaker sample has the following structure:

  • 6 socioeconomic levels.
  • 2 sexes.
  • 1 to 3 age groups, depending on the location.
  • Hispano-Chilean speakers from Arica, Antofagasta, La Serena, Santiago, Curicó, Concepción, Temuco, Melipeuco, Tirúa and Chiloé, with more regions to come!
  • Spanish-monolingual Mapuche speakers from Santiago, Temuco, Melipeuco and Tirúa. Also with more regions to come!
  • Spanish-Mapudungun bilingual speakers from throughout Wallmapu (the traditional area in southern Chile inhabited by the Mapuche people, located in the south of the country).

When the COSCACH is completed, it will contain approximately 6.5 million words of interviews and 2.9 million words of reading tasks.


The COSCACH corpus was designed from the ground up to allow the most demanding phonetic research, and so great care was taken to insure that audio recordings would be future-proof and of the highest quality.

We record audio with Fostex FR-2LE digital recorders and Audix HT5 head-mounted microphones. The Fostex, which is known for its high quality preamps, has made for a more portable and resilient setup, while the Audix mic has proven over and over that it makes stunningly good recordings. In addition to its almost completly flat frequency response, the fact that it sits just a fraction of an inch from the speaker's mouth allows it to provide a phenomenal signal-to-noise ratio in even the noisiest of environments. If you're doing field recordings with any other type of mic, you're doing it wrong!

Retrieval and analysis

To make large-scale, efficient retrieval and analysis possible, all audio recordings are segmented and transcribed orthographically in Praat. The orthographic transcriptions are then syllabified and phonemically transcribed using Perkins, a program I wrote for this purpose. Retrieval of transcriptions and/or their corresponding audio recordings (at the utterance level) is performed with MaSCoT.

In addition, for research not related phonetics or phonology, interview transcriptions are...

  • Extracted.
  • Tagged with the Chilean Spanish version of FreeLing, which lemmatizes, chunks, assigns parts of speech, and so on.
  • Compiled into a searchable corpus using the IMS Open Corpus Workbench.
  • Imported into CQP Web, which provides a powerful and user-friendly interface for working with the corpus.

Currently, we're also working with the developers of PHON to adapt their software and our workflow to sociophonetic research, and we're very excited about the possibilities we're discovering!


Please note that the COSCACH is currently under development, and is not yet ready for use by other researchers. It's slated to be completed by early 2019.