Optional Sidebar Info

Any information can be placed in the sidebar to help your website visitors navigate your site.

To make a boxed heading like the one above, simply apply the H3 tag.

To make a box like this, assign the "sidebarlt" class.

You can do anything with a sidebar box. Insert images, ads or other web content.

Here's a text link.

To make a box like this, assign the "sidebardk" class.

You can do anything with a sidebar box. Insert images, ads or other web content.

Here's a text link.

Ver en castellano

Sociolinguistic Corpus of Spoken Chilean Spanish
(Coscach)

The Sociolinguistic Corpus of Spoken Chilean Spanish, known by its Spanish acronym Coscach, is a massive electronic database of the speech of Chileans created with cutting-edge technology and solid sociolinguistic methods. Currently, it contains audio and video recordings of 605 speakers performing a series of elicitation tasks, ranging from reciting minimal pairs to a long, unstructured and maximally-informal conversation.

The ultimate goal of the Coscach is to make it possible for myself and others to do large-scale empirical research on a wide range of linguistic issues, while at the same time allowing the phenomena studied to be analyzed in terms of a series of social variables -- geographical location, socioeconomic level, sex, age, ethnicity (Mapuche / Hispano-Chilean), languages spoken natively (Spanish only / Spanish and Mapudungun) and rural/urban provenance. To this end, I've also developed a set of tools for working with massive speech corpora, including MaSCoT and Perkins.

Please note that the Coscach is currently under development, and is not yet ready for use by other researchers. It's slated to be completed by early 2019.

 

Quick Stats

As of 18 July 2017, the Coscach consists of:

  • 612 speakers, roughly half male and half female.
  • A stratified speaker sample consisting of 6 socioeconomic levels.
  • Hispano-Chilean speakers from Arica, Antofagasta, La Serena, Santiago, Curicó, Concepción, Temuco, Melipeuco, Tirúa and Chiloé, with more regions to come!
  • Spanish-monolingual Mapuche speakers from Santiago, Temuco, Melipeuco and Tirúa. Also with more regions to come!
  • Spanish-Mapudungun bilingual speakers from throughout Wallmapu (the traditional area in southern Chile inhabited by the Mapuche people, located in the south of the country).
  • Between five and seven different elicitation activities per speaker.
  • Approximately 1.5 million words of orthographically transcribed reading activities.
  • Some 2.5 million words of transcribed conversations (wordcount only includes informants' speech).
  • About 500 GB of 24 bit / 48 kHz audio recordings in broadcast WAV format.
  • About 11 TB of video in MTS format.

 

Technology and Recording

The Coscach corpus was designed from the ground up to allow the most demanding phonetic research, and so great care was taken to insure that audio recordings would be future-proof and of the highest quality.

We record audio with Fostex FR-2LE digital recorders and Audix HT5 head-mounted microphones. The Fostex, which is known for its high quality preamps, has made for a more portable and resilient setup, while the Audix mic has proven over and over that it makes stunningly good recordings. In addition to its almost completly flat frequency response, the fact that it sits just a fraction of an inch from the speaker's mouth allows it to provide a phenomenal signal-to-noise ratio in even the noisiest of environments. If you're doing field recordings with any other type of mic, you're doing it wrong!

To make large-scale, efficient retrieval and analysis possible, all audio recordings are segmented and transcribed orthographically in Praat; the orthographic transcriptions are then syllabified and phonemically transcribed using Perkins, a program I wrote for this purpose, and retrieval is done with MaSCoT.

Currently, we're working with the developers of PHON to adapt their software and our workflow to sociophonetic research, and we're very excited about the possibilities we're discovering!

 

Socioeconomic stratification

The Coscach is a sociolinguistic corpus. Speakers are socioeconomically stratified into one of six groups using the EMIS system, which is a version of the ESOMAR methodology adapted for use in sociolinguistic research.