Optional Sidebar Info

Any information can be placed in the sidebar to help your website visitors navigate your site.

To make a boxed heading like the one above, simply apply the H3 tag.

To make a box like this, assign the "sidebarlt" class.

You can do anything with a sidebar box. Insert images, ads or other web content.

Here's a text link.

To make a box like this, assign the "sidebardk" class.

You can do anything with a sidebar box. Insert images, ads or other web content.

Here's a text link.

Ver en castellano

Sociolinguistic Corpus of Spoken Chilean Spanish
(COSCACH)

The Sociolinguistic Corpus of Spoken Chilean Spanish, known by its Spanish acronym COSCACH, is a massive electronic database of the speech of Chileans created with cutting-edge technology and solid sociolinguistic methods. Currently, it contains audio recordings of 1,251 speakers performing a series of elicitation tasks, ranging from reciting minimal pairs and reading sentences and meaningful texts, to participating in an extended, unstructured and maximally-informal conversation (typically 35-50 minutes long).

The ultimate goal of the Coscach is to make it possible to do large-scale empirical research on a wide range of linguistic issues, while at the same time allowing these phenomena studied to be analyzed in terms of a series of social variables -- geographical location, socioeconomic status, sex, age, ethnicity (Mapuche / Hispano-Chilean), native languages spoken (Spanish only / Spanish and Mapudungun) and rural/urban provenance. To this end, I've also developed a set of tools for working with massive speech corpora such as the COSCACH, including MaSCoT and Perkins.

 

The COSCACH corpus at a glance

As of August 2019, the COSCACH consists of:
  • Audio and video recordings of 1,230 Chilean speakers (roughly half male and half female), plus a control group of 21 non-Chilean Spanish speakers.
  • Orthographic and phonemic transcriptions of these recordings. Transcriptions are done with Praat, and are segmented and aligned at the utterance level.
  • 9,125,765 words of running text. This includes only informants' speech.
  • 83,002 minutes (1383.4 hours or 172.9 eight-hour work days) of recordings.

The COSCACH Chilean Spanish speaker sample has the following structure:

  • 2 types of "lingualism": monolingual in Spanish, bilingual in Mapudungun and Spanish.
  • 2 ethnicities: Hispano-Chilean, Mapuche.
  • 6 socioeconomic levels: A, B, Ca, Cb, D, E (from highest to lowest).
  • 2 sexes.
  • Between 1 and 5 age groups, depending on the location: 16-24, 25-34, 35-49, 50-64, 65+ years.
  • Hispano-Chilean speakers from Arica, Antofagasta, La Serena, Santiago, Curicó, Concepción, Tirúa, Temuco, Melipeuco, Valdivia, Chiloé and Wallmapu.
  • Spanish-monolingual Mapuche speakers from Santiago, Tirúa, Temuco, Melipeuco and Chiloé. Also with more regions to come!
  • Spanish-Mapudungun bilingual speakers from throughout Wallmapu (the traditional area in southern Chile inhabited by the Mapuche people, located in the south of the country).

The non-Chilean native Spanish speaker sample currently includes people from Argentina, Bolivia, Colombia, Cuba, Mexico, Paraguay, Peru and Venezuela.

The following speech elicitation instruments were used with each speaker:

  • Conversational interview based on speakers' own interests, with no pre-defined questions or structure. Seeks to elicit maximally spontaneous speech.
  • Interview about language attitudes based on a pre-defined questionnaire.
  • Reading of meaningful texts.
  • Reading of minimal pairs and wordlists to obtain maximally controlled speech.
  • Sustained pronunciation of vowels for voice quality research.

 

Recording

The COSCACH corpus was designed from the ground up to allow the most demanding phonetic research, and thus great care was taken to insure that audio recordings would be future-proof and of the highest quality.

Audio was recorded with Fostex FR-2LE digital recorders and Audix HT5 head-mounted microphones. The Fostex is known for its high-quality preamps and low self-noise, while the Audix mics have proven over and over that they make stunningly good recordings. In addition to its almost completly flat frequency response, the fact that it sits just a fraction of an inch from the speaker's mouth allows it to provide a phenomenal signal-to-noise ratio in even the noisiest of environments. If you're doing field recordings with any other type of mic, you're doing it wrong!

 

Retrieval and analysis

To make large-scale, efficient retrieval and analysis possible, all audio recordings are segmented and transcribed orthographically in Praat. The orthographic transcriptions are then syllabified and phonemically transcribed using Perkins, a program written specifically for this purpose. Retrieval of transcriptions and/or their corresponding audio recordings (at the utterance level) is performed with MaSCoT.

In addition, for research not related phonetics or phonology, interview transcriptions are...

 

Access

The Sociolinguistic Corpus of Chilean Spanish (COSCACH) is currently in beta testing. Access is by invitation only. This restriction exists for a host of reasons: the site needs to be properly tested before it is opened to the scientific community in general; the corpus itself is still under active development; and all interview texts must be thoroughly vetted and, when necessary, redacted, in order to respect speakers' privacy and anonymity. We expect this process will be finished by the end of 2020.

 

The COSCACH Team

Principal Investigator

Dr. Scott Sadowsky

Catholic University of Chile (Santiago, Chile) & Max Planck Institute for the Science of Human History (Jena, Germany)

 

Fieldworkers

FIELDWORKER
RECORDINGS
María José Aninao
343
Beatriz Yáñez
266
Scott Sadowsky
175
Sebastián Zepeda
108
Ruth Contreras
99
Bárbara Galdames
88
Camila Aedo
59
Tiare Araya
30
Edson Salgado
27
Lorena Perdomo
24
Viviana Vergara
18
Andrea Osorio
5
Francisca Morales
4
Matt Muñoz
3
Camila Valdebenito
2
Javiera Solís
2
Ignacia Fuentes
2
Daniela Contreras
2
Catalina Pérez
1
Laura Avendaño
1
Adolfo Bravo
1
Constanza Fajardo
1
Daniela Millalén
1
Camila Moreno
1
Belén Solís
1
TOTAL*
1264

* The number of recordings is greater than the number of speakers in the corpus because some recordings were excluded for technical reasons.

 

Transcribers

TRANSCRIBER
TRANSCRIPTIONS
Belén Solís
324
Ignacia Fuentes
246
Francisco Beltrán
150
Francisco Martínez
124
Majo Zanetta
103
Sebastián Zepeda
89
Mareba Torres
88
Scott Sadowsky
38
Andrea Noria
24
Ruth Contreras
11
Bárbara Galdames
11
Paola Vega
9
Roby Delgado
8
Javier Riquelme
5
Darío Fuentes
5
Paz Otth
4
Francisca Carrasco
4
Carla Cerda
4
Daniela Contreras
3
Daniela Millalén
2
Isabel Cayunao
2
María José Aninao
2
Constanza Fajardo
2
Maggie Mora
2
Guillermo Loyola
1
Maddy Rees
1
Camila Moreno
1
Diego Fuentes
1
TOTAL*
1264

* The number of transcriptions is greater than the number of speakers in the corpus because some recordings were excluded for technical reasons.

 

Postprocessing and database entry

  • Danitza Matus