Ver en castellano

Sociolinguistic Speech Corpus of Chilean Spanish
(COSCACH)

The SociolinguisticSpeech Corpus of Chilean Spanish, known by its Spanish acronym COSCACH, is a massive electronic database of Chilean Spanish speech created with cutting-edge technology and solid sociolinguistic methods. It contains a total of 9,288,301 tokens, 68,705 types and 1,061,711 utterances derived from 83,002 minutes of audio recordings.

The goal of the COSCACH is to make it possible to do large-scale empirical research on a wide range of linguistic issues, while at the same time allowing these phenomena studied to be analyzed in terms of a series of social variables.

The COSCACH consists of audio recordings of 1,237 L1 speakers of Chilean Spanish, plus a control sample of 21 non-Chilean L1 Spanish speakers, all of whom perform a series of elicitation tasks, ranging from reciting minimal pairs and reading sentences and meaningful texts, to participating in an extended, unstructured and maximally-informal conversation (typically 35-50 minutes long).

Chilean speakers are stratified by six social variables: locality, socioeconomic status (using the EMIS system), sex, age/generation, ethnicity and lingualism (monolingual in Spanish or bilingual in Spanish and Mapudungun). Speakers are further categorized by five variables derived from their locality: urbanness, locality size, region, distance from Santiago and travel time from Santiago.

The transcriptions of the recordings were lemmatized and morphologically tagged using the Chilean Spanish version of FreeLing, and can be queried using the IMS Open Corpus Workbench and CQPweb at corpora.pro.

The COSCACH launch paper, and the one that should be cited when using the COSCACH, is the following:

Sadowsky, Scott. 2022. The Sociolinguistic Speech Corpus of Chilean Spanish (COSCACH). A socially stratified text, audio and video corpus with multiple speech styles. International Journal of Corpus Linguistics. DOI: 10.1075/ijcl.19103.sad. Download PDF · Read in International Journal of Corpus Linguistics

You can access the COSCACH at corpora.pro.

The COSCACH corpus at a glance

The COSCACH consists of:

Audio and video recordings of 1,237 Chilean speakers (half male and half female), plus a control group of 21 non-Chilean Spanish speakers.
Orthographic and phonemic transcriptions of these recordings. Transcriptions are done with Praat, and are segmented and aligned at the utterance level.
9,288,301 tokens of running text. This includes only informants' speech.
83,002 minutes (1383.4 hours or 172.9 eight-hour work days) of recordings.

The COSCACH Chilean Spanish speaker sample has the following structure:

2 types of "lingualism": Monolingual in Spanish, bilingual in Mapudungun and Spanish.
2 ethnicities: Hispano-Chilean, Mapuche.
6 socioeconomic levels: A, B, Ca, Cb, D, E (from highest to lowest) from the EMIS stratification system.
2 sexes.
Between 1 and 5 age groups, depending on the location: 16-24, 25-34, 35-49, 50-64, and 65+ years of age.
Hispano-Chilean speakers from Arica, Antofagasta, La Serena, Santiago, Curicó, Concepción, Tirúa, Temuco, Melipeuco, Valdivia, Chiloé and Wallmapu.
Spanish-monolingual Mapuche speakers from Santiago, Tirúa, Temuco, Melipeuco and Chiloé.
Spanish-Mapudungun bilingual speakers from throughout Wallmapu (the traditional area in southern Chile inhabited by the Mapuche people, located in the south of the country).

The non-Chilean native Spanish speaker sample includes people from Argentina, Bolivia, Colombia, Cuba, Mexico, Paraguay, Peru and Venezuela.

The following speech elicitation instruments were used with each speaker:

Conversational interview based on speakers' own interests, with no pre-defined questions or structure. Seeks to elicit maximally spontaneous speech.
Interview about language attitudes based on a pre-defined questionnaire.
Reading of meaningful texts.
Reading of minimal pairs and wordlists to obtain maximally controlled speech.
Sustained pronunciation of vowels for voice quality research.

Recording

The COSCACH corpus was designed from the ground up to allow the most demanding phonetic research, and thus great care was taken to insure that audio recordings would be future-proof and of the highest quality.

Audio was recorded with Fostex FR-2LE digital recorders and Audix HT5 head-mounted microphones. The Fostex is known for its high-quality preamps and low self-noise, while the Audix mics have proven over and over that they make stunningly good recordings. In addition to its almost completly flat frequency response, the fact that it sits just a fraction of an inch from the speaker's mouth allows it to provide a phenomenal signal-to-noise ratio in even the noisiest of environments. If you're doing field recordings with any other type of mic, you're doing it wrong!

Retrieval and analysis

To make large-scale, efficient retrieval and analysis possible, all audio recordings are segmented and transcribed orthographically in Praat. The orthographic transcriptions are then syllabified and phonemically transcribed using Perkins, a program written specifically for this purpose. Retrieval of transcriptions and/or their corresponding audio recordings (at the utterance level) is performed with MaSCoT.

In addition, for research not related phonetics or phonology, interview transcriptions are...

Extracted.
Tagged with the Chilean Spanish version of FreeLing, which lemmatizes, chunks, assigns parts of speech, and so on.
Compiled into a searchable corpus using the IMS Open Corpus Workbench.
Imported into CQP Web, which provides a powerful and user-friendly interface for working with the COSCACH transcriptions, their lemmatizations and morphological tags, and all the rich metadata they contain.

Access

Researchers can freely access the transcribed and tagged version of the COSCACH at corpora.pro. The only requirement is to make an account at the site. Recordings are not currently available to the public, as they must all be further reviewed and, when necessary, redacted, in order to protect speakers' privacy and anonymity.

The COSCACH Team

Principal Investigator

Dr. Scott Sadowsky

Catholic University of Chile (Santiago, Chile) & Max Planck Institute for the Science of Human History (Jena, Germany)

Fieldworkers

FIELDWORKER	RECORDINGS
María José Aninao	343
Beatriz Yáñez	266
Scott Sadowsky	175
Sebastián Zepeda	108
Ruth Contreras	99
Bárbara Galdames	88
Camila Aedo	59
Tiare Araya	30
Edson Salgado	27
Lorena Perdomo	24
Viviana Vergara	18
Andrea Osorio	5
Francisca Morales	4
Matt Muñoz	3
Camila Valdebenito	2
Javiera Solís	2
Ignacia Fuentes	2
Daniela Contreras	2
Catalina Pérez	1
Laura Avendaño	1
Adolfo Bravo	1
Constanza Fajardo	1
Daniela Millalén	1
Camila Moreno	1
Belén Solís	1
TOTAL*	1264

* The number of recordings is greater than the number of speakers in the corpus because some recordings were excluded for technical reasons.

Transcribers

TRANSCRIBER	TRANSCRIPTIONS
Belén Solís	324
Ignacia Fuentes	246
Francisco Beltrán	150
Francisco Martínez	124
Majo Zanetta	103
Sebastián Zepeda	89
Mareba Torres	88
Scott Sadowsky	38
Andrea Noria	24
Ruth Contreras	11
Bárbara Galdames	11
Paola Vega	9
Roby Delgado	8
Javier Riquelme	5
Darío Fuentes	5
Paz Otth	4
Francisca Carrasco	4
Carla Cerda	4
Daniela Contreras	3
Daniela Millalén	2
Isabel Cayunao	2
María José Aninao	2
Constanza Fajardo	2
Maggie Mora	2
Guillermo Loyola	1
Maddy Rees	1
Camila Moreno	1
Diego Fuentes	1
TOTAL*	1264

* The number of transcriptions is greater than the number of speakers in the corpus because some recordings were excluded for technical reasons.

Postprocessing and database entry

Danitza Matus

ENGLISH

CASTELLANO

Optional Sidebar Info