Sociolinguistic Corpus of Spoken Chilean Spanish
The Sociolinguistic Corpus of Spoken Chilean Spanish, known by its Spanish acronym Coscach, is a massive electronic database of the speech of Chilean young adults created with cutting-edge technology and solid sociolinguistic methods. At present, it contains audio and video recordings of about 220 speakers.
Although the Coscach grew out of my Ph.D. dissertation work on the socioeconomic stratification of vowel allophones in Chilean Spanish, its ultimate goal is to make it possible for myself and others to do large-scale empirical research on a wide range of subjects. To this end, I've also developed a set of tools for working with massive speech corpora, including MaSCoT and Perkins.
Please note that the Coscach is currently under development, and is not yet ready for use by other researchers.
A demo of the Coscach is now available! See below for details...
The Coscach currently consists of:
- About 300 speakers, roughly half male and half female.
- Speakers from Santiago, Concepción, Temuco and rural regions of the Araucanía region, with more coming soon.
- Between three and six different elicitation activities per speaker.
- Approximately 600,000 words of orthographically transcribed reading activities.
- Some 800,000 words of transcribed interviews (wordcount only includes informants' speech).
- 120 GB of 24 bit / 44.1 kHz audio.
- 195 GB of MP4 video.
Technology and Recording
The Coscach corpus was designed from the ground up to allow the most demanding phonetic research, and so great care was taken to insure that audio recordings would be future-proof and of the highest quality.
We record audio with Fostex FR-2LE digital recorders and Audix HT5 head-mounted microphones. The Fostex, which is known for its high quality preamps, has made for a more portable and resilient setup, while the Audix mic has proven over and over that it makes stunningly good recordings. In addition to its almost completly flat frequency response, the fact that it sits just a fraction of an inch from the speaker's mouth allows it to provide a phenomenal signal-to-noise ratio in even the noisiest of environments. If you're doing field recordings with any other type of mic, you're doing it wrong!
To make large-scale, efficient retrieval and analysis possible, all audio recordings are segmented and transcribed orthographically in Praat; the orthographic transcriptions are then syllabified and phonemically transcribed using Perkins, a program I wrote for this purpose, and retrieval is done with MaSCoT.
Currently, we're working with the developers of PHON to adapt their software and our workflow to sociophonetic research, and we're very excited about the possibilities we're discovering!
The Coscach is a sociolinguistic corpus. Speakers are socioeconomically stratified into one of six groups using the EMIS system, which is a version of the ESOMAR methodology adapted for use in sociolinguistic research.
A public demo of the Coscach is now available. It contains brief extracts of six different elicitation activities as performed by a single speaker. The audio recording is a 24-bit / 44.1kHz FLAC file. This is a lossless compression format that provides exactly the same quality as a WAV.