Word Frequency List of Chilean Spanish
Scott Sadowsky & Ricardo Martínez Gamboa
For downloads of versions 1.0 and 1.1, see bottom of page.
The Word Frequency List of Chilean Spanish (Lifcach) is a set of 102 frequency lists derived from the sub-corpora of the Corpus Dinámico del Castellano de Chile (Dynamic Corpus of Chilean Spanish, (Codicach), a corpus of contemporary written Chilean Spanish developed by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words when the Lifcach was created (it currently contains some 800 million words). The Lifcach also contains a non-weighted list of total frequencies (the Total Occurrences column), which is simply the sum of the frequencies of the 102 individual lists (in other words, the list of frequencies of the entire Codicach corpus.)
The Codicach is an opportunistic corpus with a bias toward press-based sources; it does not seek to be a BNC-style representative sampling of the language in general. The modular nature of the Codicach and of the 102 individual Lifcach lists, however, allows researchers to use one or more of these lists alone, to combine them as needed, or to create their own frequency lists for Spanish by weighting each of the Lifcach’s individual lists as they see fit.
The Lifcach contains 476,776 lemmas derived from the approximately 4.5 million types found in the 450 million running words contained in the Codicach at the time the lists were created.
- This version corrects a handful of cases where a word with more than one part of speech was treated as more than one lexeme. Thanks to José Joaquín Atria for tracking this down.
- New statistic: frequency of lemmas per million words of source text.
- New information: total number of tokens per sub-corpus.
What has it been used for?
Here's a partial bibliography of research that has used the Lifcach:
Calude, A. & Pagel, M. (2011). How do we use language? Shared patterns in the frequency of word use across 17 world languages. Philosophical Transactions of the Royal Society B 366 (1567): 1101-1107. Download.
Ibáñez, A., Gleichgerecht, E., Hurtado, E., González, R., Haye, A. & Manes, F. (2010). Early Neural Markers of Implicit Attitudes: N170 Modulated by Intergroup and Evaluative Contexts in IAT. Frontiers in Human Neuroscience 4 (188). Download.
Cornejo, C., Simonetti, F., Ibáñez, A., Aldunate, N., Ceric, F., López, V. & Núñez, R. (2009). Gesture and metaphor comprehension: Electrophysiological evidence of cross-modal coordination by audiovisual stimulation. Brain and Cognition 70: 42-52. Download.
Hurtado, E., Haye, A., González, R., Manes, F. & Ibáñez, A. (2009). Contextual blending of ingroup/outgroup face stimuli and word valence: LPP modulation and convergence of measures. BMC Neuroscience 10 (69). Article. Supplementary Data.
Cornejo, C., Ibañez, A. & Lopez, V., (2008). Significado, contexto y experiencia: Evidencias conductuales y electrofisiológicas del holismo del significado. In: C. Cornejo and E. Kronmüller (ed.), La pregunta por la mente: Aproximaciones desde Latinoamérica. Chile: J.C. Saez Editor. Download chapter.
Rojo, G. (2008). Lingüística de corpus y lingüística del español. Plenary conference, ALFAL 15 (Montevideo, 18-21 August 2008). Download.
Ibáñez, A. López, V. & Cornejo, C. (2006). ERPs and contextual semantic discrimination: Degrees of congruence in wakefulness and sleep. Brain and Language 98 (3): 264:275. Download.
If you use the Lifcach in your research, please drop me a line.
Creation of the Lifcach
The steps in creating the Lifcach were as follows:
- Type frequency lists based on the running words of each of the 102 sub-corpora of the Codicach were generated.
- Each type frequency list was lemmatized and POS-tagged using the Universitat Politecnica de Catalunya’s MS-Tools v2.0 (For more information on MS-Tools, contact Lluís Padró).
- A slightly more compact version of the Lifcach was created, in which lemmas with a frequency of 1 were removed (approximately 300,000). Eliminating these was considered an acceptable trade-off in exchange for a far more manageable file size.
- The resulting lemma frequency lists were assembled in the attached CSV file and total occurrences were calculated.
An important caveat regarding this methodology must be mentioned. The use of type frequency lists instead of running words in the POS tagging and lemmatizing process was a practical necessity, due to the speed of the software used and the computing resources available at the time the Lifcach was created. As a result, the software had to analyze words such as canto without the information required to decide if a given instance of this word was a form of the verb cantar or the noun canto. This elimination of context reduced the accuracy of the lemmatization process, though far less so than would happen with English, thanks to Spanish's complex morphology.
It should also be noted that the lemmatizing and tagging software that was used is based on European Spanish, a national dialect that is somewhat removed from Chilean Spanish.
Do not open the Lifcach in versions of Microsoft Excel before 2007, as they can only open the first 65,000 or so rows.
The Lifcach is Copyright © 2006-2012 by Scott Sadowsky & Ricardo Martínez Gamboa. It may be freely used for non-profit academic purposes if properly cited. All commercial use or application of the Lifcach is expressly prohibited without express written consent from the authors.
More information can be found in the Lifcach readme file.
Download the Lifcach