Dynamic Corpus of Chilean Spanish
The Dynamic Corpus of Chilean Spanish (Codicach) is an electronic corpus of written Chilean Spanish. It contains about 800 million running words in 1.3 million files and 102 sub-corpora. It has been chunked, lemmatized, and tagged with POS and syntactic relationship information using Connexor's Machinese Syntax program. Search and retrieval are powered by the IMS Open Corpus Workbench.
How to cite: Sadowsky, Scott. 2006. Corpus Dinámico del Castellano de Chile (Codicach). Electronic database. http://sadowsky.cl/codicach.html
The Codicach is an opportunistic corpus that prioritizes size over filling pre-determined quotas of specific text types. This makes it maximally useful for lexical studies and studies of low-frequency phenomena, among other things, but it also means that researchers must carefully select the sub-corpora they use in order to achieve the degree of representativeness they require.
Types of linguistic production included
Nearly all the texts in the Codicach come from written sources. The transcriptions of speeches given in the Chilean Congress are the main exception.
The Codicach seeks to be 100% Chilean. Therefore, an extremely conscientious effort has been made to eliminate non-Chilean texts from it. To cite two examples:
- MASS MEDIA TEXTS: Virtually all newspaper and magazine articles originally sourced from news agencies or foreign media were eliminated. Something on the order of 250,000 such texts were rejected.
- ACADEMIC TEXTS: Papers published in Chilean journals were eliminated if one or more of their authors was affiliated with a non-Chilean university.
In spite of these efforts, it was neither possible nor practical to determine the nationality of authors affiliated with Chilean universities (though the vast majority are, in fact, Chilean), nor of people who posted in Chilean Usenet groups or on-line forums, or wrote letters to the editors of newspapers. Thus, a minimal percentage of the texts that make up the Codicach are from non-Chilean sources.
Well over 98% of texts in the Codicach were published between 1997 and 2003. The main exceptions are the literary and reference works, laws, jurisprudence, and advertising texts the corpus contains, which are from before 1997 in most cases.
The Codicach contains about 800 million tokens of running text (the number varies from about 700 million to 930 million depending on the method of chunking) in some 1.3 million files. With most text types, one file equals one text (newspaper article, academic paper, e-mail message, letter to the editor, court decision, etc.) With certain text types, however (on-line forums and Usenet groups, mainly), a single file may contain many texts or messages.
The Codicach is divided into 102 sub-corpora. Searches can be performed on any sub-corpus or set thereof.
The Codicach's sub-corpora are organized by functional text types. These text types, along with the number of words and files that each contains, are detailed on the following two pages:
Number of words in the Codicach by text type (opens in new window)
Number of files in the Codicach by text type (opens in new window)
The texts that make up the Codicach were subjected to a series of procedures designed to make the corpus as "pure" as possible. The main processing steps were as follows:
Conversion to a standardized file format and encoding
All files --HTML, DOC, RTF, PDF, etc.-- were converted to ISO-8859-1 plain text with Windows/DOS line breaks. The exact procedure used varied by file type.
Characters such as apostrophes, quotation marks, fraction symbols and the copyright symbol were converted to a standard representation (e.g. non-smart quotes and apostrophes, fractions made up of numbers separated by a slash, etc.).
Elimination of non-Chilean texts
As indicated above, as many non-Chilean texts as possible were eliminated from the Codicach.
Elimination of metadata
File metadata that made it through the conversion process were later eliminated to the extent possible.
Elimination of "template text"
The average on-line text is surrounded by a tremendous amount of "template text" -- page headers, weather forecasts, lists of related articles and sites, publisher information, headlines, advertising, etc. A particularly strenuous effort was made to develop scripting solutions to remove all such text.
Elimination of duplicate and near-duplicate files
Duplicate texts were of course eliminated from the Codicach. Near-duplicates, however, presented a very difficult problem. Such texts might consist of a single news article downloaded by a web crawler on different days, each with different headlines, comments, timestamps or other differences unrelated to the text itself. While most such "template text" could be removed with specialized scripts, as mentioned above, small differences often remained, making identical texts appear different.
Over the approximately six years of active downloading that were involved in assembling the Codicach, several billion words of such texts were downloaded. However, normal duplicate detection techniques are incapable of eliminating them, as they are only duplicates to a discerning human reader familiar with modern text presentation conventions -- not to a naive algorithm.
The solution to this problem came in the form of a custom-built plugin for the ABC-View software that was generously written and donated by the program's author, Nils Haeck. This plug-in uses fuzzy logic to detect near-identical texts. It was used on the Codicach with a setting of 50, meaning that all texts with less than 50 characters of difference between them were eliminated.
Due to copyright restrictions on the source texts, the Codicach cannot be put on line at the moment. However, it can be used. If you're an academic researcher and are interested in using the Codicach, drop me a line.
Here's a partial bibliography of publications and research projects that have used the Codicach.
González, C. (submitted). Estrategias gramaticales de expresión de la evidencialidad en el español de Chile.
Soto, G., Sadowsky, S. & Martínez, R. 2010. Sobre el caso del caso. Las construcciones del tipo 'el caso + nominal' en un corpus de textos periodísticos chilenos. Boletín de Lingüística 22 (33). Download
Urrejola, K. "De que es raro, es raro": un análisis gramatical y pragmático-discursivo de estructuras independientes introducidas por "de que". B.A. thesis. Universidad Católica de Chile. December 2010.
Hugo Rojas, E. (2010). Las formas de segunda persona singular como estrategias evidenciales. VI Conference of the Latin American Association for Discourse Studies (ALED). Universidad de Chile. Santiago, Chile.
González, C. & Hugo Rojas, E. (2010). "Cuando te lo piden, uno no siempre sabe qué decir": "Uno" y "tú" como estrategias evidenciales en el español de Chile. IV International Congress of Letters "Transformaciones Culturales: Debates de la teoría, la crítica y la lingüística en el Bicentenario". Universidad de Buenos Aires. Buenos Aires, Argentina. November 2010.
González, C. (2010). Evidencialidad en el español de Chile. 26th International Conference on Romance Linguistics and Philology. Universidad de Valencia. Valencia, Spain. September 2010.
González, C. (2010). El condicional de rumor: ¿modalidad epistémica o evidencialidad? Antecedentes para su discusión en el español de Chile. XII Conference of the Argentinian Socitey of Linguistics. Universidad Nacional de Cuyo. Mendoza, Argetina. April 2010.
Soto, G. (2009). Vigencia y significado del pretérito anterior. Un estudio a partir del español escrito en Chile. Estudios Filológicos 44: 227-241. Download
González, C. (2009). Formas gramaticales de expresión del significado evidencial en el español de Chile. XVIII Conference of the Chilean Linguistics Society. Universidad de Chile, Santiago, Chile. October 2009.
González, C. (2009). Distribución de los significados de los verbos en condicional. XVIII Conference of the Chilean Linguistics Society. Universidad de Chile. Santiago, Chile. October 2009.
González, C. & Hugo Rojas, E. (2009). "Uno" y "tú" en el continuo evidencial. Significados y distribución en un corpus de español de Chile. Poster presentation, XVIII Conference of the Chilean Linguistics Society. Universidad de Chile. Santiago, Chile. October 2009.
González, C. (2009). Una aproximación inicial a las formas de expresión del significado evidencial en el español de Chile. 2nd National Colloquium on Grammar, Pragmatics and Discourse. Universidad de Concepción. Concepción, Chile. August 2009.
Hugo Rojas, E. (2009). Las formas de segunda persona singular como estrategias evidenciales. Descripción y análisis de significado evidencial en un corpus de español de Chile. B.A. thesis. Universidad Católica de Chile. December 2009.
Soto, G., Martínez, R. & Sadowsky, S. (2006). Condicionantes pragmático-discursivos de le por les. IV Conference of the Latin American Association for Discourse Studies (ALED Chile). Universidad Católica de Valparaíso. Valparaíso, Chile. November 2006. Download
Soto, G., Martínez, R. & Sadowsky, S. (2005). Verbos y sustantivos en textos científicos. Análisis de variación en un corpus de textos de ciencias aplicadas, naturales, sociales y humanidades. Philologia Hispalensis (Sevilla) 19: 169-187. Download
Soto, G., Sadowsky, S. & Martínez, R. (2005). El caso del caso: Esquemas gramaticales de productividad restringida, marcos cognitivos y discurso. IV Conference of the Latin American Association for Discourse Studies (ALED). Universidad Católica de Chile. Santiago, Chile. 2005. Download
Martínez, R., Sadowsky, S. & Soto, G. (2004). El le invariable en el español escrito en Chile. Incidencias sintácticas y genéricas en el fenómeno. III Conference of the Latin American Association for Discourse Studies (ALED Chile). Universidad Austral de Chile. Valdivia, Chile. September - October 2004. Download
Research Grant DID-2001 (SOC-01/01-2) for Social Science, Humanities and Education Research. Universidad de Chile, 2001-2003. "El discurso científico escrito en ciencias naturales y sociales: un estudio comparativo de los textos de especialistas y estudiantes universitarios".
Research Grant, Universidad de Chile Research Fund, School of Philosophy and Humanities. "La función discursiva de los sintagmas nominales preverbales en el español oral y escrito en Chile". 2005-2006.
In addition, an early version of the Codicach was used to create the Word Frequency List of Chilean Spanish (Lifcach).