Corpus del Español Actual [OER]

Corpus del Español Actual

The Corpus del Español Actual (the Corpus of Contemporary Spanish) contains 540 million words, which have been lemmatized and tagged with detailed part-of-speech information. The CEA is made up of the following texts: The Spanish part of the eleven-language parallel corpus Europarl: European Parliament Proceedings Parallel Corpus, v. 6 (1996-2010); The Spanish portion of the trilingual Wikicorpus, v. 1.0, which was extracted from a snapshot of Wikipedia (2006); and The Spanish part of the seven-language parallel corpus MultiUN: Multilingual UN Parallel Text 2000-2009, a corpus made up of the resolutions of the United Nations. The CEA was tagged using an online Spanish dictionary containing 635,000 wordforms, which was automatically generated from a dictionary of 86,000 single-word lemmas (e.g., unir, inmoralidad, allí) and 26,000 multiword lemmas (e.g., muerte cerebral, carga de profundidad, de armas tomar) (Subirats 1989, 1992, 1994a, 1994b; Mogorrón 1994; Garrido 1999; Bobes 2000). Tag disambiguation was carried out with intersecting finite-state automata using lexical and syntactic information (Subirats 1998, Subirats and Ortega 2000, 2001, Ortega in progress).

view resource

Type of material

Terms of use

Reference info

http://spanishfn.org/tools/cea/english

Target audience

Subject areas

Languages

Media formats

Other metadata: author: Subirats, Carlos; author: Ortega, Marc; publisher: Computational Linguistics Laboratory, Universidad Autónoma de Barcelona

OER type: Metadata and online reference

Submitted by Fernando Martínez de Carnero
06/12/2015
in the project Strumenti e tecnologie per insegnare le lingue