The first tool we developed in this area is a dashboard dedicated to content evaluation. It gathers functions to analyze textual contents and to draw from them some useful indicators. The dashboard is accessible from a few content pages of CommonSpaces through a "TA" link, which is available only to authenticated users: the link below is an exception.
The Text Analysis Dashboard
We are developing aText Analysis (TA) Dashboard relying mainly on spaCy, an open source Python library that implements a pipeline of Natural Language Processing (NLP) basic operations. spaCy is not a research product, but it integrates a set of state-of-the-art techniques, mostly based on statistical and machine-learning approaches, in a very clean and open architecture.
Currently spaCy is equipped with statistical language models for 10 European languages, including Greek and Lithuanian, but it supports at a more basic level tens of world languages. We are starting to test the TA Dashboard with English and Italian texts, using the content of CS Learning Paths and Help Pages.
Objectives and target users
This tool aims to allow the analysis of web pages and documents that include mainly text, starting with the contents of this platform and then continuing with other web resources.
Some of the functions that we will support are provided already by well-known applications, such as Voyant, an open source platform available also online, and TaLTaC; while achieving the completeness and sophistication of those applications is not our target, we aim to go beyond the purely lexical level of their analysis by addressing also the syntactic structure and possibly the meaning of the text.
We think that the first users of the Text-analysis Dashboard could be the creators of the contents, which will get an evaluation of the readability of the text, based on indicators of syntactic complexity and lexicon level; this could help finding spelling errors and stimulate the use of more plain terms, or also more varied lexicon.
A teacher could use the same indicators to assess the writing ability of a student in terms of lexicon richness, syntax mastery and, possibly, liveliness of style. Any other reader could enjoy a synthetic picture of the content, provided by some genre or topic categorization and possibly by a short summary.
Teachers, other educators and educational managers could be interested to evaluate educational material and other contents in order to assess the suitability of a text for a target audience or an educational goal.
The Dashboard elements
Currently, the Text-analysis Dashboard includes several elements, hosted by as many window panels.
General text properties and indicators
This panel displays, mixed together:
- a set of numerical values representing some basic properties of the texts, such as its length in terms of characters, tokens (sequences of characters representing words, numbers, punctuation marks) and sentences;
- a set of indicators, that is of values derived through simple statistical computations, which, depending on the user viewpoint, can be assumed as representative of additional, less objective text properties; for example, a very high value of indicators related to the dependency depth (nesting of dependency relationships) could correspond to a complex syntactical structure.
This panel also lists most frequent word forms, that is unique tokens, irrespective of their lexical categories.
At this stage, text summarization is a bit rudimentatary: the summary assembles a choice of sentences or sentence fragments that a very simple algorithm extracts from the text itself, based on their lexical representativeness.
Absolute frequency of lemmas
The term lemma refers to the normalized form of a possibly inflected word, used as entry for the lookup in vocabularies and dictionaries. In fact, the absolute frequency is not a frequency in the common sense of the term, but simply a count; it is computed for the tokens belonging to three lexical categories: verb, noun and adjective.
Each lemma is annotated with its absolute frequency in the text and with a lexical level, if known:
- for English, the lexical levels a1, a2, b1, b2, c1 and c2 have been derived directly from the word lists of the Oxford Learner's Dictionary;
- for Italian, we hypothesized the lexical levels a, b and c by mapping to them the three families of lemmas of the Nuovo vocabolario di base della lingua italiana (De Mauro and Chiari, 2016): crucial words, high usage words and high availability words.
The lexical levels correspond in some way to the levels of proficiency in the learning of a language as L2 (second language), as defined by the Common European Framework of Reference for Languages (CEFR). Please, note however that the definitions of the Proficiency Levels by CEFR itself do not include explicit word lists.
In the frequency lists we have highlighted the (assumed) CEFR levels with colors: green for a, a1 and a2; orange for b, b1 and b2; red for c, c1 and c2. Lemmas in black can correspond to a wide range of cases: more difficult or rarely used words, neologisms, acronyms, proper nouns, obsolete words or poetic forms, errors of our text analysis algorithms and data, typos or other misspellings. Please, note also that, while POS-tagging of English text is biased toward US English, the CEFR-related lists are based on British English.
Lexical categorization, syntactic structure and named entities
The biggest panel in the Dashboard merges in a graph the representations of three types of analysis results; the unifying data, providing a common scaffolding for those results, is the original text, seen as a plain character sequence. Warning: the visualization of the graph requires a few seconds.
In our implementation the analysis algorithms, although very different, are all based on very complex data; these have been compiled off-line by machine-learning algorithms based on statistical approaches. In any case, the results are accurate within a certain percentage; moreover, the accuracy largely depends on the quality of the statistical data, which in turn largely depends on the size of the training data, such as manually annotated text corpora, that have been presented to the learning algorithms as good examples.
POS-tagging, that is assigning Part-Of-Speech Tags, is a way of annotating the tokens of a text with lexical categories. The set of available categories (POS-tags) can vary from language to language, but most languages have a common core.
The traditional syntactic analysis, based on a recursive decomposition of a text in syntactic constituents, such as noun phrases, verb phrases and relative clauses, was the standard in secondary education some 20-40 years ago; perhaps a similar situation persists in some schools. Our Dashboard implements a different approach to build syntactic trees, based on dependency relationships; this approach is akin to what in the old times was called logical analysis and was taught mainly as a prerequisite for the learning of classical languages.
Dependency relationships are now considered more natural to understand and much effective in learning grammatics. Only, the graphical representation of the trees produced by the analysis is a bit convolute in our implementation since, with reference to the linear succession of the tokens in a sentence, they can link elements very distant on the screen.
The set of available types of dependency relationships can vary from language to language. A common core has been defined by the Universal Dependencies (UD) Project.
Named entity recognition
Named entity recognition (NER) algorithms try to locate in the text and to classify named entities, i.e. references by means of proper names to persons, organizations, times and places .. NER is a function that draws on encyclopedic knowledge: it requires the training of the algorithms with large corpora of documents related to a varied set of subjects, including geography, history, science and technology, current events.
While in most cases POS-tagging and dependency analysis provide fairly good results in analyzing well-formed texts, currently the NER implementation of our Dashboard is very weak: it provides disappointing results even when faced with English language documents. The fact is that, to limit the use of computing resources, we always use the lighter version of the spaCy language models, although there are larger versions for English and some other languages.
Some implementation details
Like the rest of the CS Platform, the TA Dashboard software is based on the Python language, the Django server-side framework, several Django extensions and many other Python libraries.
The spaCy NLP pipeline is integrated as a second web server, being accessed as a webservice by CS, through a couple of API endpoints. It was initially set up as a slightly customized installation of NLPBuddy, Open Source Text Analysis Tool, and is still accessible at nlp.commonspaces.eu; then, we further customized it to expose some of the basic functionality of spaCy, to avoid integrating CS directly with spaCy itself and its language models.