The first tool that we developed in this area is the prototype of a Text Analysis (TA) Dashboard, dedicated to content evaluation. It gathers functions to analyze textual contents and to draw from them some useful indicators. The dashboard is accessible from a few content pages of CommonSpaces through a "TA" link, which is available only to authenticated users: the link below is an exception, since it should work also for anomymous users.
Being aware of the structure of CommonSpaces objects, such as OERs, units of Learning Paths and Projects, the TA Dashboard is able to extract from them meaningful text content to be processed. However, it can be used to analyze the text of any web page, even outside CommonSpaces. See the section The TA Bookmarklet at the bottom of this page.
In our plan, the TA Dashboard is just the first tool of a suite of Content Analysis (CA) tools, to be accessed form a Content Analysis Dashboard. You will find later in this page the roadmap for the development of the CA tools.
The Text Analysis Dashboard
The TA Dashboard relies mainly on spaCy, an open source Python library that implements a pipeline of Natural Language Processing (NLP) basic operations. spaCy is not a research product, but it integrates a set of state-of-the-art techniques, mostly based on statistical and machine-learning approaches, in a very clean and open architecture.
Currently spaCy is equipped with statistical language models for 10 European languages, including Greek and Lithuanian, but it supports at a more basic level tens of world languages. We are starting to test the TA Dashboard with English and Italian texts, using the content of CS Learning Paths and Help Pages.
Objectives and target users
This tool aims to allow the analysis of web pages and documents that include mainly text, starting with the contents of this platform and then continuing with other web resources.
Some of the functions that we will support are provided already by well-known applications, such as Voyant, an open source platform available also online, and TaLTaC; while achieving the completeness and sophistication of those applications is not our target, we aim to go beyond the purely lexical level of their analysis by addressing also the syntactic structure and possibly the meaning of the text.
We think that the first users of the Text-analysis Dashboard could be the creators of the contents, which will get an evaluation of the readability of the text, based on indicators of syntactic complexity and lexicon level; this could help finding spelling errors and stimulate the use of more plain terms, or also more varied lexicon.
A teacher could use the same indicators to assess the writing ability of a student in terms of lexicon richness, syntax mastery and, possibly, liveliness of style. Any other reader could enjoy a synthetic picture of the content, provided by some genre or topic categorization and possibly by a short summary.
Teachers, other educators and educational managers could be interested to evaluate educational material and other contents in order to assess the suitability of a text for a target audience or an educational goal.
The Dashboard elements
Currently, the Text-analysis Dashboard includes several elements, hosted by as many window panels.
General text properties and indicators
This panel displays, mixed together:
- a set of numerical values representing some basic properties of the texts, such as its length in terms of characters, tokens (sequences of characters representing words, numbers, punctuation marks) and sentences;
- a set of indicators, that is of values derived through simple statistical computations, which, depending on the user viewpoint, can be assumed as representative of additional, less objective text properties; for example, a very high value of indicators related to the dependency depth (nesting of dependency relationships) could correspond to a complex syntactical structure.
This panel also lists most frequent word forms, that is unique tokens, irrespective of their lexical categories.
At this stage, text summarization is a bit rudimentatary: the summary assembles a choice of sentences or sentence fragments that a very simple algorithm extracts from the text itself, based on their lexical representativeness.
Absolute frequency of lemmas
The term lemma refers to the normalized form of a possibly inflected word, used as entry for the lookup in vocabularies and dictionaries. In fact, the absolute frequency is not a frequency in the common sense of the term, but simply a count; it is computed for the tokens belonging to three lexical categories: verb, noun and adjective.
Each lemma is annotated with its absolute frequency in the text and with a lexical level, if known:
- for English, the lexical levels a1, a2, b1, b2, c1 and c2 have been derived directly from the word lists of the Oxford Learner's Dictionary;
- for Italian, we hypothesized the lexical levels a, b and c by mapping to them the three families of lemmas of the Nuovo vocabolario di base della lingua italiana (De Mauro and Chiari, 2016): crucial words, high usage words and high availability words.
The lexical levels correspond in some way to the levels of proficiency in the learning of a language as L2 (second language), as defined by the Common European Framework of Reference for Languages (CEFR). Please, note however that the definitions of the Proficiency Levels by CEFR itself do not include explicit word lists.
In the frequency lists we have highlighted the (assumed) CEFR levels with colors: green for a, a1 and a2; orange for b, b1 and b2; red for c, c1 and c2. Lemmas in black can correspond to a wide range of cases: more difficult or rarely used words, neologisms, acronyms, proper nouns, obsolete words or poetic forms, errors of our text analysis algorithms and data, typos or other misspellings. Please, note also that, while POS-tagging of English text is biased toward US English, the CEFR-related lists are based on British English.
Lexical categorization, syntactic structure and named entities
The biggest panel in the Dashboard merges in a graph the representations of three types of analysis results; the unifying data, providing a common scaffolding for those results, is the original text, seen as a plain character sequence. Warning: the visualization of the graph requires a few seconds.
In our implementation the analysis algorithms, although very different, are all based on very complex data; these have been compiled off-line by machine-learning algorithms based on statistical approaches. In any case, the results are accurate within a certain percentage; moreover, the accuracy largely depends on the quality of the statistical data, which in turn largely depends on the size of the training data, such as manually annotated text corpora, that have been presented to the learning algorithms as good examples.
POS-tagging, that is assigning Part-Of-Speech Tags, is a way of annotating the tokens of a text with lexical categories. The set of available categories (POS-tags) can vary from language to language, but most languages have a common core.
The traditional syntactic analysis, based on a recursive decomposition of a text in syntactic constituents, such as noun phrases, verb phrases and relative clauses, was the standard in secondary education some 20-40 years ago; perhaps a similar situation persists in some schools. Our Dashboard implements a different approach to build syntactic trees, based on dependency relationships; this approach is akin to what in the old times was called logical analysis and was taught mainly as a prerequisite for the learning of classical languages.
Dependency relationships are now considered more natural to understand and much effective in learning grammatics. Only, the graphical representation of the trees produced by the analysis is a bit convolute in our implementation since, with reference to the linear succession of the tokens in a sentence, they can link elements very distant on the screen.
The set of available types of dependency relationships can vary from language to language. A common core has been defined by the Universal Dependencies (UD) Project.
Named entity recognition
Named entity recognition (NER) algorithms try to locate in the text and to classify named entities, i.e. references by means of proper names to persons, organizations, times and places .. NER is a function that draws on encyclopedic knowledge: it requires the training of the algorithms with large corpora of documents related to a varied set of subjects, including geography, history, science and technology, current events.
While in most cases POS-tagging and dependency analysis provide fairly good results in analyzing well-formed texts, currently the NER implementation of our Dashboard is very weak: it provides disappointing results even when faced with English language documents. The fact is that, to limit the use of computing resources, we always use the lighter version of the spaCy language models, although there are larger versions for English and some other languages.
Some implementation details
Like the rest of the CS Platform, the TA Dashboard software is based on the Python language, the Django server-side framework, several Django extensions and many other Python libraries.
The spaCy NLP pipeline is integrated as a second web server, being accessed as a webservice by CS, through a couple of API endpoints. It was initially set up as a slightly customized installation of NLPBuddy, Open Source Text Analysis Tool, and is still accessible at nlp.commonspaces.eu; then, we further customized it to expose some of the basic functionality of spaCy, to avoid integrating CS directly with spaCy itself and its language models.
The TA Bookmarklet
What is a Bookmarklet?
Creating the TA Bookmarklet
To create the TA Bookmarklet in a browser such as Chrome or Firefox, perform this once-only operation:
- add a new bookmark to the bookmarks bar and give it the name "TA Bookmarklet",
- if you access CommonSpaces at the web address including the Up2U domain name (cs.up2university.eu), the URL of the TA Bookmarklet should have a slightly different value:
(*) this is why we call it "bookmarklet" and not simply "bookmark".
Using the TA Bookmarklet
To use the TA Bookmarklet:
- login to CommonSpaces and possibly navigate inside it;
- open other tabs of the same browser and navigate the web inside them;
- when you are viewing an interesting page, worth being analyzed and containing text of moderate size, click on the TA Bookmarklet and wait.
You must be aware of a few facts:
- currently, the TA Bookmarklet is not able to analyze large amounts of text, due to limited computation resources; trying with big pages, you risk killing both the TA Bookmarklet application and CommonSpaces itself (**);
- in any case, the response will come with considerable delay; please be patient and don't retry if you get no response;
- the TA Bookmarklet does its best in trying to "clean" the page, by extracting the content from the main body of the web page and cutting off what it deems of less or no interest, such as the page header and footer, the menubar, side columns and ads; however, it employs simple euristics, wich could match badly your page and what you expect.
(**) A future version of the TA Bookmarklet could be able to extract and analyze a specific portion of the page content that you will select by highligting some text with the mouse and the keyboard.
The Content Analysis Dashboard
The first implementation of the CA Dashboard is now accessible from the Library menu; since it is aimed at the individual user, it takes into consideration only contents authored by her/him. Another version, contextualised to a project, will be available to its members.
The role of the CA Dashboard
The CA Dashboard supports the definition of content collections and their consolidation in a kind of corpora. The individual items to be included in a collection can be selected from CommonSpaces (CS) content objects, such as OERs, Learning Paths and Documents in Project Folders, or taken from outside: web pages, online documents, documents local to the user device. In any case
- a corpus is a collection of contents that has undergone a natural language pre-processing phase (a pipeline of analysis and annotation tasks), whose result has been saved to disk, in order to get some summary data and to spare some computation when more specific processing will be needed;
- all items in a corpus should contain text in a single language (analyzing mixed-language text is outside our goals).
The CA Dashboard makes easy to select CS content objects from the following categories:
- for the individual user: authored LPs, shared LPs, personal LPs, authored OERs, shared OERs, personal OERs (that is, OER "stubs" created by using the CommonSpaces bookmarklet) and Documents added to any Project Folder;
- for a project group: LPs and OERs created inside the Project, LPs and OERs shared inside the Project, Documents present in the Project's folder and sub-folders, message streams in the Project Forum.
The tools provided by the CA Dashboard
Tools already existing or under development are:
- the TA Dashboard, which has been described above;
- the Keywords in Context (KWIC) Analysis tool.
Keywords in Context Analysis
The tool thus named performs two basic tasks:
- counting the number of occurrences (tokens) of each unique word in a corpus (frequency), sorting words by frequency and listing the most frequent ones (keywords);
- listing all occurrences in the text of the top N keywords, each word in its context, that is together with the L tokens preceding and following it in the text; these sublists so include token sequences corresponding to text "windows" of size 1+2*L; currently the value of L is 5.
Additional planned tools
Some tools are foreseen but still in a specification stage. Please suggest features for them and submit ideas for additional tools!
We are ready to provide measures of similarity between pre-processed texts; but serious doubts concern:
- the definition itself of such measures; currently, a similarity score would be computed by a method of the spaCy library, based on the available language models, in particular on their component data vectors, which are the result of model training performed with large annotaded training corpora; language models have been built with learning algorithms being state-of-the-art; they work reasonably well for tokenization, POS-tagging, parsing and, at some extent, named-entity recognition (NER); we don't know which kind of "similarity" they are able to capture;
- the user functionality; is the user willing to compare two individual content items or two corpora, or to compare each item inside a corpus with all other items? knowing the most useful functionality would help in designing the user interface (UI).
As to the similarity definition, our guess is that the similarity scores provided by spaCy (just numbers in the range 0-1) could work well for categorizing documents by genre or domain, by comparison with already categorized documents. In other cases, for example in discovering plagiarism, presumably different approaches have to be studied and specific algorithms need to be developed.