Module 3 – Coolang

Work modules

Module 3. Creation of resources

For the development and exploration of techniques to address the challenges posed in the project, it is essential to have data, which requires not only the extraction and compilation of information on the scenarios under study, but also the curation, annotation and enrichment of this information to obtain quality data.

Milestones

Study and selection of data collection techniques.
Analysis and selection of data filtering and cleaning tools.
Definition of annotation guidelines appropiate to the source, domain and task under study.
Implementation of data extraction techniques.
Implementation of data curation tools.
Construction of annotated datasets from the extracted, compiled and curated information using the defined annotation guides.

Deliverables

Annotation guides made available to the scientific community to facilitate the annotation of other datasets.
Datasets on digital content from different information sources.
Repository compiling the annotation guides and datasets generated for each domain

Task 3.1 Extraction and compilation

The collection and storage of information is fundamental to the development of practical Artificial Intelligence solutions. Data plays a very important role in this project, as it is the basis for studying the dynamics of digital content exchanged between entities. This task aims to extract information about the identified scenarios from various data sources, such as social networks, newspapers, forums, medical reports, among others. For this purpose, the most appropriate data collection techniques will be studied and selected for each source and domain, in order to ensure that the data collected are accurate.

Task 3.2 Data curation

According to the activities programmed in T4.1, it is foreseen to obtain digital contents from different information sources. These contents must be filtered in order to obtain precisely the information required, eliminating those contents that do not offer the appropriate quality levels for the project. On the other hand, it is necessary to carry out data cleansing activities in which characters and annotation structures that do not provide useful information are removed. An example of this is retrieving information from web pages, where it is necessary to remove HTML annotation tags or other elements that make it difficult for humans to read without the intervention of interpreter programs such as web browsers (i.e. Chrome, Mozilla, etc.). Finally, it is necessary to ensure that the recovered contents maintain a certain coherence and, regardless of the form in which it is digitised in the original source, when storing it in its final structure it should respond to an order, structure, format and access conditions that make it easier for computer programmes to recover its content for subsequent NLP processes.

Task 3.3 Annotation and enrichment

Machine learning techniques have become one of the fundamental strategies in any natural language modelling process, also in cases of automatic content processing. Through these techniques, algorithms allow computers to learn from experience. This experience is embodied in training data (datasets), which in the case of supervised learning require prior annotation. The success of the predictions made by the language model depends directly on the quality and size of our training data, especially when used in deep learning algorithms. One of the most relevant advantages of this type of algorithms is that they do not need an a priori design of features. However, this property makes them dependent on larger datasets than traditional machine learning algorithms. In this task we will focus on the construction of annotated datasets with advanced semantic features from previously extracted, compiled and curated collections of resources. This will be done using tools to access other advanced resources such as open databases, and other reusable and redistributable information collections, as well as specific annotation tools for the different use cases. In addition, the quality of the resources will be assured through the development of problem-specific annotation guidelines and annotation quality metrics. Moreover, due to the difficulty of determining certain types of annotation, in many cases of complex characterisation even that done by human experts, research will be carried out on the generation of datasets in a semi-automatic way, so that the creation of corpora reduces in cost considerably and the increase of training examples improves the accuracy of the machine learning and deep learning methods used.