Module 4 – Coolang

Work modules

Module 4. Design of methods and algorithms

This module is the most extensive proposed by the project, as it incorporates the tasks with the greatest research burden. Following the detailed analysis of the domains, scenarios, current techniques, information sources, as well as the specific challenges to be solved, in this module we will identify the expected response of future solutions. The module will explore new methods and techniques that break the state of the art and provide a breakthrough in the search for solutions to the problems indicated. To model digital content, all the information related to the context where this content is produced and consumed is relevant: its textual content, the profiles of the digital entities involved in the exchange, the network structure maintained by these entities. This requires a novel approach to the construction of models, which must deal with heterogeneous features (text, network structure, temporal evolution, additional features present…). From feature engineering to end-to-end solutions, the holistic reality of digital content must be considered in the search for models that allow inference processes (classification and decision making) in the detection of both beneficial (hope-speech and secure content) and malicious (hate-speech and fake news) content.

Milestones

Identification of relevant homogeneous features and study of the application of methods of extraction, selection and transformation of these features.
Language modeling based on the above features.
Study and design of the most appopiate representation to model the content.
Review of the state of the art in supervised and semi-supervised learning methods, applied to the generation of inference models.
Study and design of methods and strategies to mitigate bias in machine learning techniques.
Study and design of methods and resources for mutation prediction and content viralization.

Deliverables

Report with relevant heterogeneous features and application methods.
Language modeling representation.
Knowledge graphs and GNNs derived from content modeling.
Report on the most suitable supervised and semi-supervised learning methods.
Report on methods and strategies to mitigate bias in machine learning techniques.
Report on methods and resources for mutation prediction and content viralization.

Task 4.1 Methods for the extraction, selection and transformation of features

This task will be in charge of determining which are the most relevant heterogeneous features to be used in the subsequent modeling, taking into account that the aim is to extract the high-level semantics associated to the content, characterizing the relationships between digital entities in different aspects such as contradiction, congruence, polarization, bias, emotions, irony, etc. Features can have diverse formats and origins. Specifically, we have three main sources of data: social media, knowledge bases and corpora of available or generated data, and from these sources we will select and filter according to the scenario under study. Thus, for example, in the case of fake news analysis, networks such as Twitter and online press would be sources to consider, but also news corpora and geographic and factual knowledge bases. In the case of hate speech, the search for toxic content can be framed to a specific community by topic (racism, gender-based violence, bullying) or age (networks most used by teenagers, such as TikTok, Twitch or Instagram). In other words, the contextualization of the specific scenario serves as a premise for the selection of sources and the extraction of characteristics (profile information, messages, timestamps, network structure…). From the extracted features, those that are considered decisive in the resolution of a given problem will be selected, and in some cases it will be necessary to transform them for the purpose to which they are applied.

Task 4.2 Methods for content modelling from heterogeneous features

Once these heterogeneous features have been extracted, selected and transformed, the objective of this task is to model the content for language models, which take into account the linguistic characteristics of the different domains and scenarios. In this case, features of all levels (lexical, syntactic, semantic, discourse and pragmatic) will be taken into account. Current methods of digital content analysis aim at an integration of data of diverse nature (message text, text associated to an image, network structure, user data, lexicons and specialized ontologies…) to enrich the features available to machine learning algorithms. On the one hand, the network structure or temporal evolution are key for the detection of hoaxes; on the other hand, the gender of the receiver of a message, their age and details of their profile help in the detection of offensive messages. In addition, the type of multimedia content, links, lexical diversity, or the way in which the content is propagated through the network facilitate the discovery of constructive content. Thus, the heterogeneity of the content determines its semantics.

Task 4.3 Methods for knowledge generation

Knowledge generation requires organizing, normalizing, tabulating and categorizing amounts of data to generate additional information. The knowledge generated often requires a considerable size of data, often composed from various sources, e.g., publications, patents, web resources such as forums, social networks, etc. Knowledge discovery may be associated with a specific context (e.g., it may make use of controlled languages or ontologies of a given domain). Thus, knowledge generation must face both quantitative issues, due to the volume of data, and qualitative ones, due to the necessary data processing. Once obtained, it provides domain knowledge that can be converted, for example, into a rule-based system or other inference engines based on machine learning. In the scope of the present proposal, this task will be in charge of carrying out the integration of all these models into a homogeneous formalism from the different digital models obtained, so that such models can be uniformly accessed and efficiently manipulated. Such is the case of the generation of knowledge graphs, ontologies or databases, derived from the content modeling.

Task 4.4 Learning methods for the generation of automated, interpretable and explainable inference models

Esta tarea define el conjunto de algoritmos utilizados para la integración de los resultados inferidos a partir de los bases de conocimiento. Un pilar clave de la generación de conocimiento es cómo representar este conocimiento una vez obtenido, para que pueda ser utilizado de manera conveniente y en diferentes contextos. Para que esto ocurra, es necesario relacionar datos mediante la asociación de atributos y características, lo que da como resultado un modelo de datos con suficiente capacidad expresiva para representar el conocimiento obtenido, siendo computacionalmente manejable. Así, una vez que el formalismo o formalismos para la representación del conocimiento (gráficos de conocimiento, ontologías, bases de datos,…) se han establecido, es necesario definir el conjunto de algoritmos que permitan hacer inferencias a partir de ellos y, por tanto, descubrir nuevos conocimientos. Particularmente relevantes en el contexto de la propuesta actual son los grafos de conocimiento basados en Redes, GNN o Graph Neural Network. Una GNN es una red neuronal que se puede aplicar directamente a gráficos. Proporciona un formalismo que es particularmente adecuado para tareas de predicción. Por ejemplo, dada la configuración, un GNN puede predecir el arco que es más probable que se agregue al gráfico. Este tipo de inferencia requiere un uso intensivo de técnicas de machine learning, en particular aquellas arraigadas a redes neuronales y deep learning. Por lo tanto, el objetivo de esta tarea es determinar algoritmos de machine learning supervisados (en el caso del aprendizaje clásico) o semi-supervisados (para modelos end-to-end pre-entrenados en grandes conjuntos de datos). En este último caso, un análisis de idoneidad para cada escenario (detección de contenidos beneficiosos y maliciosos) de las técnicas de fine tuning es obligatorio para diseñar los modelos finales adaptados a cada problema. Además, para hacer frente a la opacidad o falta de transparencia de los algoritmos de machine learning, se estudiarán y aplicarán técnicas apropiadas de explicabilidad e interpretabilidad para cada escenario y así comprender las decisiones que toman los modelos y saber si están sesgadas o no al hacer predicciones. También, la topología de los GNN puede ser particularmente útil, ya que esto es una consecuencia directa de la semántica del problema que se modela y, por lo tanto, puede ayudar a interpretar los resultados obtenidos de la red, según la semántica de los nodos y arcos que participan en la predicción de la solución.

Task 4.5 Study of methods and strategies to mitigate bias in machine learning techniques

Systems developed on biased models can lead to discriminatory treatment of certain groups and being capable of assessing the neutrality of the algorithm or model is increasingly required. To solve problems fairly, this task works on facilitating access to bias evaluation metrics and mitigation algorithms, being therefore as a very relevant task. As a result, we expect to obtain tools that are publicly available for use by the scientific community, which would help to encourage the resolution of machine learning problems in a more equitable way.

Task 4.6 Predicting the dissemination and evolution of content

In this task the objective is to predict the web of influences between digital entities, their relationships and digital content, which determines the “viral behavior” and mutations of such content. Various theories, models and techniques will be applied, so that we are able to determine patterns of behavior that lead information to become viral. It is here where the network structure information (network of entities where the content propagates) and temporal information are of greatest interest.