Work modules
Module 4. Design of methods and algorithms
This module is the most extensive proposed by the project, as it incorporates the tasks with the greatest research burden. Following the detailed analysis of the domains, scenarios, current techniques, information sources, as well as the specific challenges to be solved, in this module we will identify the expected response of future solutions. The module will explore new methods and techniques that break the state of the art and provide a breakthrough in the search for solutions to the problems indicated. To model digital content, all the information related to the context where this content is produced and consumed is relevant: its textual content, the profiles of the digital entities involved in the exchange, the network structure maintained by these entities. This requires a novel approach to the construction of models, which must deal with heterogeneous features (text, network structure, temporal evolution, additional features present…). From feature engineering to end-to-end solutions, the holistic reality of digital content must be considered in the search for models that allow inference processes (classification and decision making) in the detection of both beneficial (hope-speech and secure content) and malicious (hate-speech and fake news) content.
Milestones
- Identification of relevant homogeneous features and study of the application of methods of extraction, selection and transformation of these features.
- Language modeling based on the above features.
- Study and design of the most appopiate representation to model the content.
- Review of the state of the art in supervised and semi-supervised learning methods, applied to the generation of inference models.
- Study and design of methods and strategies to mitigate bias in machine learning techniques.
- Study and design of methods and resources for mutation prediction and content viralization.
Deliverables
- Report with relevant heterogeneous features and application methods.
- Language modeling representation.
- Knowledge graphs and GNNs derived from content modeling.
- Report on the most suitable supervised and semi-supervised learning methods.
- Report on methods and strategies to mitigate bias in machine learning techniques.
- Report on methods and resources for mutation prediction and content viralization.
Task 4.1 Methods for the extraction, selection and transformation of features
This task will be in charge of determining which are the most relevant heterogeneous features to be used in the subsequent modeling, taking into account that the aim is to extract the high-level semantics associated to the content, characterizing the relationships between digital entities in different aspects such as contradiction, congruence, polarization, bias, emotions, irony, etc. Features can have diverse formats and origins. Specifically, we have three main sources of data: social media, knowledge bases and corpora of available or generated data, and from these sources we will select and filter according to the scenario under study. Thus, for example, in the case of fake news analysis, networks such as Twitter and online press would be sources to consider, but also news corpora and geographic and factual knowledge bases. In the case of hate speech, the search for toxic content can be framed to a specific community by topic (racism, gender-based violence, bullying) or age (networks most used by teenagers, such as TikTok, Twitch or Instagram). In other words, the contextualization of the specific scenario serves as a premise for the selection of sources and the extraction of characteristics (profile information, messages, timestamps, network structure…). From the extracted features, those that are considered decisive in the resolution of a given problem will be selected, and in some cases it will be necessary to transform them for the purpose to which they are applied.
Task 4.2 Methods for content modelling from heterogeneous features
Once these heterogeneous features have been extracted, selected and transformed, the objective of this task is to model the content for language models, which take into account the linguistic characteristics of the different domains and scenarios. In this case, features of all levels (lexical, syntactic, semantic, discourse and pragmatic) will be taken into account. Current methods of digital content analysis aim at an integration of data of diverse nature (message text, text associated to an image, network structure, user data, lexicons and specialized ontologies…) to enrich the features available to machine learning algorithms. On the one hand, the network structure or temporal evolution are key for the detection of hoaxes; on the other hand, the gender of the receiver of a message, their age and details of their profile help in the detection of offensive messages. In addition, the type of multimedia content, links, lexical diversity, or the way in which the content is propagated through the network facilitate the discovery of constructive content. Thus, the heterogeneity of the content determines its semantics.
Task 4.3 Methods for knowledge generation
Knowledge generation requires organizing, normalizing, tabulating and categorizing amounts of data to generate additional information. The knowledge generated often requires a considerable size of data, often composed from various sources, e.g., publications, patents, web resources such as forums, social networks, etc. Knowledge discovery may be associated with a specific context (e.g., it may make use of controlled languages or ontologies of a given domain). Thus, knowledge generation must face both quantitative issues, due to the volume of data, and qualitative ones, due to the necessary data processing. Once obtained, it provides domain knowledge that can be converted, for example, into a rule-based system or other inference engines based on machine learning. In the scope of the present proposal, this task will be in charge of carrying out the integration of all these models into a homogeneous formalism from the different digital models obtained, so that such models can be uniformly accessed and efficiently manipulated. Such is the case of the generation of knowledge graphs, ontologies or databases, derived from the content modeling.
Task 4.4 Learning methods for the generation of automated, interpretable and explainable inference models
This task defines the set of algorithms used for the integration of the results inferred from the available knowledge bases. A key a pillar of knowledge generation is how to represent this knowledge once obtained, so that it can be used in a convenient way and in different contexts. For this to occur, it is necessary to relate data by associating attributes and characteristics, which results in a data model with enough expressive capacity to represent the knowledge obtained, while being computationally manageable. Thus, once the formalism or formalisms for knowledge representation (knowledge graphs, ontologies, databases,…) have been established, it is necessary to define the set of algorithms that allow inferences to be made from them and, thus, to discover new knowledge. Particularly relevant in the context of the present proposal are knowledge graphs based on neural networks, GNN, or Graph Neural Network. A GNN is a neural network that can be applied directly to graphs. It provides a formalism that is particularly suitable for prediction tasks. For example, with a given configuration, a GNN can predict the arc that is most likely to be added to the graph. This type of inference requires intensive use of machine learning techniques, particularly those rooted in neural networks and deep learning. Thus, the objective of this task is to determine supervised (in the case of classical learning) or semisupervised (for pre-trained end-to-end models on large data sets) machine learning algorithms. In the case of the latter, a suitability analysis for each scenario (detection of beneficial content and malicious content) of tuning techniques is mandatory to design the final models adapted to each problem. In addition, to deal with the opacity or lack of transparency of machine learning algorithms, the most appropriate explainability and interpretability techniques will be studied for and applied to each scenario in order to understand the decisions being made by the models and to know whether they are biased or not when making predictions, and where the topology of the GNNs can be particularly useful, as this is a direct consequence of the semantics of the problem being modelled and can therefore help to interpret the results obtained from the network, according to the semantics of the nodes and arcs that participate in the prediction, the solution, reached.
Task 4.5 Study of methods and strategies to mitigate bias in machine learning techniques
Systems developed on biased models can lead to discriminatory treatment of certain groups and being capable of assessing the neutrality of the algorithm or model is increasingly required. To solve problems fairly, this task works on facilitating access to bias evaluation metrics and mitigation algorithms, being therefore as a very relevant task. As a result, we expect to obtain tools that are publicly available for use by the scientific community, which would help to encourage the resolution of machine learning problems in a more equitable way.
Task 4.6 Predicting the dissemination and evolution of content
In this task the objective is to predict the web of influences between digital entities, their relationships and digital content, which determines the “viral behavior” and mutations of such content. Various theories, models and techniques will be applied, so that we are able to determine patterns of behavior that lead information to become viral. It is here where the network structure information (network of entities where the content propagates) and temporal information are of greatest interest.