2 scientific challenges on the global approach to data management for learning

29 Jan 2024Data lifecycle, Non classifié(e)

Designing a component based on machine learning requires mastering the life cycle of the data and knowledge used for its training and qualification. If the data is not reliable, the AI ​​that processes it cannot be. Thus, to guarantee trust, data and knowledge management must be put in place, throughout the life cycle of the learning-based solution. This objective leads to two main scientific challenges:

  • The definition of a data and knowledge engineering process without adhering to a specific tool;
  • Control of the data life cycle including the characterization and maintenance of data quality.

 

Definition of a data and knowledge engineering process

Data engineering is an iterative process requiring a deep understanding of the data and the problem to be solved. It plays a crucial role in preparing data for machine learning and impacts model performance. This engineering includes several stages.

  1. The data collection :
    1. First, the task that machine learning must perform (classification, prediction, content generation, etc.) must be specified.
    2. The identification of relevant information must be deduced from the definition of the operational design context (ODD: Operational Design Domain) of the system/solution.
    3. The type of data to create an effective model is specified: the data can be structured (tables, databases, etc.) or unstructured (texts, images, etc.). It is also necessary to specify the appropriate sources of data.
    4. Next, we must define the sample size, that is to say the quantity of data necessary to train and evaluate the model reliably. The sample size will depend on the type of problem, the complexity of the model, and the variability of the data.
    5. The next step is to ensure that the data collected is of high quality, which means minimizing measurement errors, outliers, missing data and other similar issues. It is also important to check compliance with regulations such as data confidentiality or the absence of bias.
    6. Finally, since data can become obsolete over time, a data maintenance strategy must be put in place.
  2. Data analysis: After this collection stage, data analysis allows us to understand their structure, quality and availability. This may require data visualization activities, statistical analysis, identification of missing or outlier values…
  3. Variable selection: This involves selecting the relevant variables for training the model. This may involve the use of expert knowledge, statistical analysis or automated tools.
  4. Data cleaning: Once anomalies and missing values ​​have been identified, they must be cleaned with appropriate techniques such as replacing missing values, eliminating outliers, normalization, etc. Missing values ​​must also be handled appropriately. This may include replacing missing values ​​with an estimate (mean, median, etc.) or deleting rows or columns containing missing values.
  5. Data transformation: Depending on the nature of the data and the type of learning model, it may be useful to transform the data using various techniques like dimensionality reduction, normalization, etc. Sometimes creating new variables from existing data can be useful to improve model performance. This can be achieved by applying mathematical operations, combinations of existing variables or by using dimensionality reduction techniques.
  6. Data labeling: If the data requires labeling (for supervised learning for example), it is necessary to specify how the data will be labeled, for example through the use of active learning techniques.
  7. Separation of the data sets: You must then divide all the data into training, validation and test sets. The training set will be used to train the model, the validation set will adjust the hyperparameters of the model and the evaluation of the final performance of the model will be done from the test set.
  8. Training the model: This involves choosing an appropriate algorithm, tuning the model’s hyperparameters, providing the training data, and evaluating the performance of the model as it learns.
  9. Model validation and tuning: After initial training of the model, it is important to validate it through the validation set. This step allows you to adjust the model hyperparameters, select the best features and improve the overall performance of the model.
  10. Model evaluation: Once the model is validated and adjusted, it is evaluated using the test set to estimate its performance on previously unseen data. Evaluation can be carried out using various performance measures such as accuracy, precision, recall, F-measure, etc.
  11. Finally, reiteration: This engineering process is often iterative, meaning it may require multiple cycles of collecting, cleaning, transforming, training, validating, and adjusting the model to gradually improve model performance and obtain insights. more precise results.

In conclusion, the data engineering process includes collection, exploration, cleaning, transformation, data set separation, model training, validation and adjustment, model evaluation and reiteration. This must be based on a rigorous methodology without adhesion with one or more tool(s) to design quality machine learning and this to find and prepare the variables which will be used to train the model in order to obtain a representation more informative and relevant data.

 

Mastery of the data life cycle including the characterization and maintenance of data quality

As mentioned above, in the context of learning, the data lifecycle is generally divided into several stages:

  1. The data collection ;
  2. Data preparation where raw data is analyzed, cleaned, transformed and eventually labeled;
  3. Then training the model: In this step, the machine learning model is trained from the prepared data. This involves selecting a learning algorithm suitable for the problem, setting the model parameters, separating the data into training and testing sets, and then passing the training data through the model to adjust its parameters;
  4. Once the model is trained, it is evaluated using the test data, thus making it possible to evaluate the performance of the model and its ability to generalize to new data.
  5. Based on these results, the model can be optimized by adjusting hyperparameters, changing its structure, adding or removing features, etc.
  6. Once the model has been trained and optimized, it can be deployed to new data. However, the model must be regularly “monitored” (monitored) to ensure that it is working properly and maintaining good performance. Sometimes it can be retrained as new data becomes available.

This data life cycle is iterative, as new data can be collected, new models can be developed, and even improved over time.

So the quality of the data is decisive. Indeed, if the training data is of poor quality, weakly relevant, with errors, missing values, noise or bias, the model risks providing inaccurate or biased answers. It is therefore necessary to evaluate the quality of the data during the different stages described above, that is to say throughout the life cycle of the data and to propose methods and tools making it possible to improve or maintain the required level of quality.

For this, it is necessary to put in place tooled methods to allow data profiling based on a clear specification of the attributes/criteria contributing to the quality of the data and therefore of the learning model such as accuracy, relevance, completeness, reliability, absence of bias, validity…. For each of these attributes, their evaluation must be based on clearly defined metrics, indicators or methods that may depend on the type of data and/or the underlying approach to learning. Finally, once the evaluation has been carried out, tooled approaches are necessary to guarantee the required level or even offer correction methods in the event of model “drift”.