Facilitating Data Science with a Machine Learning Platform

Jerrold Stolk

19 Nov, 2023

Machine Learning: a platform perspective

Businesses are making increasing use of Machine Learning. Predicting behavior or automating decisions - some examples of Machine Learning - has great value for organizations. Often these projects begin as separate innovation initiatives, but how do you ensure that the models built remain available and accurate? How do you create an overview of which models are in development? More importantly, how do you facilitate Data Scientists so they can work quickly and securely? A mature Machine Learning Platform plays a crucial role in this.

The goals of this platform

More and more companies are discovering the added value of Machine Learning, and are employing one or more Data Scientists to develop models. A Data Scientist, like any software developer, must be supported with the right tools.

A Machine Learning Platform aims to organize and facilitate the data science process. Machine Learning processes require support in process security, computing power requirements and efficient development to models (version control). We group these terms together under the heading 'organizing'.

'Facilitation' looks primarily at the process. Machine learning has fixed steps that must be supported within the organization, including data acquisition, training and deployment.

Organizing

What is needed in order to organize?

Security

Who has access to what code, models and data? How do we keep this data within a secure environment? Data science tasks performed on not (fully) managed work environments, such as local laptops, pose a major threat in terms of loss of personal or privacy-sensitive data. A Machine Learning Platform facilitates a secure working environment that shields data from unwanted access.

Computing power

Computing power is needed for two purposes: for training models and for hosting models. The requirements for the two environments are often different. A training environment is used for model learning. This process is characterized by a period of intensive computing power that can often be parallelized. The training process can then be split up and run simultaneously on multiple computing cores.

Hosting models involves making the trained model available so that it is ready for use. There are two categories in this:

Batch Inference: Some Use Cases are set up so that a calculation is performed periodically for a larger amount of data. For example, KNMI's weather model that calculates eight times a day. As with training models, this requires intensive computing power for a limited period of time.

Real-Time Inference: Other Use Cases are characterized by real-time calculation of outcomes. Such as dynamic pricing that is calculated instantly based on a person's characteristics and behavior. This is characterized by continuous availability of computing power, with the capacity to handle peak loads. A container solution is often used for this in practice, with the possibility to dynamically add computing power.

Version control

A data science process is iterative. It is not a linear process in which you plan the details in advance and then work them out. It is, however, a process of discovery, adjustment, adaptation and testing. Version tracking is essential in this context, and this tracking is done in two areas:

Version control by code: A Data Scientist usually writes code to train a model. In addition, code is written to streamline model inferencing. It must be possible to preserve versions of this code, so that it is always possible to go back to a previous version.
Version control for models: Each time a model is trained, it has new characteristics. When the model is further adjusted, the scores must be compared, and to do this it is necessary to be able to go back to an earlier version of a model. This also requires version control.

Facilitate

Now that we've seen how a Machine Learning Platform organizes, it's time to look at how the data science process is facilitated. What phases should be supported?

Data acquisition

Correct data is essential for a well-functioning model. Data acquisition includes obtaining and making this data available for Machine Learning purposes. A Machine Learning Platform provides an enabling role in this. Datasets must be discoverable and of good quality to be used directly in Machine Learning processes. In addition, a data platform, the role of Data Engineers and good cooperation between different disciplines are also important in this process.

Model Training

Model Training is the central step in the Machine Learning process. This step results in a model that can be validated, tested and published. To carry this out, a Data Scientist needs a wide range of tools and frameworks. These can include development environments in Python or R and deep learning frameworks such as PyTorch. A Machine Learning Platform provides this environment. These tools and frameworks can be offered in a GUI, in a managed environment or in a linked, proprietary environment.

In practice, it becomes clear that Data Scientists prefer environments that can be set up flexibly, with the open source tools and frameworks to perform Model Training. In doing so, the platform enables the model training process to be recorded and monitored.

Model Deployment

Once a model has been trained and validated, the time has come to put it into action. This results in a model that can be used repeatedly. The computational step required for repetitive deployment differs for Batch Inferencing and Real-Time Inferencing.

Both environments require a process to track versions and roll out new versions. A Machine Learning Platform helps track and automate this process so that new versions can be delivered at the push of a button. After deployment, the deployed versions can be monitored for performance (is the model online) and for model-fit (do the results match the truth). Both values can be a reason to make adjustments. This makes the above steps an iterative and repetitive process.

"With a Machine Learning Platform, the power of your Data Scientists and Engineers is fully leveraged."

Without using such a platform, your organization runs risks in the area of data security and the transparency of your Machine Learning processes.

Our Data & Analytics Expertise

Connect with us

Wondering about the added value of Machine Learning for your organization, or what innovative solutions data science can offer you? We will be happy to help you!

First name

Last name

E-mail

Phone optional

Company name optional

Company location

Message

By using this form you agree to the storage and processing of the data you provide, as indicated in our privacy policy. You can unsubscribe from sent messages at any time. Please review our privacy policy for more information on how to unsubscribe, our privacy practices and how we are committed to protecting and respecting your privacy.

Learn more