In the latest post of our series, we talked about Data Engineering and Exploratory Data Analysis. In this post, we will talk about Modeling, the area with the most weight (36%) on the exam.
One very important subject in the test (and also when using Machine Learning to solve problems) is how to measure your results. The first step is always to split your data between a train and test set: your test set should be representative of the kind of data you’re expecting to see when the model goes live, and your model should never “see” this data when training or tuning its hyperparameters. After, you can either split your training set into train and validation (using the train set to train your model and the validation to tune hyperparameters) or do a similar procedure by using a K-Fold Cross-Validation.
Also important is which metric(s) to use. This selection is very dependent on your problem (ex: regression or classification), the kind of data you have (especially the imbalance between classes), and the business problem (for example, it is better to have a false negative or a false positive?). We don’t have time in this post to talk in detail about most metrics, but I’ll list the ones you’re expected to know in detail before going to the test: accuracy, recall, precision, sensitivity, sensibility, false positive and false negative rates, ROC AUC, PR AUC, F1 score, MAE (Mean Absolute Error) and MSE (Mean Squared Error).
Everyone that works with ML knows that the model is just as good as the quality of the data used to train it. But anyone that had to manually label their data (instead of using a third-party labeled dataset) knows that this process is long, tedious, and hard to organize between multiple people. To help in this situation, AWS has a labeling service called SageMaker Groundtruth.
With SageMaker Groundtruth you can create labeling jobs (indicating what should be labeled and including good and bad examples) in distributing it to teams in 3 ways: using employees of your company, using a licensed vendor, or using AWS Mechanical Turkey, outsourcing this job to people around the world. Besides this, the most distinctive feature of Groundtruth is it’s automatic labeling: after some labels had been created, Groundtruth trains an ML model to predict new labels (using only the ones with confidence over 95%), reducing your costs and time to label all your data.
After talking about how to evaluate your models and to label your dataset, we will now discuss about model training. To train models in AWS you use SageMaker, a platform where you can use any number and types of machines to train your models, paying only for the time you are training.
There are three ways to create models and train them using SageMaker. The first way is by using SageMaker built-in algorithms, and this is the easiest and most optimized option. Those built-ins algorithms are AWS implementations of famous ML algorithms, and right now there are 17 of them, including models for regression and classification with tabular data (Linear Regression, XGBoost), Image Classification, Segmentation, Topic Modeling, and Natural Language Processing, Forecast and Clustering. The list of all algorithms and a detailed explanation of each one can be found here.
But suppose that you want to train your custom neural network that is not available in the built-in algorithms. In that case, you can use the second option: script mode. In script mode, you just use a python script containing your code and define the machines you want to use, similar to what you do by using the built-in algorithms. The main limitation of this option is that your script must use one of the five available machine learning frameworks: Tensorflow, PyTorch, Apache MXNet, scikit-learn, or SparkML.
Finally, if you want to run a model using another framework or another language (like R) you’ll have to use the Bring Your Own Algorithm option. In this case, you create a Docker container containing your code and following AWS specifications and then do the rest of the process as the other options. This option involves more work on the developer’s side but has the most amount of flexibility.
After you created a data pipeline, you will need to be able to sanitize and explore this data to feed it into your ML models. This process is called Exploratory Data Analysis.
For a visual approach to data exploration, Amazon also has a BI tool, very similar to Tableau or Power BI, called Quicksight, that lets you create dashboards and graphs to describe your data. Quicksight can connect to multiple data-sources such as Athena, Redshift, RDS databases, or even on-premises databases.
This was the third post in our 4-part series on AWS’s Machine Learning Specialty Certification. In the next and final post, we will explore the fourth area of the test: Machine Learning Implementation and Operations, talking about AWS’s High-Level AI Services and how to deploy your models using Sagemaker.
Featured Image by Christopher Gower on Unsplash.