In the first part of this series, we could explore the idea of MLOps, some of its frameworks, and other tools that can be useful when we want to apply this concept on a daily basis. Although we saw some examples there, I think it deserves a hands-on guideline on how to build a MLOps pipeline in a data science project.
The dataset we’ll be using here is available on Kaggle, and it’s called Credit Card Fraud Detection. Moreover, the data processing step we are going to use is based on this notebook: Credit Fraud || Dealing with Imbalanced Datasets, available on Kaggle as well.
Here we are going to use GitHub for hosting our repository, Data Version Control, for managing and reproducing the data science pipeline, and Continuous Machine Learning to create automatic reports for our ML experiments. In addition, we are going to store within an S3 bucket all the data that are too big to be tracked by Git, such as our model’s artifacts and the dataset.
Following the MLOps principles, we are going to use GitHub Actions to automate all our development workflow, and we will integrate the repository with Heroku in order to automatically deploy our API and serve our model. With that being covered, in the next few sections, we will dive into this very simple MLOps pipeline and see how can we implement the concepts we learned before.
First things first, we can’t do anything without being able to access our dataset. So, we have to develop a routine that can help us in downloading the data from our S3 bucket.
The following code snippet does the trick for us, the method takes as parameters the name of your bucket (bucket_name), the object key (key), and the destination (dst) where you’d like to store locally the data.
python import boto3 import botocore def download_data_from_s3(bucket_name, key, dst): try: s3 = boto3.resource('s3') s3.Bucket(bucket_name).download_file(key, dst) except botocore.exceptions.ClientError as e: if e.response['Error']['Code'] == "404": print("The object does not exist.") else: raise
To make our lives easier, we’ve created a script called download_dataset.py to be used as the first step of our pipeline managed by DVC.
In this script, we are just defining the information we need to use the download_data_from_s3 method in order to download the dataset.
The next part of our pipeline is data pre-processing, followed by the model training and validation steps.
For the first part, we’ve created a script called preprocessing.py, and within it, we have several data processing methods we can use, such as data sampling, outliers detection, and missing values treatments, and dimensionality reduction. It’s important to highlight that all these methods were extracted on this public notebook available on Kaggle.
Our development pipeline ends with the train.py script, and you can easily access the full code here. There, we put all the steps together and link them all. Therefore, after downloading the dataset, we go through the preprocessing step in order to treat outliers and split the original data into subsets of data.
After that, we define a model to be trained and also its hyperparameters search space that a GridSearchCV method uses to find the best hyperparameters values for the chosen model. Then, we train this model using these hyperparameters and go through a validation step in order to check whether our model is performing well in the validation dataset.
The last thing we do in the train.py script is gather some model metrics, and we save both the trained model and its performance metrics after testing it on the validation dataset.
Until now, we haven’t seen a MLOps pipeline, but a traditional data science one. So it’s time to configure the DVC in order to make our work reproducible in any other environment.
Therefore, all we have to do is to create a YAML file containing the following script.
yaml stages: download_data: cmd: python src/download_dataset.py deps: - src/download_dataset.py outs: - data/creditcard.csv train_clf: cmd: python src/train.py deps: - src/train.py - data/creditcard.csv outs: - data/confusion_matrix.png - data/model.pickle metrics: - data/metrics.json: cache: false
By doing that, we’re dividing the pipeline into stages, where each one is responsible for the execution of a specific part of the project. Moreover, we have also defined the dependencies and outputs of each stage.
In this article, we have two stages: download_data and train_clf. The first one executes the download_dataset.py script, and the second one runs the train.py. So, we put them in the correct order of execution and specify their requirements. For more details, you can check the DVC Repro Documentation.
Now that the pipeline is ready, we can execute it in any other environment by just opening a terminal on the workspace of the project and typing the command dvc repro.
This far, we have built a pipeline and made it ready to run anywhere, linking its dependencies and specifying the outputs. We can use it to integrate this with GitHub Actions in order to generate insights to be used in a decision-making process, for example, whether to approve or not a Pull Request.
GitHub Actions uses YAML syntax to define the events, jobs, and steps. These YAML files are stored in your code repository, in a directory called .github/workflows. And our workflow looks like this:
name: credit-fraud-detection-flow on: [ push ] jobs: experiment-run-tests: runs-on: ubuntu-latest strategy: matrix: python-version: [ 3.8 ] steps: - uses: actions/checkout@v2 - uses: iterative/setup-cml@v1 - uses: iterative/setup-dvc@v1 - uses: aws-actions/configure-aws-credentials@v1 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: ${{ secrets.AWS_DEFAULT_REGION }} - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v1 with: python-version: ${{ matrix.python-version }} - name: Install dependencies run: | python -m pip install --upgrade pip pip install --no-cache-dir -r requirements.txt - name: Run Experiment env: repo_token: ${{ secrets.GITHUB_TOKEN }} run: | dvc repro echo "## Metrics" >> report.md git fetch --prune dvc metrics diff master --show-md >> report.md # Publish confusion matrix diff echo -e "## Plots\n### Confusion matrix" >> report.md cml-publish data/confusion_matrix.png --md >> report.md cml-send-comment report.md - name: Run tests run: | pytest --cov-report=term-missing --cov=src --cov-fail-under=0 pytest --cov-report=html --cov=src - name: Upload coverage report uses: actions/upload-artifact@v2 with: name: coverage-report path: htmlcov/ retention-days: 5
We begin by naming the workflow and defining when it will be triggered. Then, we specify the jobs and, for each job, we define the steps to be executed. You can think of the steps as docker containers used to execute something. So, in each step we have to define the “`actions“`, and here we are using 5 actions in total:
One of the jobs we have in our workflow is called experiment-run-tests, and within it, we set the actions specified before so that we can be able to run the DVC pipeline we’ve created previously. By doing that, our pipeline will be executed remotely and its outputs (the confusion_matrix.png and metrics.json) will be used by CML to generate an automatic report about our model performance.
Now, whenever someone modifies anything on the code and open a Pull Request, in addition to the code diff, we will have visual information about the model performance after these modifications, such as improvements on metrics compared to the master“` branch and also confusion matrix plots, both cases are illustrated in Fig. 1 and Fig. 2.
Now that we have set our GitHub repository to automatically run experiments and tests after any modification on the code, we only need to create a routine to automatically deploy our model to serve in an API.
In order to simulate a production environment, where we have a model available to be used through API calls, we’ve developed a script called app.py which contains our very simple API developed using FastAPI.
The predict method of API takes a CSV file as input and returns the classification score, and the predictions.
The automatic deployment of our API will be handled by Heroku, all we have to do is to create a Procfile file containing the following script and save it in our workspace.
web: uvicorn src.app:app --host=0.0.0.0 --port=${PORT:-5000}
Then, we have to create an App on Heroku and set the Deployment method to work with your GitHub account, as shown in Fig. 3.
The next and final step is to define which repository will be connected to your Heroku app, and then you just need to enable the Automatic Deploy option to work with the master branch of your repository. We present this last step in Fig. 4.
That’s it! At this point, you have built a very simple, but complete, MLOps pipeline. And we hope to have at least elucidated a little about your doubts on how to apply MLOps concepts in a real-life Data Science project. All the scripts we used here are available on this GitHub repository.
If this hands-on article left you more curious about MLOps and its applications, I strongly recommend watching this MLOps tutorial on YouTube. I’m sure that it will help you in building your first MLOps pipeline.