[ MLFlow With MinIO (Special Guest Apache Spark) ]
For the past few days, I’ve been exploring MLFlow and its application to our Machine Learning pipeline.
MLFlow has four main components: Tracking, Project, Models, and Model Registry. The core of MLFlow is its Tracking service. This component allows you to log code, different versions of your model(s), metrics, and other artefacts.
MLFlow helps developers manage/reproduce experiments and models with their own choice of tools and platforms whether it be Apache Spark ML Pipeline, Tensorflow model or scikit-learn pipeline deployed to their own instance, Amazon SageMaker, GCP, etc.
However, the only way to grasp the features of MLFlow is when you play with it. It’s better to set it up locally as it gives you the idea on the orchestration and provisioning of resources when you decide to use it on production.
In this post, I am going to demonstrate how to make MLFlow work with MinIO as in most cases you really want to store your artefacts in a service like Amazon S3. We are going to set it up locally using Docker. Essentially we train (Spark ML) our model in one container while MLFlow server runs in another container which uses MinIO as an artefact store.
Create docker network
docker network create mlflow
- Run MinIO with your desired access/secret keys. Here, we also created a bucket named ml-bucket
mkdir -p /buckets/ml-bucket docker run --rm --net mlflow --name s3 \
-e "MINIOACCESSKEY=xxxx" \
-e "MINIOSECRETKEY=xxxx" \
-p "9000:9000" \
-v "/buckets:/data:consistent" \
--minio/minio:RELEASE.2020–07–27T18–37–02Z server /data
- Start MLFlow server. In this step, we’re going to manually install MLflow in a python container. The important part is to set the environment variable MLFLOWS3ENDPOINT_URL to point to your MinIO server.
docker run --rm --net mlflow --name mlflow \
-p "5000:5000" -it python:buster /bin/bash ###
In the container
# install mlflow and related packages
pip install awscli boto3 mlflow==1.10.0 export MLFLOWS3ENDPOINT_URL=http://s3:9000
export AWSACCESSKEY_ID=xxxx
export AWSSECRETACCESS_KEY=xxxx
- Still inside the MLFlow container, create an experiment with artefact store pointing to ml-bucket
mlflow experiments create --experiment-name demo --artifact-location s3:// ml-bucket /
- Start MLFlow server
mlflow server --host 0.0.0.0 --default-artifact-root s3:// ml-bucket /
- Set the s3 endpoint before you start training. You may also set the tracking URI and experiment as ENV but you could also do this via MLFlow’s API.
pip install awscli boto3 mlflow==1.10.0 export MLFLOWS3ENDPOINT_URL=http://s3:9000
export MLFLOWTRACKINGURI=http://mlflow:5000
export MLFLOWEXPERIMENTNAME=demo
- (Optional) If using S3A client on your local setup, you may want to set the fs.s3.impl in your Hadoop config to use S3A for better write performance.
- Run some Apache Spark training code
python3 /src/train_spark.py
python3 /src/trainspark.py --numtrees 25 --max_depth 10
python3 /src/trainspark.py --numtrees 30 --max_depth 15
python3 /src/trainspark.py --numtrees 30 --max_depth 30
And that’s it. When you go to your local MLFlow UI, you should see the different runs as well as the info on where the artefacts are stored.
List of experiments
View experiment
Logging artefact from Apache Spark’s PipelineModel
In some cases, you may want to extract data from one of your stages and save it as an artefact. To do that, just use the mlflow.log_artifact