Skip to main content

One post tagged with "Machine Learning"

View All Tags

· 10 min read

What will be covered in this article:

  • How SageMaker works
  • How to prepare a model for SageMaker
  • How to use AWS Lambda to trigger model training and deployment automatically

source code for this article can be found at https://github.com/xg1990/aws-sagemaker-demo.

1. SageMaker Introduction

SageMaker is a fully-managed service by AWS that covers the entire machine learning workflow, including model training and deployment.

API levels

It has 3 levels of API we can work with:

All you need to do is to define the model training/prediction/data input/output function, and then submit/deploy the source code, with necessary configurations (e.g.: instance types). The SDK will take care of the rest work (e.g.: load data from S3, create training job, publish model endpoint)

  • Mid-Level API: boto3

Besides defining the source code, you also need to upload the source code to S3 yourself, and specify the s3url to the source code, and explicitly setup all other configurations.

  • Low-Level API: awscli+docker

Essentially SageMaker does everything within a container. Users can create their own docker container and make it do whatever they want. These containers are called Algorithm in SageMaker

In this article, I will cover the usage of python-sagemaker-sdk andboto3. Defining your own docker container (low level API) is only necessary when your ML model is not based on any of the SageMaker supported frameworks: Scikit-Learn, Tensorflow, PyTorch, …..

SageMaker Modules

There are many modules provided by SageMaker. In this article, the following module will be used:

  • Notebook instances: full managed jupyter-notebook instance where you can test your machine learning code with access to all other AWS services (e.g. S3)
  • Training jobs: the place to manage model training job
  • Models: the place to manage trained models
  • Endpoints: full managed web service that can handle requests (HTTP or others) as input and make predictions as responses
  • Endpoint configurations: configuration for endpoints

2. Prepare SageMaker Model

To train and deploy a machine learning model on SageMaker, we need to prepare a python script that defines the behaviours of our model by the following python functions:

Main script

  • defined within if name == 'main'.
  • It will be run by SageMaker with several command line arguments and environment variables pass into it.
  • It should load training data from --train directory and output the trained model (usually a binary file) into --model-dir
  • Example code looks like this:
if __name__ =='__main__':
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=50)
parser.add_argument('--batch-size', type=int, default=64)
parser.add_argument('--learning-rate', type=float, default=0.05)
# Data, model, and output directories
parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
args, _ = parser.parse_known_args()
# another function that does the real work
# (and make the code cleaner)
run_training(args)

Besides the main script, other functions are defined for model deployment, which includes the following functions:

  • model_fn: loads saved model binary file(s)
  • input_fn: parses input data from different interface(HTTP request or Python function calling)
  • predict_fn: takes parsed input and make predictions with the model loaded by model_fn
  • output_fn: encodes prediction made by predict_fn and return to corresponding interface (HTTP response or python function return)

model_fn

The model_fnfunction should look into the model_dir and load the saved model binary file into memory for prediction tasks. Example code looks like this:

input_fn

input_fn takes two arguments: request_body, and request_content_type Usually, the input_fn should check the type of input data (request_content_type) first before it can be parsed. You can also do data pre-processing here. Example code looks like this:

predict_fn

predict_fn takes two arguments: input_data and model. input_data is the output of input_fn .model is the output of model_fn. Usually, we may need to do some data transformation before prediction. Example code looks like this:

output_fn

output_fn also takes two arguments: prediction and accept. prediction is the output of predict_fn , accept is the expected output format (e.g. application/json) from the client side.

All above functions should be put into a python script (let's say it is train_and_deploy.py ), then we can use python-sagemaker-sdk to test our model for SageMaker in our local environment.

3. Model development with python-sagemaker-sdk

Besides train_and_deploy.py script, we need to prepare another python script to run python-sagemaker-sdk , which look like this:

We use from sagemaker.sklearn import SKLearn if our model is based on Scikit-learn. If the model is based on Tensorflow, we can use from sagemaker.tensorflow import TensorFlow instead.

The meaning of these arguments can be found in SageMaker official documents for scikit-learn, TensorFLow, and PyTorch.

Debug Locally

One thing to pay attention: Every time we call sklearn_estimator.fit or sklearn_estimator.deploy , the SageMaker SDK will start a new docker instance and run the corresponding job, which is very slow if the job is done on the server side for debugging purpose. In this case, we can set train_instance_type='local' or instance_type='local' to conduct local testing, which is much faster (make sure docker is set up on a local machine)

Test Published Endpoint

One the trained model is published as an endpoint service, we can test the endpoint with new data:

4. Using Lambda Function to control SageMaker

Now we have successfully trained and deployed a model on SageMaker. But it not enough.

In real world, we should receive new data every day and need to retrain the machine learning model periodically.

Moreover, model training usually takes a long time and we need to make sure one the training job is done, the trained model should be automatically deployed into existing endpoint (SageMaker does not do this automatically).

There are two solutions for this: Step Function and Lambda

  • Solution 1-Step Function. We can define a lambda function to check the status of a training job. Then we use a step function to call the lambda function (e.g. every 1 hour), once training is done, call another lambda function to deploy the model. This solution has been well-documented in this article https://github.com/aws-samples/serverless-sagemaker-orchestration
  • Solution 2: Lambda. Once a model training job is finished, the trained model will be written into an S3 bucket. S3 PUT event can be associated with a lambda function to trigger model deployment.

This article will demonstrate the solution 2, how we use lambda function and S3 event to manage SageMaker.

The lambda_handler takes the argument event, which contains information about how the lambda is triggered. By interpreting the event, we can conduct different action within one lambda handler. Following is an example code of the lambda function we use:


def lambda_handler(event, context):

response = event

# Handle S3 event
if 'Records' in event:
for records in event['Records']:
if 's3' in records:
handle_s3_event(records['s3'])
else:
# unrecognised events
pass

if 'task' in event:
if event['task'] == 'retrain':
# start retrain the model
retrain_the_model()
response = "OK"
if event['task'] == 'prediction':
response = make_prediction()
return {
'statusCode': 200,
'body': response,
}

Three different events are handled here:

  • If input event is S3 event, it will call handle_s3_event(records['s3']) to handle the event
  • Otherwise, if event is model re-train task, it will call retrain_the_model()
  • And if event is prediction task, it will call make_prediction() and pass the result as response

Trigger model retrain task periodically

Lambda function can be triggered by CloudWatch event periodically. We can add a CloudWatch Event trigger and set up a Rule.

  • Event Source should be Schedule, with customised event pattern (very similar to crontab on Linux)
  • Set Targets as the lambda function we use to control SageMaker.
  • Configure input can be Constant (JSON text), so that the lambda_handler can understand what to do with it.

Model Retrain

This is an example of model re-training function.

import boto3
src_path = 's3://<path-to-source-code-train_and_deploy.py>.tar.gz'
def retrain_the_model():
now_str = datetime.utcnow().strftime('%Y-%m-%d-%H-%M-%S-%f')
training_job_name = <pub-training-job-name-here>-{now_str}'
sm = boto3.client('sagemaker')
resp = sm.create_training_job(
TrainingJobName = training_job_name,
AlgorithmSpecification={
'TrainingInputMode': 'File',
'TrainingImage': '783357654285.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
},
RoleArn=role_arn,
InputDataConfig=[
{
'ChannelName': 'train',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://<path-to-training-data>',
'S3DataDistributionType': 'FullyReplicated',
}
},
},
],
OutputDataConfig={
'S3OutputPath': 's3://<path-to-output_dir>'
},
ResourceConfig={
'InstanceType': 'ml.m4.xlarge',
'InstanceCount': 1,
'VolumeSizeInGB': 30,
},
StoppingCondition={
'MaxRuntimeInSeconds': 600
},
HyperParameters={
'sagemaker_submit_directory': 's3://path-to-sagemaker-submit-directory',
'sagemaker_program' : "train_and_deploy.py",
'sagemaker_region': "<your-aws-region>",
'sagemaker_job_name': training_job_name,
'sagemaker_submit_directory': src_path
},
Tags=[]
)

This function will create a model training job.

Please be noted that we cannot use python-sagemaker-sdk within lambda environment. Then the best solution is to use boto3. That means we need to upload the train_and_deploy.py to S3 in a gzipped tar package by ourselves. And also set up everything as done in above code.

Model Deployment

Once a model training job is done, we need to deploy the trained model and update existing endpoint. SageMaker doesn't do this for you automatically. And we cannot ask Lambda to wait for the training job, as training may take several hours.

Once a training job is done, an S3 PUT event will be triggered, which can notify our lambda function that training is done and we can do deployment now.

Within an S3 PUT event, the key of the S3 object is provided, which is usually related to the unique ID of our training job, as is done in the following code:


def handle_s3_event(s3):
bucket = s3['bucket']['name']
fn = s3['object']['key']
jobid = fn.split("/")[-3]
return deploy_model(jobid)

Once the model training job ID is known, we then can call the deploy_model function to deploy our model, which looks like this:


endpoint_name = <your-endpoint-name>
src_path = 's3://<path-to-source-code-train_and_deploy.py>.tar.gz'
def deploy_model(training_job_name):
sm = boto3.client('sagemaker')
model = sm.create_model(
ModelName=training_job_name,
PrimaryContainer={
'ContainerHostname': 'model-Container',
'Image': '783357654285.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
'ModelDataUrl': f's3://<path-to-your-output-dir>/{training_job_name}/output/model.tar.gz',
'Environment': {
'SAGEMAKER_PROGRAM': 'train_and_deploy.py',
'SAGEMAKER_REGION':'<your-aws-region-name>',
'SAGEMAKER_SUBMIT_DIRECTORY': src_path

},
},
ExecutionRoleArn=role_arn,

)
endpoint_config = sm.create_endpoint_config(
EndpointConfigName=training_job_name,
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': training_job_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.t2.medium',
},
],
)
sm.update_endpoint(EndpointName=endpoint_name,
EndpointConfigName=training_job_name)

...

And then, everything is set up and our model will keep training periodically and provide the best performance.

Wrap Up

SageMaker is especially helpful when you need to retrain your model periodically and serve your model as a web service. For training, SageMaker can automatically start a high-performance EC2 instance and finish model training within a short time at the minimum cost. For web serving, SageMaker can take care of auto-scaling and make sure your endpoint is always available.

To use SageMaker for Machine Learning, the most important step is to prepare a script that defines the behaviours of your model. And you also have full control of the whole system by creating your own docker container.

Other things to be aware of

  • We have not involved model performance test in this workflow, which is also important in a machine learning pipeline;
  • S3 Event doesn't guarantee 100% delivery. If model training is critical, Step-Function is a better choice
  • Make sure your AWS role have enough permission to control necessary resources
  • SageMaker can also run batch prediction jobs, and there are many other functions remain to be explored.