AWS Sagemaker demonstration for machine learning model deployment

Jyoti Yadav
6 min readJul 25, 2021

--

Nowadays, every company is reliant on cloud infrastructures. Therefore, it becomes essential for a data scientist to move into that domain as well. The article represents a demonstration of the XGBoost model. The example dataset chosen over here is related to bank products and the target variable depicts if the customers buy it or not

The code defines the way in which any model can be deployed over the AWS for production purposes. Each and every instruction has been provided in the sheet.

AWS Instance Creation

Before we begin with the process let's take care of a few steps which are required to be completed. Please follow the following steps to create the instance in AWS:

  1. Create AWS Account (No money required)
  2. Login into AWS Management console
  3. Search for AWS Sagemaker
  4. Left-hand side panel navigation → Notebook > Notebook Instances > Create Notebook Instances
  5. Create IAM Role while creating the notebook instance (This is an essential part of the process because it helps manage the access of the notebook)
  6. The instance creation will take some time. Once done, please open it and start writing the below code

Article Sections

  1. Set up environment
  2. Download and split dataset
  3. Model
  4. Deployment
  5. Predictions
  6. Delete endpoint

1. Set up environment

There are some libraries that are required to be installed before we begin the entire process.

  1. sagemaker — inbuilt engine to perform modeling and deployment
  2. boto3 — help in connecting server with this machine instance

Specific functions import:

  1. get_image_uri — As AWS inbuilt model will be used, it is required to be fetched as a container through this function
  2. csv_serializer — This will be used for prediction purposes, as the input will be supplied in the form of the CSV (serialization of the input)
import urllib
import os
import pandas as pd
import numpy as np
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.session import s3_input, Session
from sagemaker.predictor import csv_serializer

This step is to automatically create the S3 bucket in the AWS

  1. specify the bucket name as per availability
  2. fetch the region-name (Note: S3 buckets are free of region specification)

(Note: S3 (Simple storage service) bucket is a data storage platform that provides scalability as well. Region name: location of the operation performed (Eg: N. Virginia))

bucket_name = 'ba-data-112233'
#get region name
my_region = boto3.session.Session().region_name
print(my_region)

The below code is to connect to S3 (using boto3) and create a bucket for the project

#get access of the s3 bucket
s3 = boto3.resource('s3')
#create bucket
try:
if my_region == 'us-east-1':
s3.create_bucket(Bucket=bucket_name)
print('S3 bucket created successfully')
except Exception as e:
print('S3 error: ',e)

The bucket will be used for storing everything produced by the entire model. It involves:

  1. Original data file
  2. train and test data (created by splitting function)
  3. model file
  4. predictions
# set an output path where the trained model will be saved
prefix = 'xgboost_model'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

2. Download and split the dataset

The data is downloaded from the Github page through urllib library. and then finally converted to the data frame.

# Download data in s3 bucket
try:
urllib.request.urlretrieve('https://raw.githubusercontent.com/jyotiyadav99111/AWS_Bank_Applkication-/main/bank_data.csv', 'bank_data.csv')
print('Data Downloaded successfully')
except Exception as e:
print('Downloading error: ',e)
# load dataset in pandas dataframe
try:
df = pd.read_csv('./bank_data.csv') # provide relative path in s3 bucket
print('Datafame created successfully')
except Exception as e:
print('Dataframe creation error: ',e)

The dataset has been split into training and testing files using NumPy library instead of splitting it into x_train, y_train, x_test, and y_test.

# Train and test data split
train_data, test_data = np.split(df.sample(frac=1, random_state=1729), [int(0.8 * len(df))])
print(train_data.shape, test_data.shape)

It is advisable to have the first column of the data as the target variable. The first line of code performs that task. The training and testing set has to be stored in the S3 bucket so that they could be fetched easily every time.

# as per some documentations in AWS the target variable should be the first column
# training data saved as csv
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
# upload data to s3 bucket under the 'train' folder
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
# for general upload of the we will reuire path of the data next time
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

Repeat it for the testing dataset

pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

3. Model

The AWS inbuilt models are built-in forms of containers or images. These are required to be pulled off and loaded in the instance for use. One can replace ‘xgboost’ with other models to create a corresponding image.

# any algo can be called using this method (not necessarily the xgboost)
xgboost_container = get_image_uri(boto3.Session().region_name,'xgboost', repo_version='latest')

These hyperparameters have been decided base on the pre-training of the dataset on the local machine. As doing it on the cloud with a free tier could be a little slow.

# These hyperparameter have been tuned already on local machine as on AWS it will be alittle slow and costly
hyperparameters = {
"max_depth":"5",
"eta":"0.2",
"gamma":"4",
"min_child_weight":"6",
"subsample":"0.7",
"objective":"binary:logistic",
"num_round":50
}

Once the parameters have been specified, feed them into the estimator function to create the model architecture. The estimator will remain the same for most of the machine learning algorithms. The last three arguments in ‘Estimator’ function are for controlling the consumption of the cloud and reduce it by ~50%.

# This is a gernal method can be used for any ML algorithms, you just need to specify it in the container itslef
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
train_instance_count=1,
train_instance_type='ml.m5.2xlarge',
train_volume_size=5, # in GBs
output_path=output_path,
train_use_spot_instances=True,
train_max_run=300,
train_max_wait=600)

Once the basic architecture is in place, training and validation data can be entered into the model. The argument provided thereby takes the storage path to the train and test data which is stored at ‘s3_input_train’ and ‘s3_input_test’.

# final training of the model given the chosen paramters
estimator.fit({'train': s3_input_train,'validation': s3_input_test})

4. Deployment

The entire model that has been built is required to make predictions which can be done by making the endpoints.

xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

5. Prediction

After building the endpoint, predictions can be made through the following code. The first line of the code extracts only the features out of the test dataset. As the data input format will be CSV, therefore it is required to be a serializer. The last line is to convert prediction into an array.

test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values #load the data into an array
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array

The below code has been taken from the AWS documentation. It is to create the confusion metric.

cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))

6. Delete the endpoint

Once the model has been created and everything has been accomplished it is a good idea to delete the data and other files.

sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

For full code please refer to the link

Conclusion

The entire process contains a lot of steps and each step has its own significance in the process. This was a simple example of how a model can be deployed on AWS. Other deep learning models can also be deployed in a similar manner with a bit of advancement of methods.

References:

--

--

Jyoti Yadav
Jyoti Yadav

Written by Jyoti Yadav

Data Scientist with brains of an Economist :)

No responses yet