Skip to main content

· 11 min read

the key takeaways from our data platform journey are:

  • Focus on a low-maintenance data platform with high ROI.
  • Prioritise developer-friendly design, intuitive configurations, and ease of change implementation.
  • Embrace declarative and stateless architecture for simplicity and maintenance.
  • Implement GitOps principles for version control, collaboration, and automation.

The Challenge

The importance of a data platform

Importance of a data platform

Importance of a data platform (image source: internet)

In the dynamic world of eCommerce, a robust data platform is paramount to our success. As New Aim evolves, so does the complexity of our operations, demanding significant time and effort. Imagine overseeing 5000 products throughout their lifecycle: from procurement to marketing, fulfilment, and after-sales. The sheer magnitude of these tasks renders manual decision-making virtually impossible.

Recognising this challenge, we've come to realise the indispensable role of a data platform. It serves as a crucial tool, allowing us to efficiently manage inventory, determine optimal selling prices, enhance listings and marketing strategies. Our Last-mile Postage Optimiser (LPO) technology also enables us to select suitable postage options, and choose the most suitable courier services for seamless order fulfilment. By harnessing the power of automation, we alleviate the burden of manual work in customer service and equip our New Aim leadership team with the insights needed to make well-informed decisions.

Our implementation of a new data platform aims to streamline daily operations, foster automation, and empower our team with actionable insights. These efforts are vital in driving the growth and prosperity of our company.

The Complexity of a Data Platform

Elements of ML

Elements of ML (image source: https://cloud.google.com/blog/products/ai-machine-learning/what-it-really-takes-to-build-ml-into-your-business )

Developing a robust data platform is a formidable undertaking, as illustrated in this diagram borrowed from Google's real-world ML systems blog. At New Aim, our data platform acts as a vital support system, enabling our data scientists and analysts to deliver valuable data services to the company. While stakeholders may simply receive the desired spreadsheet or dashboard, the behind-the-scenes effort required to produce accurate and timely results is substantial.

These endeavors involve the dedicated work of platform engineers, who meticulously establish the necessary infrastructure and allocate vital resources. Our reliance on data engineers ensures the smooth movement and cleansing of data for optimal usability. Furthermore, the expertise of machine learning engineers is paramount in transforming Python notebooks or SQL queries of our data scientists and analysts into production-ready pipelines, seamlessly integrating them into the larger system.

Considering the intricate nature of constructing a data platform, and the demanding production environment, it’s an exercise in skill and teamwork to develop new features and uphold existing services.

The Success of a Data Platform

ROI of a data platform

ROI of a data platform

When considering the key elements of a successful data platform, several factors come to mind: scalability, reliability, security, and governance. However, from our perspective at New Aim, we place particular emphasis on ROI—the return on investment.

Measuring the investment in our platform team entails assessing the team's size, expertise, and resource allocation. Conversely, the return is determined by the multitude of use cases and services we deliver and sustain in a production environment.

We believe a successful data platform team is one that efficiently maintains a wide range of data services. That’s why we prioritise ROI.

Constructing a proper and contemporary data platform demands a substantial financial commitment. Justifying this expenditure to senior management is essential. Ultimately, the value of a data platform is realised through the services it offers to stakeholders.

Strategies in our Data Platform

General Guideline: Simplicity

When it comes to constructing our data platform at New Aim, , simplicity is our guiding principle. We aim to create a platform that is easily comprehensible and navigable, reducing the cognitive burden on our team when implementing changes. This commitment to simplicity is reflected in the following ways:

  1. Developer Friendly: Our goal is to ensure that our data platform is accessible to developers at all levels of experience. We encourage junior and entry-level team members to confidently contribute, while also welcoming input from other team members such as IT professionals, data analysts, and stakeholders.
  2. Transparency and Openness: To accommodate diverse contributors, it is essential that our platform is transparent and open. We strive to provide clear visibility into its workings and empower users to understand the available capabilities.
  3. Streamlined Change Process: A smooth change process is vital to ensure that only approved and thoroughly tested modifications are deployed in a production environment. We prioritise a seamless workflow that maintains the integrity of our platform.

Key Principle 1: Declarative Approach

Based on our commitment to simplicity, one of the key principles we embrace when building our data platform is the adoption of a declarative approach. Declarative programming enables us to abstract the control flow and concentrate on describing the desired outcome, rather than providing intricate step-by-step instructions.

We've learned valuable lessons from the experiences of other companies. For instance, many organisations heavily rely on stored procedures for data processing in their data warehouses. However, these stored procedures often involve complex logic, making it challenging to understand, debug, and reproduce issues when they arise.

To address these challenges, we have made a deliberate choice to minimise the use of stored procedures or Python, and instead, encourage the utilisation of YAML and SQL wherever possible. This approach offers enhanced simplicity and maintainability.

Key Principle 2: Stateless & Idempotent

Another crucial principle we adhere to is the adoption of stateless and idempotent approaches.

  • Stateless: When evaluating third-party tools or developing our own services, we prioritise solutions that are stateless. This means they operate independently of past interactions, eliminating the need for a database backend or other stateful resources. By avoiding these complexities, we save valuable time and effort in managing our platform.

  • Idempotency: Furthermore, we place significant emphasis on idempotency within our data pipelines and transformations. Idempotent operations can be applied multiple times without altering the result beyond the initial application. By designing our pipelines and transformation jobs to be idempotent, we can easily retry them in case of issues, effectively resolving most operational problems and preserving valuable time.

These principles not only simplify our architecture but also enhance the robustness and reliability of our data platform.

Key Decisions made

Apply GitOps Everywhere

Git Ops in a nutshell

Git Ops in a nutshell (image source: https://blogs.vmware.com/cloud/2021/02/24/gitops-cloud-operating-model/)

One of our pivotal decisions was to implement the GitOps framework throughout our data platform, merging the principles of Infrastructure as Code (IaC) with DevOps best practices. This approach revolves around utilising Git repositories as the single source of truth for our system configurations, employing Git pull requests and CI/CD pipelines to validate and automatically deploy changes.

This decision has yielded numerous advantages for our data platform at New Aim:

  1. Version Control: Every system alteration is meticulously tracked, allowing us to easily revert any unintended modifications. It also empowers our team to take calculated risks and iterate rapidly.
  2. Traceability: We maintain a comprehensive record of change authors, timestamps, and reasons, enhancing accountability and facilitating troubleshooting.
  3. Thorough Testing: We subject changes to a battery of tests, including functionality, performance, and data integrity testing, ensuring their validity before deployment. This meticulous testing process enhances the overall quality and mitigates the risk of introducing errors.
  4. Collaboration: GitOps facilitates peer review processes before merging changes into production, promoting collaboration, knowledge sharing, and ensuring that decisions are made by qualified and responsible individuals.
  5. Automation: By embracing GitOps, we have streamlined deployment and resource management procedures. In many cases, changes can be implemented without directly accessing the production environment, reducing the potential for human error and enabling efficient automation.

The adoption of GitOps has granted us heightened control, transparency, and operational efficiency in managing our data platform.

gitops-everywhere In our data platform, GitOps is implemented across various areas to streamline management and maintainability. Here's how we apply GitOps in different components:

  1. Infrastructure: We utilise Infrastructure as Code (IaC) and CI/CD pipelines to define and create our cloud infrastructure. This enables version control of configurations and automated deployment of changes.
  2. Data Pipeline: Our data pipeline tools are configured using separate JSON files. These files define pipeline behavior and are executed based on predefined schedules or triggers.
  3. Transformations: Data transformation queries are stored in a dedicated SQL repository. These transformations run on a managed Spark cluster using Databricks, and we employ dbt (Data Build Tool) to manage the transformation workflows.
  4. Orchestration: Argo Workflows orchestrate various jobs, including data transformation tasks. Workflow specifications are defined in YAML files, facilitating management and version control.
  5. Kubernetes Workloads: YAML files define configurations for Kubernetes workloads, such as microservices or applications. ArgoCD synchronises these workloads into the production environment.

Furthermore, when developing in-house software, we follow a similar approach of separating execution and configuration. Our skilled development team focuses on building flexible execution components, while other team members can leverage these tools by modifying configuration files.

By segregating components into execution and configuration parts, our data platform becomes modular, flexible, and easy to manage. We prioritise tools that support independent configuration management, enabling changes without impacting the source code.

SQL as First-Class Citizen for Data Transformation

Within our data platform, we prioritise the use of SQL for data transformations. We view transformations as a means of expressing business rules, and collaboration with stakeholders is crucial during their development. Our aim is to ensure that these transformations are easily comprehensible and maintainable across the organisation.

While languages like Scala may offer complexity and sophistication, if they lack understanding among others in the company, it can result in issues and erode trust in the data platform. Additionally, managing various runtime environments for non-SQL languages like Python adds overhead and complexity.

By employing SQL, we leverage its declarative and stateless nature. SQL is widely understood, and many stakeholders have familiarity with reading and writing SQL queries. It also simplifies management and maintenance, particularly with the rise of tools like DBT-core, which greatly facilitate the management of SQL transformations.

Declarative and Stateless Tools First

stateless-tools Another pivotal decision we made was the careful selection of tools and software for our data platform.

An important criterion was the ability to separate configuration from the tools themselves, adopting a declarative approach. This empowers us to independently manage and maintain configurations, ensuring flexibility and user-friendliness.

Furthermore, we considered the statelessness of the tools. We favored solutions that do not rely on a backend database for storing secrets or managing workflow status. This choice minimises maintenance costs associated with database reliability, backups, disaster recovery plans, and software upgrades.

Consequently, we opted for tools like Argo Workflow, dbt, and Meltano, which operate efficiently without the need for a backend database. These tools enable streamlined execution of jobs and workflows, without the additional overhead and maintenance burden of a database.

It's worth noting that tools like Airflow, Airbyte, and Superset are excellent and effective in their respective use cases. However, we decided against using them primarily due to the higher maintenance costs from our perspective.

Achievements and Limitations

Limitations

With the decisions and principles we have embraced, our data platform has delivered remarkable outcomes. As with everything, it is however important to acknowledge that our solution has its limitations and trade-offs.

Our primary focus has been on optimising Full-Time Equivalent (FTE) efficiency, which means that we may have compromised on other factors such as cost and flexibility. For instance, by not utilising Airflow, we have sacrificed certain powerful features that it offers. We continuously review our decisions and may incorporate those features if our strategy changes.

We have chosen to minimise the usage of SaaS, given our adoption of the GitOps framework. While comprehensive SaaS solutions with user-friendly interfaces can provide a low-maintenance platform, they often come with high license fees and limited system-level flexibility.

Another limitation we currently face is stream processing. Our design and principles have primarily focused on batch processing, and we are actively working on extending these principles to incorporate stream processing. It is an ongoing effort and a work in progress for us.

What have we achieved?

archievements

We’ve come a long way. The numbers do not lie. Each member of our data platform and engineering team is responsible for maintaining and supporting:

  • Over 20 ingestion pipelines from diverse data sources, including both internal and custom pipelines developed in-house.
  • 300+ data transformation queries, primarily in SQL with some in Python. We release changes almost daily without disruptions.
  • More than 8 API endpoints to serve downstream data consumers.
  • Additionally, more than 3 data services to cater to specific use cases like postage optimisation (Last-mile Postage Optimiser), demand forecasting, and product recommendation.

What’s truly astonishing is that, on average, just one person manages all these resources within our team. Our team efficiently manages over 300+ data workloads per person. This speaks volumes about the efficiency and productivity achieved through our low-maintenance data platform.

I couldn't be prouder of our accomplishments at New Aim. It exemplifies the value of our decisions and the effectiveness of our approach.

· 10 min read

What will be covered in this article:

  • How SageMaker works
  • How to prepare a model for SageMaker
  • How to use AWS Lambda to trigger model training and deployment automatically

source code for this article can be found at https://github.com/xg1990/aws-sagemaker-demo.

1. SageMaker Introduction

SageMaker is a fully-managed service by AWS that covers the entire machine learning workflow, including model training and deployment.

API levels

It has 3 levels of API we can work with:

All you need to do is to define the model training/prediction/data input/output function, and then submit/deploy the source code, with necessary configurations (e.g.: instance types). The SDK will take care of the rest work (e.g.: load data from S3, create training job, publish model endpoint)

  • Mid-Level API: boto3

Besides defining the source code, you also need to upload the source code to S3 yourself, and specify the s3url to the source code, and explicitly setup all other configurations.

  • Low-Level API: awscli+docker

Essentially SageMaker does everything within a container. Users can create their own docker container and make it do whatever they want. These containers are called Algorithm in SageMaker

In this article, I will cover the usage of python-sagemaker-sdk andboto3. Defining your own docker container (low level API) is only necessary when your ML model is not based on any of the SageMaker supported frameworks: Scikit-Learn, Tensorflow, PyTorch, …..

SageMaker Modules

There are many modules provided by SageMaker. In this article, the following module will be used:

  • Notebook instances: full managed jupyter-notebook instance where you can test your machine learning code with access to all other AWS services (e.g. S3)
  • Training jobs: the place to manage model training job
  • Models: the place to manage trained models
  • Endpoints: full managed web service that can handle requests (HTTP or others) as input and make predictions as responses
  • Endpoint configurations: configuration for endpoints

2. Prepare SageMaker Model

To train and deploy a machine learning model on SageMaker, we need to prepare a python script that defines the behaviours of our model by the following python functions:

Main script

  • defined within if name == 'main'.
  • It will be run by SageMaker with several command line arguments and environment variables pass into it.
  • It should load training data from --train directory and output the trained model (usually a binary file) into --model-dir
  • Example code looks like this:
if __name__ =='__main__':
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=50)
parser.add_argument('--batch-size', type=int, default=64)
parser.add_argument('--learning-rate', type=float, default=0.05)
# Data, model, and output directories
parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
args, _ = parser.parse_known_args()
# another function that does the real work
# (and make the code cleaner)
run_training(args)

Besides the main script, other functions are defined for model deployment, which includes the following functions:

  • model_fn: loads saved model binary file(s)
  • input_fn: parses input data from different interface(HTTP request or Python function calling)
  • predict_fn: takes parsed input and make predictions with the model loaded by model_fn
  • output_fn: encodes prediction made by predict_fn and return to corresponding interface (HTTP response or python function return)

model_fn

The model_fnfunction should look into the model_dir and load the saved model binary file into memory for prediction tasks. Example code looks like this:

input_fn

input_fn takes two arguments: request_body, and request_content_type Usually, the input_fn should check the type of input data (request_content_type) first before it can be parsed. You can also do data pre-processing here. Example code looks like this:

predict_fn

predict_fn takes two arguments: input_data and model. input_data is the output of input_fn .model is the output of model_fn. Usually, we may need to do some data transformation before prediction. Example code looks like this:

output_fn

output_fn also takes two arguments: prediction and accept. prediction is the output of predict_fn , accept is the expected output format (e.g. application/json) from the client side.

All above functions should be put into a python script (let's say it is train_and_deploy.py ), then we can use python-sagemaker-sdk to test our model for SageMaker in our local environment.

3. Model development with python-sagemaker-sdk

Besides train_and_deploy.py script, we need to prepare another python script to run python-sagemaker-sdk , which look like this:

We use from sagemaker.sklearn import SKLearn if our model is based on Scikit-learn. If the model is based on Tensorflow, we can use from sagemaker.tensorflow import TensorFlow instead.

The meaning of these arguments can be found in SageMaker official documents for scikit-learn, TensorFLow, and PyTorch.

Debug Locally

One thing to pay attention: Every time we call sklearn_estimator.fit or sklearn_estimator.deploy , the SageMaker SDK will start a new docker instance and run the corresponding job, which is very slow if the job is done on the server side for debugging purpose. In this case, we can set train_instance_type='local' or instance_type='local' to conduct local testing, which is much faster (make sure docker is set up on a local machine)

Test Published Endpoint

One the trained model is published as an endpoint service, we can test the endpoint with new data:

4. Using Lambda Function to control SageMaker

Now we have successfully trained and deployed a model on SageMaker. But it not enough.

In real world, we should receive new data every day and need to retrain the machine learning model periodically.

Moreover, model training usually takes a long time and we need to make sure one the training job is done, the trained model should be automatically deployed into existing endpoint (SageMaker does not do this automatically).

There are two solutions for this: Step Function and Lambda

  • Solution 1-Step Function. We can define a lambda function to check the status of a training job. Then we use a step function to call the lambda function (e.g. every 1 hour), once training is done, call another lambda function to deploy the model. This solution has been well-documented in this article https://github.com/aws-samples/serverless-sagemaker-orchestration
  • Solution 2: Lambda. Once a model training job is finished, the trained model will be written into an S3 bucket. S3 PUT event can be associated with a lambda function to trigger model deployment.

This article will demonstrate the solution 2, how we use lambda function and S3 event to manage SageMaker.

The lambda_handler takes the argument event, which contains information about how the lambda is triggered. By interpreting the event, we can conduct different action within one lambda handler. Following is an example code of the lambda function we use:


def lambda_handler(event, context):

response = event

# Handle S3 event
if 'Records' in event:
for records in event['Records']:
if 's3' in records:
handle_s3_event(records['s3'])
else:
# unrecognised events
pass

if 'task' in event:
if event['task'] == 'retrain':
# start retrain the model
retrain_the_model()
response = "OK"
if event['task'] == 'prediction':
response = make_prediction()
return {
'statusCode': 200,
'body': response,
}

Three different events are handled here:

  • If input event is S3 event, it will call handle_s3_event(records['s3']) to handle the event
  • Otherwise, if event is model re-train task, it will call retrain_the_model()
  • And if event is prediction task, it will call make_prediction() and pass the result as response

Trigger model retrain task periodically

Lambda function can be triggered by CloudWatch event periodically. We can add a CloudWatch Event trigger and set up a Rule.

  • Event Source should be Schedule, with customised event pattern (very similar to crontab on Linux)
  • Set Targets as the lambda function we use to control SageMaker.
  • Configure input can be Constant (JSON text), so that the lambda_handler can understand what to do with it.

Model Retrain

This is an example of model re-training function.

import boto3
src_path = 's3://<path-to-source-code-train_and_deploy.py>.tar.gz'
def retrain_the_model():
now_str = datetime.utcnow().strftime('%Y-%m-%d-%H-%M-%S-%f')
training_job_name = <pub-training-job-name-here>-{now_str}'
sm = boto3.client('sagemaker')
resp = sm.create_training_job(
TrainingJobName = training_job_name,
AlgorithmSpecification={
'TrainingInputMode': 'File',
'TrainingImage': '783357654285.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
},
RoleArn=role_arn,
InputDataConfig=[
{
'ChannelName': 'train',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://<path-to-training-data>',
'S3DataDistributionType': 'FullyReplicated',
}
},
},
],
OutputDataConfig={
'S3OutputPath': 's3://<path-to-output_dir>'
},
ResourceConfig={
'InstanceType': 'ml.m4.xlarge',
'InstanceCount': 1,
'VolumeSizeInGB': 30,
},
StoppingCondition={
'MaxRuntimeInSeconds': 600
},
HyperParameters={
'sagemaker_submit_directory': 's3://path-to-sagemaker-submit-directory',
'sagemaker_program' : "train_and_deploy.py",
'sagemaker_region': "<your-aws-region>",
'sagemaker_job_name': training_job_name,
'sagemaker_submit_directory': src_path
},
Tags=[]
)

This function will create a model training job.

Please be noted that we cannot use python-sagemaker-sdk within lambda environment. Then the best solution is to use boto3. That means we need to upload the train_and_deploy.py to S3 in a gzipped tar package by ourselves. And also set up everything as done in above code.

Model Deployment

Once a model training job is done, we need to deploy the trained model and update existing endpoint. SageMaker doesn't do this for you automatically. And we cannot ask Lambda to wait for the training job, as training may take several hours.

Once a training job is done, an S3 PUT event will be triggered, which can notify our lambda function that training is done and we can do deployment now.

Within an S3 PUT event, the key of the S3 object is provided, which is usually related to the unique ID of our training job, as is done in the following code:


def handle_s3_event(s3):
bucket = s3['bucket']['name']
fn = s3['object']['key']
jobid = fn.split("/")[-3]
return deploy_model(jobid)

Once the model training job ID is known, we then can call the deploy_model function to deploy our model, which looks like this:


endpoint_name = <your-endpoint-name>
src_path = 's3://<path-to-source-code-train_and_deploy.py>.tar.gz'
def deploy_model(training_job_name):
sm = boto3.client('sagemaker')
model = sm.create_model(
ModelName=training_job_name,
PrimaryContainer={
'ContainerHostname': 'model-Container',
'Image': '783357654285.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
'ModelDataUrl': f's3://<path-to-your-output-dir>/{training_job_name}/output/model.tar.gz',
'Environment': {
'SAGEMAKER_PROGRAM': 'train_and_deploy.py',
'SAGEMAKER_REGION':'<your-aws-region-name>',
'SAGEMAKER_SUBMIT_DIRECTORY': src_path

},
},
ExecutionRoleArn=role_arn,

)
endpoint_config = sm.create_endpoint_config(
EndpointConfigName=training_job_name,
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': training_job_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.t2.medium',
},
],
)
sm.update_endpoint(EndpointName=endpoint_name,
EndpointConfigName=training_job_name)

...

And then, everything is set up and our model will keep training periodically and provide the best performance.

Wrap Up

SageMaker is especially helpful when you need to retrain your model periodically and serve your model as a web service. For training, SageMaker can automatically start a high-performance EC2 instance and finish model training within a short time at the minimum cost. For web serving, SageMaker can take care of auto-scaling and make sure your endpoint is always available.

To use SageMaker for Machine Learning, the most important step is to prepare a script that defines the behaviours of your model. And you also have full control of the whole system by creating your own docker container.

Other things to be aware of

  • We have not involved model performance test in this workflow, which is also important in a machine learning pipeline;
  • S3 Event doesn't guarantee 100% delivery. If model training is critical, Step-Function is a better choice
  • Make sure your AWS role have enough permission to control necessary resources
  • SageMaker can also run batch prediction jobs, and there are many other functions remain to be explored.

· One min read

I recently studied for and successfully passed the Google Cloud Professional Data Engineer certification/exam. This is a 12-page exam study guide that I personally compiled and used in the hope of covering all the keys points. Below you’ll find a preview of the study guide, but you can click here to download full PDF file.

I hope this helps anyone looking to sit their certification! Note: This is a good way to recap, but you may still need to read original documentation for each product to learn new concepts etc. And remember — there’s no substitute for hands-on experience with the tools. 😀

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12

Pdf File: https://github.com/xg1990/GCP-Data-Engineer-Study-Guide/blob/master/GCP%20Data%20Engineer.pdf

Thanks Graham Polley for helping prepare my first Medium article!