Data Engineering Zoomcamp: Free Data Engineering course. Register here!

DataTalks.Club

Data Science on AWS

by Chris Fregly, Antje Barth

The book of the week from 28 Jun 2021 to 02 Jul 2021

With this practical book, AI and machine learning practitioners will learn how to successfully build and deploy data science projects on Amazon Web Services. The Amazon AI and machine learning stack unifies data science, data engineering, and application development to help level upyour skills. This guide shows you how to build and run pipelines in the cloud, then integrate the results into applications in minutes instead of days. Throughout the book, authors Chris Fregly and Antje Barth demonstrate how to reduce cost and improve performance.

Questions and Answers

Antje Barth

Hi everyone! Chris Fregly and myself are here to answer your questions!

Antje Barth

We’re also running a live, 4hrs hands-on workshop tonight, starting at 6pm CET. The workshop is based on our book and code repo: Build an End-To-End Pipeline With BERT, TensorFlow, and Amazon SageMaker. You can RSVP here.

Lalit Pagaria

Thanks for sharing link. Is it possible to share link with other community?

Antje Barth

Sure, please do!

Kshitiz

Hi Antje Barth , Would this session be recorded? It will get pretty late in IST.

Ken Lee

Hi same question here Antje Barth, it would be pretty late in SGT too

Antje Barth

We do have previous workshop recordings shared here:
https://youtube.datascienceonaws.com/
For example, the May workshop

Chris Fregly

yesterday’s workshop here: https://youtu.be/R0dyrBQMAnQ

Ken Lee

That’s great stuff! Thanks

Agrita Ga

Hi Antje Barth and Chris Fregly. 👏
Is there currently any standard (or trend) for MLOps best practices in context of AWS tech stack?

Antje Barth

I think it’s going to be a phased approach
1/ Moving away from manually building models to
2/ Orchestrating the individual model building workflow steps (ie. data preprocessing, training, evaluating) in a pipeline, and automating tasks within each step
3/ Automatically re-running training and/or deployment pipelines on triggers such as model decay, or code changes
The tools and technologies will likely vary based on each use case and team experience. AWS recently launched SageMaker Projects, which automatically sets up a CI/CD automation for your ML pipelines. So whenever you commit new code into a code repo, it would re-run the model building and deployment pipeline.
On manual approval, you could deploy into a staging environment, on a second approval, after running integration tests, you could deploy into a production environment. This is just a sample setup.
We also describe this approach in our book!

Bayram Kapti

Hi Antje Barth & Chris Fregly! Thanks for sharing your thoughts here with us!
While every problem requires different approach and solution, it’s said that most of the times, fundamental algorithms in DS are the most effective solutions to actually getting results.
1-) Do you agree? How has it been working for you so far?
2-) How does it work at #aws? Are there systems, procedures built at Amazon to make sure best solution is used when approaching a problem?

Antje Barth

You should always go for the solution that solves your problem the fastest and in the most efficient way. For example, if you need a simple image recognition model, why build the model from scratch yourself, if there’s plenty of algorithms and models in the industry that have already solved this problem? Re-use and customize it to your needs. You should focus your time on solving the actual business problem, not figuring out how to train a model or managing infrastructure.
Coming back to AWS, I would always recommend looking at the pre-built AI services first, ie. Amazon Rekognition for object/video detection, Amazon Comprehend for NLP, Amazon Textract for document text extraction, Amazon Personalize for personalized recommendations etc. If they already solve your problem, great.
If you need to build and train your own model, go to the next level, Amazon SageMaker. This service gives you the freedom to either use built-in algorithms, write your own code, or even bring your own Docker images with your custom libraries. This layer still takes care of all of the infrastructure management for you.
AWS always works back from the customer in solving customers’ challenges.

Bayram Kapti

3-) Also, how does your day look like working at aws?

Antje Barth

Getting up, a lot of coffee ☕😅, checking emails and Slack, a few calls with team members, working on demo code, presenting at a conference, running a workshop, dropping in here and answer #book-of-the-week questions! 🙂

Alex

Hi there Antje Barth & Chris Fregly! Pleasure talking to you 🙂
What are the three most important things that someone using AWS (instead of any of the other main cloud providers) for ML/AI purposes can brag about?
It’s known that top-tier cloud providers offer very similar services, so how would you convince someone who doesn’t know which one to start using to get into AWS? And more specifically, to AWS for data science?

Antje Barth

AWS always approaches everything they do by focusing on their customers. This customer obsession drives 90 percent of the product roadmap, leading to the invention of a bevy of machine learning services at all three layers of the stack that I mentioned earlier:

For example, at the bottom layer of the stack AWS builds their own custom silicon designed to accelerate deep learning workloads with AWS Inferentia.
In 2017, AWS launched Amazon SageMaker – a fully managed service that enables developers to quickly build, train, and deploy machine learning models at scale.
And at the top layer of the stack, AWS offers the broadest set of artificial intelligence services – with little to no ML experience required.
This includes entirely new categories of services enabled through machine learning including Amazon Kendra to reinvent enterprise search, Contact Lens for Amazon Connect to embed intelligence into contact center operations, Amazon HealthLake to help healthcare organizations make sense of their data, and machine learning services purpose built for industrial and manufacturing companies just to name a few.
Also, AWS is the only cloud provider to offer a choice of Intel, AMD, and Arm processors. And for running machine learning at scale and in production, AWS introduced the Inf1 instances for EC2, which are powered by AWS Inferentia chips.
So I’d say it’s the combination of the depth and breadth of services, together with the fast pace in innovation, always working backwards from customer challenges, and the experience and maturity after 15 years in business paired with millions of active customers and tens of thousands of partners globally.

David Cox

Very excited to see this book here! Riffing off Alex’s post, my question is why AWS? Why contain everything within the AWS ecosystem?

Chris Fregly

I use the AWS ecosystem everyday - works very well for me! 🙂

Chris Fregly

We work with a lot of customers that use open source container (ie. Kubernetes), analytics (ie. Apache Spark), and machine learning (ie. TensorFlow/PyTorch/Scikit-Learn) technologies with AWS. AWS handles the “undifferentiated heavy lifting” of managing your own infrastructure. Would you like to spend your time debugging lower-level infrastructure issues? Or would you like to spend your time solving higher-level business problems? I personally like to solve business problems, so I use AWS for my analytics and ML infrastructure needs.

Heeren Sharma

Hi Antje Barth Thanks for the heads up regarding workshop. I have not played around with Sagemaker but have experience with Azure ML. One challenge that I see time and again is integration among services e.g. for a full blown data pipelines which leverages ML realtime scoring ( and maybe training), something like a feedback loop related, how well AWS Sagemaker integrate well with overall AWS services (e.g. with a Glue job, etc.)? Another quick follow up question, how well code integration work e.g. can one connect with jupyter servers with their local IDE, are there extension? Many thanks.
Disclaimer: I am primarily coming from AWS world, but since past 1 year working over Azure ML for a client project. So I am curious how AWS solves these problems 🙂

Chris Fregly

Amazon SageMaker integrates with many of the AWS services - and we are always adding new integrations based on customer feedback. More specifically, SageMaker Pipelines integrate with EventBridge to trigger a Start Pipeline event or respond to state changes within your pipeline execution. Also, SM Pipelines supports a flexible CallbackStep that lets you call any service directly using Python, Java, etc.

Heeren Sharma

Many thanks! Sounds awesome.

Fernando Lichtschein

Hello! Is there a workshop that can be done by an IT student using an AWS Educate classroom account that can demonstrate a real pipeline? Ideally it should not involve too much knowledge of AWS services,

Antje Barth

I’m not super familiar with the Educate account, but we recently launched a Coursera specialization in collaboration with DeepLearning.AI, called Practical Data Science.
It comes with 3 courses 10 weeks of lecture videos, quizzes and on-demand, self-paced hands-on labs! (no personal AWS account required, you get access to a demo account): practical data science
This might be a good path to start!

Ken Lee

Looks great! Thanks Antje Barth will start on this

Fernando Lichtschein

Great, thank you!

Asmita

Hi Antje Barth and Chris Fregly, will the book be helpful for beginners who are currently learning AWS and do not have much hands on experience? What advise would you give on how to start working on AWS?

Antje Barth

If you are completely new to AWS, you might want to start with a general, intro-level AWS course such as the AWS Cloud Practitioner, or AWS Fundamentals course.
As for our book, we always start with an introduction and discussion of the relevant concepts first before we go into the AWS specifics to help readers follow along. So as long as you bring some basic cloud and ML knowledge, readers should be able to follow along easily.

Philip Dießner

Hello Antje Barth and Chris Fregly.
Thanks for being here! Where do you see your book on the way to becoming AWS ML Certified? Is it + some more practice and experience a good basis? Or are there areas where other learning resources should be used?

Antje Barth

The book definitely covers a good amount of ground towards the ML certification. In addition, I’d spend some time hands-on, familiarizing yourself with the services and concepts. And as a final check, scanning through the specific ML cert preparation materials!

Livsha Klingman

Hi Antje Barth & Chris Fregly! Thank you for being open for Q&As!
Being a trained data scientist but my employment pushing me more into the data engineering spectrum, due to the complexity & velocity that data is now flooding the ‘market’ with, and the greater need for pipelining data to apply ML/AI applications. After successfully building a few pipelines, both in Azure and AWS, I find myself quite bewildered by the array of AWS platform tools and almost unsurmountable amount of syntax surrounding each. Does your book provide a method of selecting the correct tools or combinations of for each given data scenario, to optimize the pipeline, as well as maximize the data prep needed for ML in a way that speaks not just to professionally trained data engineers? The documentation ‘out in the market’ I find quite overwhelming coming from a data science stance and designed for professional data engineers at large. My role is purely ML/AI goal oriented data pipelining.

Chris Fregly

the book is focused on ML/AI-oriented data pipelines, yes. the first couple chapters show the breadth of services available for different use cases. the remaining chapters dive deep into building an end-to-end pipeline for a single ML/AI use case. we preferred this approach because it provides a specific slice through the broad set of AWS options across the analytics, machine learning, streaming, and security services on AWS.

Livsha Klingman

Thank you!

Ganeshkumar

Hi Antje Barth and Chris Fregly , thank you so much for Q&As.

  1. Currently AWS does not offer drag and drop feature in UI, which may simplify things while model creation. When can we expect such feature in Sagemaker?
  2. In your view how much of data architecture skill is required for data scientist to excel in building an effective model for consumption
    ?
Chris Fregly

AWS offers drag-and-drop for the data/feature engineering step of the pipeline. This service is called SageMaker Data Wrangler and it’s integrated with SageMaker Studio. we cover this in the book. I can’t comment on the roadmap for SageMaker, but we prioritize features based on customer feedback - and i will pass this request on to the SageMaker team!

Chris Fregly

having a good data foundation is very important to building effective models. everything starts with your data.

Ganeshkumar

Thanks very much !

Livsha Klingman

Hi againAntje Barth & Chris Fregly!
Taking full advantage of this opportunity… Due to my novice data engineering stance and my ‘on-the-job’ training. I basically devised a ‘split personality’ method of focusing on the engineering -Part 1, the most optimal ETL/ELT pipeline to a final data structure, and then and only then focusing on the specific data requirements to implement the Part 2- Data scientific or analytical ML algorithms or visualizations applications.
My question is…Is this really in the long run (the final goal in mind) inhibiting my optimized process. I see the benefits of a final comprehensive data structure at the end of the ETL/ELT process, but maybe the final data manipulation output (ML or AI) specifications should really be incorporated into the initial pipeline? Does it make a difference whether processing BIG data or not? Within the AWS platform or not?

Chris Fregly

definitely depends on the dataset (where is the data located?), model type (how difficult/time-consuming to re-generate the features for this model type?), organization structure (are you sharing this data with other teams outside of yours?), security needs (do you need to lock down certain features to certain teams?), etc. data pipelines can get complex, for sure. if you the data transformations are being re-used by many teams, you might want to use SageMaker Feature Store to store and manage your transformed features - feature-store

Chris Fregly

SageMaker Feature Store is supported by SageMaker Pipelines, btw.

Livsha Klingman

I am going to look into this - thank you for the direction and your time!

Livsha Klingman

Another… Antje Barth & Chris Fregly!
Do you have any ideas how I can boost my knowledge and professionalism within my constraints of my full time job?
When considering the whole picture, the pipeline AND the final output. What should the list of focus points and considerations in order of importance? What are the typical pitfalls to be aware of when integrating into the AWS platform?

Chris Fregly

data is the most important. useful data transformations that produce useful features.

Livsha Klingman

Does this mean that the pipeline should be adapted for the features involved and not adapt (slice) the data after the data loading?
The ideal, for producing a data science goal, would be to pipe the necessary features and not the whole data structure, like I am currently doing, only filtering out data (features) after all of the important data has been cleaned and stored in the correct structure?

Livsha Klingman

Thanks so much your time and availability…Antje Barth & Chris Fregly
How do you see the tools in AWS providing an optimal medium for building & training a model? Is using the AWS infrastructure and their given tools really more effective for modeling, than experimenting with the vast assortment of other algorithms, NLP, CNNs and general deep and unsupervised learning tools?

Chris Fregly

Amazon SageMaker - and the broad AWS set of analytics tools such as Amazon Athena and Redshift - support all available algorithms for NLP, CNN’s, deep learning, supervised learning, unsupervised learning, reinforcement learning - everything! AWS provides the managed infrastructure that allows you focus more on the business problem and less on the infrastructure. You can also scale out your data processing and model training/tuning/deploying with a simple API call or click of a button. In other words, AWS does not limit your options in any way. You can easily convert the Python code running on your laptop to a scalable SageMaker Job. If it runs locally, it can run on the SageMaker infrastructure.

Livsha Klingman

Amazing - almost too good to be true… so what’s the snag? Why am I not crossing paths with others using AWS in this way?

Sara Lane

The AI field is constantly evolving, and Amazon keeps adding new features to its AI suite. I think that many people are actually taking advantage of these features, and I think that usage will grow as time goes on.

Tim Becker

Hello Antje Barth and Chris Fregly, I would like to ask you a few questions concerning the book. Of course, the book focuses on AWS, but is the covered knowledge transferable to other platforms? For example, if my company is using google.

Chris Fregly

Sure, the book discusses the concepts/problems at a high level, then dives deep into how to implement the solution on AWS. throughout the book, we provide best practices. Some of the cost-savings and performance-related tips are AWS specific, but the concepts are transferrable across all environments.

Tim Becker

Do you have some recommendations on how to practice cloud services for free?

Chris Fregly

SageMaker supports free tier. Just make sure to clean up the resources when you’re done to avoid extra cost beyond the free tier. ie. shutdown the notebooks, etc.

Tim Becker

What kind of performance can we expect from models developed with SageMaker Autopilot?

Chris Fregly

SageMaker Autopilot builds a set of model pipeline candidates using various algorithms and feature-engineering steps using Automated ML (AutoML). Performance varies depending on your dataset and type of problem (classification, regression, NLP, multi-layer perceptron, etc). The cool part is you can just point Autopilot to your dataset and try it out. Your mileage may vary, etc.

Tim Becker

thank you for your answers, is testing of autopilot included in the free trial?

Ken Lee

Hello Antje Barth and Chris Fregly. What are the advice on a proper machine learning project that accumulates as little of technical debt as possible? A lot of the tutorials found online taught us on the HOWS to build a model, but did not cover most of the stuffs other than that. Or can we find answer in the book?

Chris Fregly

In an ideal world, you would have clean data, perfect feature transformations, the best algorithm, and the optimal set of hyper-parameters. This way, you would just train your models and push them to production. This is rarely the case. Anything beyond the ideal scenario will incur some technical debt. The ideal state would be a perfectly-tuned experimentation pipeline that lets you easily try, track, and compare different combinations of data, feature transformations, algorithms, and hyper-parameters. This is what SageMaker Pipelines (integrated with SageMaker Experiments) offers as a managed service to let you focus on the business problem.

Chris Fregly

short answer: the less infrastructure code you need to write, the less technical debt you will incur as you are focusing only on the business problem.

Chris Fregly

and yes, the book covers a clean, end-to-end implementation of the common machine learning task of text classification using natural language processing (NLP) and BERT.

Ken Lee

Great! Thanks for the tips! Appreciate jt

Ken Lee

also particularly curious on the Model Monitoring aspects of the deployed model, aware that ML models invariably suffer from performance degradation. Which AWS services cover the monitoring part? Also regarding the containerization of application on AWS, do you have a good resource to point us to or it is already covered in the book?

Chris Fregly

SageMaker Model Monitor is the AWS service that monitors models for drift and degradation in production. Model Monitor uses an open source library called deequ (a library on top of Apache Spark) to continuously monitor the live model-prediction inputs for statistical drift against the baseline training dataset - as well as model prediction drift. For other types of statistical analysis, check out SageMaker Clarify and the accompanying open source library smclarify .

Chris Fregly

Containerization of applications should be very well-documented as containerization is so common these days. If you hae trouble finding examples, ping me and i can point you to some references. If you mean containerization of the trained models, we cover this in Chapter 10 when we talk about deploying models.

Chris Fregly

Oh, and it’s worth highlighting that SageMaker Model Monitor is integrated with Amazon CloudWatch to automatically notify you if your model starts to drift or degrade outside of a given threshold. This is a very powerful mechanism to help wake up the data scientist at 2am! 🙂

Ken Lee

Great stuff! Thanks Chris Fregly 👍👍👍

Ricky McMaster

Hi Antje Barth & Chris Fregly, many thanks for doing this!
My first question is on tool choices - I see from the preface that you cover both Kinesis and Kafka, but generally speaking you understandably stick to AWS products. Do you discuss (or can you imagine) use cases where it’s best to opt for a more customised solution? For example, more than one company has had scaling issues with Athena - how do you address this topic?

Chris Fregly

AWS offers Managed Streaming for Apache Kafka (MSK) which uses open source Apache Kafka as an alternative to Amazon Kinesis. However, I also see customers that prefer to manage their own Apache Kafka clusters on EC2 directly. These are all valid options - and the choice depends on many factors including the skillset of the team, etc.

Chris Fregly

I’m not familiar with the specific scaling issues that you mention. In those cases, the customer should work with AWS Support to diagnose the issue and find the right solution forward. Amazon Athena is based on - and compatible with - Apache Presto, so the customer could choose to manage their own Presto cluster on EC2 with minimal changes to their data pipelines.

Ricky McMaster

Ok thanks for your take on this Chris Fregly, appreciated. The Presto approach you cite is something I’m a little familiar with.

Ricky McMaster

Secondly, do you see any scope at AWS for developing its own framework(s) to rival either TensorFlow or PyTorch?

Chris Fregly

Amazon has contributed Apache MXNet to the Apache Foundation which is similar to TensorFlow and PyTorch. Just like any large enterprise, we use many different machine learning, deep learning, and reinforcement learning frameworks for our business needs.

Ken Lee

Correct me if I’m wrong but I recall that Amazon is a major contributor of Pytorch, alongside with Facebook?

Alper Demirel

Hi, first of all thanks for being here. What should be considered when migrating an ML project from local or another cloud platform to AWS? What are your suggestions about this?

Antje Barth

Depends on the complexity of the workloads you want to migrate.
It could be as simple as using your local IDE and configuring a connection to AWS, importing Python SDKs and start launching jobs with Amazon SageMaker.
For fully operational ML workload migrations, the plan would be more like:
1/ Plan the migration

  • Validate source code and datasets
  • Identify target build, train, and deployment instance types and sizes
  • Create capability list and capacity requirements
  • Identify network requirements
  • Identify the network or host security requirements for the source and target applications
  • Determine a backup strategy
  • Determine availability requirements
  • Identify the application migration or switchover strategy
    2/ Configure the infrastructure
  • Network, security, storage
    3/ Upload the data and code
  • Migrate datasets to provisioned S3 buckets
  • Package ML training/hosting code as Python packages and push to provisioned code repos
    4/ Migrate the application, and cut over
Alper Demirel

Thank you very much, what you said means a lot to me. It will work great for me 🙏

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.