Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Serverless Analytics with Amazon Athena

by Anthony Virtuoso

The book of the week from 28 Mar 2022 to 01 Apr 2022

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure.

This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You’ll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you’ll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you’ll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you’ll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server.

By the end of this book, you’ll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today’s ML modeling exercises.

Questions and Answers

adanai

Hello Anthony, thank you for the QnA! I am relatively new to AWS and working to specialize in Analytics+ML in AWS.
What level of experience in AWS is expected when starting with this book?

Anthony Virtuoso (Author)

as long as you have an AWS account, basic SQL knowledge, and know how to modify IAM policies you can do everything in this book. We tried to really make it accessible to beginners and intermediate folks alike. Only one or two chapters at the end are a bit more advanced and do require comfort with developing java code.

adanai

What does an ETL pipeline look like? What are the components in it and why does Athena primitively not require one(or avoids it)?

Anthony Virtuoso (Author)

In this case we typically mean that you do not need to run jobs/queries that read literally ALL your data and then rearrange it or pre-join it with other data to make it either (a) small enough (b) in the right format or (c) denormalized across data sources. These techniques can still be helpful with Athena, but Athena does its best to still make data that is no well organized, optimized, or formated using best practices accessible without all the upfront work/maintenance of ETL

adanai

Would I, as a book reader, need to incur costs in executing the examples in this book?

Anthony Virtuoso (Author)

Yes, we typically mention it in the chapters but while authoring the book we spent < $30 total on AWS fees and keep in mind we had to run through the exercises 5 or 6 times to refine them.

adanai

I recently started learning about and working with Tableau (small data - 100K records). All I did was click and drop elements and write a few queries for calculated fields. The results seemed satisfactory.
What are the ways in which the Athena integration make working with Tableau better? Is it the scale that makes the difference?

Anthony Virtuoso (Author)

Aside from the visualizations that you get with Tableu I think youll find the experience with Athena is pretty similar with one key difference being that if you ever need to scale up from 100k records to 100Billion or somewhere in between Athena will handle that just as well and you might start to see other differentiation in terms of price and performance as you scale up. But id agree that at 100k records, probably even Excel can used for a bunch of quick analysis.

Rui Ramos

Hi, reinforcing adanai question. What would be the level of expertise required in AWS to start on this book ? Also does the book contains any use-cases for the usage of this service ? Are there practical examples to tryout ?

Anthony Virtuoso (Author)

We tried to include practical examples in every chapter in the form of exercises. There are also lots of stories about cases we’ve seen people use Athena for X or Y. As for AWS knowledge, the pre-requiste is pretty small, just that you (a) have an AWS Account (b) know some basic SQL (c) are able to modify IAM policies in your AWS account since each chapter requires different levels of access (though you could just do the entire thing with admin access if the account your using isn’t a production account or is a ‘burner’ as we say).

Matias Rebolledo Dezerega

Hi Anthony Virtuoso (Author)! :blob_wave: Now i’m starting work with AWS and my questions is about the pipelines, can you make a ETL in Athena or just work with a ELT like in google BigQuery?
The books has examples of making a datalake loading data from different sources (like API, connectors, etc)? Thanks!!

Anthony Virtuoso (Author)

There is a chapter in the book with some basic examples on how to create what many would call an ETL pipeline but if your needs include chains of jobs (probably anything more than 3 or 4 jobs). I’d recommend looking at Glue ETL as it has the concept of scheduled jobs as well as a dependency modeler that can run chains of jobs based on triggers like data arriving in S3, completion of another job, or a schedule.

Matias Rebolledo Dezerega

Thanks!

Anthony Virtuoso (Author)

Great questions! keep em coming, ill check back in later today.

Tim Becker

Hi Anthony Virtuoso (Author), really interesting topic! I was wondering who AWS charges for Athena and Lake Formation?

Anthony Virtuoso (Author)

what do you mean by who?

Tim Becker

ah sorry, I meant how, sometimes my autocorrect behaves strangely

Tim Becker

Could you please explain the difference between Athena and redshift (for beginners)?

Anthony Virtuoso (Author)

Redshift targets datawharehouse usecases while Athena is more geared towards ad hoc analysis without ETL. Keep in mind this is an extremely reductive explaination and the two products can both do many of the same things just with different price/performance and strengths.

Tim Becker

Is it possible to use terraform or cloudformation to setup the data lake? If possible, would you recommend it?

Denis L.

Hi Anthony Virtuoso (Author), thanks for answering the questions. Once ACID transactions are out of public preview for Athena and Lake Formation, do you see it potentially replacing Redshift (+ Spectrum)? It seems there are 2 competitive solutions that are converging. What is your take on that?

Anthony Virtuoso (Author)

it won’t replace Redshift + Spectrum as those services do a lot more than just ACID. They have a different performance profile, cost, and SQL feature set.

Anthony Virtuoso (Author)

I do think youll see more overlap as time goes on since AWS is striving to make it easy for customers to move between and/or use a combination of services as one seemless offering since each it fit for a specific purpose. For example, can you use a hammer to break concrete? Yes, but a sledge hammer would be btter. and You can use a sledgehammer to drive nails but a hammer would be better.

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.