Questions and Answers
Hello Anthony, thank you for the QnA! I am relatively new to AWS and working to specialize in Analytics+ML in AWS.
What level of experience in AWS is expected when starting with this book?
as long as you have an AWS account, basic SQL knowledge, and know how to modify IAM policies you can do everything in this book. We tried to really make it accessible to beginners and intermediate folks alike. Only one or two chapters at the end are a bit more advanced and do require comfort with developing java code.
What does an ETL pipeline look like? What are the components in it and why does Athena primitively not require one(or avoids it)?
In this case we typically mean that you do not need to run jobs/queries that read literally ALL your data and then rearrange it or pre-join it with other data to make it either (a) small enough (b) in the right format or (c) denormalized across data sources. These techniques can still be helpful with Athena, but Athena does its best to still make data that is no well organized, optimized, or formated using best practices accessible without all the upfront work/maintenance of ETL
Would I, as a book reader, need to incur costs in executing the examples in this book?
Yes, we typically mention it in the chapters but while authoring the book we spent < $30 total on AWS fees and keep in mind we had to run through the exercises 5 or 6 times to refine them.
I recently started learning about and working with Tableau (small data - 100K records). All I did was click and drop elements and write a few queries for calculated fields. The results seemed satisfactory.
What are the ways in which the Athena integration make working with Tableau better? Is it the scale that makes the difference?
Aside from the visualizations that you get with Tableu I think youll find the experience with Athena is pretty similar with one key difference being that if you ever need to scale up from 100k records to 100Billion or somewhere in between Athena will handle that just as well and you might start to see other differentiation in terms of price and performance as you scale up. But id agree that at 100k records, probably even Excel can used for a bunch of quick analysis.
Hi, reinforcing adanai question. What would be the level of expertise required in AWS to start on this book ? Also does the book contains any use-cases for the usage of this service ? Are there practical examples to tryout ?
We tried to include practical examples in every chapter in the form of exercises. There are also lots of stories about cases we’ve seen people use Athena for X or Y. As for AWS knowledge, the pre-requiste is pretty small, just that you (a) have an AWS Account (b) know some basic SQL (c) are able to modify IAM policies in your AWS account since each chapter requires different levels of access (though you could just do the entire thing with admin access if the account your using isn’t a production account or is a ‘burner’ as we say).
Hi Anthony Virtuoso (Author)! :blob_wave: Now i’m starting work with AWS and my questions is about the pipelines, can you make a ETL in Athena or just work with a ELT like in google BigQuery?
The books has examples of making a datalake loading data from different sources (like API, connectors, etc)? Thanks!!
There is a chapter in the book with some basic examples on how to create what many would call an ETL pipeline but if your needs include chains of jobs (probably anything more than 3 or 4 jobs). I’d recommend looking at Glue ETL as it has the concept of scheduled jobs as well as a dependency modeler that can run chains of jobs based on triggers like data arriving in S3, completion of another job, or a schedule.
Thanks!
Great questions! keep em coming, ill check back in later today.
Hi Anthony Virtuoso (Author), really interesting topic! I was wondering who AWS charges for Athena and Lake Formation?
what do you mean by who?
ah sorry, I meant how, sometimes my autocorrect behaves strangely
Could you please explain the difference between Athena and redshift (for beginners)?
Redshift targets datawharehouse usecases while Athena is more geared towards ad hoc analysis without ETL. Keep in mind this is an extremely reductive explaination and the two products can both do many of the same things just with different price/performance and strengths.
Is it possible to use terraform or cloudformation to setup the data lake? If possible, would you recommend it?
Hi Anthony Virtuoso (Author), thanks for answering the questions. Once ACID transactions are out of public preview for Athena and Lake Formation, do you see it potentially replacing Redshift (+ Spectrum)? It seems there are 2 competitive solutions that are converging. What is your take on that?
it won’t replace Redshift + Spectrum as those services do a lot more than just ACID. They have a different performance profile, cost, and SQL feature set.
I do think youll see more overlap as time goes on since AWS is striving to make it easy for customers to move between and/or use a combination of services as one seemless offering since each it fit for a specific purpose. For example, can you use a hammer to break concrete? Yes, but a sledge hammer would be btter. and You can use a sledgehammer to drive nails but a hammer would be better.