Questions and Answers
Matt Palmer Can you share examples how folks at Replit use Databricks products (for ETL or not)? I know you guys partnered with MosaicML to train your LLM (https://www.youtube.com/watch?v=ju73sWVtvU0) but not sure about anything else.
I can’t comment on our tech stack outside of what you’ll find online 🙂
Databricks is a great option for ETL, but Snowflake has good options too.
Depending on your situation, an open-source stack could work well, also
In the field of software engineering, we rely on established best practices, design patterns, and architectural principles to develop robust and maintainable systems.
However, borrowing some of these design principles to design an ETL pipeline is not straightforward (at least in my opinion).
I am interested in applying/implementing layered architectures, such as business layers, repository layers, and routing layers, to my ETL pipelines any other well know design patterns.
My goal is to create a structure that resembles an “well know and approved” architecture, or something that more well designed to avoid coupling and allow better maintenance, just to run away of having one large function with all the code, or multiple small functions complementing each other without a clear design.
In your opinion, can these patterns and practices from traditional software engineering be effectively translated to data pipelines in tools like Airflow or Prefect?
If not, why not? What are the alternatives?
Additionally, do you have any suggestions or recommendations for resources that touch on these topics? (software design of etl pipelines).
Does your book discuss any of these questions/concerns?
Matt Palmer
I would research the “modern data stack” for best practices in data engineering.
This guide will feature some of those principles, but it’s written more for folks who are just learning about ETL.
There are newer tools that leverage best practices in software development to create robust, scalable solutions.
IMO things like pure airflow or prefect do not accomplish that alone. Again, researching “modern data stack” as well as new tools in the ETL space, reading things like r/dataengineering and chatting in slack groups is a great place to start
If you live somewhere with industry events or if you can chat with practitioners, that’s also an excellent source of knowledge
I would be wary of “data influencers,” many of them are paid or operating on some sort of agenda.
Just be skeptical, which it sounds like you already are
Hi Matt Palmer, thanks a lot for doing this. Just curious as to whether the book covers how the nature of ETL (and ELT) has changed due to the rise of the cloud data platforms.
For example, since cloud DWHs provide more volume and speed at the cost of less rigorous data integrity measures, the gap has been filled to a large extent by frameworks such as dbt. Therefore data integrity is being handled more and more by these tools, rather than the data platforms themselves. And would you see this as a positive development?
Hi Ricky! The focus of the book is very much on the basics of ETL / ELT— you bring up an interesting point though.
The need for platforms like dbt has largely grown from the volume / velocity of data, made possible by cloud data platforms
Hi, what is the data destination of data pipeline for this text book? Azure Synapse Analytic or Databricks Data Lake?
I think the book does not have a GitHub repo for code. But most the technologies come from Databricks. So Databricks Delta Lake could be the answer for your question, at least when it comes to ETL or ELT.
Thanks for the reply :D
Does the book cover the theory behind ELT/ETL, use-cases? Or, it showcases examples using specific tools/platforms?
Book released Mar 2024? Oh, not sure my local library has it in yet…
what data engineering technologies does this book cover?
By technologies, I understand 2 things:
- Specific tools, services, etc. like Databricks, dbt, GCP’s Dataproc. If that’s the question, then the answer is things that Databricks offer like Delta Lake. (The cover kinda gives it away 🙂).
- Theoretical solution to the problem of data engineering. Then it is Matt’s answer.
Talking about effective and innovative, for point 1., it really depends on what you are trying to do to select the correct tool or service providers (My company is a Databricks customer, though we are being pitched by others too). For theoretical solution, there can be ETL, ELT, reverse-ETL, shm-ETL, but in the end the stuff you have to do based on 1st principle will always involve: - Extract the data into the system somehow.
- Transform it into something useful.
- Load the useful stuff out to where it can be useful.
which Matt covered in the book.
does it take into account the most effective and innovative Technologies ?
Sorry if my question is not thoroughly thought through. In the book’s table of contents, terms such as Data Transformation Patterns, Data Update Patterns, and Design Patterns are listed.
Q: If we could provide bullet points naming those DT, DU, and Design patterns, what would they be?
Thank you for the answer
EDIT: Got the book thanks to the link shared : ) and now I can see the names of the patterns
Hi again Matt Palmer, just wondering what thoughts you had (or what you put in the book) about change data capture, and how that’s evolved? It’s an old topic, but given the huge increase in variety and volume of data sources, I would imagine there are quite a few more approaches than those written about by (for example) Joe Caserta and Ralph Kimball ~20 years ago.
There are! and it seems like CDC is being supported out-of-the-box by more and more tools.
Even tech like Delta Lake (and Iceberg, I’m sure) supports simple CDC