Questions and Answers
- Is there any tool comparable to dbt?
- Have you tested dbt vault? If yes, how does it support?
- Some database are supported by dbt labs, some by other companies and some by the community. Can you recommend dbt to all databases or would you say, that on some one should prevent using dbt?
- Can you say anything regarding dbt certificates, in particular how complex they are?
- Since version 1.3, dbt supports python next to sql. Do you have some examples where this is really useful?
Hello Shalltear, thank you very much for all theses questions. So starting from the begging:
- Yes there are tools comparable with dbt, one that I mentioned earlier is https://www.matillion.com. For sure each tool have their own specificities, but others like Mode Analytics, or even Ibis (also discussed earlier) could in a sense be use as an alternative of dbt - at least in some features;
- Never tested AutomateDV (formerly dbt vault), but definitely it is supported by dbt. In high level, if you think about data vault with links, satellites and hubs, your models would be built under this paradigm. Doc good to read: Data Vault 2.0 with dbt Cloud;
- Good point. I’ve used dbt with BigQuery (for the book), but also redshift and databricks. Those I’m comfortable. Also heard good things using dbt with Postgres or SQL Server, but indeed each database has is particularities, so the suitability of dbt for a specific database depends on the use case and requirements. It’s essential to evaluate compatibility and performance before deciding to use dbt with a specific database;
- Dbt certification is a new thing. It didn’t exist when we started to write this book. For what I heard it is not that is complex, but it is to broader since it can cover all the features within dbt. Unfortunately I can’t give you more details than that up to now, but I know that with the comprehensive trainings provided by dbt (and our book 😅) you’ll be one step ahead;
- On the book we’ve focused only in SQL but yes dbt support Python alongside SQL, allowing more advanced data transformations and manipulations. For example, you can use Python functions and libraries within dbt to perform complex calculations, data cleansing, or custom aggregations, providing greater flexibility in your data transformations.
Hope I could help with your questions!
Thank you
Shalltear I’d like to add a few points here:
Is there any tool comparable to dbt?
- Matilion is more of fully-fledged ETL / ELT tool, comparable to Airbyte / Fivetran / Meltano - with integrations for dbt for handling the ‘Transformation’ part of the pipeline
- An alternative to dbt, for Transformations only is Dataform.
◦ It used to support the three major Data Warehouses (Snowflake, RedShift, BigQuery), But, back in 2020, Dataform was acquired by Google. If you’re working with GCP, you can use it in a pretty similar way of that indbt Cloud
, it has its own UI, schedulers, etc
◦ Otherwise, you can use dataform core to connect to RedShift or Snowflake, but it’s like working withdbt-core
, you’ll have to deal with CLI and schedule your runs on your own (e.g.: Airflow/Mage)
◦ Also, instead of Python, you’ll have JavaScript support.
◦ But, honestly? Stick with dbt, your CV/resume will thank you
Have you tested dbt vault?- Nope
Some database are supported by dbt labs, some by other companies and some by the community- The adapters for BigQuery, Snowflake, Databricks and RedShift are production-grade. I haven’t worked with the others in production.
- I’d advise against using an OLTP DB such as Postgres for OLAP purposes. Just because you “can”, doesn’t mean you should. Transactions queries are meant to be extremely fast and handle high concurrency, Analytical queries, OTOH, are the exact opposite
- If you’re on AWS and don’t wanna spend sht ton of money on setting up a RedShift cluster that will be idle most of the time, go with RedShift Serverless; if you’re On-Premises (for God-knows-what-reason), try Clickhouse but avoid the sacrilegious use of Postgres for OLAP on scale
*Can you say anything regarding dbt certificates, in particular how complex they are?
I think you’ll find this article useful: it tells how the girl failed and then passed on the exam, and what she did to accomplish it
Since version 1.3, dbt supports python next to sql. Do you have some examples where this is really useful?- When you’re dealing with dynamic unions queries, you would have to make long and ugly store procedures to handle it, OR you could use a python function to make it cleaner and tidy
- When your datalake is more like a data swamp, with schemas mixed up all over the place (for example, saving schemaless JSONs as strings
just because back then we thought it'd be easier
🤦🏻) , the SQL statements become so complex to the point it’s illegible and hard to maintain, so, using Python would make it much cleaner- TL;DR: when the SQL is becoming too complex to read and to maintain
Hi iobruno, thank you very much. Good points, as always. Appreciate it.
u’re very welcome Shalltear
Another, similar to dbt, tool is sqlmesh: https://sqlmesh.com
Thank you everyone for the great sets of questions, the winners will be announced soon 🙌🏻 !! Was a pleasure for us to join this book of week and hope you find the book interesting!! Any other questions or feedback ping me or Hélder Russa directly, we are staying in this amazing DataTalks slack!
Thank you everyone! It was a pleasure. And yes, we will be around, definitely.
Thank you very much!
Congrats!!! 🙂
Rui Machado and Hélder Russa thanks for the effort you put into the book!
One of the main architecture patterns realized by dbt is moving away from stored procedures towards tables/views - this is great for design readability and lineage exploration. However, in some cases business logic is so complex that wiring it into a set of views can become a nightmare. So, a couple of quesrions
- What is you personal take on “modular” approach of dbt?
- Does your book provide any tips & tricks which would help with migrating from stored procedures or handling complex transformation logic?
Hello Valery. Indeed good discussion topic.
The modular approach of dbt has its advantages in terms of design readability and lineage exploration. However, it’s true that in some cases, when we need to write complex business logic into a set of views/tables can be challenging. I would say on these cases a balanced approach, considering the complexity of the business logic and the benefits of modularity, is essential to make effective use of dbt capabilities.
We don’t cover direct migration from procedures to views, yet we introduce and explain the benefits of modularity, and how to manage in particular scenarios.
Valery!! Great to have a question from you!! 🙌🏻 🙌🏻 🙌🏻
Thanks for the answers, Helder.
Rui Machado happy to be part of this conversation! 🙌
How does dataform compares to DBT? Do you think yet choice of JS instead of Python makes adoption easier due to its prominence in web and fullstack devs or does the panda crowd is more likely to use these tools?
I find JS to be a weird choice, given how widely python is viewed as the main language in the data space.
We had a tough time on our team where one developer wrote some javascript transforms then left the team. We ended up having to re-write them because the rest of the team didn’t want to spend the team learning enough javascript to debug it.
Hello again Toxicafunk. I would say there are some similarities in terms of features between dbt and Dataform.
Respective with JS and Python. I would say that JS has indeed a broader adoption between software engineers, yet not that much from data practitioners. In the other hand Python has a broader adoption from the whole variants of data practitioners and even from some software engineers, example Django framework.
My first question: dbt is a nice tool. But I have also seen some disadvantages, like debugging is sometimes not easy or using snapshots for a scd table when you have to make a back-fill and the dataset now contains the unique identifier more than once. From your perspective could you provide a list of the greatest advantages and more even important a list of the greatest disadvantages of using dbt? I think this would help many people to decide whether they should introduce dbt or not.
Hey Shalltear! Thank you for your question! Definitely a great way to start. I can give you a set of dbt’s pros and cons that we actually cover in our book. (This answer assumes you know what dbt is)
Some Pros:
- This is my favorite: It’s SQL-friendly. If you’re comfortable with SQL, dbt will be very natural to you.
- dbt encourages modular and reusable code, which can make your life easier when building complex data models.
- It integrates with Git, so you can keep track of changes and roll back if something goes wrong.
- dbt automatically generates documentation, which is a huge time-saver and helps with data governance.
- It has built-in testing functions to ensure your transformations are accurate, which helps maintain data integrity.
Challenges (Not going to say cons as there are ways to overcome) of dbt:
- dbt doesn’t handle the extract and load in ELT; you’ll need other tools for that, such as Airflow which means more moving parts to manage.
- When things don’t go as planned, debugging in dbt can be a bit of a headache, especially if you’re used to more visual debugging tools.
- For those not familiar with Jinja or advanced SQL, there might be a learning curve - Not very hard though, you will pick it up quite fast
- If you’re coming from a GUI-based tool, dbt’s code-centric approach might feel less intuitive.
- dbt doesn’t handle the extract and load in ELT; you’ll need other tools for that, such as Airflow which means more moving parts to manage.
thank you
My questions are three:
1- What are in your opinion the best tools and packages to use with dbt?
2- Does dbt have any concurrency?
3- What is going to be the next trends/changes in the analytics engineering?
My opinions:
Packages
- dbt-utils (must have for tests/quality checks/freshness/etc)
- dbt-expectations (more advanced statistical testing)
- dbt-checkpoint (formerly pre-commit, excellent for coverage reports, automating linting rules etc)
https://github.com/dbt-checkpoint/dbt-checkpoint
Concurrency
dbt generates a dependency graph and will run models concurrently that do not depend on one another.
If we had parent model and two children
A -> [B, C]
A would be run first and then B and C could be run concurrently. There’s flexibility with how much concurrency you use and it does depend a bit on the backend warehouse/db you’re using.
Trends
I think a number of people are saying LLMs will be the future of modelling. I’m skeptical of this claim because of the difficulty of mapping business processes to code but that’s just my 2 cents.
Hello Luis!
Paul pretty much answered the questions. From one side, dbt-utils and dbt-expectations (which relies on great expectations data quality tool) are potentially the most used and useful packages of dbt
In the other side, yes, dbt allows concurrency at model execution level. For that you need to define the execution threads, saying how many models can be parallelized. Yet, important to know, and as Paul mentioned, this also needs to consider the dependency graph. If you have three models (A, B and C), but if B and C dependes from the A, even if you increase the number of threads. It will always execute the A first and then B and C in parallel.
I think one of the main challenges the analytics engineer field might face is related to the boundaries of their work, where might be some gray area between what a data engineer does and what a data analyst of even scientist do. It is a new field, and I believe that in the future we might see some more enlightening on this.
Next question: Comparing dbt core with dbt cloud. Is it worth to use dbt cloud?
Choosing between dbt Core and dbt Cloud depends on your team’s technical skills, resource availability, and project requirements. dbt Core suits teams that prefer managing infrastructure and workflows via command-line, offering a robust, open-source option without added costs. On the other hand, dbt Cloud provides a managed, user-friendly platform with collaborative features, scheduling, and built-in CI/CD, streamlining operations at a monthly cost. For teams prioritizing ease and reduced maintenance, dbt Cloud is ideal, while budget-conscious or technically proficient teams may lean towards dbt Core.
thank you
Next question: Can you or does your book provide a list of dos and don'ts
and something like “10 best tips and tricks”?
Not the author but speaking from my own experience.
Some mistakes we made.
Incorrectly configure incremental models so all of your tables are doing full refreshes for a full year.
Set the test config block to limit test failures to 100 and only throw an error with > 100 test failures. Partly due to how the failures get saved to a table and you specify the max row count but definitely our mistake! 🤣
So many warnings that were ignored which should have been failing!
As for dos:
Use the great functionality dbt has for data quality, checking freshness and generating/persisting docs.
In the book we do cover a few tips and recommendations although we don’t really have a chapter on “top 10” across the book you will find many best practices 🙂 Not just on dbt but also on data modelling, sql, jinja among others
thanks
Hi, want to ask how do you setup the dbt core for production env workflow?
Not the author but I’ve used airflow for orchestration by packaging each project separately into a docker container to be triggered via airflow.
It works nicely and allows for versioning different projects, dbt versions, dbt dependencies, etc.
Hey Low! Thank you for your question. The answer is not that easy as it depends on the availability, security, scalability and other factor you might want to take in consideration. However I will build on top of Paul and say that in fact, when setting up dbt Core for production, incorporating Airflow and Kubernetes elevates data operations. Airflow orchestrates dbt tasks, managing dependencies and precise scheduling, providing a control tower for data workflows. Kubernetes ensures containerisation of each project, functioning as a self-adjusting engine for efficient performance.
thank you so much for both of your responses on my questions 😄
I have few question and hope you can help me on it,
- Can you please provide insights into the specific sections or chapters of the book that would be most beneficial for someone with a foundational understanding of DBT and SQL (Snowflake) and and how DBT assists in ensuring data quality and reliability??
- In your opinion, what are the key takeaways or advanced techniques that readers with basic knowledge of DBT and SQL (Snowflake) can expect to gain from reading this book?
- How does the book cater to readers like me who want to deepen their knowledge and proficiency in DBT and SQL (Snowflake) while focusing on practical implementation?
Hey Sumith, thank you for your questions!! the book overall is quite accessible but I would highlight our Data Modelling with SQL and dbt as the main one to take in consideration in your case.
I would say reader will gain enough understanding of analytics engineering, the importance of data modelling and overall sql and dbt which in my opinion will help analysts and engineers booting their data skills.
Although we go deep in some chapters, our intention was always to set the direction when it comes to understanding what is worth deep diving into. Take it as a comprehensive depth first book. Then for deep dives maybe there are other books that will complement.
Thank you Rui Machado for sharing much useful information.
Question: I tried simple tutorial. But my ask is where exactly in my DE day to day job i can use DBT. Where experienced DE guys are using it and how?
For analytics or general data quality or transformation i get it. But is there anything DE guy can leverage from it?
Hey Waqas, I would say that dbt is great for making data transformations more efficient, automated and reusable. Imagine that you can take all your ETL packages tranformations into a sql-friendly environment full of packages that understand your needs (Create surrogate keys, test data, document data, see lineage, among others ).
It also helps you work better with analysts because you can set up the core transformations and they can build on top of them to gain insights. dbt also integrates well with version control and CI/CD workflows, which is a big win for reliability. Plus, dbt is a trusted ally when it comes to ensuring the accuracy of your data through testing and automatically keeping your documentation up to date.
Hi there. I have a few questions.
- There are some who say analytics engineering is a passing trend that will become phased out sooner or later as a sub-field or field in data. I’d love to know your thoughts on that if possible, thank you.
- Secondly, what in your opinion are the emerging trends in analytics engineering at the moment that are of importance to note for someone in this field looking to boost their career?
- Thirdly and finally, what general tips would you give someone who’s about to go into an analytics engineering interview?
Thank you for your time.
Hey Josh! Great one.
I believe analytics engineering is bigger that you call it 🙂 For example, when I started working with data we have roles such as data miner, business intelligence developer among other. My personal view is that at the moment companies are facing the need to have people that understand data in such a way that can create semantic models on top of it to facilitate insights creation. Analytics Engineers are in fact data lovers that live in between data engineers and scientists. With this in mind the role might evolve and its name might change but the need for the kind of responsibilities will be around for a long time.
My piece of advice would be: Invest in owning the basics: Data modelling, the main transformation patterns, and how to ensure data reliability and understanding. dbt can help there but its only a tool 🙂
Thank you so much for the response Rui Machado 🙏🏾
Hi there, I got a question as somebody who has being working on his first project with Dbt snowflake and airflow for the last year.
During my way I’ve seen how things are done in the DBT way.
Sometimes, my team intuitively solved with a way which is not Dbt native solution.
Do you show any ways of thinking the DBT way when facing various problems?
Great point Aviv!! Embracing the dbt way means thinking in modular SQL models, writing tests alongside transformations, and documenting your work for clarity and collaboration. It’s about trusting the process, which means structuring your work so it’s maintainable, understandable, and ready for automation.
Thanks! I was also talking about the usage of the different materialisation types to our models :)
Hi,
Thanks for taking time to answer.
My question:
Schematically I believe this is one question hence putting them together.
Can any organization be able to leverage DBT?Who is DBT for vs not for? Things to know before deciding to use DBT? And common pitfalls for developers?
Great question Arun!! I believe dbt can be beneficial for any organization but a bit more to those with SQL-savvy teams that want to automate data transformations and adopt technical best practices. However, for organizations that rely heavily on ETL processes or don’t have strong SQL skills, dbt might require a certain learning curve. Before getting involved with dbt, you should learn how to treat data like code, which can be a challenge by itself.
“Data like code”! Interesting! Thank you. Will explore more on this.
What functionality do you expect to be added/would like to be added to dbt in the future?
Hello Tim, some two great features ahead (or that are having some refinements) is the dbt semantic layer and referencing models across projects - this last one will enable some new data architectures such as data mesh.
Nonetheless dbt Semantic Layer is here for a while, it now leverages MetricFlow where you can define and manage your companies metrics.
:thank_you: actually looking into implementing semantic layer in my org
all the best with the book release!
Thank you very much!
Rui Machado and Hélder Russa thank you so much .I have been following Analyitcs engineering trends closely .Just few questions
1.Iam a data analyst, with experience in SQL ,do you think I need to learn data engineering first before I learn DBT to be ready for the Analytics engineering career ?
- Besides this book ,what other resources you can share for someone interested in Analytics engineering career ?
3.The future of Analytics engineering and AI ? How LLM models assist the future Analytics engineer ?
Love this one Polite!! Great set of questions.
Starting with 1) With your SQL experience, you’re already on the right path. Also, dbt can be a great entry point to learn data engineering concepts while focusing on analytics. It’s designed to bridge the gap between data analysis and engineering, making it a powerful tool for analysts. You can gradually dive deeper into data engineering as you go along.
2) I would say the dbt community and blog itself but there are also great courses already available. Plus I always default to Kimball and Inmon for anyone to learn more about data modelling which is always helpful.
3) When it comes to AI, always remember that garbage in garbage out, meaning that with increasing reliance on quality data for AI analytics engineers will play a pivotal role in designing efficient data pipelines and ensuring data quality - Of course often overlapping with data engineers and scientists responsibilities.
What is your take on NoSQL? Especially for Analytics or Transformation. Is it best practices to always try SQL modeling first because most of time it makes more sense?
Thank you Sven, as a data modelling fanatic I love your question. Personally I believe the choice between NoSQL and SQL for analytics and transformation depends on the specific project requirements. It’s a best practice in my opinion to start with SQL modeling for its maturity and transactional integrity, but consider NoSQL for scalability and unstructured data when needed. Noways with certain file formats and databases you can benefit from both worlds (We briefly touch on that on the book)
Thanks! I also dig deeper into scalability for NoSQL but I also came to the conclusion that most of time it won’t be a problem. And as you mentioned now you have DBs that combine the best of both worlds.
I was just wondering how to start but I think you have the same opinion, go with SQL until you run in problems? 😀
Hi,
Thanks for being here.
- What are the best practices for using DBT in Analytics Engineering?
- Does your book cover lesser known DBT topics such as seeding, snapshotting and debugging?
- What are the prerequisites for reading this book?
Hey Toxicafunk!
- Ehehe great question to find the answer in our book!! There are a few in fact, I would highlight modularisation, data modelling and testing properly
- Yes we do 🙂 We tried to go deep in coverage of dbt features
- Not much, if you know a bit of SQL you should be fine 🙂 We tried to make the book very accessible and easy to follow.
Hello
- Does the Book cover dimensional modeling with dbt ?
- Besides running quality tests and documentation, what are the biggest use cases for using dbt?
- At which size of team/company/tables using dbt become necessary?
Thanks
Hello Koffi!
- Yes it does, we show hoe you can implement start schemas / snowflake and data vault with dbt.
- I can add what I wrote before:
a. Some Pros:
▪︎ This is my favorite: It’s SQL-friendly. If you’re comfortable with SQL, dbt will be very natural to you.
▪︎ dbt encourages modular and reusable code, which can make your life easier when building complex data models.
▪︎ It integrates with Git, so you can keep track of changes and roll back if something goes wrong.
▪︎ dbt automatically generates documentation, which is a huge time-saver and helps with data governance.
▪︎ It has built-in testing functions to ensure your transformations are accurate, which helps maintain data integrity.
b. Challenges (Not going to say cons as there are ways to overcome) of dbt:
i. dbt doesn’t handle the extract and load in ELT; you’ll need other tools for that, such as Airflow which means more moving parts to manage.
ii. When things don’t go as planned, debugging in dbt can be a bit of a headache, especially if you’re used to more visual debugging tools.
iii. For those not familiar with Jinja or advanced SQL, there might be a learning curve - Not very hard though, you will pick it up quite fast
iv. If you’re coming from a GUI-based tool, dbt’s code-centric approach might feel less intuitive. - Any size 🙂 As long as you handle transformations, dbt can be usefull
Thanks
Hello Rui Machado and Hélder Russa thank you for your support.
There are following thing i want to know.
- What are the prerequisite to explore and read the book ? i am very excited.
- I know the data pipeline using airflow and dbt, i have worked on data validation and testing, customized validation etc. I want to learn to explore dbt outside of airflow data pipelines like analytics engineering, data validation automation etc, will it be helpful to explore the out of the box modeling. ?
Hi Bilal, thank you for your questions. Not much, if you know a bit of SQL you should be fine 🙂 We tried to make the book very accessible and easy to follow. Yes!! Modelling is always usefull, it will allow you to create semantic layers that are fit for purpose (Multiple purposes actually)
So far, great set of questions everyone 🙂
Cheers, everyone. The title “Analytics Engineering with SQL and DBT” made me come here : ) Congratulations on the release of the title!
I know the book focuses on dbt as the primary tool for data transformation.
However, do you know if are there other tools available as an alternative to dbt?
Also, what are your thoughts on ibis and its potential as an alternative to dbt? Do you see ibis as something worth trying, or do you believe that these tools have different strengths that don’t complement each other? (my user case is an ETL pipeline in Python).
Hello William, thank you for your questions. While writing the book I came across with Matillion . Never tried tbh but might be good to take a look as well.
Now concerning with Ibis, just took a look to get familiarize. I think it is interesting its native compatibility with Spark, which is already a huge plus. Yet in its core can be just a matter of choice between a more SQL-like syntax or pandas-like syntax. However if you are looking into a bigger ecosystem and features, from testing, documentation or even some flavor of governance (like data lineage), dbt might be more interesting.
Respective to your use case, if you are already in python, and considering time to market just test Ibis, shouldn’t be that hard as setup locally when compared with dbt core, since you had also to re-do all your code.
In a scenario where a company already has a set of defined routines for data transformation.
I would like to know if your book covers these topics:
- What are the specific benefits of moving to dbt in this scenario?
- How can companies evaluate whether or not moving to dbt is a worthwhile investment?
- What are some potential challenges that companies may face when migrating to dbt?
Hey again William, we don’t have a dedicated chapter who covers the benefits or challenges of dbt adoption within a company, it is sort of spread through the book. This type of decision requires also to understand the company reality. If there is already a transformation tool and want to move to dbt I would say to evaluate if the current already supports features like:
- collaborative development and version control;
- Scalability;
- Data documentation and testing parallel to documentation;
- Facility of developing DRY code;
Along with that it will also be important to do a cost-benefit analysis and time saving it will gain (or not) by adopt dbt. Not only from the potential time gain in development but also from the native data observability capabilities that dbt provides and that the whole company can benefit of.
And by last, the main challenges, for sure. dbt isn’t a complex tool to learn but requires always a learning curve. Plus, it is also important to understand the current systems/databases inside the company, as well as ensure existing data formats are compatible with dbt.
First question: Does the book focus on dbt core or dbt Cloud? I have experience using both, but in the case of dbt core, Im having trouble setting up and configure a scheduler, which becomes very important when you want the create a specific materialization. It would be great if you could provide some guidance or point us to some resources. 😀
Second question: What is your opinion about the Modern Data Stack?
Thank you!
Hello Eduardo! Interesting questions. The book uses dbt cloud to present dbt. We decided to go for a more pragmatic approach, since our goal was to show the full extent of dbt and its features and so with dbt cloud we were able (in a more simple way) to demonstrate that. Nevertheless a good number of what is done inside the dbt cloud, and it is presented in the book, uses the dbt CLI, which in the backend runs on top of dbt core.
Using only dbt core you can use airflow as a scheduler. It will certainly work.
The second question, wow, it would be a whole new book 😅. Overall exciting times, yet more complex and dynamic. I remember back in 2013, my “modern data stack” was using SSIS, SSAS and SSRS from Microsoft. We had a more clear boundaries between tools and their purpose. Now the times are different, we have way more requirements related with data security, governance, processing, scalability, etc. Other tools more prepared, leveraging cloud, has emerged. I love that but it turns the choice a bit more complex. As an example, if you look into AWS data stack, you have some tools overlap. A good example is Glue or EMR, which turns the choice a bit more challenged. But from another point we have more offer and competition, which demands the data tools developers to continue to level up their products and provide with more cool features. dbt is a good example of that with the semantic layer (for example).
Thank you so much Hélder Russa