Questions and Answers
Hi Manoj Kukreja . Congratulations on your book
Does this book also cover in detail how Data Lakes differ from Data warehouses and use cases for either one or both of them
Hi Isaac Toluwani Over several years, data warehouses have been the de-facto standard for OLAP use cases, until the challenges around the volume, variety and velocity of data took over. Last few years, we have been in an era where analytics has moved over to data lakes. But this has created a unique challenge which I refer to as the “The power struggle” in my book. Moving from data warehouses to data lakes has forced us to sacrifice a few good things such as ACID compliance, indexing and caching. And that’s precisely why the modern Lakehouse architecture is taking over. My book not only promotes the new architecture but also explains how it is different from other architectures like lambda and kappa.
Nice.. Thanks
Hi Manoj Kukreja.. From the book description it says that the book explains more detail about Microsoft Azure Cloud for data engineering purpose. How about people that do not using the cloud service, instead using other ones like GCP or AWS? Could your examples and codes in this book be implemented also in other cloud service provider? Thank you and congrats for your book.
Hi Tony Gunawan Moreover, all cloud vendors perform data processing using open-source frameworks like Spark, Hadoop, and Kafka but are packaged into services under different names. My book tries to teach the readers about big data concepts rather than enforce a particular technology or cloud service. Most of the examples particularly Databricks Spark can be run on the cloud platform of your choice.
Cool. Thanks for answering, Manoj Kukreja
> Hi Manoj Kukreja - Congratulation on your book. From the description of your book, it would provide steps to deploy the data pipelines in a repeatable and continuous way. May I infer that it would cover data-ops as well?Would it provide the steps to implement the data-ops using Azure ?Would it also cover similar steps using cloud agnostic way?
Hi Anand Towards the end of my book, Chapter 11 “Infrastructure Provisioning” and chapter 12 “Continuous Integration and Deployment (CI/CD) of Data Pipelines” are covered in great detail. Since this book is centered around Azure, I have used services like ARM templates and Azure DevOps. Having said that these services use the same deployments principles as any other cloud agnostic tool such as Terraform or Ansible.
Hello Manoj Kukreja 👋🏻
Asking questions that I am personally struggling with at this point of time. Hope to get your viewpoints on the same.
- Do you consider Infra first or Use case first as sort of a chicken and egg problem? Should organizations hack together some models and define some business use cases and then move on and create an DE backed infrastructure to support them or should the first and foremost thing should be to create an Infrastructure of consistent and reliant data and then working on Data Science use cases?
- With more and more push towards data compliance, what role do you think DE or Solution architects should play to ensure that Data availability should not be a bottleneck for a Data Scientist?
I might have some more questions which I’ll try to get your opinions on. Hope thats alright. Would love to hear other peoples opinion as well.
Hi A
Answer to Q1
Back in 2011, I and my team were one of the early adopters of big data. In my experience, starting a data science practice without a proper data engineering back-end is a mistake that several organizations have made in the past and unfortunately have paid dearly for it. Data scientists are not the best data engineers but are forced to perform that work in their absence, in many cases they don’t get proper time to do the work they are originally supposed to be doing. In my opinion, these days it is mandatory to have a solid DE back-end that enforces proper governance, master-data-management, and data sharing techniques like data mesh.
Answer to Q2
Data engineers play a very huge role in ensuring data compliance. Newer regulations such as GDPR and CCPA are very strict about enforcement. Once again having a DE practice that takes care of data security, standardization, quality, and cataloging is becoming a huge necessity for any organization that dreams of adopting effective data science principles. Instead of being a bottleneck, it ends up making the job of data scientist easier. It lets them focus on what they do best.
Hello Manoj Kukreja. Thanks for the opportunity to get your amazing book free. I read the topics and I understand that you propose the delta lake architecture.
- Which technic of modeling data is the more accurate for this kind of architecture? For example: Data Vault 2.0 , Canonical tables, Kimball, OBT, a mix of technics etc.
- In which layer of the architecture would you apply this technique? Do you consider it necessary to order and modeling data starting in the bronze layer?
Thanks for sharing the knowledge and experience. Regards
juan manuel franzante I don’t think it is advisable to rank the modeling techniques based on accuracy. Overall, I can say that the use of Kimball model is generally preferred and widespread. The modeling techniques are usually applied at the gold layer. However, in some cases particularly related to denormalization, you may end up implementing them at the silver layer as well. The bronze layer represents the state of data in the shape or form that it was delivered or ingested from sources. There are many reasons why you should not model data in the bronze layer of a lakehouse:
- Having the exact state of data is important for auditing
- You may need to replay data in the future in which case you may need the pre-existing state
- Format of data in bronze is usually a mixture, applying a particular modeling technique is technically not even possible
But Kimball is based on oldest paradigm when the storage and compute were expensive in my opinion. I think that its important have a technic of modeling that offer more advantages for the current Cloud paradigm like Data Vault. I’m totally agree with apply these in the silver layer or gold. Thanks for answer!
I do agree that its the oldest paradigm, and that is precisely why good practices and differs from reality. Most times on projects you are forced to do things a certain way based on what the customer is comfortable supporting and/or has skills for.
Hi Manoj Kukreja, thanks for this really interesting book and for the opportunity to ask questions.
- What is a delta lake?
- In your book, do you cover an end-to-end project?
- Do you cover streaming and batch processing of data?
- Why did you choose Spark and Azure?
Tim Becker What is a delta lake?
Delta lake is a new framework that works over Spark to provide some useful features to data lake such as ACID transactions, time travel, schema evolution, etc.
In your book, do you cover an end-to-end project?
My book covers an end-to-end project for an online electronics retailer that wants to streamline its inventory, shipping, finance, and marketing operations using analytics.
Do you cover streaming and batch processing of data?
The project covers both streaming and batch operations.
Why did you choose Spark and Azure?
Azure is one of the most prevalent cloud platforms for big data storage and computing operations. Similarly, Spark is the most widely used distributed compute platform. Serious big data operations using Spark can most effectively be supported using highly scalable platforms like Azure.
How do you implement monitoring in practice?
In the book, I have pretty much relied on Azure monitor. However, I would recommend using Datadog for enterprise monitoring.
What kinds od tools do you use to automate it and what are best practices and common mistakes?
Azure DevOps and ARM templates
I am currently implementing monitoring for ML models and I believe there could be a lot of similarities. Do you have any advice?
I have many times used Prometheus (Kubernetes service monitoring) to collect metrics from endpoints.
thanks a lot 🙂 I will look into it
- How do you implement monitoring in practice? What kinds od tools do you use to automate it and what are best practices and common mistakes? I am currently implementing monitoring for ML models and I believe there could be a lot of similarities. Do you have any advice?
Hello Manoj Kukreja, Congrats on your book!
Depending on the size of a company/project (and the types of data to be aggregated), a data or delta lake seems to possibly introduce a lot complexity than needed, especially when considering very small use cases. Do you give recommendations in the book on when to extend from e.g. a DWH to a data lake? Or would you always start with a data lake(house) architecture to take advantage of the scalability?
Thanks Philip Dießner You have a valid point. The problem is not visible and can be easily resolved if the data is small. Overall. it is not about whether Delta Lake introduces complexity or not, it’s about can we survive without it. In a normal DWH/database environment, atomic updates to row data are easily possible. Data lakes are file/object stores, there is no concept of atomicity. This means every change (CDC) is delivered to you as a duplicate row. In days before delta lake, whenever we got CDC we had to run compute-intensive operations (sometimes lasting hours depending on the size of data) to deduplicate data. All of that has disappeared when delta lake was introduced. In many respects, it has proven to be a life savior for data engineers.
There are legit cases where the complexities of a data lake are unwanted. But hose cases are typically that have a limited volume, variety and almost non-existent velocity of data.
Thanks for your answer and sharing your experience.
Thanks everyone for the great questions, please don’t hesitate to ask more in the future.