Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Fundamentals of Data Engineering

by Joe Reis, Matthew Housley

The book of the week from 15 Aug 2022 to 19 Aug 2022

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you’ll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You’ll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.

This book will help you:

  • Get a concise overview of the entire data engineering landscape
  • Assess data engineering problems using an end-to-end data framework of best practices
  • Cut through marketing hype when choosing data technologies, architecture, and processes
  • Use the data engineering lifecycle to design and build a robust architecture
  • Incorporate data governance and security across the data engineering lifecycle

Questions and Answers

Jasmin Classen

Hi Joseph Reis and @matthew housley, thanks so much for answering questions here. I’d be curious about the following things:

  1. How would you describe the target audience of this book? (e.g. Software Engineers as well as Data Scientists, beginner experience with Data Engineering vs. more advanced)
  2. I’m very interested in the „cut through the marketing hype“ part of the book. How would you sum up this section exactly, how do you yourselves decide which technologies would be worth trying in an enterprise context?
  3. Does this book touch upon Data Engineering in context of providing data for Data Science projects?
    Thank you again and have a nice day!
Joseph Reis

Hey Jasmin,

  1. Another way to answer this question are learning outcomes - the reader will come away with a good understanding of the foundations of data engineering, which they can apply in their respective disciplines (SWE, DS/DA, data engineering, etc)
  2. To determine the best technologies worth trying in an enterprise context, please read Chapter 4 🙂
  3. Yep, it provides a holistic context for providing data for DS projects. This is covered throughout the book, via the Data Engineering Lifecycle
Matthew Housley

Jasmin Classen One thing I’d add to Joe’s response to 2 above: study classes of technologies rather than single technologies. For example, instead of just looking at Apache Kafka, try to understand streaming event platforms as a class (Kafka, Pulsar, Kinesis, Pub/Sub.)

Daniel

Hi Joseph Reis and Matthew Housley,
thanks for doing this. I’d be interested in …

  • How do you stay up-to-date with the breadth and depth of techniques that are available/upcoming in data science?
  • From your experience: What are frequent mistakes when developing robust data systems?
  • Any cloud provider you prefer/ would recommend (if you had to choose a single one)?
    Many thanks in advance!
    Cheers
    Daniel
Joseph Reis

Your first question - Staying up to date is tricky, as there’s so much going on. I personally read a ton of newsletters, articles, and whitepapers on a weekly basis. For example, on most weekends, I’ve got 50+ articles/papers queued on my ipad to read. In general, I try to focus 80% of my time on areas in my peripheral area of expertise, with another 20% on outliers or stuff that might impact the field I work in.

Matthew Housley

My follow on to what Joe said about staying up to date is related to my response to Jasmin Classen’s question about technologies. Try to connect that dots between different technology developments. Try to understand common engineering threads between current technology announcements and other technologies that you are familiar with.

Matthew Housley

Regarding frequent mistakes when developing robust data systems:
We frequently see companies “rolling their own” stacks, when they could rely more on off the shelf solutions. (Note that off the shelf includes open source.) This leads to excessive complexity, maintenance overhead, massive tech debt and slow ongoing delivery.

Matthew Housley

Small startups will often point to Google or Facebook as examples of why you should build your own technology from the ground up, but this ignores the fact that many off the shelf technologies didn’t exist when these companies were trying to solve problems.

Joseph Reis

re: cloud providers - they’re all great 😉

Matthew Housley

Build a custom solution when you discover a problem that is not solved by something already in the ecosystem.

Joseph Reis

> Small startups will often point to Google or Facebook as examples of why you should build your own technology from the ground up, but this ignores the fact that many off the shelf technologies didn’t exist when these companies were trying to solve problems.

Joseph Reis

We sometimes call this “cargo-cult data engineering”

Cesar Garcia

Is that the reason for the proliferation of endless orchestrators in the ecosystem?

Matthew Housley

Haha, that’s part of it.

Matthew Housley

We’ve seen several startups create their own orchestrators. (I’m even aware of an orchestrator written in Clojure…)

Matthew Housley

Although I think that orchestration is so central to data engineering that there are legitimately many different visions. Unfortunately, there’s only so much mindshare available, so the small players end up on the sidelines.

Roman Zabolotin

Hey Joseph Reis and @Matt Housley,
Thank you for sharing your ideas with us in your book.
Can you give a piece of advice to ones, who stand in data engineering path.
What practice project would be best to advance in this path? Can you give us some sketch examples?

Joseph Reis

There’s no single right answer to this question, but the approach I’ve seen fail over and over is doing projects that you think will land you a job. It might, but what did you learn from the project?
The best practice project is the one you’ll be interested in working on after the “homework” is over. Meaning, pick something that truly interests you.

Matthew Housley

Also, decide what kind of job you’re looking for. Are you trying to work for a Silicon Valley giant? A rapidly growing startup? A large enterprise?

Matthew Housley

For the latter two, focus on cloud skills, both data specific ones and general cloud infrastructure skills. Consider working on cloud certifications to stand out.

Ricky McMaster

Hi Joseph Reis and Matthew Housley thanks so much for doing this!
In your recent appearance on the Data Engineering Podcast you spoke of the need for a new conversation on data modelling. A reevaluation of dimensional modelling was recently offered (in brief) by Holistics’ Analytics Setup Guidebook; given how much has changed technologically since the ubiquity of cloud data warehouses (which even the more recent additions to the canon such as Agile Data Warehouse Design do not consider), could you imagine a new (or modified) set of modelling principles usefully stretching to book-length?
Also, can you cite any interesting discussions of the topic? For example, Maxime Beauchemin’s contention that slowly changing dimensions could be considered obsolete given the cheap columnar storage provided by cloud DWH’s, which allows options like daily partitioned dimensions that were previously considered anti-patterns?

Joseph Reis

These are all good ideas. Modeling needs to incorporate much more than batch-paradigms. I’m personally excited on data modeling for streaming and event data, among other innovations.
Data Vault is pretty dope too.

Joseph Reis

Graph is another area I’m excited about, and I think there’s quite a bit to borrow from graph models with respect to traditional relational data modeling.

Matthew Housley

I think there are a lot of ideas out there on the present and future of data modeling. However, so far I haven’t seen someone commit to a new global vision of data modeling principles that can be adapted to current database and data lake technologies. That could potentially be a great book (or multiple books), but would require a huge amount of work.

Ricky McMaster

Thanks for the responses guys. So far I haven’t needed to do much with streaming data but modelling for that sounds interesting, along with graph - will check that out. I guess the neo4j O’Reilly book would be a good place to start?
Admittedly I’ve only really used the Kimball approach so I don’t really know what Data Vault could offer in addition.
Ok so a new book would be difficult but definitely useful then - here’s hoping that happens 🤞

Ricky McMaster

Sorry Joseph Reis and Matthew Housley, a follow-up on this one: I’ve been using the Kimball approach exclusively as it seemed best suited to what was necessary (and I liked the iterative approach in comparison to Inmon). However, it seems Inmon is personally quite persuaded by Lakehouse…
In any case, you mention Data Vault above: would you say any of the three main dimensional modelling approaches lend themselves best to the cloud DWH landscape?

Joseph Reis

There might be some confusion with how these various modeling techniques are described. Inmon originated the concept of the data warehouse (and later extending this to the data lakehouse). The data warehouse as defined by Inmon uses data marts to present data for various business use cases. These data marts might be modeled using Kimball’s dimensional modeling techniques.
Inmon defines a data warehouse as “a subject-orientated, integrated, time variant, non-volatile collection of data in support of management’s decision-making process.
Kimball took the notion of dimensional modeling, and applied it to a data mart only. He called this a data warehouse, which caused decades of confusion and infighting about the meaning of a data warehouse (I define it as Bill Inmon originally did).
Data Vault is considered the son of the data warehouse (according to Inmon). It’s a way of organizing your data from various sources in a very coherent way. You can add Kimball star schema on top of the Data Vault.
There’s not a one-size fits all answer to the cloud DWH, sadly, since can also go for wide, quasi-denormalized tables too, incorporating nested data structures.
As I mentioned before…confusing, isn’t it?

Ricky McMaster

Hmmmm yeah maybe my perception of Inmon is not in fact the reality 😉 … I need to research him and Data Vault for sure.

Joseph Reis

welcome to a giant can of worms 🙂

Ricky McMaster

Hahahahaha thanks 🙂

Ricky McMaster

> Data Vault is considered the son of the data warehouse (according to Inmon). It’s a way of organizing your data from various sources in a very coherent way. You can add Kimball star schema on top of the Data Vault.
Very good to know, thanks a lot. This is exactly the sort of thing I wanted to find out.

Joseph Reis

sadly, this stuff is often buried underneath hundreds of pages of text in various books

Ricky McMaster

Your hours (weeks?) of pain on our behalf is much appreciated

Joseph Reis

months and years and decades :)

Ricky McMaster

Yes I can imagine… based on that, I guess even some stuff that’s very dated in technical terms, e.g. The DWH ETL Toolkit, might still have useful ideas/techniques?

Joseph Reis

The underlying techniques are still very valid

Ricky McMaster

Very good to know

Matthew Housley

One problem with traditional data modeling techniques is that the emphasize normalization, but absolute adherence to normalization doesn’t really make sense for today’s data and databases.

Matthew Housley

For example, third normal form (3NF) prohibits arrays and other forms of nesting. With the rise of NoSQL, trying to remove all JSON nesting in analytics data really doesn’t make sense. In addition, modern columnar database offer extraordinary performance on nested data and arrays; data engineers are foolish not to take advantage of these capabilities.

Matthew Housley

On the opposite end of the spectrum, I’ve frequently seen people advocate for denormalizing everything. Does this mean that you should try to put all of your data into one huge table? It’s a non-sensical prescription.

Matthew Housley

My suggestion is that data engineers learn traditional approaches to modeling, then make adjustments based on the data their handling, the use case, and the technology involved.

Matthew Housley

Normalization should be viewed not as a strict set of rules, but as a knob we can turn to suit the problem at hand.

Matthew Housley

For example, data that is naturally nested should potentially remain that way in tables for analytics, ML, etc. On the other hand, data that’s reused in many places should potentially live in its own table and get joined into other tables as required.

Cesar Garcia

Hey Joseph Reis and Matt Housing! Thanks for creating this book on Data Engineering. These are my questions for you:

  • What are the areas less developed in the Data Engineering ecosystem?
  • Do you envision that Data Engineering practices could pervade other areas besides big corps, like Open Government Data?
  • What would be the biggest knowledge gaps for someone with sysadmin background to work as Data Engineer?
    Thanks for your time!
    César
Joseph Reis

Less developed areas in the DE ecosystem - collaboration with upstream/downstream stakeholders is a big mess right now, the immaturity of skills and competencies of data engineering teams is something that holds back the full potential of data engineering best practices. That’s hopefully where a book like FoDE can help bridge the skills gap.

Matthew Housley

I would love to see data engineering principles applied to open government data, but this is not something easily achieved given that government IT tends to be way behind corporate IT, and even further behind startups and tech companies. In the US, a huge initiative at the federal level would be required to make this happen.

Cesar Garcia

Thanks for your responses! I am getting ready some publications regarding data engineering principles applications to improve open government data quality.
I totally agree about the inertia in public administration so I am proposing a complementary approach: could citizens/civic initiatives collaboratively construct a quality checking infrastructure as some sort of check and balances system?
To offer a concrete example, I am envisioning some kind of shared Meltano infrastructure, that automatically extracts data from Open Government Data Portals and then triggers a quality check using Great Expectations. These Great Expectations rules could be versioned and shared across different cities, states, countries, etc. These checks could also trigger some metrics on response times, data drift, etc.
If you’d like to know more about this topic, please let me know so I can send you the links when published

Joseph Reis

seems reasonable

Ning Wang

It would be great if there is a unified governance system. Maybe finally smart contract has a suitable use case that can benefit everyone in the world. 🙂

Nakul Bajaj

Hi Joseph Reis and Matthew,
Thanks for creating the book.
I wanted to ask when building data ingestion pipelines and or data ingestion patterns, what design principles can be utilised when creating pipelines and design patterns?
I also wanted to ask, if there are frameworks to help choose between open source pipeline orchestrators on the basis of their deployment, maintenance vs managed services. Will the book cover those?
Finally. What is the role of the data quality framework in data engineering? Also any tips on best practices for ELT and ETL? And which one is better for modern data warehouses such as snowflake or Big query?

Joseph Reis

> I wanted to ask when building data ingestion pipelines and or data ingestion patterns, what design principles can be utilised when creating pipelines and design patterns?

Joseph Reis

Think in terms of push/pull and sync/async. Ensure reliability of payloads in terms of schedule and structure.

Joseph Reis

> I also wanted to ask, if there are frameworks to help choose between open source pipeline orchestrators on the basis of their deployment, maintenance vs managed services. Will the book cover those?

Joseph Reis

yes, we cover this extensively in Chapter 4 - choosing the right technologies.

Joseph Reis

> Finally. What is the role of the data quality framework in data engineering? Also any tips on best practices for ELT and ETL? And which one is better for modern data warehouses such as snowflake or Big query?

Joseph Reis

Data quality is part of data management, one of the undercurrents of the Data Engineering Lifecycle. Definitely incorporate data quality into your workflows.
As for ELT vs ETL for Snowflake or BQ, both can handle either one. We don’t pick a side with ELT vs ETL, and suggest using the best approach for the job.

Matthew Housley

Regarding data ingestion: use off-the-shelf ingestion tools when they exist. Writing custom ingestion code is often a waste of time if someone else has already done this work. You’ll find that you still spend a lot of time on ingestion pipelines where there are no existing solutions.

Matthew Housley

In the long term, I would like to see data providers move toward a “data sharing” paradigm where they land data in object storage or a cloud columnar database. This would free up a lot of time for data engineers to focus on higher value tasks. Google is a leader in this area, and we’ve recently seen companies like Stripe move in this direction.

Joseph Reis

And Snowflake

Matthew Housley

Regarding open source pipeline orchestrators: this area is changing extremely fast. I would suggest watching the current market leaders (Airflow, Dagster, Prefect) and talking to other practitioners. There will likely be interesting new entrants in 2022 and 2023, though I would be careful about being an early adopter.

Joseph Reis

Cron 😉

Joseph Reis

jk

Matthew Housley

😱

Ricky McMaster

Yeah this was something I was thinking about asking - what you both reckon about the data platforms like Snowflake and Databricks. Now I know.

Nakul Bajaj

Thanks so much for the reply Joseph Reis and Matthew Housley..
Love airflow, used it as a managed service mostly.
Started testing out Prefect..
Any serverless orchestrators or serverless frameworks suggested? And your thoughts on these compared to deployments..

Matthew Housley

Thank you!

Joseph Reis

Howdy Matt!

Matthew Housley

I just joined the channel.

Rosona

Hey Joseph Reis and Matthew Housley . From the “look in the book” preview, this looks like a bird’s eye view, readable like a book book instead of a “bam, here’s a GitHub repo, get your hands dirty” book. I’m here to ask if that changes mid book or if I’ve understood the vibe and writing style correctly.
Also, thanks for coming here to answer questions.

Joseph Reis

Part 1 sets the tone. Part 2 gets decently technical, but in an approachable way.
There are no github repos or code exercises.

Matthew Housley

We essentially wrote a book to compliment other technical resources. Data engineering is so vast that we felt like we had to take bird’s eye view. But we would suggest reading this book in conjunction with general engineering books (e.g. Designing Data Intensive Applications) and specific technology books based on the stack you’re working with. (E.g. Learning Spark or Stream Processing with Apache Flink).

Rosona

Excellent! Thanks for the very thorough answers.

Gur Hevroni

Hi Joseph Reis and Matthew Housley, thanks for writing the book and for taking questions here!
I have a couple of questions:

  1. Do you cover in the book what would a data engineer role entails in different organizational settings? (corporate, startup, etc.). For example, I’m curious to know what would be the first order of business for the first data engineer hire in a 50-100 employees startup, compared to that joining one of the big-tech companies.
  2. What are the set of skills required to become a successful data engineer? Would it be more similar to the skill set of a software engineer? a data scientist? a combination of both/others?
    I’m looking forward to learn more about your book!
Joseph Reis
  1. Yes, we discuss this quite thoroughly throughout the book. As the end of each chapter, there’s a “who you’ll work with” section. We also cover early vs mature companies, and how they should work with DE
  2. There’s not a single answer to the question. A successful DE is the one who can keep the business and stakeholders happy :)
Laura

Hi Joseph Reis and Matthew Housley, nice to meet you here. I really enjoyed listening to your Super Data Science podcast episode. 👋
I have a questions as well:
What’s your advice for a Data Scientist trying to get more professional experience with cloud technologies? Doing some hobby projects and watching YouTube tutorials is a nice start, but I feel that I am still missing a lot of knowledge on how to actually run things in production. Do you have any advice?

Joseph Reis

Nothing like digging into cloud products, tinkering, and breaking things 🙂

Joseph Reis

On another note, to get a comprehensive view of a cloud (AWS, GCP, Azure), you might want to consider getting one of their dedicated certs. While some people might bash cloud certs, we think they’re useful for providing a baseline context and skills for that particular cloud. There’s what you think you know about a cloud, and then there’s the way the cloud wants you to use it. These aren’t always clear, and a cert can help clarify these things.

Matthew Housley

I second that, and I’ll add that most of the clouds have data engineering and machine learning specific certs. For example: Google Cloud Platform Certified Professional Data Engineer or the Google Machine Learning Engineer certifications.

Joseph Reis

We’re answering questions now. Feel free to drop new questions. We’ll answer until 5pm MT

Cesar Garcia

This is side question but… why are there almost no scholar references about data engineering? In the first chapter you talk about the story of data engineering (coming from DBs to Big Data to current situation) and the term is quite new. But, are academics still using BigData terms? Are there people outside O’Reilly / Packt writing about this topic? Is everything moving so fast that people don’t have time to publish about it?

Joseph Reis

Academics tend to be quite behind industry (we also teach at a university, so we know of which we speak). Data engineering is still a relatively new field, so it will take some time for the term and practices to solidify.
That said, read ch. 11 if you really want to be confused 🙂

Cesar Garcia

I will do it for sure! Thanks for the tip!

Matthew Housley

The mandate of academia is much more focused on theoretical problems. So, for example, computer science researchers are interested in general questions about distributed systems, where tech company engineers want to build working distributed systems to solve concrete problems.

Matthew Housley

A typical feedback loop between academia and silicon valley goes like this. Engineers build a distributed system and discover some kind of behavior. They write a paper about it. Academic researchers take the problem and develop a theoretical framework to explain it. Engineers then use this to improve their distributed systems.

Matthew Housley

There is definitely value in this feedback loop, but it’s a slow process.

Matthew Housley

This is not to say that every academic research is strictly focused on theoretical problems, but this is the tendency. Professors who want to do concrete experimentation have to work very hard to stay up to date.

Matthew Housley

From my perspective, Martin Kleppmann has done a good job of staying up to date, partially based on his experience in the trenches.

Matthew Housley

Regarding incentives to publish: many technology employers don’t offer incentives for employees to publish. Worse still, employees may not be allowed to publish due to NDAs.

Matthew Housley

Google has done a good job of contributing back to the community through publications, but I’m sure they keep many things under wraps, and it’s possible that they’re less open now as a huge, mature company.

Joseph Reis

Re: q&a. We will try to answer all questions by 5pm MT every day of this book club. Ask away!

Ricky McMaster

Second question: I totally agree with your point about the importance of data engineers understanding business requirements from the stakeholders’ perspective. However, do you have any advice for junior engineers who do not have previous experience in more stakeholder-oriented roles such as data analysis, and whose main academic and career focus hitherto has been overwhelmingly technical? Would you consider it useful for such roles to be embedded in business unit teams?

Joseph Reis

Definitely. Embedding is a great way to build useful non-DE skills that actually make you a better DE. Learning the downstream user’s requirements builds empathy.

Matthew Housley

It’s a tricky problem to solve. To some extent, you’re at the mercy of the company that employs you, especially as a junior engineer. You’re stuck with their organizational dysfunction. Having said that, conversations go a long ways. Learn to have conversations with stakeholders you work with. Also, try to meet people in many roles across the company and talk to them about how they use data.

Ricky McMaster

Cool, good to know thanks. For me personally I come from a BI background so I’m definitely sold on the need for these conversations; it’s good to have it confirmed though that it’s important to prioritise the topic for data engineers, whether formally (e.g. embedding) or not.
I’ve often experienced quite a division between business and tech, but ultimately everyone loses if it persists.

Grzegorz Sajko

When it comes to breadth vs width when acquiring tech skills - when do you recommend to go deeper? What are your thoughts on specialization? As the tech landscape is changing very fast, is it better to be jack-of-all-trades?

Matthew Housley

I think you need to develop both general purpose and specific tech skills. If I’m looking to hire someone, I look for general purpose coding skills and what I’ll call “data intuition.”

Matthew Housley

I frequently find that software engineers who have mostly worked on services that handle single calls and events struggle when asked to handle bulk data. There’s a learning curve in transitioning from thinking of data as individual elements to viewing data as a large set.

Matthew Housley

This often leads to monstrosities at startups such as “ETL pipelines” that are a bunch of single event microservices stitched together.

Matthew Housley

I’ve seen many, many people make the transition from event level thinking to bulk data thinking, but there’s a learning, and ideally a potential hire has already made that transition.

Matthew Housley

If someone has general data skills, it’s relatively easy to retrain them on new data tools. If someone is good at Spark, they can probably learn other data frameworks with relative ease. If they know a realtime framework such as Flink, they can probably make the transition Beam.

Matthew Housley

So, my concrete advice is that you should experiment with data frameworks that you’re interested in. It’s critical that you move beyond frameworks that are traditionally single machine oriented (Pandas, R) to bulk data processing tools such as Spark and Beam. Also, get very good at SQL using database engines such as BigQuery, Snowflake or Redshift to solve analytics and data transformation problems. (SparkSQL is also great.) Even if you work primarily in Spark, SQL is an extremely useful tool, and it shows a potential employer that you have developed data intuition.

Matthew Housley

Also, as I mention in another reply, learn cloud infrastructure and orchestration skills. You can’t learn everything, but knowing one cloud and one orchestration tool is a good start for developing proficiency with other such tools.

Grzegorz Sajko

thanks! This is mine blind spot (single machine oriented / not cloud), because most personal projects don’t need scale.

Alexey Grigorev

Why so many data scientists are suddenly interested in data engineering? Is data science no longer the sexiest job?

Erald David

Curious to hear your perspective on this, Alexey Alexey Grigorev
Could this be because:

  1. Many data scientists realized that a lot of company who hired them don’t have the sufficient data, so decide to learn data engineering also (sort of become fullstack), or
  2. A lot of data scientist wanna be realized they fall in love with the engineering side of data science, so they flock to a more “engineering”-y position like data engineer or even analytics engineer (like this great post from Benn Stancil)?
Roman Zabolotin

As for me, I think than I like work with data and I like programming too.
I think that working as a data scientist is more like research, math, science and so on.
But if your interest is programming, that data engineering will suit you more 😊

serdar

I am not even a data scientist but trying to make a change but in my last couple DS interviews I heard about interviewers complaning about the mediocre skills of the DS people. I guess those who really understand the needy greedy stuff find a new direction with DE. Just a thought..

GerryK

Maybe DS needs sector expertise related to data. DE has more software / tools/ infrastructure.

Matthew Housley

See the thread above on becoming “recovering data scientists.” https://datatalks-club.slack.com/archives/C01H403LKG8/p1660777107448379?thread_ts=1660703899.890469&cid=C01H403LKG8

Matthew Housley

But I’ll add that increasingly, data teams are expected to put models into production.

Matthew Housley

At the height of the data science craze a few years back, data scientists would create a simple model on their laptop, hand some magical insight to a business stakeholder, and move on to the next project.

Matthew Housley

At least, that was the dream. In practice, businesses discovered that the value of their data initiatives was limited if they couldn’t get models into production.

Matthew Housley

Let’s led to a much greater emphasis on data engineering and ML engineering. As such, there seem to be more data engineering openings than data science openings, and data engineer demand far outstrips supply. (We’ll see if this trend holds with ongoing economics shifts.)

Matthew Housley

This blog post provides a great visualization of the foundations required for successful data science.

Matthew Housley

At any rate, I would argue that the prestige of data science and ML have actually increased in the last several years, but this has led to companies emphasizing better support structures for these initiatives, hence the demand for data engineers.

Ning Wang

My lab mates in college are mostly Data sciences, but I was an SWE before and a data engineer now (I am more interested in coding than building/tweaking data models). Personally I think with basic/standard tools, it could be very challenging to analyze most raw data. DEs build systems (standard or dedicated) to further process data so that DSs can get A LOT more values from the data and be A LOT more productive.

Avinash M

Joseph Reis would you suggest this book to a newbie in data engineering field like me?

Joseph Reis

yep!

Avinash M

Thank you!

Aayush

Thanks for this great book!

  • How, according to you, should one approach Data Engineering at Reasonable scale, ie, at places that have data but not necessarily at terabyte scale? Do you think there is a need of implementing any modern data architecture at such organizations?
  • At what point does one realize that they have reached a place where they now need specialised Data Engineers to look after their data needs?
  • Currently, there is a heavy reliance on tool based learning for Data Engineers - learn SQL, learn spark, kafka and you are good to go. Do you think the heavy focus on tools rather than the Fundamentals of DE is a concern?
Kevin Kho

That third question is a very good question. Though SQL is universal so it’s not the same level as Spark and Kafka. Wanna hear their thoughts.

Matthew Housley

Regarding the first bullet point:

Matthew Housley

A really great thing about modern data engineering is that many of our tools now scale smoothly from gigabytes to petabytes.

Matthew Housley

Off the shelf tools such as Spark, BigQuery, Snowflake, etc. can quickly process a few megabytes or run a 1 PB query. It’s just a question of allocating resources at the correct scale for the problem at hand.

Matthew Housley

The relative simplicity of using these tools means that I generally advise organizations with small data problems to spin up one of these options rather than using self hosted Postgres (for example). The reason is that using a cloud based tool decreases operational overhead, and delivers better uptime and consistency, key considerations at any data scale.

Matthew Housley

Regarding the second bullet point: because scaling data is now relatively easy, the main considerations for dedicated data engineering talent have shifted somewhat.

Matthew Housley

I would ask these questions to assess the need for dedicated data engineers:

Matthew Housley
  1. What are the quality expectations for the data? How difficult will it be to maintain quality? Are you ingesting data that is relatively dirty and complex, requiring complex pipelines for cleaning?
  2. What is the expected service level agreement for the data?
  3. How many different data sources are you handling? (Note that this is a separate question from the size of the data.)
  4. How quickly will new data sources be onboarded in the future?
  5. How sensitive is the data? Are you handling data with proprietary company secrets or sensitive personal information?
Matthew Housley

Any such requirements can be a motivation for hiring dedicated data engineers even if the data is only gigabytes in size. One major problem with the modern data stack and the cloud is that it made data too easy, so that people with no security and engineering qualifications ended up causing data breeches or providing incorrect business data to business stakeholders.

Matthew Housley

Regarding the last bullet point, see my response to Grzegorz Sajko above.

Srik

Joseph Reis and Matthew Housley - Do the contents of the book vary by each region they are released? Example - Amazon.com indicates FoDE is 440 pages, amazon.in list it as 452 pages.

Srik

btw kindle edition has 740 pages

Matthew Housley

Not to my knowledge — I believe the editions are just formatted a bit differently. The Kindle version paginates into smaller pages, and the Indian version may have extra back or front matter pages. (I actually have a copy of this edition. I’ll have to check.)

Srik

Thank you

Arianna Cooper

Hi Joseph Reis and Matthew Housley! 😄 Given the years of experience and knowledge you both have in the industry, what have you both seen to be the best practices when it comes to senior level data engineers growing their early career level folks on their teams? What are the green or red flags that companies have demonstrated that show you that they really value (or don’t value) growing their junior and mid level data engineers? (I ask since I just graduated!)

Joseph Reis

Green flags - Showing you best practices and coaching you while you’re growing. Assigning you work and telling you what needs to be done, letting you figure out how. Stepping in to coach you when you’re astray or beyond stuck.
Red flags - No support or direction.

Arianna Cooper

Gotcha. Thank you so much for your input, really appreciate it!!

Ning Wang

Hi, Joseph Reis and Matthew Housley Thanks for writing the book! As a data engineer myself, I think the job is relatively new comparing to many other computer jobs, and many people may not have the right idea about what it is and what it can/should do. A bird-eye view could be very helpful for data engineers and other related people.
My question is: from some of your previous answers, it seems to me that you are supports of “data sharing”. So what are the main values of “data sharing” in your mind? In the high volume and high efficiency scenarios, what are the main points in your mind to justify “data sharing” vs efficiency? In the architecture in my company, we cares a lot about performance and efficiency, it makes data sharing to be trickier in my mind, so I am wondering what you think.
Finally, good luck to your book!

Ning Wang

And btw, the “Cut through marketing hype” part reminds me why I left my previous company. LOL.

Matthew Housley

Data sharing is generally used to share data in a database system that separates compute and storage. For example, BigQuery and Snowflake support flavors of data sharing, and Amazon S3 can be used for data sharing in a data lake environment.

Matthew Housley

The basic idea is that you grant another user, team or organization access to specific data, such as a dataset in BigQuery, a “share” in Snowflake or a specific set of objects in S3.

Matthew Housley

The users on the other end then spin up their own compute to consume the data as they wish. This saves you the trouble of having to grant access to account and clusters — they get read only access to the stored data itself.

Matthew Housley

Data sharing is a great way to publish datasets publicly. For example, a good deal of government data is available through shared BigQuery public dataset.

Matthew Housley

In addition, data sharing facilitates cross organizational collaboration. For example, you may be working with a partner ad tech company that needs access to certain data to create models. Data engineers build pipelines to appropriately scrub the data and load it into the shared datasets.

Matthew Housley

Finally, data sharing is interesting from the perspective of Zhamak Dehghani’s data mesh concept. Individual teams build pipelines to prepare data for external consumption, then share the data product across the company without granting any access to pipelines.

Ning Wang

Thanks for the details! Totally makes sense.

Alexey Grigorev

For data scientists who want to go into data engineering, what should they do apart from reading your book?

Doink
enroll in <#C01FABYF2RG course-data-engineering> 😉 ?
Joseph Reis

What Doink said 😉

Joseph Reis

Also, build your network of people who can get you in front of great job and project opportunities.

Joseph Reis

Thanks everyone!

Ricky McMaster

Another one: something I have experienced more often than not is years-old, accumulating technical debt as a result of poorly designed and maintained operational/application databases. Sometimes, this is compounded by a switch to a cloud data warehouse, but with even less data integrity given how flexible they are in this respect.
I definitely acknowledge tools such as dbt filling the gap in data quality maintenance, but meanwhile do you detect a genuine rediscovery of solid relational database modelling principles, which are in large part decades old?

Joseph Reis

Data modeling is making a comeback for sure. Learn it, love it.

Matthew Housley

Unfortunately, technical debt is just a part of the job. Even for forward looking organizations that embrace new technologies, it takes time to retire old systems, and you don’t necessarily have the authority to accelerate this process. Interfacing with old systems is often a data engineering responsibility.

Ricky McMaster

> Unfortunately, technical debt is just a part of the job.
Tell me about it… I actually don’t come from a technical academic background, and I would love to know if data modelling was given more of a priority in computer science degrees generally these days.

Joseph Reis

I’ve never seen CS degrees teach data modeling, at least as it pertains to analytics. A database class might teach relational algebra and the normal forms of relational data modeling.

Ricky McMaster

> A database class might teach relational algebra and the normal forms of relational data modeling.
Yup - I feel like the correct procedures for even 3NF modelling are often overlooked though (at least from my perspective in Germany).
I don’t doubt that it’s part of my job to deal with the legacy issues, but I’d love to see modelling making a comeback for sure.

Joseph Reis

It’s hard. The amount of laziness to put in the hard work is high

Ricky McMaster

Haha well there is that

Alexey Grigorev

How important is it for data engineers to know how to build a dashboard? Usually it’s more a job for an analyst

Joseph Reis

While building a dashboard isn’t a direct requirement for a DE, if a DE knows how to build a dashboard, I think it helps the DE understand how to structure the data for delivery to the analyst. Zero downside for a DE to know dashboard basics

Matthew Housley

I second that. And learn to communicate with the people building dashboards so you can be better at your job.

GerryK

Hi Joseph Reis and Matthew Housley,

  1. Are you touching on data validation during the ETL/ELT?
  2. Do you use coding examples?
  3. Are you touching on unit test /integration / functional tests for data pipelines?
Joseph Reis
  1. Yes, this is covered throughout the book
  2. No coding examples
  3. We touch on testing throughout the book
GerryK

Sounds good!

Arnthor S

Hi Joseph Reis and Matthew Housley, thanks for doing this! I’m currently reading Designing Data-Intensive Applications, and wondering what the difference between the books are and if I should read FoDE when I’m finished with DDIA? Any important/interesting chapters/areas in this book that are missing in DDIA that you can highlight? I think I read somewhere here (can’t find it now) an answer from Joseph Reis where you recommended reading this book first and then DDIA, so maybe I should pause DDIA and start this one now?

Joseph Reis

DDIA shows you the ins and outs of building distributed systems. FoDE is oriented toward the big picture of data engineering. You can think of FoDE as the prequel to DDIA, and very much orthogonal to DDIA’s direction. If you’re a data engineer, I’d say read both, starting with FoDE.
A common thing I hear about DDIA is it throws you off the deep end quite quickly. FoDE will make your experience with DDIA much nicer.
Sidenote - DDIA authro Martin Kleppman was one of our tech reviewers

Matthew Housley

DDIA is important for two reasons in data engineering. First, if you work for a big tech company, you may be responsible for working on the guts of large scale distributed data systems. This book explains how the sausage is made. Second, if you work more with off the shelf technologies, you still need to understand how they work so you can debug problems.

Arnthor S

Thanks!

Sergio Rozada

Hi Joseph Reis and Matthew Housley, quick question here, what do you think about ML Engineers with core capabilities of doing Data Engineering? Unicorns? Better to have competences split into different roles?

Joseph Reis

Split the competencies into different roles. All in one is definitely a unicorn.

Matthew Housley

There’s also an ongoing debate about which responsibilities belong under ML engineering versus data engineering. The exact boundaries will be determined by the company you work for.

Varun Nayyar

Hey Joseph Reis and Matt.
Thanks for visiting here and taking the time out to answer our questions.
I just have a few teeny ones.

  1. Would you recommend this book to someone with a little to no knowledge about data engineering practices?
  2. Which cloud provider would you recommend as there is a lot of diverse advice in the market right now, keeping in mind future relevance?
  3. What is a good starting point for data engg. Other than python and pandas? Or something entirely different?
  4. How influential are data engineers in contributing to growth of projects? Is their role undermined by someone say a data scientist?
    Thanks for your presence here.
    Wish you great luck for your book.
Joseph Reis
  1. Definitely. This book is geared toward people new to the field. That said, very senior DE’s have said they’ve also learned a ton of new stuff from FoDE.
  2. I can’t suggest a particular cloud provider. Pick the one you’ll get most traction with (job prospects, joy, etc)
Joseph Reis
  1. A good starting point for a DE? Read our book and find out 😉
  2. The influence of DE’s largely depends on the team they’re operating in. DS’s shouldn’t undermine the role of a DE, and vice versa.
Matthew Housley

Regarding 2: GCP has great technology, but still somewhat limited mindshare. Azure appeals to companies that utilize Microsoft products; they are growing extremely fast and investing heavily in their data offerings. AWS still seems to be the mindshare leader for Silicon Valley engineers.

Matthew Housley

Regarding 4: many projects initially need data engineering more than they need data science. For example, customer facing analytics for a large number of customers can have a big impact on the growth of a SaaS platform, even if these analytics are relatively simple.

Matthew Housley

In the days of peak data science a few years back, companies hired data scientists like crazy, but they were stymied by a lack of data engineering. Now, there’s a recognition that data engineering acts as a catalyst for the success of data science and machine learning.

Varun Nayyar

Thanks for the responses.
Looking forward to reading your book.

Joseph Reis

Great questions so far. Keep ‘em coming.

Eric Sims

Not directly related to your book, but how did you each find yourselves working in data engineering? I imagine it wasn’t called DE then, and you probably didn’t start out expecting to write a book about it someday. Where did you start? Where did you think you were going? And why did you end up here?

Matthew Housley

I have a PhD in math, and my research was on the pure side of the discipline. That is, theoretical research problems generally not related to statistics, data, computer science, etc. When I started, “big data engineer” was a hot title, with an emphasis on managing on-premises Hadoop clusters and writing map reduce jobs. However, EMR (Amazon’s managed Hadoop service) and Redshift (Amazon’s managed columnar database) were starting to take off.

Matthew Housley

Both Joseph Reis and I refer to ourselves as recovering data scientists because we started out in data science but organically became data engineers because we needed to build data pipelines in order to deliver projects we were working on.

Matthew Housley

In my case, I worked on a number of cloud oriented data project and spotted that as a long term career growth opportunity.

Joseph Reis

I’ve only really worked with data in some capacity or another. My path has been very circuitous in the details, but the general direction has always been there for over 20 years.

Philip Dießner

Hello Matthew Housley and Joseph Reis Thanks for being here!
Following up on a previous question, what are good ways to learn about data modeling, especially when to use one or another modeling method? (Besides reading the section in your book 😉 )

Joseph Reis

A lot of trial and error. I suggest learning the big ideas, such as dimensional and relational modeling, and applying those to real world datasets. For example, you can use BigQuery’s numerous public datasets and experiment with various ways to model them.

Alber Novo
  • Joseph Reis in a previous thread, you shared how you stay up to date by reading a lot. Could you share some of the sources you have found reliable?
  • Matthew Housley, you mentioned about getting a cloud certification in another thread. In your opinion, which one would you recommend as a start point, AWS or GCP (Just to clarify, I don’t work in the area, but I’m considering work in this area eventually).
  • And a question about the book, on chapter 11 you mentioned about emerging tools boosting spreadsheets with OLAP systems, could you mention some of the candidates for this new class of tools you’ve seen? Thank you for writing this book and taking time to answer questions here :)
Matthew Housley

Hmm… this is a tough question to answer. Personally, I prefer GCP’s data tools, but AWS certs will potentially give you access to more jobs given their massive mindshare.

Matthew Housley

Regarding our speculation on spreadsheets and new interactive data paradigms: there are various SaaS scalable spreadsheets, such as https://www.gigasheet.com/. We’ll see how they do in the marketplace.

Matthew Housley

Beyond that, I suspect that we’ll see a lot of interesting developments in terms of interactive analytics paradigms in the next few years.

Alber Novo

Thanks Matthew Housley. If it’s tough for you to give an opinion, you can imagine how hard it’s for me to decide 😅. I appreciate your input; I’m also inclined to start with GCP.

Idil Ismiguzel

Hi Matthew Housley and Joseph Reis thanks for being here and writing this wonderful book. You mention modern data engineering profession exists to serve downstream data consumers such as data scientists and ML engineers, and boundaries between these three roles are often blurry and depend on the organization’s data maturity. I was wondering your opinion on the main overlaps between ML Engineering and Data Engineering. In which areas an ML Engineer should be as expert as a Data Engineer even though the organization has the split between these two roles?

Muhammad Awon

I was trying to figure this out today but couldn’t find much help, glad you ask the question.
Thanks.

Joseph Reis

Look at the similarities - both DEs and MLEs need to move data between systems and storage. Additionally, they need to store and serve it.
The differences are the MLE needs to know ML pretty well, as that’s the use case the MLE is serving. The DE might also need to know ML, if that’s also the use case the DE is serving. And herein is where the lines get blurry…The big difference is the MLE focuses more on the ML lifecycle, which is quite distinct from the DE lifecycle

Matthew Housley

And MLEs build a lot of data pipelines. Each organization has to decide which pipelines are owned by MLEs and which by DEs.

Joseph Reis

is MLDE a job title yet?

Idil Ismiguzel

Thank you for your answers!

Joseph Reis

Thanks for your questions and interest!

Jk Jensen

Thanks for putting in the effort to make a quality contribution to the space Matthew Housley and Joseph Reis! I’m curious what you see as the biggest problems to be solved in the Data Engineering space with regard to privacy? I am recently coming from the privacy infrastructure world and I would love to see an increased focus in this community on protecting the data we use.

Joseph Reis

There are no shortage of problems, that’s for sure. Privacy is becoming more top of mind, but it’s still a relatively young consideration for DE’s. If this is your specialty, you’ve got a bright future ahead of you!

Matthew Housley

In some respects, the modern data stack has increased issues with privacy by making data too easy. Basically, anyone handling sensitive data needs appropriate training, experience and best practices to do so. Cutting corners can lead to disaster.

Matthew Housley

Beyond that, I’m excited to see the continued emergence of automated sensitive data detection tools. One example is GCP DLP (data loss prevention).

Matthew Housley

Always follow best practices and be on the lookout for sensitive data, but automated tools can help to prevent human error and oversights.

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.