Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Data Engineering Zoomcamp

Learn Data Engineering in 9 weeks

18 Nov 2023 by Valeriia Kuka

In this article, we take a closer look at the Data Engineering Zoomcamp, a free nine-week course that covers the fundamentals of data engineering. It is perfect for people who already know how to code and want to learn about building data systems.

We will describe different aspects of this course so you can learn more about it:

  • Who the course is for?
  • Course curriculum: what topics and technologies are covered by the course
    • Course project for your portfolio
  • Course assignments and scoring
    • Homework and getting feedback
    • Learning in public approach
  • DataTalks.Club community

The next cohort of the course starts in January 2025. You can find more information about the course here. If you’re ready to join, sign up here.

Who the course is for?

Before we get into the details, it’s important to know what skills you should have to join the course comfortably.

Here are the main prerequisites for the course:

  • Skilled in coding
  • Comfortable with the command line
  • Basic SQL

We’ll mainly use Python, but if you’re not familiar with it yet, it’s not a problem. If you’re already programming in another language, you’ll have no trouble picking up Python here.

Course Curriculum

The course curriculum is structured to guide you through the essential elements of data engineering, beginning with foundational concepts and advancing to complex topics.

Course overview

You’ll spend the first six weeks learning and practicing each part of the data engineering process. In the concluding three weeks, you will apply your acquired knowledge and skills to develop an end-to-end data pipeline from the ground up.

  • Week 1: Introduction & Prerequisites
  • Week 2: Workflow Orchestration
  • Week 3: Data Warehouse
  • Week 4: Analytics engineering
  • Week 5: Batch processing
  • Week 6: Streaming
  • Weeks 7, 8, 9: Project

Let’s quickly go over each week, focusing on the main points and the tech you’ll use.

Week 1: Introduction & Prerequisites

Tech: Docker, Postgres, Google Cloud Platform (GCP), Terraform

Focus: Week 1 is dedicated to setting up the key tools and technologies you’ll be using throughout the course.

  • Local Part: Learn to package your application and its dependencies with Docker and run a fully functioning database using PostgreSQL.
  • Cloud Part: Discover Google Cloud Platform and Terraform, focusing on automating the setup and management of your cloud architecture.

Week 2: Workflow Orchestration

Tech: Mage.AI, Google Cloud Storage (GCS)

Focus: Week 2 is about task automation and coordination within complex data pipelines.

  • Learn what Data Lakes are and how to set them up using GCS.
  • Use Mage to connect multiple steps in the workflow into a single process.

Week 3: Data Warehouse

Tech: BigQuery

Focus: Week 3 focuses on efficient data management.

  • Learn to use Google’s BigQuery for effective data management.
  • Study data storage techniques like partitioning and clustering, and understand their differences.

Week 4: Analytics engineering

Tech: dbt

Focus: Week 4 emphasizes transforming raw data into actionable insights.

  • Get introduced to dbt, a tool for data transformation.
  • Explore dbt models, their testing, documentation, and deployment options.
  • Conclude with a section on data visualization using Google Data Studio or Metabase.

Week 5: Batch processing

Tech: Spark

Focus: Week 5 explores the collection and processing of data blocks.

  • Dive into Spark SQL, DataFrames, and Spark internals.
  • Learn data manipulation techniques like joins.
  • Explore cloud-based options for running Spark more efficiently.

Week 6: Streaming

Tech: Kafka, Stream Processing

Focus: Week 6 transitions you from batch to Stream Processing, enabling real-time data handling.

  • Understand Kafka fundamentals, including advanced techniques like stream joins and windowing.
  • Cover Faust and Pyspark for Python-based stream processing.
  • Learn how to use KSQL and ksqlDB for data manipulation.
  • Complete the week by setting up your environment with Docker.

Weeks 7, 8, 9: Project

Duration: 2 weeks for development, 1 week for peer review

Objective: The project focuses on applying your acquired skills to build a data engineering pipeline from scratch. Completing this hands-on project not only validates your skills but also enhances your portfolio, offering a competitive edge in job searches.

Peer Review: To complete the project, you are required to evaluate projects from at least three of your peers. Failure to do so will result in your project being marked as incomplete. For detailed peer review criteria, check this link.

Project Requirements:

  • Select a Dataset: Choose a dataset that intrigues you
  • Data Lake Pipeline: Create a pipeline to process the selected dataset and store it in a data lake.
  • Data Warehouse Pipeline: Develop another pipeline to move the data from the lake to a data warehouse.
  • Data Transformation: Transform and prepare the data in the warehouse to make it dashboard-ready.
  • Dashboard Creation: Build a dashboard featuring at least two tiles that visually represent your data.

The course description on GitHub provides a detailed overview of the topics covered each week. You can see the video lectures, slides, code, and community notes for each week of the course to dive into the content. By the end of the course, you will have acquired the fundamental skills necessary for a career as a data engineer.

If you’re ready to join the next cohort of the course, submit this form to register and stay updated.

Course assignments and scoring

Homework and getting feedback

To reinforce your learning, you can submit a homework assignment at the end of each week. It’s reviewed and scored by course instructors. Your scores are added to an anonymous leaderboard, creating friendly competition among course members and motivating you to do your best.

Anonymous leaderboard with scored homework

For support, we have an FAQ section with quick answers to common questions. If you need more help, our Slack community is always available for technical questions, clarifications, or guidance. Additionally, we host live Q&A sessions called “office hours” where you can interact with instructors and get immediate answers to your questions.

A screenshot of a FAQ document

Learning in public approach

A unique feature is our “learning in public” approach, inspired by Shawn @swyx Wang’s article. We believe that everyone has something valuable to contribute, regardless of their expertise level.

An extract from Shawn @swyx Wang's article about learning in public

Throughout the course, we actively encourage and incentivize learning in public. By sharing your progress, insights, and projects online, you earn additional points for your homework and projects.

Anonymous leaderboard from the previous cohort of the course. On the right, you can see the bonus points for learning in public

This not only demonstrates your knowledge but also builds a portfolio of valuable content. Sharing your work online also helps you get noticed by social media algorithms, reaching a broader audience and creating opportunities to connect with individuals and organizations you may not have encountered otherwise.

DataTalks.Club community

DataTalks.Club has a supportive community of like-minded individuals in our Slack. It is the perfect place to enhance your skills, deepen your knowledge, and connect with peers who share your passion. These connections can lead to lasting friendships, potential collaborations in future projects, and exciting career prospects.

Course channel in our Slack community

Conclusion

The Data Engineering Zoomcamp offers a nine-week program covering the essentials of data engineering. As a student, you will explore different aspects of data engineering and technologies to work with each of them. The curriculum covers everything from managing workflows to proficiency in data warehousing and analytics engineering, emphasizing practical knowledge and its application in real-world scenarios.

We focus on hands-on learning through consistent homework assignments and a final project, preparing graduates to advance in their data engineering careers. DataTalks.Club is a supportive community. We help more students finish the course and stay motivated.

Again, the next cohort starts on January 15, 2024! Take your chance to start 2024 by learning data engineering.

Register for the Data Engineering Zoomcamp.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.