Taxi Data FAQ

This guide covers common issues and solutions when working with NYC Taxi data in the Data Engineering Zoomcamp, including download, format handling, and data processing.

Table of contents

  1. Data Download Issues
    1. 403 Forbidden error from TLC website
  2. Data Format Handling
    1. Handling CSV.GZ compressed files
    2. Working with Parquet format data
  3. Tools & Dependencies
    1. wget command not recognized
    2. Certificate verification error on MacOS
  4. Data Reference & Documentation
    1. NYC Taxi data dictionaries

Data Download Issues

403 Forbidden error from TLC website

Problem: Getting a 403 Forbidden error when trying to download 2021 data from the official TLC website.

Root Cause: The official TLC website may have access restrictions or temporary issues.

Solution: Use our backup data source:


Data Format Handling

Handling CSV.GZ compressed files

Problem: Taxi data files are available as *.csv.gz format and need proper handling.

Root Cause: Compressed files require special handling in data processing scripts.

Solution: Use dynamic filename parsing from URL:

Original approach (problematic):

csv_name = "output.csv"  # This won't work correctly with .csv.gz files

Improved approach:

# Parse filename from URL
csv_name = url.split("/")[-1]  # Gets "yellow_tripdata_2021-01.csv.gz"

Why this works:

  • The URL structure: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
  • Pandas read_csv() function can read .csv.gz files directly
  • No manual decompression needed in the script

Working with Parquet format data

Problem: Need to process taxi data available in Parquet format.

Solutions:

Option 1: Convert to CSV first

# Decompress parquet file
gunzip green_tripdata_2019-09.csv.gz
# Then use pandas normally
pd.read_csv('green_tripdata_2019-09.csv')

Option 2: Handle Parquet directly in script Modify your ingest_data.py script:

def main(params):
    # ... existing code ...
    
    parquet_name = 'output.parquet'
    
    # Download parquet file
    os.system(f"wget {url} -O {parquet_name}")
    
    # Convert parquet to CSV
    df = pd.read_parquet(parquet_name)
    df.to_csv(csv_name, index=False)
    
    # Continue with normal processing...

Tools & Dependencies

wget command not recognized

Problem: “wget is not recognized as an internal or external command” error.

Root Cause: wget is not installed or not in your system PATH.

Solutions by Operating System:

Ubuntu/Debian:

sudo apt-get install wget

MacOS:

brew install wget

Windows (Multiple Options):

Option 1: Using Chocolatey

choco install wget

Option 2: Manual Installation

  1. Download binary from GnuWin32
  2. Add to your system PATH

Option 3: EternallyBored.org

  1. Download the latest wget binary from eternallybored.org
  2. Extract using 7-zip if Windows utility fails
  3. Rename wget64.exe to wget.exe if necessary
  4. Move to Git\mingw64\bin\ directory

Alternative Solutions:

Use Python wget library:

pip install wget
python -m wget <url>

Use Python requests library:

pip install requests
# Then use requests in your Python script

Manual download:

  • Download files directly through your web browser

Certificate verification error on MacOS

Problem: “ERROR: cannot verify website certificate” when using wget on MacOS.

Solutions:

Option 1: Use Python wget

python -m wget <url>

Option 2: Skip certificate check

wget <website_url> --no-check-certificate

For Jupyter Notebooks:

!wget <website_url> --no-check-certificate

Data Reference & Documentation

NYC Taxi data dictionaries

Problem: Need to understand the structure and meaning of taxi data columns.

Solution: Official data dictionaries are available:

Yellow Taxi Trips:

Green Taxi Trips:

These documents contain:

  • Column definitions and descriptions
  • Data types and formats
  • Valid value ranges
  • Historical changes to the schema