Taxi Data FAQ
This guide covers common issues and solutions when working with NYC Taxi data in the Data Engineering Zoomcamp, including download, format handling, and data processing.
Table of contents
Data Download Issues
403 Forbidden error from TLC website
Problem: Getting a 403 Forbidden error when trying to download 2021 data from the official TLC website.
Root Cause: The official TLC website may have access restrictions or temporary issues.
Solution: Use our backup data source:
- Backup URL: Yellow Taxi Data 2021-01
- Important: Make sure to unzip the “gz” file properly
- Note: The standard
unzip
command won’t work for.gz
files - usegunzip
instead
Data Format Handling
Handling CSV.GZ compressed files
Problem: Taxi data files are available as *.csv.gz
format and need proper handling.
Root Cause: Compressed files require special handling in data processing scripts.
Solution: Use dynamic filename parsing from URL:
Original approach (problematic):
csv_name = "output.csv" # This won't work correctly with .csv.gz files
Improved approach:
# Parse filename from URL
csv_name = url.split("/")[-1] # Gets "yellow_tripdata_2021-01.csv.gz"
Why this works:
- The URL structure:
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
- Pandas
read_csv()
function can read.csv.gz
files directly - No manual decompression needed in the script
Working with Parquet format data
Problem: Need to process taxi data available in Parquet format.
Solutions:
Option 1: Convert to CSV first
# Decompress parquet file
gunzip green_tripdata_2019-09.csv.gz
# Then use pandas normally
pd.read_csv('green_tripdata_2019-09.csv')
Option 2: Handle Parquet directly in script Modify your ingest_data.py
script:
def main(params):
# ... existing code ...
parquet_name = 'output.parquet'
# Download parquet file
os.system(f"wget {url} -O {parquet_name}")
# Convert parquet to CSV
df = pd.read_parquet(parquet_name)
df.to_csv(csv_name, index=False)
# Continue with normal processing...
Tools & Dependencies
wget command not recognized
Problem: “wget is not recognized as an internal or external command” error.
Root Cause: wget is not installed or not in your system PATH.
Solutions by Operating System:
Ubuntu/Debian:
sudo apt-get install wget
MacOS:
brew install wget
Windows (Multiple Options):
Option 1: Using Chocolatey
choco install wget
Option 2: Manual Installation
- Download binary from GnuWin32
- Add to your system PATH
Option 3: EternallyBored.org
- Download the latest wget binary from eternallybored.org
- Extract using 7-zip if Windows utility fails
- Rename
wget64.exe
towget.exe
if necessary - Move to
Git\mingw64\bin\
directory
Alternative Solutions:
Use Python wget library:
pip install wget
python -m wget <url>
Use Python requests library:
pip install requests
# Then use requests in your Python script
Manual download:
- Download files directly through your web browser
Certificate verification error on MacOS
Problem: “ERROR: cannot verify website certificate” when using wget on MacOS.
Solutions:
Option 1: Use Python wget
python -m wget <url>
Option 2: Skip certificate check
wget <website_url> --no-check-certificate
For Jupyter Notebooks:
!wget <website_url> --no-check-certificate
Data Reference & Documentation
NYC Taxi data dictionaries
Problem: Need to understand the structure and meaning of taxi data columns.
Solution: Official data dictionaries are available:
Yellow Taxi Trips:
Green Taxi Trips:
These documents contain:
- Column definitions and descriptions
- Data types and formats
- Valid value ranges
- Historical changes to the schema