Transform Dates and Ensure Data Quality With Databricks & PySpark

Introduction

Messy date formats can wreak havoc on your data pipelines, causing errors, failed joins, and unreliable insights. Whether you’re dealing with inconsistent formats like YYYY-MM-DD, MM/DD/YYYY, or text-based dates like 'Oct 25th', cleaning and standardizing dates is crucial for ensuring data quality.

In this blog post, we’ll explore how to handle date formatting challenges using Databricks and PySpark. From parsing text-based dates to converting Excel serial numbers, you’ll learn practical techniques to clean your data and build reliable pipelines.

Why Fixing Date Formats Matters in Databricks

In distributed data environments like Databricks, your data often comes from multiple sources, each with its quirks. Here’s what you might encounter:

  • One source uses the ISO 8601 format: YYYY-MM-DD.

  • Another source sticks to U.S. formatting: MM/DD/YYYY.

  • And then there’s the wildcard—text-based dates like 'Oct 25th' or integers like 45000 representing Excel serial dates.

If these formats aren’t standardized, you’ll run into issues with joins, aggregations, and downstream reporting. Fixing date formats up front ensures your pipelines run smoothly and your insights are accurate.

How to Handle Date Formatting in PySpark

1. Parsing Standard Date Formats

To convert standard formats like MM/DD/YYYY into a consistent format (YYYY-MM-DD), PySpark provides the to_date function. Here’s how:

from pyspark.sql.functions import to_date

# Sample data
data = [('12/26/2024',)]
df = spark.createDataFrame(data, ['raw_date'])

# Convert MM/DD/YYYY to YYYY-MM-DD
df = df.withColumn('formatted_date', to_date('raw_date', 'MM/dd/yyyy'))
df.show()

Databricks Example code and output for converting US dates to ISO standard dates.


2. Formatting Dates for Reporting

Need to display dates in a specific format? Use PySpark’s date_format function. For instance, converting a date to 'December 26, 2024':


from pyspark.sql.functions import date_format

# Format the date for display
df = df.withColumn('display_date', date_format('formatted_date', 'MMMM dd, yyyy'))
df.show()
display(df)

Databricks / pyspark date formatting example.

3. Handling Unix Epoch Time

Unix Epoch time stores dates as the number of seconds since January 1, 1970. PySpark’s from_unixtime function is awesome for converting these types of date/time formats.

from pyspark.sql.functions import from_unixtime

# Sample data with Epoch timestamps
data = [(1706303400,)]
df = spark.createDataFrame(data, ['epoch_time'])

# Convert Epoch time to a readable date
df = df.withColumn('readable_date', from_unixtime('epoch_time'))
df.show()

**4. Cleaning Partial Dates Like 'Oct 25th'

Text-based dates like 'Oct 25th' can be tricky to handle. PySpark allows you to clean and parse these using functions like regexp_replace and concat.

from pyspark.sql.functions import regexp_replace, concat, lit, to_date

# Sample data
data = [('Oct 25th',)]
df = spark.createDataFrame(data, ['raw_date'])

# Clean and parse partial dates
df = df.withColumn('clean_date', regexp_replace('raw_date', 'th', '')) \
       .withColumn('complete_date', concat('clean_date', lit(' 2024'))) \
       .withColumn('parsed_date', to_date('complete_date', 'MMM dd yyyy'))
df.show()

Databricks / pyspark example of converting partial text dates to a ISO standard date.

5. Converting Excel Serial Dates

Excel stores dates as serial numbers representing days since January 1, 1900. Here’s how to convert these to readable dates:

from pyspark.sql.functions import expr

# Sample data
data = [(45000,)]  # Represents March 15, 2023
df = spark.createDataFrame(data, ['serial_date'])

# Convert Excel serial dates to readable dates
df = df.withColumn('excel_date', expr("date_add('1900-01-01', CAST(serial_date - 2 AS INT))"))
df.show()

Databricks / pyspark - converting excel serial dates to ISO standard dates.

Best Practices for Handling Dates in Databricks

  1. Standardize Early: Convert all dates into a consistent format (like YYYY-MM-DD) at the start of your pipeline.

  2. Validate Inputs: Catch null or invalid dates before processing to avoid errors downstream.

  3. Document Your Assumptions: Be explicit about how you handle partial or messy dates.

  4. Test for Edge Cases: Include leap years, ambiguous formats, and outliers in your tests.

Conclusion

Whether you’re cleaning Unix Epoch time, Excel serial dates, or inconsistent formats like 'Oct 25th', PySpark in Databricks offers powerful tools to simplify the process. By mastering these techniques, you’ll ensure high-quality data pipelines that are scalable, reliable, and easy to maintain.

💡 What’s the trickiest date format you’ve dealt with? Share it in the comments on YouTube below—I’d love to hear your examples!

If this guide helped, share it with your team and check out my YouTube channel for more practical PySpark tutorials!

Fix Frustrating dates with PySpark in Databricks!

Previous
Previous

How AI and Data Engineering Are Shaping the Future of Work in 2025

Next
Next

Data Engineering Roadmap and Fresher Resume Guide: Your Blueprint to Success