From Google Sheet to S3: My First Data Pipeline

Like a lot of side projects, this one started with a problem: a disorganized Google Sheet. It was full of manual data entry, inconsistencies, and a growing sense of dread whenever I had to open it. I knew I couldn't build a reliable dashboard or get meaningful insights from it. That's when I decided it was time to build my first real data pipeline to automate the entire process.

The Spark: A Messy Spreadsheet

My data was simple—a list of daily sales records. But as more people entered data, it became a mess. Sometimes a date was a string, sometimes a number. The product names had typos. Trying to pull a clean report was a nightmare. I needed a way to automatically grab the latest data, clean it, and put it somewhere reliable. My goal was a system that could run on its own, without me having to touch a single thing.

The Blueprint: My Architecture

I planned a simple but powerful serverless pipeline. I wanted each step to be a self-contained unit, triggered by the completion of the previous one. Here's a quick overview of what I built:

📝

Google Sheets

The Source

⚡

AWS Lambda

The Trigger

💾

Amazon S3

The Data Lake

The Journey: Key Takeaways

1. The Power of a Trigger

The most satisfying part was setting up a Google Apps Script trigger. Every time new data was added to the sheet, a signal was sent to my AWS Lambda function. No more manual button clicks or scheduled jobs—the pipeline just started on its own. It felt like true automation.

2. The Data Lake as a Safety Net

Instead of loading the data directly into a final database, I first dropped it into an S3 bucket. This "data lake" approach was a game-changer. It meant I could store the raw, un-transformed data, giving me a safety net in case my transformation logic was flawed. It also meant my pipeline was more flexible and scalable.

3. The Sweet Taste of Clean Data

The transformation step using AWS Glue was a mix of challenge and satisfaction. Writing the script to handle different date formats and clean up typos was tedious, but seeing the final, perfect data loaded into Redshift was worth it. Now, I have a single source of truth that I can query with confidence.

What's Next?

This project taught me so much about data pipelines, serverless architecture, and the importance of good data hygiene. It might be a small project, but it's a solid foundation for future work.

What do you think of my first pipeline? I'd love to hear your thoughts or suggestions on what I should build next!