DesiDataDuo
Posts
📨 🚀 AWS Crypto Data Pipeline: Transform Crypto Data with AWS Lambda + Polars - Part 2 of 3

📨 🚀 AWS Crypto Data Pipeline: Transform Crypto Data with AWS Lambda + Polars - Part 2 of 3

From messy JSON to clean, queryable datasets, Polars makes it fast and easy.

Mohit Sharma & Rehana Sheikh
May 15, 2025

In Part 1, we set up a pipeline that ingests raw data from a public API using AWS Lambda and drops it into an S3 "raw zone".

Now that your raw data has landed smoothly in S3, it’s time to clean it up and structure it for analysis. Want to keep learning Subscribe below 🙂

In Part 2 of our AWS Data Engineering Pipeline series, we are diving straight into AWS Lambda + Polars. This combination allows us to transform crypto data seamlessly, all while staying in the AWS Free Tier!

🧠 Why Polars? Faster Than Pandas, lot of adoption is happening!

Polars is an efficient, multi-threaded DataFrame library that's gaining traction for data transformation tasks, especially when dealing with large datasets. Unlike Pandas, which can be slow for large datasets, Polars delivers performance improvements with its parallel processing capabilities.

If you want to read about polars official Benchmarks - Link

✍️ Step-by-Step: Transform Crypto Data with AWS Lambda + Polars

We’re transforming raw crypto data stored in S3 (coingecko.com) using AWS Lambda + Polars. Polars provides fast data cleaning, and AWS Lambda gives us a serverless architecture—meaning no server management and pay-as-you-go pricing.

✅ Step 1: S3 Bucket Structure for Crypto Data

Organize your raw and clean data in S3 (raw we already created in Part 1):

s3://desidataduo-crypto-data/raw/coins/YYYY-MM-DD/
s3://desidataduo-crypto-data/clean/coins/YYYY-MM-DD/

Raw Data: JSON files from your crypto API (e.g., CoinGecko).
Clean Data: Parquet files after applying transformations (ideal for analytics).

✅ Step 2: Package Polars for Lambda

AWS Lambda doesn’t come with Polars pre-installed, so we’ll need to create a Lambda Layer to include it. Here’s how we will do that in AWS CloudShell (No dependency on which OS you use 🙂 ):

Open AWS CloudShell in your AWS Console.
Run the following commands one by one to install Polars and create a zip file for the Lambda layer:

# Step 1 Create a directory for the layer
mkdir polars_layer

# Step 2 navigate to the new directory
cd polars_layer

# Step 3 Install Polars in the 'python' folder
pip install polars -t python/

# Step 4 Create a zip file for the Lambda layer
zip -r polars_layer.zip python/

Now its time to download the polars zip file, Click on the Action button on the cloud shell and then put the path and file name → polars_layer/polars_layer.zip
Action button for download
put the path for Zip file (it might be different for you)
Upload polars_layer.zip to AWS Lambda as a custom layer:
- Go to Lambda > Layers > Create Layer.
- Upload the zip file and choose compatible Python runtimes (e.g., Python 3.11).

✅ Step 3: Lambda Script to Transform(Clean) Crypto Data

We will use this script in our Lambda function to transform the crypto data:

import boto3
import polars as pl
import io
from datetime import date
from botocore.exceptions import ClientError

# AWS S3 setup
s3 = boto3.client('s3')
bucket = 'desidataduo-crypto-data'
today = date.today().isoformat()
input_key = f"raw/coins/{today}/top_100_coins.json"
output_key = f"clean/coins/{today}/top_100_coins.parquet"

def lambda_handler(event, context):
    try:
        # 1. Read raw crypto data from S3
        raw_obj = s3.get_object(Bucket=bucket, Key=input_key)
        raw_data = raw_obj['Body'].read()
        
        # 2. Load JSON into Polars DataFrame
        df = pl.read_json(raw_data)

        # 3. Transform the data
        df_clean = df.with_columns([
            pl.col("last_updated").str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.fZ").alias("last_updated_dt"),
            pl.col("ath_date").str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.fZ").alias("ath_date_dt"),
            pl.col("atl_date").str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%.fZ").alias("atl_date_dt")
        ]).drop_nulls(["current_price", "total_volume"])

        # 4. Write cleaned data to Parquet in-memory
        out_buffer = io.BytesIO()
        df_clean.write_parquet(out_buffer)
        out_buffer.seek(0)

        # 5. Upload Parquet to S3
        s3.put_object(Bucket=bucket, Key=output_key, Body=out_buffer.getvalue())

        return {
            "status": "success",
            "input_key": input_key,
            "output_key": output_key,
            "rows_cleaned": df_clean.shape[0]
        }

    except ClientError as e:
        if e.response['Error']['Code'] == 'NoSuchKey':
            return {
                "status": "error",
                "message": f"S3 key not found: {input_key}"
            }
        else:
            raise
    except Exception as e:
        return {
            "status": "error",
            "message": str(e)
        }

Below is the screenshot of .parquet file created using above code:

✅ Step 4(Optional): Automate Data Processing with EventBridge

To automate this Lambda function, use Amazon EventBridge, similar to Part 1:

Create an EventBridge rule to trigger the Lambda function on a schedule (e.g., every day at midnight) or when new raw data arrives in the S3 bucket.
This will automate the process of cleaning and transforming crypto data without manual intervention.

🧠 What You’ve Achieved

Cleaned and transformed raw crypto data into clean, ready-to-analyze Parquet format.
Leveraged Polars for fast, efficient data cleaning. This is a good learning to adapt to market trends
Created a Lambda and Polars based serverless data pipeline , minimizing infrastructure overhead.
Stayed within the AWS Free Tier without needing AWS Glue or other costly services.
Interested to learn more about Polars . Here is their official website

🔥 Coming Up Next: Explore, Query & Visualize

In Part 3, we’ll show you how to turn that clean Parquet crypto data into insights using Amazon Athena no infrastructure setup needed.

You’ll learn:

How to set up SQL access to your S3 data
How to explore it interactively using Athena or Similar service
And how to build dashboards or reports on top of it

Stay tuned — the final piece of the pipeline is where the real magic happens ✨

💬 Let’s Talk

What are your thoughts on using Polars for data transformation? Is it faster than Pandas in your use case? Drop a comment or reach out to me on LinkedIn