The spreadsheet arrived at 4:30 PM on a Friday. "Can we import these 15,000 customer records from the legacy CRM this weekend?"

The legacy system had dumped a JSON export that looked like it was assembled by a committee of confused robots. Field names were inconsistent (customer_name, custName, CUSTOMER.NAME), dates were in three different formats, and some records had nested objects while others had flat string representations of the same data.

I needed to validate it, normalize it, extract the relevant fields, transform the values, and compress it for the import pipeline. Five distinct steps, one messy JSON file, and a Monday morning deadline.

That's when I stopped thinking of JSON manipulation as "just formatting" and started treating it as a data transformation pipeline.

The Five-Stage Transformation Pipeline

Every JSON data transformation task follows the same pattern. I've used it for CRM imports, ETL jobs, API response normalization, and configuration file migration.

Raw JSON → [1. Validate] → [2. Format] → [3. Extract] → [4. Transform] → [5. Compress] → Output

Let me walk through each stage with the nightmare CRM data I had to fix.

Stage 1: Validate — Check Your Input

Before you touch a single value, validate the structure. Bad JSON in = garbage out.

The legacy export had three types of errors scattered across 15,000 records:

// Error 1: Trailing comma in an array
{"records": [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"},]}

// Error 2: Unquoted key
{"customer_name": "Acme Corp", custLevel: "gold"}

// Error 3: Duplicate key (last one wins — not the same as the first)
{"id": 1, "name": "Alice", "name": "Alicia"}

A validator catches these upfront. Without validation, you'll get mysterious parse failures in the middle of your pipeline. The JSON Formatter has a validation step that highlights errors with line numbers, so you can fix the source before proceeding.

Stage 2: Format — Normalize the Structure

Once it's valid, format it consistently. This step is about making the JSON readable so you can spot structural patterns.

// After formatting, I noticed the inconsistency:
[
  {
    "customer_name": "Acme Corp",
    "contact": {
      "email": "billing@acme.com",
      "phone": "555-0100"
    },
    "orders": [
      {"order_id": "ORD-001", "total": 499.99, "date": "2024-01-15"}
    ]
  },
  // ... but some records had a different structure:
  {
    "customer_name": "Beta Inc",
    "contact_email": "info@beta.com",  // ← flat, not nested
    "contact_phone": "555-0200",
    "order_ids": "ORD-002,ORD-003",     // ← comma-separated string
    "order_totals": "299.99,149.50",
    "order_dates": "2024-02-01,2024-02-15"
  }
]

Formatting revealed that about 30% of records used a flat structure while the rest used nested objects. I couldn't just write one mapper—I needed a normalization step.

Stage 3: Extract — Pull Out What You Need

Extraction is about selecting the fields you actually need and filtering out noise. For the CRM import, I needed a uniform structure:

import json

with open("crm_export.json") as f:
    data = json.load(f)

normalized = []
for record in data:
    entry = {"name": record.get("customer_name") or record.get("custName") or record.get("CUSTOMER.NAME")}
    
    # Handle flat vs nested contact info
    contact = record.get("contact", {})
    entry["email"] = contact.get("email") if isinstance(contact, dict) else record.get("contact_email")
    entry["phone"] = contact.get("phone") if isinstance(contact, dict) else record.get("contact_phone")
    
    # Normalize orders from array or CSV string
    orders = record.get("orders", record.get("order_ids", []))
    if isinstance(orders, str):
        order_ids = orders.split(",")
        totals = record.get("order_totals", "").split(",")
        dates = record.get("order_dates", "").split(",")
        entry["order_count"] = len(order_ids)
        entry["total_spent"] = sum(float(t) for t in totals if t)
    elif isinstance(orders, list):
        entry["order_count"] = len(orders)
        entry["total_spent"] = sum(o.get("total", 0) for o in orders)
    
    normalized.append(entry)

This stage is where you make the hard decisions about what to keep and what to drop.

Stage 4: Transform — Convert Values to the Right Format

Extraction gives you the right fields. Transformation gives you the right values. This is where data types get fixed, formats get standardized, and business logic gets applied.

from datetime import datetime

def transform_record(entry: dict) -> dict:
    """Apply business transformations to a normalized record."""
    # Standardize name capitalization
    entry["name"] = entry["name"].strip().title() if entry["name"] else "Unknown"
    
    # Convert phone to E.164 format
    if entry.get("phone"):
        cleaned = re.sub(r'[\s\-\(\)\.]', '', entry["phone"])
        if cleaned.startswith("1") and len(cleaned) == 11:
            entry["phone"] = f"+{cleaned}"
        elif len(cleaned) == 10:
            entry["phone"] = f"+1{cleaned}"
    
    # Calculate customer tier from total spent
    total = entry.get("total_spent", 0)
    if total > 10000:
        entry["tier"] = "platinum"
    elif total > 5000:
        entry["tier"] = "gold"
    elif total > 1000:
        entry["tier"] = "silver"
    else:
        entry["tier"] = "bronze"
    
    return entry

Pro tip: always keep a "raw" copy of each field alongside the transformed one during debugging. I use a _raw prefix:

entry["phone_raw"] = entry.get("phone")
entry["phone"] = transform_phone(entry["phone"])

This saved me when a regex change broke 500 phone numbers. I just re-ran the transformation from the raw fields.

Stage 5: Compress — Prepare for Transport

The final stage shrinks the payload for the target system. Our import API had a 5MB request limit, and the normalized data was 8.2MB.

# Stage 5a: Strip null/default fields
clean = {k: v for k, v in entry.items() if v is not None and v != "" and v != 0}

# Stage 5b: Minify before sending
import json
payload = json.dumps(all_records, separators=(',', ':'))
# 8.2 MB → 2.9 MB

# Stage 5c: If still too big, batch
BATCH_SIZE = 500
batches = [all_records[i:i+BATCH_SIZE] for i in range(0, len(all_records), BATCH_SIZE)]
for batch in batches:
    send_to_api(json.dumps(batch, separators=(',', ':')))

If your target format isn't JSON (say you need CSV for a database import or XML for a legacy system), do the conversion here too.

Putting It All Together

Here's the complete pipeline script I used for that CRM import:

import json, re
from datetime import datetime
from typing import List, Dict

def transform_pipeline(raw_path: str, output_path: str) -> int:
    """Run the full validation-transformation-compression pipeline."""
    
    # Stage 1 & 2: Load, validate, format
    with open(raw_path) as f:
        raw = f.read()
    
    try:
        data = json.loads(raw)  # Validation
    except json.JSONDecodeError as e:
        print(f"VALIDATION FAILED at line {e.lineno}: {e.msg}")
        return 1
    
    # Handle nested data
    records = data if isinstance(data, list) else data.get("records", data.get("data", []))
    
    # Stage 3: Extract + Stage 4: Transform
    transformed = [transform_record(extract_record(r)) for r in records]
    
    # Stage 5: Compress and write
    compressed = json.dumps(transformed, separators=(',', ':'), ensure_ascii=False)
    
    with open(output_path, 'w') as f:
        f.write(compressed)
    
    original_kb = len(raw) / 1024
    final_kb = len(compressed) / 1024
    
    print(f"DONE: {len(records)} records processed")
    print(f"Size: {original_kb:.1f}KB → {final_kb:.1f}KB ({100 - (final_kb/original_kb*100):.0f}% reduction)")
    
    return 0

if __name__ == "__main__":
    transform_pipeline("crm_export.json", "crm_import.json")

The import went live Sunday afternoon. Monday morning, the CRM had all 15,000 records imported cleanly.

The Tool Chain You Actually Need

For quick transformations, you don't need a full ETL tool. A good JSON formatter covers stages 1, 2, and 5 (validate, format, compress). Stages 3 and 4 (extract, transform) usually need code. But if your tool can show you the data in different formats—tree view, table view, or as code output—extraction becomes much easier.

For checking intermediate results, I often validate and format a sample first in the JSON Formatter, then write the transformation script against that cleaned output. The tree view helps me understand the structure at a glance, and the compression feature lets me check final payload sizes before writing the batch logic.

FAQ

Q: Do I need a separate ETL tool like Airbyte or Fivetran for JSON transformations?

A: Not for one-off tasks or pipelines under 100K records. A Python script + a JSON formatter handles most cases. ETL tools make sense when you need scheduled, incremental loads with monitoring.

Q: How do I handle JSON keys that change case between environments?

A: Normalize all keys to a single case (I prefer snake_case) in the extraction stage. Use a case-insensitive lookup: next((v for k, v in record.items() if k.lower() == target_key.lower()), None).

Q: What's the most common mistake in JSON data transformation?

A: Assuming the sample data represents all edge cases. I've had pipelines that worked on 1,000 test records but broke on record 10,001 in production because a field that was always present suddenly went missing.

Q: Should I transform JSON before or after inserting into a database?

A: Transform before inserting. Database ETL is harder to debug and roll back. Transform in application code, write the clean data, and keep the raw JSON as a backup column for audit purposes.

Q: How do I handle deeply nested JSON in a transformation pipeline?

A: Flatten step by step. First extract the top-level structure, then recursively process nested objects. I use a depth parameter—flatten to 2 levels for the target system, keep deeper nesting as a JSON string column.

Q: Can I do JSON transformation entirely in the browser?

A: For stages 1, 2, and 5, yes. The JSON Formatter handles validation, beautification, and compression locally. Stages 3-4 (extract and transform) typically need code, but running the formatter first gives you clean, valid input to script against.

Q: What about transformation logging and rollback?

A: Log every transformed record with its original source. I add a _meta block: {"_meta": {"source_file": "crm_export.json", "source_index": 1337, "transformed_at": "2026-06-05T14:30:00Z"}}. This makes rollbacks trivial—just re-extract from the source.

Q: How do I validate transformations didn't corrupt my data?

A: Run a round-trip check: transform a small batch, then reverse the transformation and verify the output matches the input (within acceptable precision). I do this for numeric fields especially—currency conversions can introduce rounding errors.


The next time you get a messy JSON dump at 4:30 PM on a Friday, stop and think in stages: validate first, format to see what you're dealing with, extract what you need, transform to the right format, then compress for transport. The JSON Formatter handles the first two and the last stage in one clean interface, entirely in your browser. No setup, no server, no data leaving your machine.