Stage 3: Extract — Pull Out What You Need

Question

Accepted Answer

Extraction is about selecting the fields you actually need and filtering out noise. For the CRM import, I needed a uniform structure: ```python import json with open("crm_export.json") as f: data = json.load(f) normalized = [] for record in data: entry = {"name": record.get("customer_name") or record.get("custName") or record.get("CUSTOMER.NAME")} contact = record.get("contact", {}) entry["email"] = contact.get("email") if isinstance(contact, dict) else record.get("contact_email") entry["phone"] = contact.get("phone") if isinstance(contact, dict) else record.get("contact_phone") orders = record.get("orders", record.get("order_ids", [])) if isinstance(orders, str): order_ids = orders.split(",") totals = record.get("order_totals", "").split(",") dates = record.get("order_dates", "").split(",") entry["order_count"] = len(order_ids) entry["total_spent"] = sum(float(t) for t in totals if t) elif isinstance(orders, list): entry["order_count"] = len(orders) entry["total_spent"] = sum(o.get("total", 0) for o in orders) normalized.append(entry) ``` This stage is where you make the hard decisions about what to keep and what to drop. Extraction gives you the right fields. Transformation gives you the right *values*. This is where data types get fixed, formats get standardized, and business logic gets applied. ```python from datetime import datetime def transform_record(entry: dict) -> dict: """Apply business transformations to a normalized record.""" entry["name"] = entry["name"].strip().title() if entry["name"] else "Unknown" if entry.get("phone"): cleaned = re.sub(r'[\s\-\.]', '', entry["phone"]) if cleaned.startswith("1") and len(cleaned) == 11: entry["phone"] = f"+{cleaned}" elif len(cleaned) == 10: entry["phone"] = f"+1{cleaned}" total = entry.get("total_spent", 0) if total > 10000: entry["tier"] = "platinum" elif total > 5000: entry["tier"] = "gold" elif total > 1000: entry["tier"] = "silver" else: entry["tier"] = "bronze" return entry ``` **Pro tip: always keep a "raw" copy** of each field alongside the transformed one during debugging. I use a `_raw` prefix: ```python entry["phone_raw"] = entry.get("phone") entry["phone"] = transform_phone(entry["phone"]) ``` This saved me when a regex change broke 500 phone numbers. I just re-ran the transformation from the raw fields. The final stage shrinks the payload for the target system. Our import API had a 5MB request limit, and the normalized data was 8.2MB. ```python clean = {k: v for k, v in entry.items() if v is not None and v != "" and v != 0} import json payload = json.dumps(all_records, separators=(',', ':')) BATCH_SIZE = 500 batches = [all_records[i:i+BATCH_SIZE] for i in range(0, len(all_records), BATCH_SIZE)] for batch in batches: send_to_api(json.dumps(batch, separators=(',', ':'))) ``` If your target format isn't JSON (say you need CSV for a database import or XML for a legacy system), do the conversion here too. Here's the complete pipeline script I used for that CRM import: ```python import json, re from datetime import datetime from typing import List, Dict def transform_pipeline(raw_path: str, output_path: str) -> int: """Run the full validation-transformation-compression pipeline.""" with open(raw_path) as f: raw = f.read() try: data = json.loads(raw)  # Validation except json.JSONDecodeError as e: print(f"VALIDATION FAILED at line {e.lineno}: {e.msg}") return 1 records = data if isinstance(data, list) else data.get("records", data.get("data", [])) transformed = [transform_record(extract_record(r)) for r in records] compressed = json.dumps(transformed, separators=(',', ':'), ensure_ascii=False) with open(output_path, 'w') as f: f.write(compressed) original_kb = len(raw) / 1024 final_kb = len(compressed) / 1024 print(f"DONE: {len(records)} records processed") print(f"Size: {original_kb:.1f}KB → {final_kb:.1f}KB ({100 - (final_kb/original_kb*100):.0f}% reduction)") return 0 if __name__ == "__main__": transform_pipeline("crm_export.json", "crm_import.json") ``` The import went live Sunday afternoon. Monday morning, the CRM had all 15,000 records imported cleanly. For quick transformations, you don't need a full ETL tool. A good JSON formatter covers stages 1, 2, and 5 (validate, format, compress). Stages 3 and 4 (extract, transform) usually need code. But if your tool can show you the data in different formats—tree view, table view, or as code output—extraction becomes much easier. For checking intermediate results, I often validate and format a sample first in the [JSON Formatter](https://devformatters.com/json-formatter.html), then write the transformation script against that cleaned output. The tree view helps me understand the structure at a glance, and the compression feature lets me check final payload sizes before writing the batch logic. **Q: Do I need a separate ETL tool like Airbyte or Fivetran for JSON transformations?** A: Not for one-off tasks or pipelines under 100K records. A Python script + a JSON formatter handles most cases. ETL tools make sense when you need scheduled, incremental loads with monitoring. **Q: How do I handle JSON keys that change case between environments?** A: Normalize all keys to a single case (I prefer snake_case) in the extraction stage. Use a case-insensitive lookup: `next((v for k, v in record.items() if k.lower() == target_key.lower()), None)`. **Q: What's the most common mistake in JSON data transformation?** A: Assuming the sample data represents all edge cases. I've had pipelines that worked on 1,000 test records but broke on record 10,001 in production because a field that was always present suddenly went missing. **Q: Should I transform JSON before or after inserting into a database?** A: Transform before inserting. Database ETL is harder to debug and roll back. Transform in application code, write the clean data, and keep the raw JSON as a backup column for audit purposes. **Q: How do I handle deeply nested JSON in a transformation pipeline?** A: Flatten step by step. First extract the top-level structure, then recursively process nested objects. I use a depth parameter—flatten to 2 levels for the target system, keep deeper nesting as a JSON string column. **Q: Can I do JSON transformation entirely in the browser?** A: For stages 1, 2, and 5, yes. The [JSON Formatter](https://devformatters.com/json-formatter.html) handles validation, beautification, and compression locally. Stages 3-4 (extract and transform) typically need code, but running the formatter first gives you clean, valid input to script against. **Q: What about transformation logging and rollback?** A: Log every transformed record with its original source. I add a `_meta` block: `{"_meta": {"source_file": "crm_export.json", "source_index": 1337, "transformed_at": "2026-06-05T14:30:00Z"}}`. This makes rollbacks trivial—just re-extract from the source. **Q: How do I validate transformations didn't corrupt my data?** A: Run a round-trip check: transform a small batch, then reverse the transformation and verify the output matches the input (within acceptable precision). I do this for numeric fields especially—currency conversions can introduce rounding errors. --- The next time you get a messy JSON dump at 4:30 PM on a Friday, stop and think in stages: validate first, format to see what you're dealing with, extract what you need, transform to the right format, then compress for transport. The [JSON Formatter](https://devformatters.com/json-formatter.html) handles the first two and the last stage in one clean interface, entirely in your browser. No setup, no server, no data leaving your machine.

JSON Data Transformation: From Raw API Data to Production-Ready Format

The Five-Stage Transformation Pipeline

Stage 1: Validate — Check Your Input

Stage 2: Format — Normalize the Structure