Harish Thota — Data Engineer

Why hire me

Most engineers fix data.
I prevent the problem.

Three things that separate my work from a standard engineering CV — in plain terms, not buzzwords.

01 / Outcome-first

Every pipeline tied to a measurable business result

I don't hand over a completed pipeline and move on. Each system I've built came with a specific number attached — 40% fewer disputes, 3-day close reduced to 1, zero incorrect payment runs. Engineering decisions trace back to operational impact.

02 / Validation-first

I build the quality layer most engineers skip

Across three roles, I've built 12-check gate suites, 20-query reconciliation sets, and rules engines with 22+ categories. Bad rows never reach stakeholders because the pipeline hard-stops before they can.

03 / Production-grade

Medallion architecture, CI/CD, Delta Lake — in live production

My work lives in production, not in a tutorial. Watermark-based incremental merges, schema-gated deployments via Azure DevOps, dbt-compatible modular layers — systems I own and maintain daily.

Code Samples

How I write it

Anonymised snippets from real production pipelines — showing the quality-gate and transformation patterns I apply across every project.

          incremental_merge.py
          PySpark · ADF Notebook
        

# Watermark-based incremental merge into Delta table
# Pattern used across 5 ADF pipelines @ Grad Careers

def run_incremental_merge(spark, config):
    watermark = get_last_watermark(
        config["watermark_table"],
        config["pipeline_name"]
    )

    # Pull only rows changed since last run
    new_rows = spark.read.jdbc(
        url=config["source_jdbc"],
        table=config["source_table"],
        properties=config["jdbc_props"]
    ).filter(
        F.col("modified_date") > watermark
    )

    row_count = new_rows.count()
    if row_count == 0:
        log("No new rows — skipping merge")
        return

    # Merge into Delta (upsert, not overwrite)
    delta_table = DeltaTable.forPath(
        spark, config["delta_path"]
    )
    delta_table.alias("tgt").merge(
        new_rows.alias("src"),
        "tgt.id = src.id"
    ).whenMatchedUpdateAll(
    ).whenNotMatchedInsertAll(
    ).execute()

    update_watermark(config, new_watermark=
        new_rows.agg(F.max("modified_date")).collect()[0][0]
    )
    log(f"Merged {row_count} rows successfully")

Eliminates full daily reloads. Reruns only touch affected date windows — cutting compute and preventing cascade failures on partial source outages.

          quality_gate.py
          Python · Fabric Notebook
        

# 12-check quality gate — blocks refresh on any failure
# Applied after every Silver layer write

CHECKS = [
  ("row_count_nonzero",   lambda df: df.count() > 0),
  ("no_null_booking_id",  lambda df: df.filter(
      F.col("booking_id").isNull()).count() == 0),
  ("pk_uniqueness",       lambda df: df.count() ==
      df.select("booking_id").distinct().count()),
  ("timestamp_valid",     lambda df: df.filter(
      F.col("start_ts") > F.col("end_ts")).count() == 0),
  ("row_tie_out",         lambda df: source_count() ==
      df.count()),
  # ... 7 additional checks
]

def run_quality_gate(df, pipeline_name):
    failures = []
    for check_name, check_fn in CHECKS:
        try:
            passed = check_fn(df)
        except Exception as e:
            passed = False
        if not passed:
            failures.append(check_name)

    if failures:
        log_failure(pipeline_name, failures)
        raise PipelineQualityError(
            f"BLOCKED: {len(failures)} check(s) failed"
        )
    log("All 12 checks passed ✓")

Hard-stops the pipeline on any single failure. Stakeholders never see a partial or corrupted refresh — the dashboard simply doesn't update until data is clean.

          conformed_layer.py
          PySpark · Databricks
        

# Conformed layer: 3 source systems → single Delta table
# 137k+ finance transactions @ AIML

def build_conformed_transactions(spark):
    # Normalise each source to canonical schema
    src_a = read_source_a(spark).transform(normalise_a)
    src_b = read_source_b(spark).transform(normalise_b)
    src_c = read_source_c(spark).transform(normalise_c)

    conformed = src_a.unionByName(src_b).unionByName(src_c)

    # Apply 22-category business rules
    conformed = conformed.withColumn(
        "reason_code",
        apply_classification_rules(F.col("txn_type"),
                                    F.col("amount"),
                                    F.col("timestamp"))
    ).withColumn(
        "is_exception",
        F.col("reason_code").isin(EXCEPTION_CODES)
    )

    # Write to Delta with schema enforcement
    conformed.write.format("delta"
    ).mode("overwrite"
    ).option("mergeSchema", "false"
    ).save(CONFORMED_PATH)

    return conformed.count()

Merges 3 incompatible source schemas into a single governed Delta table. Schema enforcement prevents upstream changes silently corrupting downstream reports.

          validation.sql
          T-SQL · Synapse
        

-- Validation suite (excerpt: 4 of 8 checks)
-- Manufacturing supply chain @ Hyundai Mobis

-- CHECK 1: Duplicate part records
SELECT part_id, warehouse_id, 
       COUNT(*) AS dup_count
FROM   staging.inventory_snapshot
GROUP BY part_id, warehouse_id
HAVING COUNT(*) > 1;

-- CHECK 2: Negative stock levels
SELECT part_id, warehouse_id, stock_qty
FROM   staging.inventory_snapshot
WHERE  stock_qty < 0;

-- CHECK 3: Missing supplier mapping
SELECT s.part_id
FROM   staging.inventory_snapshot s
LEFT JOIN ref.supplier_catalog c
       ON s.part_id = c.part_id
WHERE  c.part_id IS NULL;

-- CHECK 4: Future-dated receipts
SELECT part_id, receipt_date
FROM   staging.inventory_snapshot
WHERE  receipt_date > GETDATE();

Each query returns rows only when a defect exists. Empty result = passed. Any rows = blocked. Zero KPI disputes after implementing this validation suite.

Case Studies

Where outcomes were built

Three production systems. Each started with a broken process. Each case study includes the full architecture, problem narrative, code patterns, and results.

Case 01 of 03

Grad Careers Australia · Sep 2024 – Present

Silently failing pipelines were giving schedulers corrupted staffing data every single day

–40%

Reporting disputes

+20–25%

Staffing accuracy

4,000+

Defects corrected

The rostering system ran full daily reloads with zero quality checks. Failures were silent. 4,000+ timestamp defects corrupted the demand fact table used every morning. I rebuilt the pipeline from scratch with a 12-check quality gate and Bronze → Gold Medallion architecture.

ADFMicrosoft FabricDelta Lake PythonT-SQLPower BIAzure DevOps

Read full case study — architecture + code →

Architecture preview

Case 02 of 03

AIML · University of Adelaide · Sep 2023 – Sep 2024

Government reporting deadlines were being missed — 3-day manual reconciliation every month

+25%

Audit accuracy

3d → 1d

Month-end close

45→10m

Per-exception audit time

137,000+ finance transactions reconciled manually every month from 3 disconnected systems. Sign-off variances were untraceable. I replaced the entire process with Databricks PySpark conformed layers, a 22-category rules engine, and Power BI drill-through for auditors.

DatabricksPySparkDelta Lake dbtT-SQLPower BI

Read full case study — architecture + code →

Architecture preview

Case 03 of 03

Hyundai Mobis Technical Centre · India · 2019 – 2022

Warehouse KPI disputes consumed hours every week — no single source of truth for performance metrics

–25–30%

KPI disputes

8 hrs/wk

Manual effort eliminated

1 hr → 15m

Root-cause analysis

Warehouse managers and Finance were using different numbers for the same KPIs — sourced from incompatible exports. I built an ADF pipeline and star schema dimensional model (6 dimensions, 3 fact tables) with 8 Power BI dashboards as the single source of truth.

ADFAzure SQLT-SQL Power BISparkStar Schema

Read full case study — architecture + code →

Architecture preview

Professional History

Experience

4+ years of progressive data engineering — university research, IT consulting, career services, and automotive manufacturing.

Sep 2024 – Present

↑ –40% disputes

Grad Careers Australia · Adelaide, SA

Data Engineer

↑ +20–25% peak staffing accuracy

Built and maintain batch pipelines processing 200,000+ annual training bookings. Implemented the organisation's first Medallion architecture with CI/CD governance and a 12-check quality-gate suite that blocks bad data before it reaches dashboards.

ADFMicrosoft FabricDelta Lake T-SQLPythonPower BIAzure DevOps

Sep 2023 – Sep 2024

↑ Month-end: 3d → 1d

AIML · University of Adelaide · Adelaide, SA

Data Engineer (Analytics & Compliance)

↑ +25% audit accuracy

Engineered conformed data layers for monthly government reporting across 137,000+ finance transactions from 3 source systems. Replaced 3-day manual reconciliation with automated Databricks pipelines.

DatabricksPySparkDelta Lake dbtT-SQLPower BI

2019 – 2022

↑ KPI disputes –30%

Hyundai Mobis Technical Centre · India

Data Engineer (Reporting & Analytics)

↑ 8 hrs/week manual effort eliminated

Built ADF pipelines and star schema dimensional model for warehouse and supplier performance. 8 Power BI dashboards reduced root-cause analysis from 1 hour to under 15 minutes.

ADFAzure SQLT-SQL Power BISparkStar Schema

Education

Master of Data Science

University of Adelaide

GPA 6.75 / 7.0 · Top 5% of class · 2022–2024

Bachelor of Engineering (ECE)

VNR Vignana Jyothi Institute of Engineering

GPA 9.02 / 10.0 · 2015–2019

Data breaks.
I stop it before
anyone notices.

Most engineers fix data.
I prevent the problem.

What I build with

How I write it

Where outcomes were built

Experience

Ready when
you are.

Data breaks.I stop it beforeanyone notices.

Most engineers fix data.I prevent the problem.

What I build with

How I write it

Where outcomes were built

Experience

Ready whenyou are.

Data breaks.
I stop it before
anyone notices.

Most engineers fix data.
I prevent the problem.

Ready when
you are.