Open to Data Engineer roles · Australia

Data breaks.
I stop it before
anyone notices.

I'm Harish Thota — a Data Engineer with 4+ years of experience including 2.5 years in Australia, building batch pipelines on Azure, Databricks, and Microsoft Fabric. I don't just move data. I make sure the right data reaches the right people — correctly, every time.

600k+
Records in production pipelines
40%
Reporting disputes eliminated
3d→1
Month-end close, Databricks
Recruiter? Save time. Download my resume or full case study pack — everything you need in one click.

Why hire me

Most engineers fix data.
I prevent the problem.

Three things that separate my work from a standard engineering CV — in plain terms, not buzzwords.

01 / Outcome-first
Every pipeline tied to a measurable business result
I don't hand over a completed pipeline and move on. Each system I've built came with a specific number attached — 40% fewer disputes, 3-day close reduced to 1, zero incorrect payment runs. Engineering decisions trace back to operational impact.
02 / Validation-first
I build the quality layer most engineers skip
Across three roles, I've built 12-check gate suites, 20-query reconciliation sets, and rules engines with 22+ categories. Bad rows never reach stakeholders because the pipeline hard-stops before they can.
03 / Production-grade
Medallion architecture, CI/CD, Delta Lake — in live production
My work lives in production, not in a tutorial. Watermark-based incremental merges, schema-gated deployments via Azure DevOps, dbt-compatible modular layers — systems I own and maintain daily.
Technical Stack

What I build with

Every tool listed here has been used in a production system — not a tutorial, not a side project.

Cloud & Orchestration
Azure Data FactoryMicrosoft Fabric DatabricksAzure Synapse Azure DevOpsSnowflake
Languages
PythonT-SQL / SQL PySparkdbt Spark SQLDAX
Architecture
Delta LakeMedallion Architecture LakehouseStar SchemaDimensional Modelling
BI & DevOps
Power BICI/CD GitDrill-through Reporting

Code Samples

How I write it

Anonymised snippets from real production pipelines — showing the quality-gate and transformation patterns I apply across every project.

incremental_merge.py PySpark · ADF Notebook
# Watermark-based incremental merge into Delta table
# Pattern used across 5 ADF pipelines @ Grad Careers

def run_incremental_merge(spark, config):
    watermark = get_last_watermark(
        config["watermark_table"],
        config["pipeline_name"]
    )

    # Pull only rows changed since last run
    new_rows = spark.read.jdbc(
        url=config["source_jdbc"],
        table=config["source_table"],
        properties=config["jdbc_props"]
    ).filter(
        F.col("modified_date") > watermark
    )

    row_count = new_rows.count()
    if row_count == 0:
        log("No new rows — skipping merge")
        return

    # Merge into Delta (upsert, not overwrite)
    delta_table = DeltaTable.forPath(
        spark, config["delta_path"]
    )
    delta_table.alias("tgt").merge(
        new_rows.alias("src"),
        "tgt.id = src.id"
    ).whenMatchedUpdateAll(
    ).whenNotMatchedInsertAll(
    ).execute()

    update_watermark(config, new_watermark=
        new_rows.agg(F.max("modified_date")).collect()[0][0]
    )
    log(f"Merged {row_count} rows successfully")

Eliminates full daily reloads. Reruns only touch affected date windows — cutting compute and preventing cascade failures on partial source outages.

quality_gate.py Python · Fabric Notebook
# 12-check quality gate — blocks refresh on any failure
# Applied after every Silver layer write

CHECKS = [
  ("row_count_nonzero",   lambda df: df.count() > 0),
  ("no_null_booking_id",  lambda df: df.filter(
      F.col("booking_id").isNull()).count() == 0),
  ("pk_uniqueness",       lambda df: df.count() ==
      df.select("booking_id").distinct().count()),
  ("timestamp_valid",     lambda df: df.filter(
      F.col("start_ts") > F.col("end_ts")).count() == 0),
  ("row_tie_out",         lambda df: source_count() ==
      df.count()),
  # ... 7 additional checks
]

def run_quality_gate(df, pipeline_name):
    failures = []
    for check_name, check_fn in CHECKS:
        try:
            passed = check_fn(df)
        except Exception as e:
            passed = False
        if not passed:
            failures.append(check_name)

    if failures:
        log_failure(pipeline_name, failures)
        raise PipelineQualityError(
            f"BLOCKED: {len(failures)} check(s) failed"
        )
    log("All 12 checks passed ✓")

Hard-stops the pipeline on any single failure. Stakeholders never see a partial or corrupted refresh — the dashboard simply doesn't update until data is clean.

conformed_layer.py PySpark · Databricks
# Conformed layer: 3 source systems → single Delta table
# 137k+ finance transactions @ AIML

def build_conformed_transactions(spark):
    # Normalise each source to canonical schema
    src_a = read_source_a(spark).transform(normalise_a)
    src_b = read_source_b(spark).transform(normalise_b)
    src_c = read_source_c(spark).transform(normalise_c)

    conformed = src_a.unionByName(src_b).unionByName(src_c)

    # Apply 22-category business rules
    conformed = conformed.withColumn(
        "reason_code",
        apply_classification_rules(F.col("txn_type"),
                                    F.col("amount"),
                                    F.col("timestamp"))
    ).withColumn(
        "is_exception",
        F.col("reason_code").isin(EXCEPTION_CODES)
    )

    # Write to Delta with schema enforcement
    conformed.write.format("delta"
    ).mode("overwrite"
    ).option("mergeSchema", "false"
    ).save(CONFORMED_PATH)

    return conformed.count()

Merges 3 incompatible source schemas into a single governed Delta table. Schema enforcement prevents upstream changes silently corrupting downstream reports.

validation.sql T-SQL · Synapse
-- Validation suite (excerpt: 4 of 8 checks)
-- Manufacturing supply chain @ Hyundai Mobis

-- CHECK 1: Duplicate part records
SELECT part_id, warehouse_id, 
       COUNT(*) AS dup_count
FROM   staging.inventory_snapshot
GROUP BY part_id, warehouse_id
HAVING COUNT(*) > 1;

-- CHECK 2: Negative stock levels
SELECT part_id, warehouse_id, stock_qty
FROM   staging.inventory_snapshot
WHERE  stock_qty < 0;

-- CHECK 3: Missing supplier mapping
SELECT s.part_id
FROM   staging.inventory_snapshot s
LEFT JOIN ref.supplier_catalog c
       ON s.part_id = c.part_id
WHERE  c.part_id IS NULL;

-- CHECK 4: Future-dated receipts
SELECT part_id, receipt_date
FROM   staging.inventory_snapshot
WHERE  receipt_date > GETDATE();

Each query returns rows only when a defect exists. Empty result = passed. Any rows = blocked. Zero KPI disputes after implementing this validation suite.


Case Studies

Where outcomes were built

Three production systems. Each started with a broken process. Each case study includes the full architecture, problem narrative, code patterns, and results.

Case 01 of 03
Grad Careers Australia · Sep 2024 – Present
Silently failing pipelines were giving schedulers corrupted staffing data every single day
–40%
Reporting disputes
+20–25%
Staffing accuracy
4,000+
Defects corrected

The rostering system ran full daily reloads with zero quality checks. Failures were silent. 4,000+ timestamp defects corrupted the demand fact table used every morning. I rebuilt the pipeline from scratch with a 12-check quality gate and Bronze → Gold Medallion architecture.

ADFMicrosoft FabricDelta Lake PythonT-SQLPower BIAzure DevOps
Read full case study — architecture + code →
Architecture preview
Booking System ADF · Watermark Merge 5 parameterised pipelines Bronze · Raw Delta Quality Gate · 12 Checks Hard-blocks on any failure Silver → Gold · 7 tables Power BI · 2 Dashboards
Case 02 of 03
AIML · University of Adelaide · Sep 2023 – Sep 2024
Government reporting deadlines were being missed — 3-day manual reconciliation every month
+25%
Audit accuracy
3d → 1d
Month-end close
45→10m
Per-exception audit time

137,000+ finance transactions reconciled manually every month from 3 disconnected systems. Sign-off variances were untraceable. I replaced the entire process with Databricks PySpark conformed layers, a 22-category rules engine, and Power BI drill-through for auditors.

DatabricksPySparkDelta Lake dbtT-SQLPower BI
Read full case study — architecture + code →
Architecture preview
Finance Sys A Finance Sys B Finance Sys C Databricks PySpark Normalise + Union all 3 sources Rules Engine 22 categories · 9 timestamp rules Delta Lake · Conformed dbt Modular Layers Power BI Drill-through Exception → Entity → Txn
Case 03 of 03
Hyundai Mobis Technical Centre · India · 2019 – 2022
Warehouse KPI disputes consumed hours every week — no single source of truth for performance metrics
–25–30%
KPI disputes
8 hrs/wk
Manual effort eliminated
1 hr → 15m
Root-cause analysis

Warehouse managers and Finance were using different numbers for the same KPIs — sourced from incompatible exports. I built an ADF pipeline and star schema dimensional model (6 dimensions, 3 fact tables) with 8 Power BI dashboards as the single source of truth.

ADFAzure SQLT-SQL Power BISparkStar Schema
Read full case study — architecture + code →
Architecture preview
ERP System WMS System ADF Extract Pipeline Nightly incremental loads Star Schema 6 dimensions · 3 fact tables Azure SQL Database Power BI · 8 Dashboards Warehouse + Supplier KPIs

Professional History

Experience

4+ years of progressive data engineering — university research, IT consulting, career services, and automotive manufacturing.

Sep 2024 – Present
↑ –40% disputes
Grad Careers Australia · Adelaide, SA
Data Engineer
↑ +20–25% peak staffing accuracy

Built and maintain batch pipelines processing 200,000+ annual training bookings. Implemented the organisation's first Medallion architecture with CI/CD governance and a 12-check quality-gate suite that blocks bad data before it reaches dashboards.

ADFMicrosoft FabricDelta Lake T-SQLPythonPower BIAzure DevOps
Sep 2023 – Sep 2024
↑ Month-end: 3d → 1d
AIML · University of Adelaide · Adelaide, SA
Data Engineer (Analytics & Compliance)
↑ +25% audit accuracy

Engineered conformed data layers for monthly government reporting across 137,000+ finance transactions from 3 source systems. Replaced 3-day manual reconciliation with automated Databricks pipelines.

DatabricksPySparkDelta Lake dbtT-SQLPower BI
2019 – 2022
↑ KPI disputes –30%
Hyundai Mobis Technical Centre · India
Data Engineer (Reporting & Analytics)
↑ 8 hrs/week manual effort eliminated

Built ADF pipelines and star schema dimensional model for warehouse and supplier performance. 8 Power BI dashboards reduced root-cause analysis from 1 hour to under 15 minutes.

ADFAzure SQLT-SQL Power BISparkStar Schema
Education
Master of Data Science
University of Adelaide
GPA 6.75 / 7.0 · Top 5% of class · 2022–2024
Bachelor of Engineering (ECE)
VNR Vignana Jyothi Institute of Engineering
GPA 9.02 / 10.0 · 2015–2019
Let's work together

Ready when
you are.

Open to Data Engineer roles across Australia. Full working rights on a Subclass 485 visa — no sponsorship required. Happy to talk this week.

Open to opportunities
Work rights
Full · 485 Visa
Experience
4+ yrs · 2.5 yrs in AU
Role
Data Engineer
Location
Australia · Remote OK
Salary target
$90k – $115k
Notice period
Immediate