I'm Harish Thota — a Data Engineer with 4+ years of experience including 2.5 years in Australia, building batch pipelines on Azure, Databricks, and Microsoft Fabric. I don't just move data. I make sure the right data reaches the right people — correctly, every time.
Three things that separate my work from a standard engineering CV — in plain terms, not buzzwords.
Every tool listed here has been used in a production system — not a tutorial, not a side project.
Anonymised snippets from real production pipelines — showing the quality-gate and transformation patterns I apply across every project.
# Watermark-based incremental merge into Delta table # Pattern used across 5 ADF pipelines @ Grad Careers def run_incremental_merge(spark, config): watermark = get_last_watermark( config["watermark_table"], config["pipeline_name"] ) # Pull only rows changed since last run new_rows = spark.read.jdbc( url=config["source_jdbc"], table=config["source_table"], properties=config["jdbc_props"] ).filter( F.col("modified_date") > watermark ) row_count = new_rows.count() if row_count == 0: log("No new rows — skipping merge") return # Merge into Delta (upsert, not overwrite) delta_table = DeltaTable.forPath( spark, config["delta_path"] ) delta_table.alias("tgt").merge( new_rows.alias("src"), "tgt.id = src.id" ).whenMatchedUpdateAll( ).whenNotMatchedInsertAll( ).execute() update_watermark(config, new_watermark= new_rows.agg(F.max("modified_date")).collect()[0][0] ) log(f"Merged {row_count} rows successfully")
Eliminates full daily reloads. Reruns only touch affected date windows — cutting compute and preventing cascade failures on partial source outages.
# 12-check quality gate — blocks refresh on any failure # Applied after every Silver layer write CHECKS = [ ("row_count_nonzero", lambda df: df.count() > 0), ("no_null_booking_id", lambda df: df.filter( F.col("booking_id").isNull()).count() == 0), ("pk_uniqueness", lambda df: df.count() == df.select("booking_id").distinct().count()), ("timestamp_valid", lambda df: df.filter( F.col("start_ts") > F.col("end_ts")).count() == 0), ("row_tie_out", lambda df: source_count() == df.count()), # ... 7 additional checks ] def run_quality_gate(df, pipeline_name): failures = [] for check_name, check_fn in CHECKS: try: passed = check_fn(df) except Exception as e: passed = False if not passed: failures.append(check_name) if failures: log_failure(pipeline_name, failures) raise PipelineQualityError( f"BLOCKED: {len(failures)} check(s) failed" ) log("All 12 checks passed ✓")
Hard-stops the pipeline on any single failure. Stakeholders never see a partial or corrupted refresh — the dashboard simply doesn't update until data is clean.
# Conformed layer: 3 source systems → single Delta table # 137k+ finance transactions @ AIML def build_conformed_transactions(spark): # Normalise each source to canonical schema src_a = read_source_a(spark).transform(normalise_a) src_b = read_source_b(spark).transform(normalise_b) src_c = read_source_c(spark).transform(normalise_c) conformed = src_a.unionByName(src_b).unionByName(src_c) # Apply 22-category business rules conformed = conformed.withColumn( "reason_code", apply_classification_rules(F.col("txn_type"), F.col("amount"), F.col("timestamp")) ).withColumn( "is_exception", F.col("reason_code").isin(EXCEPTION_CODES) ) # Write to Delta with schema enforcement conformed.write.format("delta" ).mode("overwrite" ).option("mergeSchema", "false" ).save(CONFORMED_PATH) return conformed.count()
Merges 3 incompatible source schemas into a single governed Delta table. Schema enforcement prevents upstream changes silently corrupting downstream reports.
-- Validation suite (excerpt: 4 of 8 checks) -- Manufacturing supply chain @ Hyundai Mobis -- CHECK 1: Duplicate part records SELECT part_id, warehouse_id, COUNT(*) AS dup_count FROM staging.inventory_snapshot GROUP BY part_id, warehouse_id HAVING COUNT(*) > 1; -- CHECK 2: Negative stock levels SELECT part_id, warehouse_id, stock_qty FROM staging.inventory_snapshot WHERE stock_qty < 0; -- CHECK 3: Missing supplier mapping SELECT s.part_id FROM staging.inventory_snapshot s LEFT JOIN ref.supplier_catalog c ON s.part_id = c.part_id WHERE c.part_id IS NULL; -- CHECK 4: Future-dated receipts SELECT part_id, receipt_date FROM staging.inventory_snapshot WHERE receipt_date > GETDATE();
Each query returns rows only when a defect exists. Empty result = passed. Any rows = blocked. Zero KPI disputes after implementing this validation suite.
Three production systems. Each started with a broken process. Each case study includes the full architecture, problem narrative, code patterns, and results.
The rostering system ran full daily reloads with zero quality checks. Failures were silent. 4,000+ timestamp defects corrupted the demand fact table used every morning. I rebuilt the pipeline from scratch with a 12-check quality gate and Bronze → Gold Medallion architecture.
Read full case study — architecture + code →137,000+ finance transactions reconciled manually every month from 3 disconnected systems. Sign-off variances were untraceable. I replaced the entire process with Databricks PySpark conformed layers, a 22-category rules engine, and Power BI drill-through for auditors.
Read full case study — architecture + code →Warehouse managers and Finance were using different numbers for the same KPIs — sourced from incompatible exports. I built an ADF pipeline and star schema dimensional model (6 dimensions, 3 fact tables) with 8 Power BI dashboards as the single source of truth.
Read full case study — architecture + code →4+ years of progressive data engineering — university research, IT consulting, career services, and automotive manufacturing.
Built and maintain batch pipelines processing 200,000+ annual training bookings. Implemented the organisation's first Medallion architecture with CI/CD governance and a 12-check quality-gate suite that blocks bad data before it reaches dashboards.
Engineered conformed data layers for monthly government reporting across 137,000+ finance transactions from 3 source systems. Replaced 3-day manual reconciliation with automated Databricks pipelines.
Built ADF pipelines and star schema dimensional model for warehouse and supplier performance. 8 Power BI dashboards reduced root-cause analysis from 1 hour to under 15 minutes.
Open to Data Engineer roles across Australia. Full working rights on a Subclass 485 visa — no sponsorship required. Happy to talk this week.