Skip to content

feat(warehouses): Add support to backfill manually exported missing electricity data#381

Open
martyngigg wants to merge 5 commits into
mainfrom
backfill_missing_electricity_data
Open

feat(warehouses): Add support to backfill manually exported missing electricity data#381
martyngigg wants to merge 5 commits into
mainfrom
backfill_missing_electricity_data

Conversation

@martyngigg

@martyngigg martyngigg commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

Updates the electricity data ingestion script to accept files that have been manually exported from the RDM system.They seem to come in different format of CSV file from the automatically created files. A new option to overwrite the backfill glob patterns allows a single run of the script against a smaller subset of exported data rather than having to run the entire backfill again.

This also now requires Pandas >= 3.0 due to a breaking change in 3.0 around DST error handling: Additionally, pandas no longer throws pytz exceptions for timezone operations leading to ambiguous or nonexistent times. These cases will now raise a ValueError.

This has been checked locally by first running the script as it is on main and then running with the version on this branch and checking no duplication has occurred.

Fixes #362

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Improved Keycloak setup to automatically provision and use a persistent local admin when required.
    • SharePoint electricity ingestion now supports configurable backfill via an optional glob pattern.
  • Bug Fixes

    • Keycloak configuration now correctly avoids depending on the temporary bootstrap admin during later provisioning.
    • SharePoint electricity imports handle CSV sections and timestamps more reliably (including DST edge cases), and use safer upsert semantics to keep rdm_data consistent.

…e bootstrapped admin (#377)

### Summary

Keycloak recommend disabling any bootstrapped admin account and creating
a new permanent one - there is a banner if you log in with the temporary
one.

These changes configure the master realm with a new admin account - the
credentials are in the Vault.
Also includes a refactor to simplify resource handling and reduce code duplication
@martyngigg martyngigg requested review from a team as code owners July 2, 2026 08:58
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0754213-bd44-49e5-8f48-8006a35481ca

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Keycloak now boots a temporary admin only when needed, creates a permanent local admin, and uses that account for LDAP and realm setup. The electricity SharePoint ingest updates CSV parsing and section detection, and refactors the DLT write path and backfill file selection.

Changes

Keycloak bootstrap and admin rotation

Layer / File(s) Summary
Bootstrap and local admin variables
infra/ansible/group_vars/all/all.yml, infra/ansible/group_vars/keycloak.yml
keycloak_bootstrap.admin_user is set to "temp-admin", and keycloak_local_admin is added with user and password from secrets.
Bootstrap probe and admin creation
infra/ansible/roles/keycloak/tasks/main.yml
Adds a token-endpoint probe, gates bootstrap execution, creates the permanent local admin, disables the temporary bootstrap admin, and imports setup-target-realm.yml.
LDAP and realm auth credentials
infra/ansible/roles/keycloak/tasks/setup-ldap.yml, infra/ansible/roles/keycloak/tasks/setup-target-realm.yml
LDAP and target realm setup now use keycloak_local_admin credentials instead of keycloak_bootstrap.

Electricity SharePoint ingest and DLT write updates

Layer / File(s) Summary
CSV parsing and section detection
warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py
Updates CSV constants, section detection, CSV parsing, and the Excel reader comment while keeping UTC timestamp handling.
DLT upsert and backfill handling
warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py
Switches rdm_data to decorator-level upsert semantics and adds backfill_glob for backfill file selection.

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The Keycloak Ansible changes are unrelated to the electricity backfill objective and appear out of scope. Split the Keycloak bootstrap/admin changes into a separate PR and keep this change focused on the electricity ingestion backfill work.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding backfill support for manually exported missing electricity data.
Linked Issues check ✅ Passed The electricity ingest changes address #362 by supporting manual exports and backfill of missing data.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
infra/ansible/roles/keycloak/tasks/main.yml (1)

65-171: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Make Keycloak bootstrap resumable after a partial run

local_admin_user_exists only checks whether the permanent admin can log in. If the play fails after bootstrap-admin user has created temp-admin but before the permanent admin is fully created or disabled, the next run will enter bootstrap-admin again and Keycloak will reject it because that user already exists. Detect the existing bootstrap account, or treat the "already exists" case as success, so reruns can continue from the incomplete state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/ansible/roles/keycloak/tasks/main.yml` around lines 65 - 171, The
Keycloak bootstrap flow is not resumable because the Bootstrap Keycloak admin
step always reruns when local_admin_user_exists is false, even if the temp
bootstrap account already exists. Update the task sequence around the Bootstrap
Keycloak admin / Create permanent admin user / Disable temp-admin bootstrap
account steps to detect the existing bootstrap account or treat the “already
exists” response from bootstrap-admin as success. Ensure reruns can proceed to
the permanent admin creation and role mapping instead of failing on an
already-created temp-admin user.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`:
- Around line 76-81: The manual electricity sharepoint ingest is leaving the
surviving power column under the raw header instead of the landing schema field,
which causes downstream nulls. Update the normalization in
electricity_sharepoint.py so the manual header/value mapping in the ingestion
path uses the existing landing column name isis_elec_total_power_mwx; check the
parsing/renaming logic around CSV_PREAMBLE_ANCHOR and the manual row handling,
and ensure the renamed column is emitted consistently in the shared staging
output.
- Around line 94-95: Replace the assert-based column count check in the
electricity_sharepoint ingestion flow with explicit CSV-shape validation and
recovery. In the section that builds cols from df.columns, detect the unexpected
number of columns with a normal conditional, log or record the malformed export,
and skip that section instead of raising AssertionError. Keep the behavior
aligned with the existing recoverable timestamp parse handling in the same
ingest path so one bad file does not abort the whole ingestion.

---

Outside diff comments:
In `@infra/ansible/roles/keycloak/tasks/main.yml`:
- Around line 65-171: The Keycloak bootstrap flow is not resumable because the
Bootstrap Keycloak admin step always reruns when local_admin_user_exists is
false, even if the temp bootstrap account already exists. Update the task
sequence around the Bootstrap Keycloak admin / Create permanent admin user /
Disable temp-admin bootstrap account steps to detect the existing bootstrap
account or treat the “already exists” response from bootstrap-admin as success.
Ensure reruns can proceed to the permanent admin creation and role mapping
instead of failing on an already-created temp-admin user.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1a8d8b9a-bbd5-488f-92ab-17478424bf73

📥 Commits

Reviewing files that changed from the base of the PR and between eefbc13 and 8f81dff.

📒 Files selected for processing (6)
  • infra/ansible/group_vars/all/all.yml
  • infra/ansible/group_vars/keycloak.yml
  • infra/ansible/roles/keycloak/tasks/main.yml
  • infra/ansible/roles/keycloak/tasks/setup-ldap.yml
  • infra/ansible/roles/keycloak/tasks/setup-target-realm.yml
  • warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py

Comment on lines +94 to +95
cols = [c for c in df.columns]
assert len(cols) == 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Replace the assertion with recoverable CSV-shape validation.

assert can be stripped in optimised Python, and an uncaught AssertionError will abort ingestion for one malformed export instead of skipping that section like timestamp parse failures.

Proposed fix
     cols = [c for c in df.columns]
-    assert len(cols) == 3
+    if len(cols) != 3:
+        logger.warning(
+            f"Error loading section of {file_name!r}: expected 3 columns, got {len(cols)}. "
+            "There will be gaps in the data."
+        )
+        return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cols = [c for c in df.columns]
assert len(cols) == 3
cols = [c for c in df.columns]
if len(cols) != 3:
logger.warning(
f"Error loading section of {file_name!r}: expected 3 columns, got {len(cols)}. "
"There will be gaps in the data."
)
return None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`
around lines 94 - 95, Replace the assert-based column count check in the
electricity_sharepoint ingestion flow with explicit CSV-shape validation and
recovery. In the section that builds cols from df.columns, detect the unexpected
number of columns with a normal conditional, log or record the malformed export,
and skip that section instead of raising AssertionError. Keep the behavior
aligned with the existing recoverable timestamp parse handling in the same
ingest path so one bad file does not abort the whole ingestion.

This ensures we can keep the order of the output frame correct.
Also, add requirement for pandas >= 3 to ensure DST localization
errors are produced as ValueErrors no pytz.TimeError exceptions.
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v3.0.0.html#time-zones-are-now-represented-by-standard-library-zoneinfo-instead-of-pytz-by-default
@martyngigg

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`:
- Line 172: The UTC conversion is being written to a literal COL_DATE_TIME
column instead of the actual date_time field, so update the assignment in
electricity_sharepoint.py to write back through the existing date_time/column
constant used by the ingestion flow. Locate the transformation around the df_raw
and to_utc logic and ensure the converted timestamps replace the source
date_time values before Excel ingestion continues.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fe08a736-ae63-47b7-ba19-74243b579f85

📥 Commits

Reviewing files that changed from the base of the PR and between 8f81dff and 5fa1a9f.

📒 Files selected for processing (1)
  • warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py

# See comment at the top of this describing the format
df_raw = pd.read_excel(file_content, engine=EXCEL_ENGINE, skiprows=EXCEL_SKIP_ROWS)
df_raw = df_raw.rename(columns={"Time": COL_DATE_TIME})
df_raw["COL_DATE_TIME"] = to_utc(df_raw[COL_DATE_TIME])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Assign the UTC conversion back to date_time.

Line 172 writes a new literal COL_DATE_TIME column, so Excel ingests leave the actual date_time field unconverted.

Proposed fix
-    df_raw["COL_DATE_TIME"] = to_utc(df_raw[COL_DATE_TIME])
+    df_raw[COL_DATE_TIME] = to_utc(df_raw[COL_DATE_TIME])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
df_raw["COL_DATE_TIME"] = to_utc(df_raw[COL_DATE_TIME])
df_raw[COL_DATE_TIME] = to_utc(df_raw[COL_DATE_TIME])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`
at line 172, The UTC conversion is being written to a literal COL_DATE_TIME
column instead of the actual date_time field, so update the assignment in
electricity_sharepoint.py to write back through the existing date_time/column
constant used by the ingestion flow. Locate the transformation around the df_raw
and to_utc logic and ensure the converted timestamps replace the source
date_time values before Excel ingestion continues.

Provides an easier method of debugging issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ISIS electricity consumption data source has stopped producing data

1 participant