feat(warehouses): Add support to backfill manually exported missing electricity data#381
feat(warehouses): Add support to backfill manually exported missing electricity data#381martyngigg wants to merge 5 commits into
Conversation
…e bootstrapped admin (#377) ### Summary Keycloak recommend disabling any bootstrapped admin account and creating a new permanent one - there is a banner if you log in with the temporary one. These changes configure the master realm with a new admin account - the credentials are in the Vault.
Also includes a refactor to simplify resource handling and reduce code duplication
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughKeycloak now boots a temporary admin only when needed, creates a permanent local admin, and uses that account for LDAP and realm setup. The electricity SharePoint ingest updates CSV parsing and section detection, and refactors the DLT write path and backfill file selection. ChangesKeycloak bootstrap and admin rotation
Electricity SharePoint ingest and DLT write updates
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
infra/ansible/roles/keycloak/tasks/main.yml (1)
65-171: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftMake Keycloak bootstrap resumable after a partial run
local_admin_user_existsonly checks whether the permanent admin can log in. If the play fails afterbootstrap-admin userhas createdtemp-adminbut before the permanent admin is fully created or disabled, the next run will enterbootstrap-adminagain and Keycloak will reject it because that user already exists. Detect the existing bootstrap account, or treat the "already exists" case as success, so reruns can continue from the incomplete state.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/ansible/roles/keycloak/tasks/main.yml` around lines 65 - 171, The Keycloak bootstrap flow is not resumable because the Bootstrap Keycloak admin step always reruns when local_admin_user_exists is false, even if the temp bootstrap account already exists. Update the task sequence around the Bootstrap Keycloak admin / Create permanent admin user / Disable temp-admin bootstrap account steps to detect the existing bootstrap account or treat the “already exists” response from bootstrap-admin as success. Ensure reruns can proceed to the permanent admin creation and role mapping instead of failing on an already-created temp-admin user.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`:
- Around line 76-81: The manual electricity sharepoint ingest is leaving the
surviving power column under the raw header instead of the landing schema field,
which causes downstream nulls. Update the normalization in
electricity_sharepoint.py so the manual header/value mapping in the ingestion
path uses the existing landing column name isis_elec_total_power_mwx; check the
parsing/renaming logic around CSV_PREAMBLE_ANCHOR and the manual row handling,
and ensure the renamed column is emitted consistently in the shared staging
output.
- Around line 94-95: Replace the assert-based column count check in the
electricity_sharepoint ingestion flow with explicit CSV-shape validation and
recovery. In the section that builds cols from df.columns, detect the unexpected
number of columns with a normal conditional, log or record the malformed export,
and skip that section instead of raising AssertionError. Keep the behavior
aligned with the existing recoverable timestamp parse handling in the same
ingest path so one bad file does not abort the whole ingestion.
---
Outside diff comments:
In `@infra/ansible/roles/keycloak/tasks/main.yml`:
- Around line 65-171: The Keycloak bootstrap flow is not resumable because the
Bootstrap Keycloak admin step always reruns when local_admin_user_exists is
false, even if the temp bootstrap account already exists. Update the task
sequence around the Bootstrap Keycloak admin / Create permanent admin user /
Disable temp-admin bootstrap account steps to detect the existing bootstrap
account or treat the “already exists” response from bootstrap-admin as success.
Ensure reruns can proceed to the permanent admin creation and role mapping
instead of failing on an already-created temp-admin user.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1a8d8b9a-bbd5-488f-92ab-17478424bf73
📒 Files selected for processing (6)
infra/ansible/group_vars/all/all.ymlinfra/ansible/group_vars/keycloak.ymlinfra/ansible/roles/keycloak/tasks/main.ymlinfra/ansible/roles/keycloak/tasks/setup-ldap.ymlinfra/ansible/roles/keycloak/tasks/setup-target-realm.ymlwarehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py
| cols = [c for c in df.columns] | ||
| assert len(cols) == 3 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Replace the assertion with recoverable CSV-shape validation.
assert can be stripped in optimised Python, and an uncaught AssertionError will abort ingestion for one malformed export instead of skipping that section like timestamp parse failures.
Proposed fix
cols = [c for c in df.columns]
- assert len(cols) == 3
+ if len(cols) != 3:
+ logger.warning(
+ f"Error loading section of {file_name!r}: expected 3 columns, got {len(cols)}. "
+ "There will be gaps in the data."
+ )
+ return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cols = [c for c in df.columns] | |
| assert len(cols) == 3 | |
| cols = [c for c in df.columns] | |
| if len(cols) != 3: | |
| logger.warning( | |
| f"Error loading section of {file_name!r}: expected 3 columns, got {len(cols)}. " | |
| "There will be gaps in the data." | |
| ) | |
| return None |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`
around lines 94 - 95, Replace the assert-based column count check in the
electricity_sharepoint ingestion flow with explicit CSV-shape validation and
recovery. In the section that builds cols from df.columns, detect the unexpected
number of columns with a normal conditional, log or record the malformed export,
and skip that section instead of raising AssertionError. Keep the behavior
aligned with the existing recoverable timestamp parse handling in the same
ingest path so one bad file does not abort the whole ingestion.
This ensures we can keep the order of the output frame correct. Also, add requirement for pandas >= 3 to ensure DST localization errors are produced as ValueErrors no pytz.TimeError exceptions. https://pandas.pydata.org/pandas-docs/stable/whatsnew/v3.0.0.html#time-zones-are-now-represented-by-standard-library-zoneinfo-instead-of-pytz-by-default
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`:
- Line 172: The UTC conversion is being written to a literal COL_DATE_TIME
column instead of the actual date_time field, so update the assignment in
electricity_sharepoint.py to write back through the existing date_time/column
constant used by the ingestion flow. Locate the transformation around the df_raw
and to_utc logic and ensure the converted timestamps replace the source
date_time values before Excel ingestion continues.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fe08a736-ae63-47b7-ba19-74243b579f85
📒 Files selected for processing (1)
warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py
| # See comment at the top of this describing the format | ||
| df_raw = pd.read_excel(file_content, engine=EXCEL_ENGINE, skiprows=EXCEL_SKIP_ROWS) | ||
| df_raw = df_raw.rename(columns={"Time": COL_DATE_TIME}) | ||
| df_raw["COL_DATE_TIME"] = to_utc(df_raw[COL_DATE_TIME]) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Assign the UTC conversion back to date_time.
Line 172 writes a new literal COL_DATE_TIME column, so Excel ingests leave the actual date_time field unconverted.
Proposed fix
- df_raw["COL_DATE_TIME"] = to_utc(df_raw[COL_DATE_TIME])
+ df_raw[COL_DATE_TIME] = to_utc(df_raw[COL_DATE_TIME])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| df_raw["COL_DATE_TIME"] = to_utc(df_raw[COL_DATE_TIME]) | |
| df_raw[COL_DATE_TIME] = to_utc(df_raw[COL_DATE_TIME]) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`
at line 172, The UTC conversion is being written to a literal COL_DATE_TIME
column instead of the actual date_time field, so update the assignment in
electricity_sharepoint.py to write back through the existing date_time/column
constant used by the ingestion flow. Locate the transformation around the df_raw
and to_utc logic and ensure the converted timestamps replace the source
date_time values before Excel ingestion continues.
Provides an easier method of debugging issues
Summary
Updates the electricity data ingestion script to accept files that have been manually exported from the RDM system.They seem to come in different format of CSV file from the automatically created files. A new option to overwrite the backfill glob patterns allows a single run of the script against a smaller subset of exported data rather than having to run the entire backfill again.
This also now requires Pandas >= 3.0 due to a breaking change in 3.0 around DST error handling: Additionally, pandas no longer throws pytz exceptions for timezone operations leading to ambiguous or nonexistent times. These cases will now raise a ValueError.
This has been checked locally by first running the script as it is on
mainand then running with the version on this branch and checking no duplication has occurred.Fixes #362
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
rdm_dataconsistent.