Conversation
Walkthrough
ChangesSpark 4+ Array Indexing Compatibility
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala (1)
198-203:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
path.contains('[')log-path selector misses Spark 4get(…)expressions.In
flattenGroup, when apatharrives fromflattenStructArrayorflattenNestedArrayson Spark 4, it will look likeget(parentfield, 0).— no[present — sopath.contains('[')evaluates tofalseand the log entry usescol(...)syntax even though the path is a SQL expression thatcolcannot evaluate. This only affects thestringFieldslog output (line 209), not actual execution, but the logged "flattening code" becomes incorrect/misleading for Spark 4 paths.🔧 Proposed fix
- if (path.contains('[')) + if (path.contains('[') || path.contains('(')) stringFields += s"""expr("$path`${field.name}` AS `$newFieldName`")""" else stringFields += s"""col("$path`${field.name}`").as("$newFieldName")"""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala` around lines 198 - 203, The log-path selector in flattenGroup currently checks only path.contains('[') so Spark 4 SQL-style paths like get(parent`field`, 0) are misclassified; update the condition that decides whether to append an expr(...) vs col(...) to stringFields to also detect SQL/get-style expressions (for example check path.contains("get(") or otherwise detect '(' in the path) and use the expr(...) branch for those cases; locate the logic in flattenGroup where stringFields is appended (related symbols: flattenGroup, flattenStructArray, flattenNestedArrays, path, stringFields, fields, getNewFieldName) and change the conditional so Spark 4 get(...) expressions are logged with expr(...) rather than col(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala`:
- Around line 129-133: In flattenStructArray's primitive-array branch the actual
Column added to fields still uses bracket indexing
(expr(s"$path`${structField.name}`[$i]")) which fails on Spark 4; change the
Column expression to use getArrayIndexExpr(fieldName, i) like stringFields does
so fields and stringFields are consistent; update the fields += expr(...) call
inside flattenStructArray to build the Column via
expr(getArrayIndexExpr(fieldName, i)) and keep the existing
.as(getNewFieldName(...), structField.metadata) so the selection matches the
logger output.
---
Outside diff comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala`:
- Around line 198-203: The log-path selector in flattenGroup currently checks
only path.contains('[') so Spark 4 SQL-style paths like get(parent`field`, 0)
are misclassified; update the condition that decides whether to append an
expr(...) vs col(...) to stringFields to also detect SQL/get-style expressions
(for example check path.contains("get(") or otherwise detect '(' in the path)
and use the expr(...) branch for those cases; locate the logic in flattenGroup
where stringFields is appended (related symbols: flattenGroup,
flattenStructArray, flattenNestedArrays, path, stringFields, fields,
getNewFieldName) and change the conditional so Spark 4 get(...) expressions are
logged with expr(...) rather than col(...).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 89b865c8-138f-40cc-808c-24df907d5622
📒 Files selected for processing (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala
| case _ => | ||
| val newFieldNamePrefix = s"${fieldNamePrefix}${i}" | ||
| val newFieldName = getNewFieldName(s"$newFieldNamePrefix") | ||
| fields += expr(s"$path`${structField.name}`[$i]").as(newFieldName, structField.metadata) | ||
| stringFields += s"""expr("$path`${structField.name}`[$i] AS `$newFieldName`")""" | ||
| stringFields += s"""expr("${getArrayIndexExpr(fieldName, i)} AS `$newFieldName`")""" |
There was a problem hiding this comment.
fields still uses bracket indexing for primitive arrays — the Spark 4 fix is incomplete here.
In flattenStructArray's primitive branch (line 132), only stringFields (the log representation) was updated to use getArrayIndexExpr, but the actual Column added to fields — which is what df.select executes — still uses bracket indexing:
fields += expr(s"$path`${structField.name}`[$i]").as(newFieldName, structField.metadata)On Spark 4 this will throw [INVALID_ARRAY_INDEX] exactly as the comment above describes, because fields is what gets evaluated during execution. stringFields is only used for the logger.info call at line 209.
Compare with the correctly-fixed flattenNestedArrays at line 154, where fields itself uses getArrayIndexExpr.
🐛 Proposed fix
case _ =>
val newFieldNamePrefix = s"${fieldNamePrefix}${i}"
val newFieldName = getNewFieldName(s"$newFieldNamePrefix")
- fields += expr(s"$path`${structField.name}`[$i]").as(newFieldName, structField.metadata)
+ fields += expr(s"${getArrayIndexExpr(fieldName, i)}").as(newFieldName, structField.metadata)
stringFields += s"""expr("${getArrayIndexExpr(fieldName, i)} AS `$newFieldName`")"""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/SparkUtils.scala`
around lines 129 - 133, In flattenStructArray's primitive-array branch the
actual Column added to fields still uses bracket indexing
(expr(s"$path`${structField.name}`[$i]")) which fails on Spark 4; change the
Column expression to use getArrayIndexExpr(fieldName, i) like stringFields does
so fields and stringFields are consistent; update the fields += expr(...) call
inside flattenStructArray to build the Column via
expr(getArrayIndexExpr(fieldName, i)) and keep the existing
.as(getNewFieldName(...), structField.metadata) so the selection matches the
logger output.
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
Summary by CodeRabbit