Skip to content

feat: add TextAnalytics custom component#29

Open
andreil-spec wants to merge 1 commit intomainfrom
feat/coderabbit-test-text-analytics-component
Open

feat: add TextAnalytics custom component#29
andreil-spec wants to merge 1 commit intomainfrom
feat/coderabbit-test-text-analytics-component

Conversation

@andreil-spec
Copy link
Collaborator

@andreil-spec andreil-spec commented Feb 12, 2026

Add a new CloudGeometry component that analyzes text input and returns statistics including word count, character count, sentence count, reading time estimate, average word/sentence length, and most frequent words with configurable stop words and minimum word length filtering.

Summary by CodeRabbit

  • New Features
    • Added Text Analytics component for analyzing text content, providing metrics including word/character counts, sentence/paragraph counts, average word/sentence length, top word frequencies, and optional reading time estimation. Supports customizable stop words and minimum word length filtering.

Add a new CloudGeometry component that analyzes text input and returns
statistics including word count, character count, sentence count,
reading time estimate, average word/sentence length, and most frequent
words with configurable stop words and minimum word length filtering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py`:
- Around line 189-190: Remove the unnecessary f-string prefixes on the two
string literals (f"**Text Analytics Summary**" and f"") in text_analytics.py and
replace them with plain string literals ("**Text Analytics Summary**" and "");
do the same for the other occurrences flagged (the similar pair around lines
203-204) so only strings with actual interpolations remain as f-strings.
- Around line 133-134: The current assignment uses truthy checks that override
valid zero inputs: replace the fallbacks using explicit None checks so a
user-set 0 is preserved; specifically, change the assignments for min_len and
top_n in the method where min_len = self.min_word_length or 3 and top_n =
self.top_n_words or 10 to use conditional expressions that test "is None" (e.g.,
set min_len to 3 only if self.min_word_length is None, and set top_n to 10 only
if self.top_n_words is None) so the class fields min_word_length and top_n_words
retain 0 when explicitly provided.
🧹 Nitpick comments (3)
langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py (3)

184-186: Double computation: _compute_analytics() is called independently by each output method.

When both analyze_text and build_summary outputs are connected, the entire analysis runs twice. Consider caching the result (e.g., with functools.lru_cache or a simple instance-level cache) to avoid redundant work on large texts.

Sketch: simple instance-level caching
+    def _compute_analytics(self) -> dict:
+        if hasattr(self, "_cached_analytics"):
+            return self._cached_analytics
         text = self._get_text()
         ...
+        self._cached_analytics = analytics
         return analytics

98-100: Tokenizer drops all non-ASCII characters (accented letters, CJK, Cyrillic, etc.).

The regex [a-zA-Z0-9]+ silently discards words like "café", "naïve", or any non-Latin text. If this component is intended for English-only text, a brief docstring or input info note would be helpful. Otherwise, consider using \w+ or a Unicode-aware pattern.


102-105: Sentence count may over-count for abbreviations and ellipses.

Text like "Dr. Smith went to Washington... He arrived at 5 p.m." would be split on every ., producing inflated sentence counts. The [.!?]+ pattern collapses consecutive punctuation (good for ...), but isolated periods in abbreviations still cause false splits. This is a known limitation worth documenting, or consider a heuristic like requiring whitespace + uppercase after the delimiter.

Comment on lines +133 to +134
min_len = self.min_word_length or 3
top_n = self.top_n_words or 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Falsy-value trap: 0 is a valid user input that will be silently overridden.

self.min_word_length or 3 falls back to 3 when the user explicitly sets min_word_length=0 (meaning "no minimum"), and self.top_n_words or 10 falls back to 10 when the user sets top_n_words=0. Use an explicit None check instead.

Proposed fix
-        min_len = self.min_word_length or 3
-        top_n = self.top_n_words or 10
+        min_len = self.min_word_length if self.min_word_length is not None else 3
+        top_n = self.top_n_words if self.top_n_words is not None else 10
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
min_len = self.min_word_length or 3
top_n = self.top_n_words or 10
min_len = self.min_word_length if self.min_word_length is not None else 3
top_n = self.top_n_words if self.top_n_words is not None else 10
🤖 Prompt for AI Agents
In
`@langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py`
around lines 133 - 134, The current assignment uses truthy checks that override
valid zero inputs: replace the fallbacks using explicit None checks so a
user-set 0 is preserved; specifically, change the assignments for min_len and
top_n in the method where min_len = self.min_word_length or 3 and top_n =
self.top_n_words or 10 to use conditional expressions that test "is None" (e.g.,
set min_len to 3 only if self.min_word_length is None, and set top_n to 10 only
if self.top_n_words is None) so the class fields min_word_length and top_n_words
retain 0 when explicitly provided.

Comment on lines +189 to +190
f"**Text Analytics Summary**",
f"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefixes on strings without placeholders.

Ruff F541 flagged these lines. The f"" prefix is misleading when there are no interpolations.

Proposed fix
         lines = [
-            f"**Text Analytics Summary**",
-            f"",
+            "**Text Analytics Summary**",
+            "",
             f"- Words: {analytics['word_count']}",
             ...
         ]
         ...
         if analytics["top_words"]:
-            lines.append(f"")
-            lines.append(f"**Top Words:**")
+            lines.append("")
+            lines.append("**Top Words:**")

Also applies to: 203-204

🧰 Tools
🪛 Ruff (0.15.0)

[error] 189-189: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 190-190: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In
`@langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py`
around lines 189 - 190, Remove the unnecessary f-string prefixes on the two
string literals (f"**Text Analytics Summary**" and f"") in text_analytics.py and
replace them with plain string literals ("**Text Analytics Summary**" and "");
do the same for the other occurrences flagged (the similar pair around lines
203-204) so only strings with actual interpolations remain as f-strings.

@andreil-spec
Copy link
Collaborator Author

@coderabbitai Review all security issues and flag them if any

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the text analytics component implementation
cat langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py

# Also read the __init__.py file to understand the exports
cat langbuilder/src/backend/base/langbuilder/components/cloudgeometry/__init__.py

Length of output: 10069


@andreil-spec I've completed a comprehensive security review of the TextAnalytics component. Here are my findings:

🚨 Security Issues Identified

1. CRITICAL: Resource Exhaustion / Denial of Service

Location: text_analytics.py - input_text parameter (Lines 28-33)

Issue: No input length validation on the text being analyzed. An attacker could provide:

  • Extremely large text inputs (megabytes/gigabytes)
  • Text designed to exhaust memory during tokenization
  • Input causing excessive CPU usage in regex operations

Impact:

  • Memory exhaustion from storing large text and word lists
  • CPU exhaustion from regex operations on unbounded input
  • Counter object in _get_word_frequencies (Line 137) consuming excessive memory
  • Application becomes unresponsive or crashes

Recommendation:

MAX_TEXT_LENGTH = 1_000_000  # 1MB character limit

def _get_text(self) -> str:
    raw = self.input_text
    if hasattr(raw, "text"):
        raw = raw.text
    text = str(raw)
    
    if len(text) > MAX_TEXT_LENGTH:
        raise ValueError(f"Input text exceeds maximum length of {MAX_TEXT_LENGTH} characters")
    
    return text

2. MEDIUM: Integer Parameter Validation

Location: Lines 34-50 (top_n_words and min_word_length)

Issue: No validation on integer inputs:

  • top_n_words could be negative or excessively large (e.g., 1 billion)
  • min_word_length could be negative or zero

Impact:

  • Negative top_n_words could cause unexpected behavior in Counter.most_common()
  • Excessively large top_n_words wastes resources returning unnecessary data
  • Negative min_word_length bypasses filtering logic (Line 134)

Recommendation:

def _get_word_frequencies(self, words: list[str]) -> list[dict]:
    stop_words = self._get_stop_words()
    min_len = max(0, self.min_word_length or 3)  # Ensure non-negative
    top_n = min(max(1, self.top_n_words or 10), 100)  # Clamp between 1-100
    
    # ... rest of implementation

3. LOW: Regex Performance Concern

Location: Line 95 - _count_paragraphs

Issue: The pattern r'\n\s*\n' could be slow with maliciously crafted input containing thousands of consecutive whitespace characters.

Impact: Minor CPU performance degradation with crafted input

Recommendation:

def _count_paragraphs(self, text: str) -> int:
    # Limit whitespace matching to prevent regex slowdown
    paragraphs = re.split(r'\n\s{0,100}\n', text)
    paragraphs = [p.strip() for p in paragraphs if p.strip()]
    return len(paragraphs)

✅ Security Positives

  1. Division by Zero Protected: All division operations properly check for zero denominators (Lines 118, 119, 139)
  2. No Code Injection: Text is treated as data only, no execution paths
  3. No Sensitive Data Exposure: Component doesn't handle credentials or sensitive information
  4. Safe Regex Patterns: Tokenization regex is not vulnerable to catastrophic backtracking

📋 Recommended Actions

Priority 1 (Critical):

  • Add maximum input text length validation (1MB recommended)
  • Add early validation in analyze_text() before processing

Priority 2 (Medium):

  • Validate and clamp integer parameters to reasonable ranges
  • Add rate limiting if this component is exposed via API

Priority 3 (Low):

  • Optimize paragraph counting regex
  • Consider adding component-level timeout for large text processing

Would you like me to generate a complete implementation with all security fixes applied?

@andreil-spec
Copy link
Collaborator Author

@coderabbitai please do add the required changes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Tip

For best results, initiate chat on the files or code changes.

@andreil-spec I'll implement the security fixes for the TextAnalytics component. Let me plan the required changes:

Task: Implement Security Fixes for TextAnalytics Component

  • Add input length validation to prevent resource exhaustion
  • Add integer parameter validation and clamping
  • Optimize regex pattern for paragraph counting
  • Add constant definitions for security limits

💡 Use @coderabbitai plan to create a plan for code changes

@CloudGeometry CloudGeometry deleted a comment from coderabbitai bot Feb 13, 2026
@CloudGeometry CloudGeometry deleted a comment from coderabbitai bot Feb 13, 2026
@CloudGeometry CloudGeometry deleted a comment from coderabbitai bot Feb 13, 2026
@andreil-spec
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

Walkthrough

Introduces TextAnalyticsComponent, a new text analysis component that processes input text to compute metrics including word count, character counts, sentence/paragraph counts, reading time estimates, and top word frequencies. Also updates the cloudgeometry module to expose this new component for public access.

Changes

Cohort / File(s) Summary
Module Exposure
langbuilder/src/backend/base/langbuilder/components/cloudgeometry/__init__.py
Added TextAnalyticsComponent to TYPE_CHECKING imports, dynamic imports map, and __all__ list for public module exposure.
New Component Implementation
langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py
Implements TextAnalyticsComponent with inputs for text, word filtering, reading time, and stop words; outputs analytics data and summary message. Includes private helpers for tokenization, metric computation, frequency analysis, and reading time estimation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (30 files):

⚔️ .cg-aix-sdlc/docs/VALIDATION-REPORT.md (content)
⚔️ .cg-aix-sdlc/docs/VALIDATION-REPORT.md~HEAD (content)
⚔️ .cg-aix-sdlc/docs/architecture/README.md (content)
⚔️ .cg-aix-sdlc/docs/architecture/c4-container.md (content)
⚔️ .cg-aix-sdlc/docs/architecture/c4-context.md (content)
⚔️ .cg-aix-sdlc/docs/architecture/data-architecture.md (content)
⚔️ .cg-aix-sdlc/docs/architecture/deployment-topology.md (content)
⚔️ .cg-aix-sdlc/docs/architecture/patterns-and-principles.md (content)
⚔️ .cg-aix-sdlc/docs/architecture/system-architecture.md (content)
⚔️ .cg-aix-sdlc/docs/inventory/_profiles/foundation-context.json (content)
⚔️ .cg-aix-sdlc/docs/inventory/_profiles/tech-profile.json (content)
⚔️ .cg-aix-sdlc/docs/inventory/api-surface.md (content)
⚔️ .cg-aix-sdlc/docs/inventory/configuration-index.md (content)
⚔️ .cg-aix-sdlc/docs/inventory/core-metadata.json (content)
⚔️ .cg-aix-sdlc/docs/inventory/database-schemas.md (content)
⚔️ .cg-aix-sdlc/docs/inventory/integration-map.md (content)
⚔️ .cg-aix-sdlc/docs/inventory/repository-map.md (content)
⚔️ .cg-aix-sdlc/docs/inventory/service-catalog.md (content)
⚔️ .cg-aix-sdlc/docs/inventory/technology-stack.md (content)
⚔️ .cg-aix-sdlc/docs/product/EXECUTIVE-SUMMARY.md (content)
⚔️ .cg-aix-sdlc/docs/product/PRODUCT-POSITIONING.md (content)
⚔️ .cg-aix-sdlc/docs/product/README.md (content)
⚔️ .cg-aix-sdlc/docs/product/business-model.md (content)
⚔️ .cg-aix-sdlc/docs/product/capabilities-matrix.md (content)
⚔️ .cg-aix-sdlc/docs/product/competitive-analysis-template.md (content)
⚔️ .cg-aix-sdlc/docs/product/feature-catalog.md (content)
⚔️ .cg-aix-sdlc/docs/product/integration-ecosystem.md (content)
⚔️ .cg-aix-sdlc/docs/product/roadmap-inputs.md (content)
⚔️ .cg-aix-sdlc/docs/progress.json (content)
⚔️ langbuilder/src/backend/base/langbuilder/components/cloudgeometry/__init__.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add TextAnalytics custom component' clearly and concisely summarizes the main change: introducing a new TextAnalyticsComponent to the cloudgeometry package.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/coderabbit-test-text-analytics-component
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch feat/coderabbit-test-text-analytics-component
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py`:
- Around line 92-96: The _get_text method currently returns unbounded text and
must enforce a size guard to prevent resource-exhaustion; convert raw to a
string (as done now), then check its length against a defined MAX_TEXT_LENGTH
constant (add module-level MAX_TEXT_LENGTH, e.g. an explicit byte/char limit)
and if the input exceeds it either raise a specific exception (ValueError or a
custom InvalidInputError) or truncate safely before returning; implement the
check inside _get_text (after the hasattr(raw, "text") handling) and reference
the MAX_TEXT_LENGTH symbol in the validation and error message.
🧹 Nitpick comments (3)
langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py (3)

150-157: _compute_analytics() runs the full analysis from scratch on every call — doubled work when both outputs are used.

Both analyze_text() and build_summary() independently invoke _compute_analytics(), re-running all regex work, tokenization, and frequency counting. If both outputs are wired in a flow, this is redundant.

Consider caching the result for the lifetime of a single component run (e.g., via a simple instance-level cache or by computing once and feeding the result into both output methods).

♻️ Sketch: compute-once pattern
+    def _get_or_compute_analytics(self) -> dict:
+        if not hasattr(self, "_cached_analytics"):
+            self._cached_analytics = self._compute_analytics()
+        return self._cached_analytics
+
     def analyze_text(self) -> Data:
         """Return full analytics as a Data object."""
-        analytics = self._compute_analytics()
+        analytics = self._get_or_compute_analytics()
         self.status = f"{analytics['word_count']} words, {analytics['sentence_count']} sentences"
         return Data(data=analytics)

     def build_summary(self) -> Message:
         """Return a human-readable summary message."""
-        analytics = self._compute_analytics()
+        analytics = self._get_or_compute_analytics()

155-155: Avoid allocating a full string copy just to count non-space characters.

text.replace(" ", "") creates a complete copy of the input string only to measure its length. For large inputs this is wasteful. A subtraction is O(n) with no extra allocation.

♻️ Proposed fix
-        char_no_spaces = len(text.replace(" ", ""))
+        char_no_spaces = char_count - text.count(" ")

188-206: Minor: use list.extend for the top-words section and drop unnecessary f prefixes.

Ruff PERF401 flags the append-in-loop pattern on line 205–206. Combining this with the f"" prefix cleanup:

♻️ Proposed fix
         lines = [
-            f"**Text Analytics Summary**",
-            f"",
+            "**Text Analytics Summary**",
+            "",
             f"- Words: {analytics['word_count']}",
             f"- Characters: {analytics['character_count']} ({analytics['character_count_no_spaces']} without spaces)",
             f"- Sentences: {analytics['sentence_count']}",
             f"- Paragraphs: {analytics['paragraph_count']}",
             f"- Avg word length: {analytics['average_word_length']} chars",
             f"- Avg sentence length: {analytics['average_sentence_length']} words",
         ]

         if self.include_reading_time and "reading_time" in analytics:
             lines.append(f"- Reading time: {analytics['reading_time']['display']}")

         if analytics["top_words"]:
-            lines.append(f"")
-            lines.append(f"**Top Words:**")
-            for item in analytics["top_words"]:
-                lines.append(f"  - \"{item['word']}\" — {item['count']}x ({item['percentage']}%)")
+            lines.append("")
+            lines.append("**Top Words:**")
+            lines.extend(
+                f"  - \"{item['word']}\" — {item['count']}x ({item['percentage']}%)"
+                for item in analytics["top_words"]
+            )

Comment on lines +92 to +96
def _get_text(self) -> str:
raw = self.input_text
if hasattr(raw, "text"):
raw = raw.text
return str(raw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing input-length validation — resource-exhaustion / DoS risk.

_get_text accepts arbitrarily large inputs with no size guard. Downstream processing (regex tokenization, Counter, text.replace(...) copy, paragraph-splitting regex) will consume unbounded CPU and memory. Add an early length check here, since this is the single entry point for all text.

🛡️ Proposed fix
+    MAX_TEXT_LENGTH = 1_000_000  # class-level constant
+
     def _get_text(self) -> str:
         raw = self.input_text
         if hasattr(raw, "text"):
             raw = raw.text
-        return str(raw)
+        text = str(raw)
+        if len(text) > self.MAX_TEXT_LENGTH:
+            msg = f"Input text exceeds maximum allowed length of {self.MAX_TEXT_LENGTH:,} characters."
+            raise ValueError(msg)
+        return text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _get_text(self) -> str:
raw = self.input_text
if hasattr(raw, "text"):
raw = raw.text
return str(raw)
MAX_TEXT_LENGTH = 1_000_000 # class-level constant
def _get_text(self) -> str:
raw = self.input_text
if hasattr(raw, "text"):
raw = raw.text
text = str(raw)
if len(text) > self.MAX_TEXT_LENGTH:
msg = f"Input text exceeds maximum allowed length of {self.MAX_TEXT_LENGTH:,} characters."
raise ValueError(msg)
return text
🤖 Prompt for AI Agents
In
`@langbuilder/src/backend/base/langbuilder/components/cloudgeometry/text_analytics.py`
around lines 92 - 96, The _get_text method currently returns unbounded text and
must enforce a size guard to prevent resource-exhaustion; convert raw to a
string (as done now), then check its length against a defined MAX_TEXT_LENGTH
constant (add module-level MAX_TEXT_LENGTH, e.g. an explicit byte/char limit)
and if the input exceeds it either raise a specific exception (ValueError or a
custom InvalidInputError) or truncate safely before returning; implement the
check inside _get_text (after the hasattr(raw, "text") handling) and reference
the MAX_TEXT_LENGTH symbol in the validation and error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant