Skip to content

Enabling OpenAIParser breaks iCalendar parsing #372

@AndriusV4

Description

@AndriusV4

Environment

  • Python version: 3.12.8
  • circuit_maintenance_parser version: 2.10.0

Observed Behavior

Enabling OpenAIParser causes ICal parser to fail parsing iCalendar format maintenance emails.

Steps to Reproduce

  1. Enable OpenAIParser to be added to the CombinedProcessor (by setting PARSER_OPENAI_API_KEY env variable)
  2. Init any type of provider that sends emails in iCalendar format
  3. Load an iCalendar email (The email must not have the exact subject string in the 'text/calendar' part to trigger this, but I have tested 3 different providers and none of them send the exact subject string in the iCal part)
  4. Try to generate Maintenance objects from it

Example on how the iCal email is parsed fine, but it breaks as soon as we set PARSER_OPENAI_API_KEY:

circuit-maintenance-parser --data-file arelion.eml --data-type email --provider-type arelion

WARNING:circuit_maintenance_parser.provider:Not matching any exclude filter expression for Planned Work PWIC277299 Notification from Arelion.
Circuit Maintenance Notification #0
{
  "account": "*",
  "circuits": [
    {
      "circuit_id": "*",
      "impact": "OUTAGE"
    }
  ],
  "end": 1772546400,
  "maintenance_id": "PWIC277299.1",
  "organizer": "mailto:support@arelion.com",
  "provider": "arelion.com",
  "sequence": 8,
  "stamp": 1770376260,
  "start": 1772510400,
  "status": "CONFIRMED",
  "summary": "PWIC277299.1: Planned work notification from Arelion",
  "uid": "PWIC277299.1@arelion.com"
}

export PARSER_OPENAI_API_KEY="*"

circuit-maintenance-parser --data-file arelion.eml --data-type email --provider-type arelion 

WARNING:circuit_maintenance_parser.provider:Not matching any exclude filter expression for Planned Work PWIC277299 Notification from Arelion.
ERROR:circuit_maintenance_parser.parsers.openai:Expecting value: line 1 column 1 (char 0)
Provider processing failed: Failed creating Maintenance notification for Arelion.
Details:
- Processor SimpleProcessor from Arelion failed due to: Content line could not be parsed into parts: 'Planned Work PWIC277299 Notification from Arelion': Planned Work PWIC277299 Notification from Arelion
- Processor CombinedProcessor from Arelion failed due to: 6 validation errors for Maintenance
account

When enabling OpenAIParser I was expecting that it should only be used as a 'last-resort' if other parsers fail to parse out all of required data for the Maintenance notification object.
But after checking the source code I found that when the OpenAIParser is appended to the list of processors it also modifies the data parts, which I assume are later used by all parsers, not just the LLM one. #335

(

if os.getenv("PARSER_OPENAI_API_KEY"):
self._processors.append(CombinedProcessor(data_parsers=[EmailDateParser, OpenAIParser]))
# Add subject to all html or text/* data_parts if not already present.
self.add_subject_to_text(data)
)

add_subject_to_text() method appends a new line + the subject to the end of any text/* or html data part if it doesn't have that exact subject string in it.
But this breaks Python's native icalendar parser, since email subjects are free form and not compliant with RFC5545 format.
Example text/calendar data part:

BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Arelion//Change Management Tool
BEGIN:VEVENT
<...>
END:VEVENT
END:VCALENDAR

becomes:

BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Arelion//Change Management Tool
BEGIN:VEVENT
<...>
END:VEVENT
END:VCALENDAR
Planned Work PWIC277299 Notification from Arelion

iCalendar parser tries to split each line string against :, but fails on the last line that was added:

Traceback (most recent call last):
  File "/Users/user/Desktop/lab/python/circuit-maintenance-parser/circuit_maintenance_parser/parser.py", line 86, in parse
    result = self.parser_hook(raw, content_type)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Desktop/lab/python/circuit-maintenance-parser/circuit_maintenance_parser/parser.py", line 119, in parser_hook
    gcal = Calendar.from_ical(raw)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/circuit-maintenance-parser-1EhK8r3--py3.12/lib/python3.12/site-packages/icalendar/cal.py", line 331, in from_ical
    name, params, vals = line.parts()
                         ^^^^^^^^^^^^
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/circuit-maintenance-parser-1EhK8r3--py3.12/lib/python3.12/site-packages/icalendar/parser.py", line 353, in parts
    raise ValueError(
ValueError: Content line could not be parsed into parts: 'Planned Work PWIC277299 Notification from Arelion': Planned Work PWIC277299 Notification from Arelion

To overcome this in my local environment I changed

if part.type.startswith("text/") or part.type.startswith("html"):

to
if part.type.startswith(("text/", "html")) and part.type != "text/calendar":

to skip the subject addition for text/calendar parts, but I am not sure if this is the best solution. (Maybe enabling LLM parser shouldn't introduce a way to interfere with other parsers altogether?) If it is, let me know and I will raise a PR to implement it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions