-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Environment
- Python version: 3.12.8
- circuit_maintenance_parser version: 2.10.0
Observed Behavior
Enabling OpenAIParser causes ICal parser to fail parsing iCalendar format maintenance emails.
Steps to Reproduce
- Enable OpenAIParser to be added to the CombinedProcessor (by setting
PARSER_OPENAI_API_KEYenv variable) - Init any type of provider that sends emails in iCalendar format
- Load an iCalendar email (The email must not have the exact subject string in the 'text/calendar' part to trigger this, but I have tested 3 different providers and none of them send the exact subject string in the iCal part)
- Try to generate Maintenance objects from it
Example on how the iCal email is parsed fine, but it breaks as soon as we set PARSER_OPENAI_API_KEY:
circuit-maintenance-parser --data-file arelion.eml --data-type email --provider-type arelion
WARNING:circuit_maintenance_parser.provider:Not matching any exclude filter expression for Planned Work PWIC277299 Notification from Arelion.
Circuit Maintenance Notification #0
{
"account": "*",
"circuits": [
{
"circuit_id": "*",
"impact": "OUTAGE"
}
],
"end": 1772546400,
"maintenance_id": "PWIC277299.1",
"organizer": "mailto:support@arelion.com",
"provider": "arelion.com",
"sequence": 8,
"stamp": 1770376260,
"start": 1772510400,
"status": "CONFIRMED",
"summary": "PWIC277299.1: Planned work notification from Arelion",
"uid": "PWIC277299.1@arelion.com"
}
export PARSER_OPENAI_API_KEY="*"
circuit-maintenance-parser --data-file arelion.eml --data-type email --provider-type arelion
WARNING:circuit_maintenance_parser.provider:Not matching any exclude filter expression for Planned Work PWIC277299 Notification from Arelion.
ERROR:circuit_maintenance_parser.parsers.openai:Expecting value: line 1 column 1 (char 0)
Provider processing failed: Failed creating Maintenance notification for Arelion.
Details:
- Processor SimpleProcessor from Arelion failed due to: Content line could not be parsed into parts: 'Planned Work PWIC277299 Notification from Arelion': Planned Work PWIC277299 Notification from Arelion
- Processor CombinedProcessor from Arelion failed due to: 6 validation errors for Maintenance
account
When enabling OpenAIParser I was expecting that it should only be used as a 'last-resort' if other parsers fail to parse out all of required data for the Maintenance notification object.
But after checking the source code I found that when the OpenAIParser is appended to the list of processors it also modifies the data parts, which I assume are later used by all parsers, not just the LLM one. #335
(
circuit-maintenance-parser/circuit_maintenance_parser/provider.py
Lines 133 to 137 in 759cad1
| if os.getenv("PARSER_OPENAI_API_KEY"): | |
| self._processors.append(CombinedProcessor(data_parsers=[EmailDateParser, OpenAIParser])) | |
| # Add subject to all html or text/* data_parts if not already present. | |
| self.add_subject_to_text(data) |
add_subject_to_text() method appends a new line + the subject to the end of any text/* or html data part if it doesn't have that exact subject string in it.
But this breaks Python's native icalendar parser, since email subjects are free form and not compliant with RFC5545 format.
Example text/calendar data part:
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Arelion//Change Management Tool
BEGIN:VEVENT
<...>
END:VEVENT
END:VCALENDAR
becomes:
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Arelion//Change Management Tool
BEGIN:VEVENT
<...>
END:VEVENT
END:VCALENDAR
Planned Work PWIC277299 Notification from Arelion
iCalendar parser tries to split each line string against :, but fails on the last line that was added:
Traceback (most recent call last):
File "/Users/user/Desktop/lab/python/circuit-maintenance-parser/circuit_maintenance_parser/parser.py", line 86, in parse
result = self.parser_hook(raw, content_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Desktop/lab/python/circuit-maintenance-parser/circuit_maintenance_parser/parser.py", line 119, in parser_hook
gcal = Calendar.from_ical(raw)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Library/Caches/pypoetry/virtualenvs/circuit-maintenance-parser-1EhK8r3--py3.12/lib/python3.12/site-packages/icalendar/cal.py", line 331, in from_ical
name, params, vals = line.parts()
^^^^^^^^^^^^
File "/Users/user/Library/Caches/pypoetry/virtualenvs/circuit-maintenance-parser-1EhK8r3--py3.12/lib/python3.12/site-packages/icalendar/parser.py", line 353, in parts
raise ValueError(
ValueError: Content line could not be parsed into parts: 'Planned Work PWIC277299 Notification from Arelion': Planned Work PWIC277299 Notification from Arelion
To overcome this in my local environment I changed
| if part.type.startswith("text/") or part.type.startswith("html"): |
to
if part.type.startswith(("text/", "html")) and part.type != "text/calendar":
to skip the subject addition for text/calendar parts, but I am not sure if this is the best solution. (Maybe enabling LLM parser shouldn't introduce a way to interfere with other parsers altogether?) If it is, let me know and I will raise a PR to implement it.