chore(loki.source.file): add file rotation stress test#5137
chore(loki.source.file): add file rotation stress test#5137
Conversation
This reverts commit 859423c.
4b49538 to
d681829
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces comprehensive stress tests for file rotation in the loki.source.file component to identify and reproduce issues during log file rotation. The tests simulate three common rotation strategies (rename, copytruncate, and delete) under varying load conditions and verify that logs are captured reliably.
Key changes:
- Adds a new test file with infrastructure to simulate realistic file rotation scenarios
- Implements stress tests at two intensity levels: QuickSmoke (fast, runs in all test modes) and HighVolume (60s duration, skipped in short mode)
- Currently sets permissive success rate thresholds (95-97%) to allow for known issues, with plans to increase to 100% as fixes are implemented
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| } | ||
| return fmt.Errorf("stat failed: %w", err) | ||
| } | ||
| _ = info |
There was a problem hiding this comment.
This line assigns the info variable but never uses it. The underscore assignment suggests the value is intentionally discarded. Consider removing the info variable assignment entirely and using the blank identifier directly in the os.Stat call if the file info is not needed.
| // Small delay to simulate real-world scenario | ||
| time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
The delay of 10 milliseconds is hardcoded but the comment describes it as "simulating real-world scenario." Consider making this configurable through the testConfig structure so different test scenarios can vary the delay to test different timing conditions. This would improve test flexibility and make the purpose of the delay more explicit.
| gracePeriod := cfg.rotationInterval * 2 | ||
| if gracePeriod < 2*time.Second { | ||
| gracePeriod = 2 * time.Second | ||
| } | ||
| if gracePeriod > 10*time.Second { | ||
| gracePeriod = 10 * time.Second | ||
| } |
There was a problem hiding this comment.
The gracePeriod calculation uses specific bounds (minimum 2 seconds, maximum 10 seconds) that appear to be arbitrary. These magic numbers should be extracted as named constants with explanatory comments about why these specific values were chosen. This would improve code maintainability and make it easier to adjust these values in the future if needed.
| componentCancel() | ||
|
|
||
| // Wait a bit more for final log delivery | ||
| time.Sleep(500 * time.Millisecond) |
There was a problem hiding this comment.
The hardcoded sleep duration of 500 milliseconds is a magic number. Consider extracting this as a named constant with a descriptive name like finalLogDeliveryWait to make its purpose clearer and easier to adjust if needed.
| } | ||
|
|
||
| // Give writers a moment to create initial files | ||
| time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
The hardcoded sleep duration of 100 milliseconds is a magic number. Consider extracting this as a named constant with a descriptive name like initialFileCreationWait to make its purpose clearer and easier to adjust if needed.
| time.Sleep(100 * time.Millisecond) | |
| const initialFileCreationWait = 100 * time.Millisecond | |
| time.Sleep(initialFileCreationWait) |
We have discovered some possible issues during the file rotation in loki.source.file.
This adds a stress test that reproduces the issue.
We currently set a very permissive required success rate, even as low as 97% of log lines to succeed. But as we introduce more fixes, we will want this to go up to 100% ideally.