Skip to content

Commit a5b5da7

Browse files
committed
Hackathon day 2
1 parent 7efbe5e commit a5b5da7

8 files changed

Lines changed: 118 additions & 0 deletions

File tree

436 KB
Loading
230 KB
Loading
241 KB
Loading
3.11 MB
Loading
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
title: "Monitoring Buddy Hackathon - Day 2"
3+
date: 2025-06-23T19:38:47+02:00
4+
slug: monitoring-buddy-hackathon-day-2
5+
pin: true
6+
7+
resources:
8+
- src: "**.{png,jpg,webp}"
9+
title: "Image #:counter"
10+
11+
tags:
12+
- Grafana
13+
14+
summary:
15+
Progress on the second day of the hackathon, exploring the capabilities of Grafana MCP and LLM to find problems in the OpenTelemetry Demo setup.
16+
And we had a break with climbing outdoor activity.
17+
18+
---
19+
20+
After a day of rest on Sunday, what I am used to have, I started the second day early.
21+
22+
First idea to explore is what the combination Grafana MCP and LLM can do to help to find problems in the [OpenTelemetry Demo](https://opentelemetry.io/docs/demo/) setup.
23+
![OpenTelemetry Demo](opentelemetry_demo_setup.png)
24+
The OpenTelemetry Demo is a webshop with several services, like product catalog, checkout, ad service, etc, but all developed with a different programming language to showcase the capabilities of OpenTelemetry.
25+
This is a perfect use case to explore the capabilities of the Grafana MCP and LLM.
26+
27+
This setup includes a flagd service, which is a feature flag service. This service is used to enable or disable failures on request.
28+
I enabled one of the failures, so I could see if I could find the service that caused the problem.
29+
30+
I asked some more questions but I forgot to create a screenshot of the conversation, unfortunately.
31+
But as I enabled the observability of my app, I still get the details from the logging and traces.
32+
33+
> We had a problem with the OpenTelemetry webshop in the past half hour, can you find which service caused this? service graph metrics are enabled
34+
35+
>
36+
> "I could not find a Grafana dashboard specifically named "webshop" or "service graph." However, I found some relevant service graph metrics available in Prometheus, such as:
37+
>
38+
> - traces_service_graph_request_client_seconds_bucket
39+
> - traces_service_graph_request_client_seconds_count
40+
> - traces_service_graph_request_client_seconds_sum
41+
> - traces_service_graph_request_failed_total
42+
43+
Then I asked `- based on traces_service_graph_request_failed_total, can you tell which service it was?`
44+
45+
> Based on the traces_service_graph_request_failed_total metric, the "server" label indicates which service the failed requests targeted. The services recorded with failures are:
46+
>
47+
> - accounting
48+
> - adservice
49+
> - checkoutservice
50+
> - flagd
51+
> - frontend
52+
> - frontend-proxy
53+
> - imageprovider
54+
> - productcatalogservice
55+
> If you’d like to see failure counts for a particular service or need to explore related data in Grafana, let me know and I can provide a ready-to-use dashboard link or query!"
56+
57+
So this is getting closer, but not the answer I hoped for. I wanted to know which service was the last in the chain of services that caused the problem.
58+
59+
So I tried it with some more details:
60+
`based on traces_service_graph_request_failed_total, can you tell which service it was? client label says which service send the request, server label which service handled the request. So there is a chain of services, can you give the chain and tell which one was the last in the chain with errors?`
61+
62+
The answer:
63+
> "Based on the traces_service_graph_request_failed_total metric, here is the observed chain of failing requests (client -> server):
64+
>
65+
> 1. user/loadgenerator/frontend-web → frontend-proxy → frontend
66+
> 2. frontend → productcatalogservice, checkoutservice
67+
> 3. checkoutservice → productcatalogservice
68+
> 4. frontend-proxy → frontend
69+
> 5. paymentservice/adservice → flagd
70+
>
71+
> The final "server" in each chain represents the service where errors were last recorded. From the data, the last services in the chain with observed failed requests are:
72+
>
73+
> - frontend
74+
> - productcatalogservice
75+
> - checkoutservice
76+
> - flagd
77+
78+
At first the LLM did not came up with the answer I wanted. So I had to ask some more questions. That **came closer, but still not the answer I hoped for**.
79+
Hopefully with some more steering and input I will be be able to get the answer I hope.
80+
But first we had a break ...
81+
82+
# Break - climbing outdoor activity
83+
84+
As team outing activity we went to a place where we could go climbing.
85+
Here are some pictures of me in action:
86+
87+
![Via ferrata climbing in action](via_ferrata.png)
88+
89+
After a refreshment in the swimming pool after this activity, it was time to continue with the hackathon.
90+
91+
# Continuing with the hackathon
92+
93+
After the break I continued with the hackathon. Let's try to see what happens if I make the ['@SystemMessage'](https://docs.quarkiverse.io/quarkus-langchain4j/dev/ai-services.html#_system_message) instruction more specific.
94+
The OpenTelemetry Demo is running again and I toggled the flagd to introduce some failures in the 'product catalog service'.
95+
Let's see if my monitoring buddy can find the problem:
96+
![Find error in chat](find_error_chat.png)
97+
98+
So it can find some problems, but not the problem as I hoped for, even with the more specific instruction.
99+
Then I thought, let's try to steer the conversation even more with just providing a specific prometheus query:
100+
101+
![Find error in chat](find_error_chat_with_more_steering.png)
102+
103+
But even based on that, it not recognized it. Maybe this is because the status code is only on the query, but not in the response of the query.
104+
105+
![Prometheus metrics with errors](prometheus_metrics_with_errors.png)
106+
107+
After a while I tried it again with asking the LLM if it could find the endpoint:
108+
![Second attempt](find_error_chat_2nd_attempt.png)
109+
So in this attempt it was able to give the right answer.
110+
111+
Sending the right 'tool' request from the LLM back to the application is a key part of the process.
112+
If the requests are not correct, the LLM will not be able to find the right data. **Having the right insights in that process is crucial to steer the LLM in the right direction.**
113+
114+
At this moment it is quite some work to see which 'tool' requests have been asked for by the LLM and in the monitoring I cannot see the execution of the tool actions.
115+
When I can see what kind of requests are send to request data from the tools, I can see if the LLM is using the right queries and if it is using the right data.
116+
If that data in correct, then I might be able to steer the LLM with more background information.
117+
118+
Next idea is I can make that more visible both in observability data and in the responses to the user.
825 KB
Loading
220 KB
Loading
2.6 MB
Loading

0 commit comments

Comments
 (0)