|
| 1 | +--- |
| 2 | +title: "Monitoring Buddy Hackathon - Day 2" |
| 3 | +date: 2025-06-23T19:38:47+02:00 |
| 4 | +slug: monitoring-buddy-hackathon-day-2 |
| 5 | +pin: true |
| 6 | + |
| 7 | +resources: |
| 8 | + - src: "**.{png,jpg,webp}" |
| 9 | + title: "Image #:counter" |
| 10 | + |
| 11 | +tags: |
| 12 | + - Grafana |
| 13 | + |
| 14 | +summary: |
| 15 | + Progress on the second day of the hackathon, exploring the capabilities of Grafana MCP and LLM to find problems in the OpenTelemetry Demo setup. |
| 16 | + And we had a break with climbing outdoor activity. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +After a day of rest on Sunday, what I am used to have, I started the second day early. |
| 21 | + |
| 22 | +First idea to explore is what the combination Grafana MCP and LLM can do to help to find problems in the [OpenTelemetry Demo](https://opentelemetry.io/docs/demo/) setup. |
| 23 | + |
| 24 | +The OpenTelemetry Demo is a webshop with several services, like product catalog, checkout, ad service, etc, but all developed with a different programming language to showcase the capabilities of OpenTelemetry. |
| 25 | +This is a perfect use case to explore the capabilities of the Grafana MCP and LLM. |
| 26 | + |
| 27 | +This setup includes a flagd service, which is a feature flag service. This service is used to enable or disable failures on request. |
| 28 | +I enabled one of the failures, so I could see if I could find the service that caused the problem. |
| 29 | + |
| 30 | +I asked some more questions but I forgot to create a screenshot of the conversation, unfortunately. |
| 31 | +But as I enabled the observability of my app, I still get the details from the logging and traces. |
| 32 | + |
| 33 | +> We had a problem with the OpenTelemetry webshop in the past half hour, can you find which service caused this? service graph metrics are enabled |
| 34 | +
|
| 35 | +> |
| 36 | +> "I could not find a Grafana dashboard specifically named "webshop" or "service graph." However, I found some relevant service graph metrics available in Prometheus, such as: |
| 37 | +> |
| 38 | +> - traces_service_graph_request_client_seconds_bucket |
| 39 | +> - traces_service_graph_request_client_seconds_count |
| 40 | +> - traces_service_graph_request_client_seconds_sum |
| 41 | +> - traces_service_graph_request_failed_total |
| 42 | +
|
| 43 | +Then I asked `- based on traces_service_graph_request_failed_total, can you tell which service it was?` |
| 44 | + |
| 45 | +> Based on the traces_service_graph_request_failed_total metric, the "server" label indicates which service the failed requests targeted. The services recorded with failures are: |
| 46 | +> |
| 47 | +> - accounting |
| 48 | +> - adservice |
| 49 | +> - checkoutservice |
| 50 | +> - flagd |
| 51 | +> - frontend |
| 52 | +> - frontend-proxy |
| 53 | +> - imageprovider |
| 54 | +> - productcatalogservice |
| 55 | +> If you’d like to see failure counts for a particular service or need to explore related data in Grafana, let me know and I can provide a ready-to-use dashboard link or query!" |
| 56 | +
|
| 57 | +So this is getting closer, but not the answer I hoped for. I wanted to know which service was the last in the chain of services that caused the problem. |
| 58 | + |
| 59 | +So I tried it with some more details: |
| 60 | +`based on traces_service_graph_request_failed_total, can you tell which service it was? client label says which service send the request, server label which service handled the request. So there is a chain of services, can you give the chain and tell which one was the last in the chain with errors?` |
| 61 | + |
| 62 | +The answer: |
| 63 | +> "Based on the traces_service_graph_request_failed_total metric, here is the observed chain of failing requests (client -> server): |
| 64 | +> |
| 65 | +> 1. user/loadgenerator/frontend-web → frontend-proxy → frontend |
| 66 | +> 2. frontend → productcatalogservice, checkoutservice |
| 67 | +> 3. checkoutservice → productcatalogservice |
| 68 | +> 4. frontend-proxy → frontend |
| 69 | +> 5. paymentservice/adservice → flagd |
| 70 | +> |
| 71 | +> The final "server" in each chain represents the service where errors were last recorded. From the data, the last services in the chain with observed failed requests are: |
| 72 | +> |
| 73 | +> - frontend |
| 74 | +> - productcatalogservice |
| 75 | +> - checkoutservice |
| 76 | +> - flagd |
| 77 | +
|
| 78 | +At first the LLM did not came up with the answer I wanted. So I had to ask some more questions. That **came closer, but still not the answer I hoped for**. |
| 79 | +Hopefully with some more steering and input I will be be able to get the answer I hope. |
| 80 | +But first we had a break ... |
| 81 | + |
| 82 | +# Break - climbing outdoor activity |
| 83 | + |
| 84 | +As team outing activity we went to a place where we could go climbing. |
| 85 | +Here are some pictures of me in action: |
| 86 | + |
| 87 | + |
| 88 | + |
| 89 | +After a refreshment in the swimming pool after this activity, it was time to continue with the hackathon. |
| 90 | + |
| 91 | +# Continuing with the hackathon |
| 92 | + |
| 93 | +After the break I continued with the hackathon. Let's try to see what happens if I make the ['@SystemMessage'](https://docs.quarkiverse.io/quarkus-langchain4j/dev/ai-services.html#_system_message) instruction more specific. |
| 94 | +The OpenTelemetry Demo is running again and I toggled the flagd to introduce some failures in the 'product catalog service'. |
| 95 | +Let's see if my monitoring buddy can find the problem: |
| 96 | + |
| 97 | + |
| 98 | +So it can find some problems, but not the problem as I hoped for, even with the more specific instruction. |
| 99 | +Then I thought, let's try to steer the conversation even more with just providing a specific prometheus query: |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | +But even based on that, it not recognized it. Maybe this is because the status code is only on the query, but not in the response of the query. |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | +After a while I tried it again with asking the LLM if it could find the endpoint: |
| 108 | + |
| 109 | +So in this attempt it was able to give the right answer. |
| 110 | + |
| 111 | +Sending the right 'tool' request from the LLM back to the application is a key part of the process. |
| 112 | +If the requests are not correct, the LLM will not be able to find the right data. **Having the right insights in that process is crucial to steer the LLM in the right direction.** |
| 113 | + |
| 114 | +At this moment it is quite some work to see which 'tool' requests have been asked for by the LLM and in the monitoring I cannot see the execution of the tool actions. |
| 115 | +When I can see what kind of requests are send to request data from the tools, I can see if the LLM is using the right queries and if it is using the right data. |
| 116 | +If that data in correct, then I might be able to steer the LLM with more background information. |
| 117 | + |
| 118 | +Next idea is I can make that more visible both in observability data and in the responses to the user. |
0 commit comments