3.2. Mission A: Investigate a fully instrumented system
In this mission, you'll investigate a fully instrumented microservices application, in Grafana Cloud.
This is the OpenTelemetry Demo - a production-grade system where services are exporting OpenTelemetry traces, metrics and logs.
Your goal in this mission is to use Grafana Cloud to understand the system, identity patterns, and see how OpenTelemetry's semantic conventions are incredibly useful when operating at scale, across many languages and frameworks.

Step 1: Get ready
Log on to the environment to get started:
-
Go to the Reference Grafana URL that you have been given (Hint: the URL looks like
https://abcd12appenv.grafana.net). -
If you are presented with a choice of sign-in options, click Sign in with SSO.
-
At the Authentication login screen, enter the username (not email) and password that you received by email, or from your instructor.
Step 2: Discover your services
In this step, you'll use OpenTelemetry resource attributes to understand what services are running, where they're deployed, and how they're configured.
Explore workloads and infrastructure
OpenTelemetry can tell us a lot about workloads, and their underlying infrastructure. Explore this environment and see if you can answer these questions:
-
How many services are running? (Hint: use the Entity Catalog)
-
Which version of each service is running? (Hint: find a trace use the service.version attribute, or use the Entity Catalog and add Service Version as a column)
-
In which cloud provider and region are these services deployed? (Hint: search for traces and look in the resource attributes, or find the information in Entity Catalog)
-
What is the name of the Kubernetes node which the checkoutservice is running on? (Hint: this service is called from other services, so if you are searching Drilldown Traces, don't forget to change the filter to "All spans", not "Root spans")
Why it's important: Resource attributes give you a complete inventory of your infrastructure - what's running, where it's running, and how it's configured. This forms the foundation for service discovery and helps you understand the topology of your distributed system.
Step 3: Explore semantic conventions
Now that you know what services exist, let's explore how OpenTelemetry standardizes the way telemetry is captured and exported.
Semantic conventions are agreed-upon naming standards for attributes, spans, and metrics. They make telemetry portable and queryable across any service, regardless of language or framework.
-
Navigate to Drilldown -> Traces.
-
Find traces for the ditl-demo-frontend-client service.
-
Open an example trace and examine the span attributes:
- HTTP spans: Look for http.request.method, http.route, http.response.status_code
- RPC spans: Find rpc.system.name, rpc.method
- Database spans: Check for db.system.name, db.query.text, db.client.connection.pool.name
-
Compare a couple of services. Notice how OpenTelemetry auto-instrumentation uses consistent attribute, span and metric naming, irrespective of the language or framework.
-
Navigate to Drilldown -> Metrics.
-
Answer the question: Which services use gRPC, and which use HTTP?
- Hint: OpenTelemetry conventions define some standard metric names, like http.server.request.duration and rpc.server.call.duration
- Try using Drilldown Metrics to find the known metrics for HTTP servers and RPC servers, and note which label values you see.
- Remember: In Grafana Cloud, OpenTelemetry resource attributes are promoted to Prometheus labels.
- Check your analysis by inspecting traces from each service and look at its spans - are they decorated with rpc.service, rpc.method or http.method, http.route?
Why it's important: The semantic conventions of OpenTelemetry make your telemetry super-portable and queryable, across any service, regardless of the different languages or frameworks that your teams are using.
In Grafana Cloud: By instrumenting your workloads with OpenTelemetry, and adopting its semantic conventions, you gain a standardized inventory of your workloads and services. In Grafana Cloud, The Entity Catalog view is populated from your services instrumented with OpenTelemetry, and other sources.
Step 4: Understand context propagation
Now let's see how OpenTelemetry connects the dots across your distributed system. Context propagation is the mechanism that allows traces to span multiple services, creating a complete picture of a request's journey.
Follow a request across services
-
In Drilldown Traces, change the Filters to All spans and then search for traces including the cartservice.
-
Click on a Trace to expand the view.
Notice how the trace view shows the end-to-end flow of the trace that included calls to cartservice. The request flow will look something like this:
ditl-demo-frontend-client → frontendproxy → cartservice → flagd
Notice how a single trace ID combines all of these interactions into a single flow.
-
Check out the trace timeline -- notice how you can see the latency of each service hop.
Why it's important: Context is the essential piece of information that makes distributed tracing work. Without passing (propagating) context between services, you'd only be able to see a bunch of disconnected traces.
Context propagation ensures that each service passes some linking information to the next service. This allows Grafana Cloud to link the traces together, so you can see how a single request can touch many downstream services.
Step 5: Correlate signals
Beyond connecting traces across services, OpenTelemetry enables correlation between different types of signals - traces, logs, and metrics. This allows you to jump seamlessly from one signal type to another, when you're investigating issues.
Navigate from traces to logs
-
In Drilldown Traces, find a trace from the cartservice.
-
Click on Logs for this span blue pill button.
-
A Logs query opens in a split view, with the specific log lines from the given trace.
Why it's important: Correlating signals is crucial to helping you make sense of what an application is doing. When you troubleshoot applications that are fully instrumented with OpenTelemetry, you can navigate from performance metrics, to specific requests and traces for that service, and then down to individual events logged by your application during a request. This correlation happens because these signals (metrics, logs, traces) carry the same attributes.
Real-world example: Finding log messages for failing spans. With OTel, you can answer: why did a specific request fail, or why was it slow? What happened?
Step 6: Analyze performance and troubleshoot
Now that you understand how to discover services, interpret semantic conventions, follow distributed traces, and correlate signals, let's put it all together to analyze performance and troubleshoot issues.
Visualize service dependencies
-
From the main menu, click on Observability -> Entity Catalog to open the Entity Catalog.
-
In the Environment dropdown, clear any existing selections and choose production.
-
Now you should see all the production services that make up our Astronomy Shop.
-
Click on the Service Map tab to see the service topology in a single view.
-
Find a service with high error rates, identified by a red circle around the entity.
-
For the service that is failing, answer this question: is it the service itself that is failing, or one of its dependencies?
Analyze service latency with standard metrics
Earlier in this workshop, you worked with metrics generated from trace spans in Grafana Cloud. This approach provides flexibility and fidelity, since you retain both the full request context from trace spans, in addition to metrics for alerting.
Additionally, OpenTelemetry automatically instruments many common HTTP and gRPC server libraries to emit standardized latency metrics, such as http.server.request.duration and rpc.server.call.duration. These metrics are available in Grafana Cloud Metrics, with consistent naming (remember: periods in names are converted to underscores in Prometheus).
-
Navigate to Drilldown -> Metrics.
-
Search for the metric
rpc_server_duration_milliseconds_bucket. -
In the job panel, click on the Select button to see the histogram broken down by service.
Note: Grafana Cloud automatically promotes many other resource attributes to Prometheus metric labels, automatically writing the complex join queries (involving
target_info) for you in the background. -
Pick a service and click Add to filters. You can break down the metric even further, using standard OpenTelemetry resource attributes, like Kubernetes Pod name (k8s.pod.name), or service version (service.version).
-
How many instances of this service were running in the last hour? What are the pod names?
OTel has a convention for mapping service details to Prometheus-style labels. Your service.name and service.namespace become the job label (like production/checkoutservice), so you can filter metrics using standard Prometheus queries like {job="production/checkoutservice"}.
For more info, see https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#resource-attributes-1
Explore runtime environment metrics
Beyond application-level metrics, OpenTelemetry automatically instruments runtime environments to emit standardized metrics about the underlying platform - whether that's the JVM, .NET CLR, Node.js V8 engine, Go runtime, or others.
These metrics follow OpenTelemetry semantic conventions, allowing you to gain visibility into runtime performance characteristics that you might typically track, like memory usage, garbage collection, thread counts, and CPU utilization - all standardized across different languages and platforms.
-
Navigate to Drilldown -> Metrics.
-
Search for runtime metrics by trying patterns like:
jvm_memory_*for Java servicesprocess_runtime_*for various runtime metrics (.NET, Python)go_*for Go-specific metrics (like goroutines)
-
Select a metric (e.g.,
jvm_memory_used_bytes) and in the job panel, click Select to see a breakdown of this metric by namespace and service. -
Add a filter for a specific service and explore how you can break down the metric using standard attributes like jvm_memory_pool_name or jvm.memory.type.
-
Try answering this question: Which Java service is using the most heap memory?
-
Try exploring metrics for other runtimes to understand health of the workloads in this system.
Why it's important: Runtime metrics give you deep visibility into how your applications are performing at the platform level. With OpenTelemetry's standardized approach, you can build unified dashboards and alerts that work across your entire polyglot application landscape - no need to learn different instrumentation libraries or metric naming conventions for each language.