9 comments

  • chfritz 3 hours ago ago

    This is something we are actively discussing in the Cloud Robotics Working Group right now. We've already had a number of sessions on this topic with various guest speakers. You can watch the recordings here: https://cloudroboticshub.github.io/meetings. Feel free to attend the upcoming meetings.

    • Lazaruscv 2 hours ago ago

      Thanks a lot for sharing this resource! I wasn’t aware of the Cloud Robotics Working Group, those sessions look super relevant. I’ll definitely check out the recordings and join future meetings. Our angle is very aligned: we’re exploring how AI/automation can help with the time sink of debugging large-scale ROS/ROS 2 systems, especially when logs/bag files pile up. It’d be valuable to hear what the community feels is still missing, even with the current set of tools. Do you think there’s space for a layer focused purely on automated error detection and root cause suggestions?

      • chfritz an hour ago ago

        "automated error detection" -- how do you want to do that? How would you define "error". Clearly you are not just proposing to detect "error" lines in the log, because that's trivial. But if you don't, then how would you define and detect errors and auto-root-cause them? Maybe we can discuss at one of the next meetings.

  • msadowski 9 hours ago ago

    Full disclosure, I work at Foxglove right now. Before joining, I spent over seven years consulting and had more than 50 clients during that period. Here are some thoughts:

    * Combing through the syslogs to find issues is an absolute nightmare, even more so if you are told that the machine broke at some point last night

    * Even if you find the error, it's not necessarily when something broke; it could have happened way before, but you just discovered it because the system hit a state that required it

    * If combing through syslog is hard, try rummaging through multiple mcap files by hand to see where a fault happened

    * The hardware failing silently is a big PITA - this is especially true for things that read analog signals (think PLCs)

    Many of the above issues can be solved with the right architecture or tooling, but often the teams I joined didn't have it, and lacked the capacity to develop it.

    At Foxglove, we make it easy to aggregate and visualize the data and have some helper features (e.g., events, data loaders) that can speed up workflows. However, I would say that having good architecture, procedures, and an aligned team goes a long way in smoothing out troubleshooting, regardless of the tools.

    • Lazaruscv 2 hours ago ago

      This is super insightful, thank you for laying it out so clearly. Your point about the error surfacing way after it first occurred is exactly the sort of issue we’re interested in tackling. Foxglove is doing a great job with visualization and aggregation; what we’re thinking is more of a complementary diagnostic layer that:

      • Correlates syslogs with mcap/bag file anomalies automatically

      • Flags when a hardware failure might have begun (not just when it manifests)

      • Surfaces probable root causes instead of leaving teams to manually chase timestamps

      From your experience across 50+ clients, which do you think is the bigger timesink: data triage across multiple logs/files or interpreting what the signals actually mean once you’ve found them?

      • msadowski 2 hours ago ago

        In my case, it’s definitely the data triage. Once I see the signal, I usually have ideas on what’s happening but I’ve been doing this for 11 years.

        Maybe there could be value in signal interpretation for purely software engineers but I reckon it would be hard for such team to build robots.

        • Lazaruscv an hour ago ago

          Our current thinking is to focus heavily on automating triage across syslogs and bag/mcap files, since that’s where the hours really get burned, even for experienced folks. For interpretation, we see it more as an assistive layer (e.g., surfacing “likely causes” or linking to past incidents), rather than trying to replace domain expertise.

          Do you think there are specific triage workflows where even a small automation (say, correlating error timestamps across syslog and bag files) would save meaningful time?

  • dapperdrake 18 hours ago ago

    Am willing to help with this as well. The math can be iffy.

    • Lazaruscv 2 hours ago ago

      Really appreciate the offer, we’d love to take you up on it. A lot of what we’re exploring right now comes down to signal analysis and anomaly detection in robotics data, which gets math-heavy fast (especially when combining time-series data from multiple sources). We’re setting up short user interviews with roboticists/devs to better map the pain points. Would you be open to a quick chat about the trickiest math/log parsing issues you’ve faced? It could help us avoid reinventing the wheel.