Lessons learned from profiling an algorithm in Rust

(blog.mapotofu.org)

104 points | by urcyanide 12 hours ago ago

17 comments

  • pcwalton 8 hours ago ago

    I'm guessing that f32::clone showing up in the profile isn't actually a call to f32::clone, because you have optimizations on (if it actually is a call to a "movd xmm0,dword ptr [rdi]; ret" instruction pair, that's a bug in the compiler). Rather it's the result of the compiler choosing to attribute seemingly-random lines to f32::clone, because when lines from multiple functions are fused into one instruction the compiler will just pick one, and it happened to pick f32::clone to write into the debug info. You really want to look at instruction-level profiling when you're profiling at that level instead of the individual functions, since debug info is going to be very unreliable.

    • urcyanide 4 hours ago ago

      I also see the profiling shows that the "iter::next" takes a large percentage in the flamegraph. Are they the same reason?

    • eftychis 6 hours ago ago

      Seconded. This could have been essentially anything and everything else bunched together. Or we have a compiler or debug symbol bug in our hands.

  • wrs 9 hours ago ago

    I’m a Rust newbie, wondering how f32::clone could show up in a profile. Wouldn’t that be an inline no-op under any kind of optimization? I mean, cloning a float is, at worst, a MOV instruction, no?

    • skavi 44 minutes ago ago

      Profiles aren’t always entirely accurate. That’s kinda fine, since (IMO) they’re mainly useful for pointing you in the correct direction.

      Once you’ve found a small enough chunk of logic that’s worth optimizing, I’d recommend relying more on benchmarks and disassembly.

    • MiguelX413 9 hours ago ago

      Floats aren't stored in the same kinds of registers.

      • epcoa 3 hours ago ago

        How would that make any difference to what is being discussed?

  • carlmr 10 hours ago ago

    Great writeup with easy to understand steps. One thing it's lacking though is in the conclusion. I'd like to see a comparison to the C++ implementation.

    • skavi 31 minutes ago ago

      https://parallel-rust-cpp.github.io/

      This goes through seven iterations of optimization an algorithm in rust, comparing it to the equivalent c++ at each stage.

    • efnx 8 hours ago ago

      Yes, exactly. How close does it come after all those optimisations?

    • urcyanide 4 hours ago ago

      It has the same QPS as the C++ version for GIST dataset. While Rust has more SIMD, C++ has const generic. I guess there is still some space for future improvement.

  • mwkaufma 9 hours ago ago

    I don't understand why half of these aren't optimized by the compiler automatically. (x - y).norm_squared()? Why is f32::clone() not just an inline mov? Begging a lot of questions.

  • andrewaylett 10 hours ago ago

    That's really interesting -- I do enjoy a good optimisation.

    I was looking at one of the diffs, and thinking at a sufficiently advanced compiler should be able to generate the same efficient code for both -- and indeed it does, if you turn the optimiser on: https://godbolt.org/z/hjP5qjabz

      - let shift = if (i / 32) % 2 == 0 { 32 } else { 0 };
      + let shift = ((i >> 5) & 1) << 5;
    • NovaX 9 hours ago ago

      I'm confused because isn't the bitwise version the inverted logic? If the LSB is 1 then it is an odd value, which should be zero, yet that is shifted to become 32. The original modulus is for an even value becoming 32. Shouldn't the original code or compiler invert it first? I'd expect

          let shift = ((~(i >> 5) & 1) << 5);
      
      EDIT: The compiler uses "vpandn" with the conditional version and "vpand" with the bitwise version. The difference is it includes a bitwise logical NOT operation on the first source operand. It looks like the compiler and I are correct, the author's bitwise version is inverted, and the incorrect code was merged in the author's commit. Also, I think this could be reduced to just (~i & 32).
    • urcyanide 4 hours ago ago

      Thanks for pointing out, this can be optimized by the compiler when enabling opt-level=3