Building a compile-time SIMD optimized smoothing filter

(scientificcomputing.rs)

83 points | by PaulHoule 2 days ago ago

22 comments

  • keldaris 2 days ago ago

    This looks like a nice case study for when you're already using Rust for other reasons and just want to make a bit of numerical code go fast. However, as someone mostly writing C++ and Julia, this does not look promising at all - it's clear that the Julia implementation is both more elegant and faster, and it seems much easier to reproduce that result in C++ (which has no issues with compile time float constants, SIMD, GPU support, etc.) than Rust.

    I've written very little Rust myself, but when I've tried, I've always come away with a similar impression that it's just not a good fit for performant numerical computing, with seemingly basic things (like proper SIMD support, const generics without weird restrictions, etc.) considered afterthoughts. For those more up to speed on Rust development, is this impression accurate, or have I missed something and should reconsider my view?

    • adgjlsfhk1 a day ago ago

      This sort of thing is where Julia really shines. https://github.com/miguelraz/StagedFilters.jl/blob/master/sr... is the julia code and it's only 65 lines and uses some fairly clean generated code to get optimal performance for all floating point types.

    • gauge_field a day ago ago

      In terms of speed, Rust is up there with C/C++. See e.g. https://benchmarksgame-team.pages.debian.net/benchmarksgame/... I also ported several algo from C and it was matching the performance.

      Regarding SIMD support, the only thing that is missing, is stable support for avx512, and some more exotic feature extensions for deep learning e.g. avx_vnni. Those are implemented and waiting to be included in the next stable versions.

      Gpu support: this is still an issue b/c of not enough people working on it, but there projects trying to improve this: see https://github.com/tracel-ai/cubecl .

      Const generics: Yeah, there are a few annoying issues: it is limited to small set of types. For instance, you cant use const enum as a generic. Also, you cant use generic parameters in const operations on stable rust: see unstable feature generic_const_exprs.

      My main reason for using rust in numerical computing:

      - type system. Some find it weird. I find it explicit and easier to understand.

      - cargo (nicer cross platform defaults, since I tend to develop both from windows and linux)

      - unconditional code generation, with [target_feature(enable = "feature_list")]. This makes it so that I dont have to set different set of flags for each compilation unit when building. It is enough to put that on top of function making use of SIMD.

      I agree that if you want to be fast/exploratory in developing algo and you can sacrifice a little bit of performance, Julia is a better choice.

      • bobmcnamara a day ago ago

        TBH, Intel ISAs have never been very stable, the mixed AVX flavors are just the latest examples.

  • jonathrg a day ago ago

    Tight loops of SIMD operations seems like something that might be more convenient to just implement directly in assembly? So you don't need to babysit the compiler like this.

    • adgjlsfhk1 a day ago ago

      Counterpoint: that makes you rewrite it for every architecture and every datatype. A high level language makes it a lot easier to get something more readable, runable everywhere, and datatype generic.

      • brrrrrm a day ago ago

        You almost always have to for good perf on non-trivial operations

      • jonathrg a day ago ago

        For libraries, yes that is a concern. In practice you're often dealing with one or a handful of instances of the problem at hand.

      • menaerus a day ago ago

        If you're ready to settle on much lower IPC then yes.

        • adgjlsfhk1 a day ago ago

          The Julia version of this (https://github.com/miguelraz/StagedFilters.jl) is very clearly generating pretty great code. For `smooth!(SavitzkyGolayFilter{4,3}, rand(Float64, 1000), rand(Float64, 1000))`, 36 out of 74 of the inner loop instructions are vectorized fmadds (of one kind or another). There are a few register spills that seem plausibly unnecessary, and some dependency chains that I think are an inherent part of the algorithm, but I'm pretty that there isn't an additional 2x speed to get here.

          • menaerus a day ago ago

            I can write SIMD code that's 2x the speed of the non-vectorized code but I can also rewrite that same SIMD code so that 2x becomes 6x.

            Point being, not only that you can get 2x on top of initial 2x SIMD implementation but usually much more.

            Whether or not you see SIMD in the codegen is not a testament of how good the implementation really is.

            IPC is the relevant measure here.

            • adgjlsfhk1 a day ago ago

              IPC looks like 3.5 instructions per clock. (and the speed for bigger inputs will be memory bound rather than compute bound).

              • menaerus 18 hours ago ago

                If true this would be between ok-ish and decent result but since you didn't say anything about the experiment or how you ran it or anything about the HW it's hard to say more.

                I assume you didn't use https://github.com/miguelraz/StagedFilters.jl/blob/master/te... since that benchmark is pretty much flawed for more than a few reasons:

                - Operates with the dataset that cannot fit even in the L3 cache on most consumer machines: ~76MB (float64) and ~38MB (float32)

                - Intermixes python with the julia code

                - Not even a single repetition - no statistical significance

                - Does not show what happens with different data distributions, e.g. random, but only uses a monotonically increasing sequence

                • ChrisRackauckas 13 hours ago ago

                  Did you link the wrong script? The script you show runs everything to statistical significance using Chairmarks.@b.

                  Also, I don't understand what the issue would be with mixing Python and Julia code in the benchmark script. The Julia side JIT compiles the invocations which we've seen removes pretty much all non-Python overhead and actually makes the resulting SciPy calls faster than doing so from Python itself in many cases, see for example https://github.com/SciML/SciPyDiffEq.jl?tab=readme-ov-file#m... where invocations from Julia very handily outperform SciPy+Numba from the Python REPL. Of course, that is a higher order function so it's benefiting from the Julia JIT in other ways as well, but the point is in previous benchmarks we've seen the overhead floor as so low (~100ns IIRC) that it didn't effect benchmarks negatively for Python and actually improved many Python timings in practice. Though it would be good to show in this case what exactly the difference is in order to isolate any potential issue, I would be surprised if it's more than a 100ns overhead for an invocation like this and with 58ms being the benchmark size, that cost is well below the noise floor).

                  Though trying different datasets is of course valid. There's no reason to reject a benchmark just because it doesn't fit into L3 cache, there's many use cases for that. But it does not mean that all use cases are likely to see such a result.

    • anonymoushn a day ago ago

      I'm happy to use compilers for register allocation, even though they aren't particularly good at it

    • menaerus a day ago ago

      Definitely. And I also found the title quite misleading. It's the auto-vectorization that this presentation is trying to cover. Compile-time SIMD OTOH would mean something totally different, e.g. computation during the compile-time using the constexpr SIMD intrinsics.

      I'd also add that it's not only about babysitting the compiler but you're also leaving a lot of performance off the table. Auto-vectorized code, generally speaking, unfortunately cannot beat the manually vectorized code (either through intrinsics or asm).

  • TinkersW a day ago ago

    I only clicked through the slides and didn't watch the video..but ugh all I see is scalar SIMD in the assembly output(the ss ending means scalar, it would be ps if it was vector)

    And they are apparently relying on the compiler to generate it...just no.

    Use intrinsics, it ant that hard.

  • jtrueb a day ago ago

    I think scientific computing in Rust is getting too much attention without contribution lately. We don't have many essential language features stabilized. SIMD, generic const exprs, intrinsics, function optimization overrides, and reasonable floating point overrides related to fast math are a long way from being stabilized. In order to get better perf, the code is full of these informal compiler hints to guide it towards an optimization like autovectorization or branch elision. The semantics around strict floating point standards are stifling and intrinsics have become less accessible than they used to be.

    Separately, is Julia hitting a different LA backend? Rust's ndarray with a blas-src on Accelerate is pretty fast, but the Rust implementation is little slower on my macbook. This is a benchmark of a dot product.

    ```

        const M10: usize = 10_000_000;
        #[divan::bench]
        fn ndarray_dot32(b: Bencher) {
            b.with_inputs(|| (Array::from_vec(vec![0f32; M10]), Array::from_vec(vec![0f32; M10])))
                .bench_values(|(a, b)| {
                    a.dot(&b)
                });
        }
    
        #[divan::bench]
        fn chunks_dot32(b: Bencher) {
            b.with_inputs(|| (vec![0f32; M10], vec![0f32; M10]))
                .bench_values(|(a, b)| {
                    a.chunks_exact(32)
                        .zip(b.chunks_exact(32))
                        .map(|(a, b)| a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>())
                        .sum::<f32>()
                });
        }
    
        #[divan::bench]
        fn iter_dot32(b: Bencher) {
            b.with_inputs(|| (vec![0f32; M100], vec![0f32; M100]))
                .bench_values(|(a, b)| {
                    a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>()
                });
        }
        
        ---- Rust ----
        Timer precision: 41 ns (100 samples)
        flops             fast    │ slow    │ median  │ mean
        ├─ chunks_dot32   3.903 ms│ 9.96 ms │ 4.366 ms│ 4.411 ms
        ├─ chunks_dot64   4.697 ms│ 16.29 ms│ 5.472 ms│ 5.516 ms
        ├─ iter_dot32     10.37 ms│ 11.36 ms│ 10.93 ms│ 10.86 ms
        ├─ iter_dot64     11.68 ms│ 13.07 ms│ 12.43 ms│ 12.4 ms
        ├─ ndarray_dot32  1.984 ms│ 2.91 ms │ 2.44 ms │ 2.381 ms
        ╰─ ndarray_dot64  4.021 ms│ 5.718 ms│ 5.141 ms│ 4.965 ms
    
        ---- Julia ----
        native_dot32:
        Median: 1.623 ms, Mean: 1.633 ms ± 341.705 μs
        Range: 1.275 ms - 12.242 ms
    
        native_dot64:
        Median: 5.286 ms, Mean: 5.179 ms ± 230.997 μs
        Range: 4.736 ms - 5.617 ms
    
        simd_dot32:
        Median: 1.818 ms, Mean: 1.830 ms ± 142.826 μs
        Range: 1.558 ms - 2.169 ms
    
        simd_dot64:
        Median: 3.564 ms, Mean: 3.567 ms ± 586.002 μs
        Range: 3.123 ms - 22.887 ms
    
        iter_dot32:
        Median: 9.566 ms, Mean: 9.549 ms ± 144.503 μs
        Range: 9.302 ms - 10.941 ms
    
        iter_dot64:
        Median: 9.666 ms, Mean: 9.640 ms ± 84.481 μs
        Range: 9.310 ms - 9.867 ms
    
        All: 0 bytes, 0 allocs
    
    
    ```

    https://github.com/trueb2/flops-bench

  • feverzsj a day ago ago

    I think most rust projects still depend on clang's vector extension when SIMD is required.