Alternatives to cosine similarity

(tomhazledine.com)

36 points | by tomhazledine 2 days ago ago

16 comments

  • seanhunter 21 hours ago ago

    It pains me that the author (and many others) refer to cosine similarity as a distance metric. The cosine of an angle is a quick measure of how close the direction between two vectors is, however it is not a distance metric (which would measure the distance between their endpoint).

    Thought experiment: If I go outside and point at the moon, the cosine of the angle between the position vector of my finger and the position vector of the moon relative to me is 1.[1] However my finger is very much not on the moon. The distance between the two vectors is very large even though their angle is zero.

    That's why it's cosine similarity not cosine distance. If your embedding methodology is trained such that the angle between vectors is a good enough proxy for distance then it will be[2]. But it's not a distance measure.

    [1] because the angle is 0 and cos 0 = 1.

    [2] A self-fulfilling prophesy but this actually is in the power of the people making the embedding to make true, presumably because the training will disperse the embeddings such that their magnitude is roughly equal so you'll have a kind of high-dimensional sphere of embeddings with most of the actual vectors ending on the outside of the sphere and not too many points far on the interior and not too many points spiking way out the sides. It seems OpenAI also normalize all the vectors so they are all unit vectors so the magnitude doesn't matter. But it's still not a distance measure.

    • xplt 15 hours ago ago

      On point.

      Just because the L2-norm yields the same rankings as cosine similarity for the particular case of normalized embeddings when retrieving relevant documents doesn't mean that any other L-norm or commonly used measure in the field of (un)supervised learning or information retrieval presents itself as a viable alternative for the problem at hand — which, by the way, had to be guessed too.

      Looking at the "history" of this development (e.g. bag-of-words model, curse of dimensionality etc.) provides a solid explanation for why we've ended up using embeddings and cosine similarity for retrieval.

      Though, I'm curious to see any advancements and new approaches in this area. This might sound snarky, but I still commend the author for doing what I wasn't able to by now: writing down their view of the world and putting it out for public scrutiny.

    • seanhunter 21 hours ago ago

      BTW the author mentions Mahalanobis distance[1]. This is a good one to know about but it isn't useful in this application. As I understand it (having used it a bit and even implemented the algorithm a couple of times) Mahalanobis distance multiplies by the inverse of the covariance matrix. What that does is essentially undo covariance if the dimensions of your space are not equivalent. So if moving in one dimension is highly correlated with moving in another dimension, Mahalanobis distance corrects for that effect.

      [1] https://en.wikipedia.org/wiki/Mahalanobis_distance

    • saltcured 20 hours ago ago

      Isn't that normalization in high dimensions pretty much the whole point?

      You don't care about differing radius at all, and the angle between the unit radial vectors directly correlates to the "great circle distance" between the points on the surface, right?

      • seanhunter 6 hours ago ago

        Correct but that doesn't make cosine similarity a distance metric. It just means n this particular case arc length is your distance metric and you're using trigonometry to get it.

  • ntonozzi 21 hours ago ago

    One important factor this article neglects to mention is that modern text embedding models are trained to maximize distance of dissimilar texts under a specific metric. This means that the embedding vector is not just latent weights plucked from the last layer of a model, but instead specifically trained to be used with a particular distance function, which is the cosine distance for all the models I'm familiar with.

    You can learn more about how modern embedding models are trained from papers like Towards General Text Embeddings with Multi-stage Contrastive Learning (https://arxiv.org/abs/2308.03281).

    • janalsncm 20 hours ago ago

      Yes, exactly. It’s one of the reasons that an autoencoder’s compressed representation might not work that well for similarity. You need to explicitly push similar examples together and dissimilar examples apart, otherwise everything can get smashed close together.

      The next level of understanding is asking how “similar” and “dissimilar” are chosen. As an example, should texts about the same topic be considered similar? Or maybe texts from the same user (regardless of what topic they’re talking about)?

  • scentoni 21 hours ago ago

    Cosine similarity of two normalized vectors is just the length of the projection of one vector on the other. That gives a very intuitive geometric meaning to cosine similarity. https://en.wikipedia.org/wiki/Vector_projection

  • hansvm 17 hours ago ago

    The biggest argument for using cosine similarity is that hardware, software, and research have co-evolved to make it fast, robust, and well-understood.

    As one simple example of that, most modern compilers can recognize false data dependency sharing and add some extra accumulators to the generated assembly for anything that looks like an inner product. For even slightly more complicated patterns though, that optimization is unlikely to have been implemented at a compiler level, so you'll have to do it yourself.

    The author benchmarked, among other things, chebyshev distance. Here's two example (zig) implementations, one with an extra accumulator to avoid false sharing, making it better than 3x faster on my machine.

        // 742ns per vec (1536-dim random uniform data)
        fn chebyshev_scalar_traditional_ignoreerrs(F: type, a: []const F, b: []const F) F {
            @setFloatMode(.optimized);
    
            var result: F = 0;
    
            for (a, b) |_a, _b|
                result = @max(result, @abs(_a - _b));
    
            return result;
        }
    
        // 226ns per vec (1536-dim random uniform data)
        fn chebyshev_scalar_sharing2_ignoreerrs(F: type, a: []const F, b: []const F) F {
            @setFloatMode(.optimized);
    
            var result0: F = 0;
            var result1: F = 0;
    
            var i: usize = 0;
    
            while (i + 1 < a.len) : (i += 2) {
                result0 = @max(result0, @abs(a[i] - b[i]));
                result1 = @max(result1, @abs(a[i + 1] - b[i + 1]));
            }
    
            if (a.len & 1 == 1)
                result0 = @max(result0, @abs(a[a.len - 1] - b[b.len - 1]));
    
            return @max(result0, result1);
        }
    
    This is apples to oranges, but if their chebyshev implementation were 3x faster after jitting it'd handily beat everything else.
  • kordlessagain 21 hours ago ago

    Good to see a mention of the Jaccard index for sets here. I did a lot of work on generating keyterm sets using knowledge transfer from the texts and calculating similarity of texts based on their Jaccard similarity, or the Jaccard distance, and their cosine similarity. I used Ada and Instructor. In the tests I ran, the values were frequently similar and useful for ranking where one or the other resulted in similar values for the result set, allowing them to be reweighted slightly if needed.

    Code: https://github.com/FeatureBaseDB/DoctorGPT

  • 21 hours ago ago
    [deleted]
  • janalsncm 20 hours ago ago

    You technically could use other distance metrics but embeddings are generated from models trained to maximize similarity under specific metrics. Usually that is cosine similarity.

    A trivial example of how it matters is the vectors (0,1) and (0,2) which have cosine distance 0 but euclidean distance 1.

    Finally, it’s notable that the author is testing via JavaScript. I am not sure if you’ll be able to take advantage of vectorized (SIMD/BLAS) optimizations.

  • thesehands 21 hours ago ago

    This paper is always useful to remember when someone tells you to just use the cosine similarity: https://arxiv.org/abs/2403.05440

    Sure, it’s useful but make sure it’s appropriate for your embeddings and remember to run evals to check that things are ‘similar’ in the way that you want them to appear similar.

    Eg is red similar to green just because they are colors

  • kristjansson 21 hours ago ago

    > I was expecting a little variation in execution time for each comparison, but what I wasn't expecting was the bimodal nature of the results. For each function, there were two distinct groups of execution times. These peaks existed for all the function types at consistent relative spacings, which suggests that certain vector comparisons took longer than others no matter which distance function was being used

    Surely pre- and post- JIT?

    • sgt101 20 hours ago ago

      May be the author could test with realistic data set sizes. Recently I've worked with 26M and 300M data sets. Perhaps something along those lines.

  • 21 hours ago ago
    [deleted]