SolidStart - Hacker News

hliyan 5 months ago ago

This reminds me of one of the most interesting bugs I've faced: I was responsible for developing the component that provided away market data to the core trading system of a major US exchange (which allows the trading system to determine whether an order should be matched in-house or routed to another exchange with a better price).

Throughputs were in the multiple tens of thousands of transactions per second and latencies were in single digit milliseconds (in later years these would drop to double digit microseconds, but that's a different story). Components were written in C++, running on Linux. The machine that ran my component and the trading engine were neighbors in a LAN.

We put my component through a full battery of performance tests, and for a while, we seem to be meeting the numbers. Then one day, with absolutely zero code changes from my end or the trading engine's end, the the latency numbers collapsed. We checked the hardware configs and the rate at which the latest test was run. Both identical.

It took, I think, several days to solve the mystery: in the latest test run, we had added one extra away market to a list of 7 or 8 markets for which my component provided market data to the trading system. We had added markets before without an issue. It's a negligible change to the market data message size, because it only adds a few bytes: market ID, best bid price & quantity, best offer price & quantity. In no way should such a small change result in a disproportionate collapse in the latency numbers. It took a while for us to realize that before the addition of these few bytes, our market data message (a binary packed format), neatly fit into a single ethernet frame. Those extra few bytes pushed it over the 1600 (or 1500?) mark and caused all market data message frames (which were the bulk of messages on the system, next to orders), to fragment. The frame fragmentation and reassembly overhead was enough to clog up the pipes at the rates we were pumping data.

In the short run, I think we managed to do some tweaks and get the message back under 1600 bytes (by omitting markets that did not have a current bid/offer, rather than sending NULLs). I can't recall what we did in the long run.

[-]

5 months ago ago

[deleted]

5 months ago ago

[deleted]

kazinator 5 months ago ago

Another thing to look into in this kind of situation is enabling jumbo frames.

[-]

ay 5 months ago ago

“You had an MTU problem. You enable jumbo frames. Now you have two MTU problems”

Unless you control the entire set of possible paths (can be many!) and set all the MTUs to match well, this (while maybe on surface helping with the problem, depending on many things) can set one up with a nasty footgun, whereby black hole will show in the most terrible moment of high traffic. See my PMTUD/PLPMTUD rant elsewhere in this thread.

[-]

vlovich123 5 months ago ago

Given this is a trading system where application latencies are measured in microseconds, the default would be to assume that jumbo frames are totally a valid approach.

[-]

5 months ago ago

[deleted]

Veserv 5 months ago ago

MTU discovery would be so much easier if the default behavior was truncate and forward when encountering a oversized packet. The endpoints can then just compare the bytes received against the size encoded inside of the packet to trivially detect truncation and thus get the inbound MTU size.

This allows you to do MTU discovery as a endpoint protocol with all the authentication benefits that provides and allows you to send a single large probe packet to precisely identify the MTU size. It would also allow you to immediately and transparently identify MTU reductions due to route changes or any other such cause instead of packets just randomly blackholing or getting responses from unknown, unauthenticated endpoints.

[-]

zamadatix 5 months ago ago

Truncation for a dedicated probe packet type: you lose the information it's a probe when you go through a tunnel of some sort (VPN, L2TP, IPsec, MPLS, VPLS, VXLAN, PBB, q-in-q, whatever). You're also dealing with different layers e.g. a client could send an L3 packet probe and now you're expecting a layer 2 PBB/q-in-q node to recognize IP packet types and treat them specially (layering violation).

Truncation for all packet types: data in transit can occasionally get split for other reasons. Right now that's just made into loss, if we had built every protocol layer on the idea it should forward anyways then any instances of this type of loss also become MTU renegotiations, at best. At worst we're having to forward generally corrupted packets which can cause all sorts of other problems. It'd be another layering violation to require that e.g. an L2 switch must adjust the UDP checksum when it's intentionally truncating a packet, but that'd be the only way to avoid that. Tunnels (particularly secure) are also tricky here (you need to run multiple separate layers of this continuously to avoid truncation information not propagating to the right endpoints). It also doesn't allow for truly unidirectional protocols e.g. a UDP video stream as there is no allowance for out of session signaling to be possible.

The above is for "if we have started networking day 1 with this plan in mind". There are of course additional problems given we didn't. I'm also not sure I follow how allowing any intermediate node to truncate a packet is any more authenticated.

The (still ugly) beauty of using PMTUD-style approach over truncation or probe+notification is it doesn't try to make assumptions about how anything in the middle could ever work for the rest of time, and that makes it both simple (despite sounding like a lot of work) and reliable. You and your peer just exchange packets until you find the biggest size that fits (or that you care to check for) and you're off! MTU changes due to a path change? No problem, it's just part of your "I had a connection and the other side seems to have stopped responding. How do I attempt to continue" logic (be that retry a new session or attempt to be smart about it). It also plays nice with the ICMP too large messages - if they are there you can choose to listen, if they are not it still "just works".

Or, like the article says, safe minimums can be more practical.

[-]

Veserv 5 months ago ago

You truncate for all packet types.

Data in transit is almost never split for reasons other than fragmentation to avoid MTU problems. Any such split necessarily defines a fragmentation and reconstruction protocol so it still "preserves" the original send length information needed for truncation detection. If they have gone truly crazy and implemented a entire stream protocol transparently backing their flows then their transparent inner point-to-point layer would need to be aware of truncation in much the same way it would need to be aware of MTU limits anyways.

Forwarding generally corrupted packets should not be a problem unless your middleboxes are aggressively engaging in layering violations. From the perspective of a middlebox that is not engaging in layering violations you just have headers with blobs of data. Truncating the blob of data is basically uninteresting; at most you recalculate your integrity tags at your appropriate layer. You do not and should not recompute anything at higher layers. Furthermore, your endpoints must already be robust to blobs of garbage that pass your integrity tag checking because it is trivial for malicious actors to send you blobs of garbage with correctly calculated integrity tags. And, even if you were fully isolated, you can still get correlated bit errors that result in a correct integrity tag despite payload bit errors. Every client implementation that is not grossly incompetent must already be robust to getting garbage. You only get problems when your middleboxes start mucking around and trying to be too smart and violating your point-point transport abstraction.

You still get unidirectional protocols because you should manage truncation information out-of-band of any of your protocols. UDP or any other protocol should not communicate back to the sender that truncation happened. You do that some other way or even do not bother to do it at all. This is extra channel information that you can choose to communicate to let the other endpoint know about channel properties to make better data encoding decisions. You can transmit that in-band, out-of-band, on a different protocol, whatever. This is a higher level property of the communication channel between you and the other side.

Truncation is better authenticated because the packet reaches the other, known, authenticated endpoint who is the entity who can inform you, over a authenticated channel, that the transport channel has problems. You do not get nonsense like ICMP too large messages which come from unknown, unauthenticated entities. Furthermore, truncated messages can still be authenticated as long as you authentication tag the base header which should never be in the truncated section (you still need to have a minimum MTU below which you should always reject, but that number is small and much smaller than existing MTUs).

[-]

zamadatix 5 months ago ago

> Data in transit is almost never split for reasons other than fragmentation to avoid MTU problems

Fragmentation is a specific (unrelated) term, it's not interchangeable with a split. You can have (depending on the protocols involved):

- A runt due to a collision

- A link drop during transmit

- A problem during cut-through type transport

You can do various things to combat some of these (such as fragment-free instead of cut-through in collision domains) but you can't guarantee every phy IP ends up riding over can or should avoid these constraints.

> Forwarding generally corrupted packets should not be a problem unless your middleboxes are aggressively engaging in layering violations. From the perspective of a middlebox that is not engaging in layering violations you just have headers with blobs of data.

If "delivery of something somewhere" is your only definition of a problem, perhaps :p.

> Furthermore, your endpoints must already be robust to blobs of garbage that pass your integrity tag checking because it is trivial for malicious actors to send you blobs of garbage with correctly calculated integrity tags.

Not only the endpoints to garbage in the data payloads but equally the gear to garbage in the network headers. Be it full authentication or just error detection, you don't want to just forward things with a corrupted network header and hope it doesn't cause an issue or security violation. Things like CRCs or HMACs are done per layer precisely for this kind of reason, going to truncation requires dropping that safe handling.

> Every client implementation

As a side note: the concerns have less to do with the clients, they have full context and control of their sessions in software land with little concerns from concerns in being the physical transport layer. Most all of these considerations need to be thought from the intermediate boxes doing the transport/truncation instead.

> You still get unidirectional protocols because you should manage truncation information out-of-band of any of your protocols

Unidirectional protocols cannot be expected to punt directionality to a separate session. In general, any time the answer to a network conundrum (such as the two generals) sounds as easy as "just move that to a separate channel which has the information" you have either duplicated the problem in that channel or added functionality which might not be physically available (or directionally available for security use case reasons, or scalably available for multicast, or something else for a use case that isn't 'inside out' from what might pop in mind as a 'standard' session).

> Truncation is better authenticated because the packet reaches the other, known, authenticated endpoint who is the entity who can inform you, over a authenticated channel, that the transport channel has problems.

I'm still not sure I follow - how is the message between endpoints still authenticated if middleboxes can modify the bytes, breaking an HMAC and/or CRC (if any), and it still gets delivered? Having authenticated an endpoint exists at an address you've sent a packet to before does not automatically authenticate any packet which arrives.

You also skipped over any of the implications for network tunnels (secure/insecure) - is MTU discovery just not supposed to work in those use cases?

I think you can absolutely make a domain specific protocol which is happy to use truncation for MTU discovery, I just don't think anything which is supposed to be as universally usable as IP can.

[-]

Veserv 5 months ago ago

Your first point appears to be about physical layer concerns. My suggestion was not meant to operate at that level. The proposed model assumes the physical layer guarantees point-point delivery of a distinct packet between adjacent nodes in the network with MTU limits manifesting as either discarding or rejecting the trailing portions of the packet.

I said there were no “problems” if there are no layering violations because you argued that recalculating checksums would be a layering violation. Either we say layering violations are unacceptable at which point my argument stands. Or we say layering violations are par for the course and you can just recalculate the checksums if you need to.

Unidirectional protocols with no back channel must assume the network channel parameters such as MTU. Adding truncation information which can be picked up at a different layer is just strictly more information you can feed into your protocol if it is designed to handle that. You can just not use it and act as if truncation is dropping if you want to. This is just strictly more data you can use for decisions.

You can get still get authenticated transport in the presence of truncation if your protocol generates a authentication tag for the “original” length and puts it at the start of the message. Then you can authenticate the length field and verify truncation otherwise you can drop it.

I did not bother with tunnels because I do not see how it is a distinct problem. Tunnels already need to figure out how to manage their MTUs. Either the tunnel is transparently managing how it fragments data and can be enhanced to support truncation (though it does not need to, it can just drop truncation/malformed as they currently do) or it tells tunnel parameters to the endpoints so that the endpoints keep themselves in bound at which point the endpoints can detect whatever the MTU of the tunnel is.

And again, you can always just ignore truncated packets and act as if they are malformed which everybody already does. This is strictly more functionality which does not require changing all existing systems which can be used to support more efficient MTU discovery by systems and networks that supported it. And if they do not, you just fallback to the current, crusty way.

[-]

zamadatix 5 months ago ago

> Your first point appears to be about physical layer concerns. My suggestion was not meant to operate at that level.

The proposal doesn't operate at that level but it must be compatible with the operations of that level. I.e. that the physical layer can also cause truncation of layers riding on top of it needs to be accounted for in the way those upper layers consider what truncation means. The same is true for possible intermediate layers (which sorta aligns with the later conversations regarding tunnels, which are basically just more complicated forms of intermediate layers).

> The proposed model assumes the physical layer guarantees point-point delivery of a distinct packet between adjacent nodes in the network with MTU limits manifesting as either discarding or rejecting the trailing portions of the packet.

Then proposed isn't applicable to IP since an upper layer protocol cannot make guarantees about the behavior of lower level protocols it may be transported on.

In addition, discarding trailing portions of the packet still results in the aforementioned problems with consistency checks and forwarding behavior limitations for lower level layers which did abide by this behavior.

> Unidirectional protocols with no back channel

One cannot guarantee bidirectional protocols will be able/allowed to form a back channel either, I just used unidirectional as a more clear-cut example.

> You can just not use it and act as if truncation is dropping if you want to. This is just strictly more data you can use for decisions.

Well sure, the same is true of the ICMP method or an active probing method. The concern is less with sessions you don't care to PMTUD in the first place and more with how the truncation design affects the designs of such other use cases.

> You can get still get authenticated transport in the presence of truncation if your protocol generates a authentication tag for the “original” length and puts it at the start of the message. Then you can authenticate the length field and verify truncation otherwise you can drop it.

I totally agree one can include an HMAC tag in your client<->client protocol to validate unmodified packets are authentic. This is regardless of whether truncation, ICMP packet too big, active PMTUD probing, or any other method is in place as, to this point, this is only about validating delivered packets which did fit in the MTU.

What isn't clicking is when a truncated message arrives how a (now invalid) HMAC helps you authenticate if this packet was completely spoofed by a malicious actor or really truncated by a middlebox. All you know is it was supposed to be longer and now something claims it needs to be shorter, how do you know that's not because of the same malicious actor who was supposed to be sending the fake ICMP packet too big rather than a middlebox really trying to signal the packet truly needed to be truncated?

> I did not bother with tunnels because I do not see how it is a distinct problem.

As highlighted earlier, tunnels may either encapsulate other protocols or encapsulate protocols which are expecting truncation. If the only things which existed in the world were client network interfaces it wouldn't be a problem, once more network devices become involved then you have to consider the impact on those too. The main thing to keep in mind is very few network middleboxes or tunnel protocols have the ability to do fragmentation on behalf of tunneled data, particularly if they are hardware based or based on protocols without such a feature (such as Ethernet) since this eats up TONS of hardware to do so (especially at high speeds). E.g. take an IPv6 VXLAN tunnel of an Ethernet frame on a 400 Gbps interface, how is an pure L3 intermediate carrier router doing truncation supposed to know not to update the UDP (a layer up the stack) checksum so the truncated Ethernet payload actually gets delivered to the client destination from the egress VTEP? It's not even that the egress VTEP needs some way to signal to the ingress VTEP how much the truncation was, it's that the original client which was VXLAN encapsulated by the ingressing VTEP needs its packet delivered to the remote client so the remote client can see the truncation and re-negotiate (in band or out of band) with the client to send smaller frames. This signaling will not occur because of the aforementioned UDP checksum being broken by an intermediate router. Just removing all checksums and allowing all modifications to headers and delivering whatever arrives would create not only high incidences of the propagation of deformed traffic but also security risks.

This brings us back to the example of secure tunnels, like IPsec, which have the same problem but in a much more succinct form. All parts of the payload of an IPsec tunnel are basically random noise after you truncate it, so there is no way to even attempt to consider sending the truncated payload to the intended destination. It's not the responsibility of the IPsec encapsulator to perform the encapsulation and the IPsec receiver usually doesn't have a path to communicate with the original client (not that it even knows who that is).

If you redesign everything about how network tunneling works under some severe limitations and assumptions then it may be possible to solve some (or maybe all if I can figure out what I'm missing regarding authentication of packets claiming MTU changes) of these problems but I'm not sure I could ever see the set of requirements needed as easier than the other MTU approaches. That doesn't necessarily mean I think there is an overall perfect answer all, just that I think PMTUD and its variants are definitely the easier path.

[-]

Veserv 5 months ago ago

I just do not understand the problems you are stating. Let me present a concrete example.

We have A <-> B <-> C. A wishes to transmit a packet of 0x1000 bytes containing a Ethernet, IPv4, and then bespoke protocol, P, which is a header containing a length, MAC on the length + header, MAC on entire packet, encrypted payload, in that order. A then prepares transmit descriptors pointing at the packet and with size 0x1000 bytes.

C prepares receive descriptors pointing to buffers with a maximum capacity of size 0x1000 bytes per packet. B prepares receive descriptors pointing to buffers with a maximum capacity of size 0x500 (1280) bytes per packet.

A transmits the packet to B. The physical coding layer transmits the bytes terminating in the FCS. B receives bytes and does a running computation of the FCS. Upon reaching 0x500 bytes, it stops storing data into memory, stores the current FCS into memory, then continues receiving the data and computing the FCS until the data stream ends. Upon determining that the FCS matches, it marks the descriptor as valid for consumption and stores that the descriptor contains 0x500 bytes of data. The transmit engine of B then configures a transmit descriptor pointing at the packet and with size 0x500 bytes.

C then receives the 0x500 byte packet from B and observes that the FCS matches the 0x500 byte FCS and marks the descriptor as valid for consumption and stores that the descriptor contains 0x500 bytes of data. C then processes the packet observing that the P header indicates a length of 0x1000 bytes, but only 0x500 bytes are available. It attempts to authenticate the P header MAC using a secret known only to A and C. As the truncation only hit the encrypted payload at the tail, the P header MAC and the header data it is authenticating have not been modified by the truncation process. As such, C is able use the higher layer secret it shares with A to successfully authenticate the header data and determine that the header containing a length field with the value of 0x1000 bytes could have only been written by A and has not been tampered with. It then rejects the rest of the packet, but stores that the inbound MTU is only 0x500 bytes.

[-]

zamadatix 5 months ago ago

In this process one can only show they are unable to authenticate the packet's length matches the length the header said it should have been i.e. you can only authenticate that nobody tried to claim the MTU should change. You have not provided any authentication to the parts of the message signaling the MTU is now supposed be 0x500.

The authentication header can only helps you authenticate when the MTU stayed the same as expected during delivery, it cannot help you authenticate the signals claiming MTU was supposed to be something else as those modifications, inherently, do not come from nodes partaking in the authentication header. The malicious middleman could falsely truncate a single packet to 0x500 bytes just as easily as they could falsely create an ICMP packet claiming the MTU is 0x500 bytes, in both cases the only thing you know for sure is "someone is trying to claim that last packet was too big".

[-]

Veserv 5 months ago ago

Can you please explain how a malicious node on your path truncating packets to 0x500 bytes is distinct from a 0x500 byte path MTU?

The distinction between this and a false ICMP packet is that a valid ICMP packet from a node in your path comes from a unauthenticated source. You can not generally distinguish this from a forged ICMP packet from a malicious entity not on your path.

In contrast, the model I propose results in the authenticated endpoint learning of the path MTU in a way that can only be altered by a node in the path refusing to send data beyond a certain size. The authenticated endpoint can then use a authenticated channel to feedback the data allowing the source to get authenticated path MTU information that could only come from the authenticated endpoint.

[-]

zamadatix 5 months ago ago

> Can you please explain how a malicious node on your path truncating packets to 0x500 bytes is distinct from a 0x500 byte path MTU?

Sure, it doesn't even actually have to be truly in-path in all cases, though that makes things easier. I receive a copy of your message (be it I'm actually inline, tap, spoofed arp for a second, shared media, etc - pick your poison of the day) and I create a truncated version of that packet to send on the line. I don't even need to truncate every single one yet, since you haven't re-added the a probing mechanism like found in PLPMTUD to allow the MTU to be raised again yet :).

Putting all of the talk about the length authentication aside, I still don't see how the truncated headers was supposed to be an improvement over the ICMP approach of just sending the headers back as part of the payload. The ICMP destination unreachable message signaling the packet was too large already includes the original IP header + at minimum the first 8 bytes (though in practice RFC1812 increased that so you'll get significantly more) of data precisely so the client is able to map the request to the specific underlying session. If having the 5 tuple + identification number + TCP sequence number you sent match up with what you sent isn't already enough to trust it came from someone who got a copy of your message then you can add the HMAC anywhere in your headers without needing to replace the ICMP based signaling approach completely to get it sent back to you too? I guess because you want the message the message sent was too large to go to the remote side first instead of the sender?

[-]

Veserv 5 months ago ago

You appear to be approaching my proposal as if it were a complete RFC intended to describe a complete end-to-end MTU discovery protocol. I am describing a primitive that can be used to efficiently implement MTU discovery. I did not describe probing because it is trivial to implement on top of that primitive in whatever way you so desire. A simple mechanism would be to just use your PLPMTUD mechanism, but instead of submitting continuously increasing packets, you can just submit a single large one and get the precise MTU in one step which is delivered back in whatever way (or equivalent) your PLPMTUD prescribes getting feedback.

You are describing a attack where somebody is either getting the unique copy of your packet by maliciously getting themselves into your path, at which point they can just drop all your packets and DoS you in a much more robust manner, or for some stupid reason you just send copies of packets to random malicious entities when they ask for it. I mean, I can imagine some knuckle-head designed a protocol and network where that is the case, but you should probably first not do something that stupid. Assuming you can not avoid it, you can still detect the case because if they can not interject into your actual path, the endpoint will receive both a authenticated packet and a authenticated truncated packet. In such a case, you can first of all just use the larger packet size since the larger size did get through. Second, you now know that somebody malicious or incompetent is getting a copy of your packets. I leave how you want to handle that up to the endpoints.

And, just as a aside, I did not mention this as I assumed it was obvious, but I believe you should still have a minimum MTU size no smaller than the current safe MTU. I am not arguing that somebody should be able to truncate you down to some egregiously small MTU. So worst case you just have to use the safe minimum MTU which is no worse than existing non-probing techniques. You could also just treat truncation as dropping if you have tons of malicious actors which defaults you to current probing techniques. My proposal can trivially fall back to the current models.

I disagree with ICMP signaling on multiple levels. First of all, you need to accept data from unauthenticated endpoints. Oh, but it is okay because they need to be on your path to get the uniquifiers. Oh, except you just pointed out how malicious actors can get copies so that is not true. You are now required to accept large amounts of data from malicious actors if you want to do ICMP based probing.

Second, I think having source addresses in headers is a design mistake and we should move away from protocols that demand it be present. We can do source authentication cryptographically which is more secure and prevents ossified middleboxes from engaging in layering violations or interfering with flows.

[-]

zamadatix 5 months ago ago

It's less about whether your proposal is ready to be submitted for standards track and more about pointing out alternative proposals always seem easier when you look at the primitive in isolation compared to a fully fleshed out solution in place today. But yes, I imagine the needed mechanism would be similar (if not identical to) the existing solutions deployed in PMTUD and PLPMTUD.

Yes, such a person in your path is problematic for both solutions. The point is not that such a person is wholly unproblematic with the ICMP solution, it's that truncation provides no additional authentication in such scenarios as was claimed. If you can already assume a fully trusted and validated path without the possibility of anyone interfering then you don't really need to worry about spoofed ICMP either, not that the involvement of the MTU discovery algorithm had much to do it with that result. Pragmatically the answer here is one loses trust of any payload which is not whole and matching contained signatures, but that still remains true when one uses truncation instead.

Agreed on minimally safe MTU, another case where truncation would not actually be providing a change from the current implementation.

In regards to the discussion around the downsides of the current ICMP approach the claim is it's no less safe than truncation, not that it was more safe. Truncation was said to have brought authentication of packets claiming MTU change, in reality it provides the same level of veracity as ICMP in, yet again, a very similar manner. In regards to size of data to accept one needs to accept single ICMP destination unreachable type packets of up to 576 bytes to handle the MTU use case (other ICMP packet types and use cases may allow for more but you don't have to support that for handling MTU notifications). It's not an accident 576 bytes is also the minimum maximum packet size define for IPv4. This means the largest unauthenticated packet you need to parse is the same size as it would be in the truncation case: 1 of whatever the protocol defines as the minimum maximum packet size.

Protocols without a cleartext source address (or much besides a destination address) in the header are definitely an interesting topic, and there are certainly use cases for this, but these really end up at the same kind of conversation except you get to skip the PMTUD portions and go straight to doing PLPMTUD inside the encrypted portion. With PLPMTUD inside the encrypted portion clients never have to listen to anything which came in a packet with an invalid signature. If one introduces truncation in this scenario you lose that as clients now need to also implicitly trust that a small packet with a broken signature was genuinely an MTU hint rather than a maliciously spoofed copy or whatnot.

The only thing I'll say against such high secure protocols is they would not necessarily be something for everyone: they'd mandate a lot of things which may not be pragmatic for many use cases. E.g. when one wants to trade source anonymity for RPF protections. Or when one doesn't get value from encryption but gets value from load balancing/directing on header info in a high performance scenario. Or easy handling of multiple generations of protocols with opposing design goals in transport equipment. This is all to say I think such protocols would be valuable but I'd stop short of saying it's what IP needs to/should have done.

ay 5 months ago ago

From the pragmatic standpoint: manually hard coding a safe minimum is the only approach which consistently works.

PMTUD somehow missed that packet networks ditching the OOB mechanisms of circuit switched networks was a good thing. By adding an OOB mechanism of attempted MTU discovery. Unauthenticated.

Yes, matching the 5-tuple from the original payload somewhat helps against the obvious security problem with this. (It was a fun 3-4 years while it was being added to systems across the ‘net while everyone was blocking the ICMP outright to avoid the exploitation. The burps of that one might still find in some security guidelines)

But the number of the network admins who understand what do they have to configure in their ACLs and why, is scarily small compared to the overall pool size.

Here’s another hurdle: for about two decades, to generate ICMP you have to punt the packet from hardware forwarding to the slow path. Which gets rate-limited. Which gives one a fantastic way to create extremely entertaining and hard to debug problems: a single misbehaving or malicious flow can disable the ICMP generation for everyone else.

Make hardware that can do it in fast path ? Even if you don’t punt - you still have to rate-limit to prevent the unauthenticated amplification attack (28 bytes of added headers is not comparable with some of the DNS or NTP scenarios, but not great anyway)

So - practically speaking, it can’t be relied on, other than a source for great stories.

PLPMTUD is a little better, in a sense that it attempts to limit itself to inband probes, but then there is the delicate dance of loss customarily being used to signal the congestion.

So this mechanism isn’t too reliable either, in very painful ways for the poor soul on call dealing with the outcomes. Ask me how I know.. ;-)

Now, let’s add to this the extremely pragmatic and evil hack that is the TCP MSS clamping, coming back from the first PPPoE days; which makes just enough of the eyeball traffic work to make this a “small problem with unimportant traffic that no one cares for anyway”.

So yes, safe minimums are a practical solution.

Until one start to build the tunnels, that is. A wireguard tunnel inside IPSec tunnel. Because policy. Inside VXLAN tunnel inside another IPSec tunnel, because SD-WAN. Which traverses NAT64, because transition and address scarcity.

At which point the previously safe minimums might not be safe anymore and we are back to square 1. I suspect when folks will start running QUIC over wireguard/ipsec/vxlan + IPv6 en masse we will learn that (surprise!) 1200 was not a safe value after all.

So, with this in mind, I posit it’s nice to attempt to at least fantasize about the universe where MTU determination would be done entirely inline, even if hypothetical - if we had the benefit of today’s hindsight and could time travel - could we have made it better ?

P.s. unidirectional protocols could be taken care of by fountain codes not unlike the I-, P- and B- frames in video world, with similar trade offs, moreover, I feel the unequal probability of loss depending on a place in the packet might allow for some interesting tricks.

[-]

zamadatix 5 months ago ago

Agree wholeheartedly on the pragmatic standpoint of just using minimums.

With regard to the problems of out of band signaling in plain PMTUD I fully agree with all your well stated points, doubly so on PLPMTUD! PLPMTUD is my preferred variation of PMTUD and I was glad to see the datagram form utilized in QUIC (especially since it's really a generic secure network tunneling protocol, not just the HTTP variant). I'm also glad QUIC's security model naturally got rid of MSS clamping... it was somewhat pragmatic in one view... but concerning/problematic in others :D. Of course it's not like TCP/mss clamping have exactly gone away though :/.

Also fully agree on both PLPMTUD still not being as reliable/fast as one would like (though I still think it's the best of the options) + safe minimums never seeming to stay "safe". At least IPv6 attempted to hedge this by putting pressure on network admins, saying "everyone is expecting 1280". Of course... we all know that doesn't mean every client ends up with 1280, particularly if they are doing their own VPN tunnel or something, but at least it gives us network guys an extra wall of "well, the standard says we need to allow expectation of 1280 and the rate of bad things which happen will be much higher lower than that".

You seem to have some really neat perspectives on networking, do you mind if I ask about what you do/where you got your experience? I came up through the customer side and eventually over time morphed my way into NOS development at some network OEMs and it feels like I run into fewer and fewer folks who deal with the lower layers of networking as time has went on. I think the most "fun" parts are trying to design overlay/tunneling systems which are hardware compatible with existing ASICs or protocols but are able to squeeze some more cleverness out of the usage (or, as you put it, if we had the benefit of today’s hindsight and could time travel - could we have made it better). The area I'd say I've been least involved in, but would like to, is anything to do with time sensitive networking or lossless ethernet use cases.

[-]

ay 5 months ago ago

> "everyone is expecting 1280"

This works great until there is an app that is expecting 1280 and there is an operator that gives you 1280, and you have to run this app over an encrypted GENÈVE tunnel that attempts to add half a kilobyte of metadata :-). RADIUS with EAP or DHCP with a bunch of options can be a good example of a user app like this. Unfortunately this is a real-world problem.

The smaller mismatch but nonetheless painful is the 20 byte difference between IPv4 and IPv6 header sizes. It trips up every NAT64 deployment.

> where you got your experience?

A long path along all the OSI layers :-). Fiber and UTP networks install between ~95 and 2000. CCIE R&S#5423 in ‘99 and from 2000 almost 10 years in TAC and one of the first CCIE in Europe. Then some years working on IPv6 transition. Large scale IPv6 WiFi. Some folks know me by “happy eyeballs”; some by a “nats are good” YouTube video (scariest thing it’s still funny a decade later). These days - relops at fd.io VPP + internal CI/CD pipeline for a bunch of projects using VPP; and as a side gig - full-cycle automation of the switched fleet (~500 boxes) at #CLEUR installations. One of the recent fun projects was [0] - probably industry first of this scale, for an event network: more than 15K WiFi clients on IPv6Mostly. Though we were benefitting from work of a lot of folks that pushed the standardization and did smaller/more controlled deployments, specifically to shout huge thanks to Jen Linkova and Ondřej Caletka.

If you like low level network stuff, you might like VPP - and given it’s Apache licensed, pretty easy to use it for your own project.

[0] https://www.ietf.org/proceedings/122/slides/slides-122-iepg-...

[-]

zamadatix 5 months ago ago

Agreed, still not perfect by any means.

One minor Ethernet MTU thing I would change with a time machine is to have the network header portion of the MTU be more like 802.11. I.e. instead of sized exactly to the headers of the day it intentionally was larger to allow variation over time. It wouldn't really do anything for most of the MTU concerns discussed here or for clients but I think it would have been helpful for the evolution of wired protocols.

Happy eyeballs! Yes, I loved that one! I was always a huge IPv6 nerd as well, though I didn't get started until shortly after that. The "nats are good" video isn't ringing any bells but if you have a link I'd definitely give it a watch as it sounds right up my humour alley.

Unfortunately all of that Cisco affiliation means we are forever blood enemies and can never speak again... ;). I kid, I came up through the Nortel heritage originally so I'm bound by contract to make such statements.

I've heard great things about the Fast Data Project, I'll definitely have to look into it some before the Oblivion remake comes out :). Maybe after this current project at work I'll finally get to mess with software based dataplanes properly.

It was great running into you here, I hope to catch you around more now that I know to look!

[-]

ay 5 months ago ago

> more like 802.11

L2 is “relatively simple” in a sense that it’s usually under the same administrative control; unlike with L3. And even then, if you have a look at all the complexity between the maintaining the interop in the wireless space… it’s amazing it works as well as it does, with so much functionality being conditional.

> "nats are good" video isn't ringing any bells

https://youtu.be/v26BAlfWBm8?feature=shared - it was a bit of a meme back at the time in making the “X fanboy” videos.

> I came up through the Nortel heritage originally

My networking cradle is Netware 4.1, and in those times it was a zoo of protocols anyway. I really liked conceptually the elegance of Nortel management being SNMP-first. Makes me smile hearing all these “API-first!” claims today.

> It was great running into you here

Indeed, nice to meet you too ! :-)

I do a fair bit of lurking. yesterday was a bit of an anomaly since the whole “truncation as a means to do PMTUD” was a subject of my idle ponder for more than a decade, so it struck the chord :-)

[-]

zamadatix 5 months ago ago

"Go buy some weed and smoke it" LOL

ay 5 months ago ago

With IPv4, clearing the DF bit in all egress packets and hacking on top of QUIC could give just enough of a wiggle room to make it possible to explore this between a pair of cooperating hosts even in today’s Internet.

Anti-DDoS middle boxes will be almost certainly unhappy with lone fragments and UDP in general, so it’s a bit of a thorny path.

The big question is what to do with IPv6, since the intermediary nodes will only drop. This bit unfortunately makes the whole exercise pretty theoretical, but it can be fun nonetheless to explore.

Feel free to contact me at my github userid at gmail, if this is a topic of interest.

[-]

zamadatix 5 months ago ago

Most carrier/enterprise/hardware IPv4 routers, particular those on the internet, will not actually perform IPv4 fragmentation on behalf of the client traffic even though it's allowed by the IPv4 standard. Typically fragmentation is reserved for boxes which already have another reason to care about it (such as needing to NAT or inspect the packets) or the client endpoints themselves. I.e. the internet will (sparing security middleboxes) allow arbitrary IPv4 fragments through but it won't typically turn a 8000 byte packet into 6 fragments to fit through a 1500 byte MTU limitation on behalf of the clients. E.g. if you send a 1500 byte IPv4 ping without DF set to a cellular modem or someone with a DSL modem using PPPoE it'll almost always get dropped by the carrier rather than fragmented.

Of course nothing is stopping you from labbing it up at home. Firewalls and software routers can usually be made to do refragmentation.

[-]

ay 5 months ago ago

Of course on the carrier boxes the fragmentation is done also not inline, so its behavior will depend on the aggressiveness of the CoPP configuration, and will be subject to the same pitfalls as the ICMP packet too big generation.

Thanks for keeping me straight here!

Based on the admittedly old study at [0] seems like some carriers just don’t bother to fragment, indeed - but by far not all of them.

Firewalls might do virtual reassembly, so the trick with the initial fragment won’t fly there.

This MTU subject is interesting for me because I have a little work in progress experiment: https://gerrit.fd.io/r/c/vpp/+/41914/1/src/plugins/pvti/pvti... (the code itself is already in, but has a few crashy bugs still and I need to take make it not suck performance wise, but that is my attempt to revisit the issue of MTU for tunnel use case. The thesis is that keeping the 5-tuple will make “chunking”/“de-chunking” at tunnel endpoints much much simpler on the endpoints of the tunnel.

The source of inspiration was a very practical setup at [1], which is, while looking horrible in theory (locally fragmented GRE over L2TP), actually gives a decent performance with 1500-byte end to end MTU over the tunnel.

The open question is which inner MTU will be sane, taking into account the increased probability of loss with bigger inner MTU… intuitively seems like something like ~2.5K should just double the loss probability (because it’s 2x packets) and might be a workable compromise in 2025….

One could also do the same trick over QUIC, of course, but i wanted something tiny and easier to experiment with - and the ability to go over IPSec or wireguard as well as a secured underlay.

[0] https://labs.ripe.net/author/emileaben/ripe-atlas-packet-siz...

[1] https://github.com/ayourtch/linode-ipv6-tunnel

[-]

zamadatix 5 months ago ago

Very interesting! It's like the best of the fragment-pre-encrypt world (everything appears as single packet 5 tuples to middleboxes) and fragment-post-encrypt world (transported packet data remains untouched) debate seen on IPsec deployments.

Like you mention you could do this under QUIC but then you'd be hamstrung to some of the design mandates such as encryption. This is way better as it's just datagrams doing your one goal - hiding that you're transporting fragments.

[-]

ay 5 months ago ago

Yeah, that was precisely the set of trade offs :-)

OTOH, I heard folks calling to banish the “no messing with a flow within 5-tuple” principle, so my hack may not have an overly long shelf life.

[-]

zamadatix 5 months ago ago

Next up: Everything just ends up being QUIC because you can't fuck with what you can't see inside :).

[-]

ay 5 months ago ago

Potentially. However, anecdotally a lot of service providers treat UDP to stricter rate limiting than TCP because it’s unauthenticated nature, so there is a back-pressure factor there.

Also: RFC9000 for QUIC is almost 50% longer than RFC9293 that is the new one for TCP - so, I would expect the implementation would be probably more complex ?

In the absence of that, everything will go over HTTP :-)

ikiris 5 months ago ago

And how do you tell the difference between cut off packets, and a mtu drop? What about crcs / frame checks? Do you regenerate the frames? Do you do this at routed interfaces? What if there's just layer 2 only involved?

[-]

LegionMammal978 5 months ago ago

> And how do you tell the difference between cut off packets, and a mtu drop?

You don't, apart from enforcing a bare-minimum MTU for sanity's sake. If your jumbo-size packets are getting randomly cut off by a middlebox, then they probably aren't stable at that size anyway.

Veserv 5 months ago ago

Packets do not get “cut-off” normally. That is kind of the point. Some protocols allow transparent fragmentation, but the fragments need to encode enough information for reconstruction, so you can still detect “less data received than encoded on send”.

You do not need bit error detection because you literally truncated the packet. The data is already lost. But in the process you learned it was due to MTU limits which is very useful. Protocols are already required to be robust to garbage that fails bit error detection anyways, so it is not “required” to always have valid integrity tags. You could transparently re-encode bit error detection on the truncated packet if you so desire to ensure data integrity of the “MTU resulted in truncation” packet that you are now forwarding, but again, not necessary.

Any end-to-end protocol that encodes the intended data size in-band can use this technique across truncating transport layers. And any protocol which does so already requires implementations to not blindly trust the in-band value otherwise you get trivial buffer overflows. So, all non-grossly insecure client implementations should already be able to safely handle MTU truncation if they received it (they would just not be able to use that for MTU discovery until they are updated). The only thing you need is routers to truncate instead of drop and then you can slowly update client implementations to take advantage of the new feature since this middlebox change should not break any existing implementations unless they are inexcusably insecure.

[-]

ikiris 5 months ago ago

I don’t think you understand what normally looks like if you start forwarding damaged frames like this because you can’t tell the difference. That was the point.

[-]

Veserv 5 months ago ago

I literally have no idea what you are talking about. You can send garbage packets that conform to no known protocol on the internet. You can get more bit errors or perfect bit errors that make your bit error detection pass while still forwarding corrupt payloads. Transport protocols and channels must be and are robust to this.

“Damaged” frames and frame integrity only matter if you need the contents of the entire packet to remain intact. Which you explicitly do not when truncating.

The only new problem that arises is that maybe the in-band length information or headers get corrupted resulting in misinterpreting the truncation that actually occurred. And again, you already need to be robust to garbage. And you can just change my proposal to recompute the integrity tag on the truncated data if you think that really matters.

[-]

ikiris 5 months ago ago

I agree you have no idea what I’m talking about

cryptonector 5 months ago ago

> Path MTU discovery has not been enthusiastically embraced

Ugh. I don't understand this. Especially passive PMTUD should just be rolled out everywhere. On Linux it still defaults to disabled! https://sourcegraph.com/search?q=context%3Aglobal+repo%3A%5E...

[-]

whiatp 5 months ago ago

PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages. The second worst offender is, IMO (data center and ISP) routers that generate ICMP replies in their CPU, meaning large packets hit a rate limited exception punt path out of the switch ASIC over to the cheapest CPU they could find to put in the box. If too many people are hitting that path at the same time, (maybe) no reply for you.

More rare cases, but really frustrating to debug was when we had an L2 switch in the path with lower MTU than the routers it was joining together. Without an IP level stack, there is no generation of ICMP messages and that thing just ate larger packets. The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.

[-]

toast0 5 months ago ago

> The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.

This is an old Linux tcp offloading bug; large receive offload smooshes the inbound packet, then it's too big to forward.

I had to track down the other side of this. FreeBSD used to resend the whole send queue if it got a too big message, even if the size did not change. Sending all at once made it pretty likely for the broken forwarder to get packets close enough to do LRO, which resulted in large enough packet sending to show up as network problems.

I don't remember where the forwarder seemed to be, somewhere far away, IIRC.

cryptonector 5 months ago ago

> PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages.

Passive PMTUD does NOT depend on ICMP messages.

immibis 5 months ago ago

L2 not generating errors is expected behaviour - all ports on the L2 network are supposed to have the same MTU set

Hikikomori 5 months ago ago

They recently started supporting pmtud on tgw. But it wasn't a big deal really as it adjusted mss instead.

JackSlateur 5 months ago ago

path mtu discovery is worthless because the sending host does not control the path used

So it is not compatible with anycast, for instance, which is massively used everywhere

In the end, having no answer is better than having a most likely wrong answer

mkj 5 months ago ago

Would that help with UDP, or only TCP?

[-]

ajb 5 months ago ago

That particular one, only TCP. There is a different one for UDP applications: https://www.rfc-editor.org/rfc/rfc8899

Because UDP is only a very thin layer, each layer on top (eg, QUIC) has to implement PLPMTUD; although, recently IETF standardised a way to extend UDP to have options and PLPTMUD is also specified for that: https://datatracker.ietf.org/doc/draft-ietf-tsvwg-udp-option...

5 months ago ago

[deleted]

cryptonector 5 months ago ago

You can implement passive PMTUD with UDP if you like. It's more work for you, but it's perfectly doable.

posnet 5 months ago ago

"Jumbogram", an IPv6 packet with the Jumbo Payload option set, allowing for an frame size of up to 2³²-1 bytes.

At 10Gbps it would take 3.4 seconds just to serialize the frame.

[-]

hugmynutus 5 months ago ago

Luckily 400Gb/s nics are already on the market [1]

[1] https://docs.broadcom.com/doc/957608-PB1

nullc 5 months ago ago

Is there any convenient way to tell linux distributions that the local subnet can handle 9k jumbos (or whatever) but that anything routed out must be 1500?

I currently have this solved by just sticking hosts on two vlans, one that has the default route and another that only has the jumbo capable hosts. ... but this seems kinda stupid.

[-]

fbouynot 5 months ago ago

Yes you can set your interface MTU at 9000 and assign a 1500 MTU to the routes themselves.

[-]

throw0101b 5 months ago ago

> […] and assign a 1500 MTU to the routes themselves.

See "mtu" option in ip-route(8):

* https://man.archlinux.org/man/ip-route.8.en#mtu

The BSDs also have an "-mtu" option in route(8):

* https://man.freebsd.org/cgi/man.cgi?route(8)

* https://man.openbsd.org/route

jeffbee 5 months ago ago

The efficiency argument applies to private flows mostly. In terms of overall network traffic, the huge majority takes place between peers that share a local or private network. Internetworking as such has a relatively small share of total flows. So large frame sizes are beneficial in the context where they are also not problematic, and path MTU discovery is not beneficial in the context where it has many drawbacks. It seems as though the current state is pretty much optimal.

[-]

avidiax 5 months ago ago

If you've ever tried to enable jumbo packets on your LAN, you'd soon learn that it causes lots of problems.

First, every L2 "dumb" switch that doesn't support your jumbogram size just silently drops the packet, which is no good.

Then, you have to figure out what size of jumbogram every device on your network supports, and select the minimum. In many cases, you'll have clients that don't support it at all.

And I hope all your OSes support setting an MTU per route, and you enjoy setting special routes on all of your clients, since Path MTU discovery, even where it is enabled and supported, at the very least adds latency to every connection, if it even works at all.

And god help you once you try to scale up your sweet jumbo frame solution. Plenty of routers have strict ICMP rate limits either imposed in software or hardware (because ICMP may be handled in an anemic CPU). So those ICMP fragmentation needed packets aren't reliably returned to your clients. It's even worse if your ISP doesn't block jumbograms outright. You will soon learn which of your ISPs peerings do or don't support jumbograms and whether they do or don't emit or forward ICMP.

The only advisable way to use jumbo frames is if you are running a datacenter and you have a group of machines that can be properly configured for route-based MTU and that would benefit from jumbo frames, and every piece of hardware you buy is carefully specced to support it.

[-]

jeffbee 5 months ago ago

Yeah, but that's really my point. The byte-weighted flow of Ethernet frames on Earth is overwhelmingly happening inside the data centers of a handful of huge organizations.

nayuki 5 months ago ago

> The speed of light in glass or fiber-optic cable is significantly slower, at approximately 194,865 kilometers per second. The speed of voltage propagation in copper is 224,844 kilometres per second.

If I understand correctly, the speed of light in an electrical cable doesn't depend on the metal that carries current, but instead depends on the dielectric materials (plastic, air, etc.) between the two conductors?

[-]

tonyarkles 5 months ago ago

If I’m interpreting what you’re asking correctly, yes. The velocity factor of a cable doesn’t spend on the metal it’s made of but rather the insulator material and the geometry of the cable.

For fibre the velocity factor depends on the refraction index of the fibre.

lucb1e 5 months ago ago

Huh? Maybe I'm completely misreading the question, but when they say fiber-optic cable, they do mean optic. It's not an "electrical cable"; there is no metal needed in optic communication cables (perhaps for stiffness or whatnot, but not for the communication)

[-]

Hikikomori 5 months ago ago

>The speed of voltage propagation in copper is 224,844 kilometres per second.

This part?

[-]

lucb1e 5 months ago ago

What about it?

[-]

Hikikomori 5 months ago ago

They're separate statements, one about speed in fiber and the other about speed in copper?

roeles 5 months ago ago

> The system dispensed with a passive common bus and replaced it with an active switching hub to which hosts were attached.

I get the impression that the standard still allows hubs to exist, but that you just don't see them in practice.

I would be interested if anyone has ever used a 100mbit hub.

beeburrt 5 months ago ago

That font size is tiny. If this is your site, maybe consider a larger font size

[-]

nayuki 5 months ago ago

The site specifies a base font size of 12px. The better practice is to not specify a base font size at all, just taking it from the user's web browser instead. Then, the web designer should specify every other font size and box dimension as a scaled version of the base font size, using units like em/rem/%, not px.

lucb1e 5 months ago ago

It's the same size as HN: 12px. HN looks larger to me for some reason, but I can't figure out why: when I overlay a quote someone posted here over the website with half transparency in GIMP, the text is clearly the same height. Some letters are wider, some narrower, but the final length of the 8 words I sampled is 360px on HN vs. 358px on that website (so differences basically cancel out)

This is on Firefox/Debian, in case that means something for installed fonts. I see that site's CSS specifies Verdana and Arial, names that sound windowsey to me but I have no idea if my system has (analogous versions to) those

tomthecreator 5 months ago ago

There's a PDF version linked at the top of the article, it's actually much better typeset.

usefulcat 5 months ago ago

Given the subject of TFA, this seems appropriate in a meta sort of way.

jiehong 5 months ago ago

No commitee want to change it, because nobody agrees. And nothing changes.

Can’t we accept to start a change that may take a decade or more to go forward? Instead of not starting that change.

2OEH8eoCRo0 5 months ago ago

Do you count the frame preamble?

yb303 5 months ago ago

tldr- a document written in 2024 that does fit on my phone

The Size of Packets