You might also want to mention AMWA NMOS, which is increasingly used alongside SMPTE 2110 in setups like this. NMOS (Networked Media Open Specifications) defines open, vendor-neutral APIs for device discovery, registration, connection management, and control of IP media systems. In practice, it's what lets 2110 devices automatically find each other, advertise their streams, and be connected or reconfigured via software.
The specs are fully open source and developed in the open, with reference implementations available on GitHub (https://github.com/AMWA-TV)
The specs define REST API's, JSON schemas, certificate provisioning, and service discovery mechanisms (DNS-SD / mDNS), providing an open control framework for IP-based media systems.
There’s also AES70, or OCA (https://news.ycombinator.com/item?id=46934318). More popular in audio than video, something of a competitor to NMOS (although there are parts of NMOS that were very much inspired by OCA). There are open source C++, Python, JavaScript and Swift implementations as well as some commercial ones.
PipeWire had some decent AES67 support for network audio. Some really fun interesting hardware already tested. Afaik no SMPTE 2110 (which is video) but I don't really know.
I know it's not the use case but I do wish compressed formats were more supported. Not really necessary for production, but these are sort of the only defacto broadly capable network protocols we have for AV, so it would expand the potential uses a lot IMO. There may be some very proprietary JPEG-XS compression, but generally the target seems to be uncompressed.
Actually SMPTE 2110 can host video (2110-20), audio (2110-30) and ancillary data (2110-40) essences, and each essence can be delivered independently of the others.
ST 2110-22 standardizes compressed video using JPEG XS. While there is a patent pool for XS, otherwise the format is standardized and open.
It would be nice to see an essence type defined for AVC, but the quality tradeoffs of AVC/HEVC are really not appropriate for the domain that ST 2110 is aiming for: which is the contribution side video network of a broadcast operation.
There are alternative "consumer grade" and "prosumer grade" IP video solutions out there.
There is Teleport which is growing up in the OBS space, but is quite capable (we've used it in production for quite awhile)
Really it's a bit odd to wish that ST 2110 had compressed video when it's really just a specific profile of RTP with some broadcast specific bits on top while RTP itself does support lots of payloads.
ST 2110-10 which provides the timing just standardizes PTP and the meaning of the RTP timestamps (notably a specific epoch), but there's nothing stopping you from using PTP based timestamps for your RTP payloads otherwise.
ST 2110 is not a "plug and play" system by itself. There is a whole standard that adds in such capabilities which is the NMOS IS standards, but none of that is attempting to make "peer to peer" (so to speak) ST 2110 a thing, so actually using it for anything other than a broadcast system is far from trivial, and you'd be better off using something else. NMOS goal is to make auto configuration of ST2110 flows a thing, which they have broadly succeeded in doing.
ST 2110-22 is codec agnostic. It just standardises CBR compression, for which JPEG-XS is a good fit today.
For plug-and-play IPMX (https://ipmx.io/about/) is looking to be a pretty promising approach that combines ST 2110 with NMOS, auth, encryption and other useful features. It's targetted at the ProAV market but IMO should be mostly suitable for consumer use.
Wish I was seeing more ST 2110 and IPMX open source work about. Would really love for good or at least common protocols to be broadly usable, available, via some good libraries.
Generally NDI is not widely used in professional AV. There's a fair bit in prosumer, and a _little_ bit in the low end of pro. But the fact that it's a proprietary protocol (they can claim it's “open” all they want, but the SDK is closed source, there is no spec and they sue people who make open reimplementations), has poor image quality (it's roughly MPEG-2 intraframe, i.e., not very good), has poor latency and isn't very reliable makes it a no-go for most larger installations.
Just to add options, there is also the relatively new OpenMediaTransport ( https://www.openmediatransport.org/ ) that aims to be a licensing free, open alternative to NDI. At the moment, there are a number of programs supporting it, but sadly not many cameras, stand alone converters, nor audio gear. If line to see that change.
Most 2110 kit relies on narrow timing. That means packets arriving and leaving in a window on the order of 10 microseconds. Doing that in software reliably for your typical 100gbit interface is challenging.
I'm amused but not entirely surprised to see that live video production hasn't meaningfully progressed since I was involved 30+ years ago.
Yes, the technology has evolved – digital vs analog (partly – for example analog comms here because digital (optical) "isn't redundant" (lol, what?)); higher resolution; digital overlays and effects, etc. But the basic process of a bunch of humans winging it and yelling to each other hasn't changed at all.
This is an industry ripe for massive disruption, and the first to do it will win big.
Today's broadcasting truly is a beautifully evolved piece of engineering. For a guy like me who dabbles in time / frequency synchronisation, this is a great example of a use case that physically malfunctions without sync.
...but the first time I learned about SMPTE, was from Frank Zappa's song "Baby Snakes" - interestingly it mentiones both SMPTE and sync. Every time SMPTE is mentioned, this plays in my head.
Seems the classic legacy overengineered thing that costs 100x production costs because it's a niche system, is 10x more complex than needed for to unnecessary perfectionism and uses 10-100x more people than needed due to employment inerta.
A more reasonable thing is to just use high quality cameras, connect to the venue fiber Internet connection, use normal networked transport like H.265 with MPEG-TS over RTP (sports fans certainly don't care about recompression quality loss...), do time sync by having A/V sync and good clocks on each device and aligning based on audio loud enough to be recorded by all devices, then mix, reencode and distribute on normal GPU-equipped datacenter servers using GPU acceleration
The sort of systems which demand 100% reliability tend to be like that. "Disruption" in the middle of live sports broadcast is unpopular with customers.
The engineer on the truck seemed to have the most annoyance with the PTP aspect of 2110, but it seemed nobody questioned the move to 2110, and at least as far as broadcast equipment goes, they're all in on 2110. As a small(ish) YouTuber, NDI is more exciting to me, but I'm not mixing dozens or hundreds of sources for a real time production, and can just re-record if I get a sync issue over the network.
Perfect is the enemy of the good, as always—reading through that site, it seems like no solution is perfect, and the main tradeoff from that authors perspective is bandwidth requirements for UHD.
It looks like most places are only hitting 1080p still, however. And the truck I was looking at could do 1080, but runs the NHL games at 720p.
> it seems like no solution is perfect, and the main tradeoff from that authors perspective is bandwidth requirements for UHD.
The “no standalone switch can give enough bandwidth” issue has generally been solved since that page was written. You can buy 1U switches now off-the-shelf with 160x100G (breaking out from 32x800G). One of the main drivers of IP in this space is that you can just, like, get an Ethernet switch (and scale up in normal Ethernet ways) instead of having to buy super-expensive 12G-SDI routers that have hard upper limits on number of ins/outs.
Of course, most random YouTubers are not going to need this. But they also are not in the market for broadcast trucks.
Yes its a huge benefit. Of course without an NMOS SDN solution, actually reliably routing so much data over a network (especially if incrementally designed) is a huge pain in the ass. But thankfully we have those systems now.
We sort of traded the big expensive SDI switchers for big expensive SDNs
Also, I guess we traded a ton of coax cable for somewhat more manageable single-mode fiber. :-)
I never fully understood why SDI over fiber remains so niche, e.g. UHD people would rather do four chunky 3G-SDI cables instead of a much cheaper and easier-to-handle fiber cable (when the standards very much do exist). But once your signal is IP, then of course fiber is everywhere and readily available, so there seems to be no real blocker there.
I don't know but is there a maximum compression weight on fiber, because in some of these broadcast centers they've got cable trays of SDI that are so heavy and packed that removing a dead line is a fire hazard (because the friction of pulling the line could cause a fire).
They'd obviously need a lot less and the lines are a lot lighter but maybe folks figured if they could avoid repeating that scenario in their design, it might be a good idea :-P
You can build fiber basically arbitrarily solid. A normal patch cable won't be that solid, but the more rugged trunk cables is something like (just pulling out of a data sheet for something I used a while back):
* Outer diameter: 6mm
* Max tensile load: 900 N
* Crush resistance: 750 N / 10 cm
* Max proof stress: >= 0,69 GPa
To be clear, this is not specially rugged cable by any means. This is just a normal G12 cable for general use. You can get stuff that's much more solid. It's certainly much lighter than the equivalent SDI copper cable.
2110 is certainly popular in the industry. There’s no one way to get video out of a sports venue and across the network to takers, though. Where I work different workflows have SDI, NDI, SRT, RIST, and our own internal stuff uses MPEG TS over UDP and gets routed by a distributed system that determines next-hop routing through our network at each hop. The encoding might be H.264, HVEC, or even JPEG2000.
NDI is indeed quite good for prosumer cases. As a Newtek (now Vizrt) shop, our Tricasters speak it natively and that's a great reason we've made use of it.
That being said, if you aren't already in the Newtek/Vizrt ecosystem, might I recommend exploring Teleport, which is a free and open source NDI alternative built into OBS which has also served us very well.
That's certainly true to an extent. Other commenters have already highlighted necessary complexities. There is absolutely a lot of very entrenched "ways-of-working" that add unnecessary complexity, as with every domain. Not everything is a technical problem though and the social / process side of this sort of setup is what can make it work at all.
> do time sync by having A/V sync and good clocks on each device and aligning based on audio loud enough to be recorded by all devices
Why do you need good clocks? For audio, even with simultaneously playing speakers, you only need to synchronize within a couple of ms unless you need coherence or are a serious audiophile. If if want to maintain sync for an hour I suppose you need decently good clock.
But as long as you have any sort of wire, basically any protocol can synchronize well enough. Although synchronizing based on visual and audible sources is certainly an interesting idea. (Audio only is a completely nonstarter for a sporting event: the speed of sound is low and the venues are large. You could easily miss by hundreds of ms.)
> then mix, reencode and distribute on normal GPU-equipped datacenter servers using GPU acceleration
Really? Even ignoring latency, we’re talking quite a few Gbps sustained. A hiccup would suck, and if you’re not careful, you could easily spend multiple millions of dollars per day in egress and data handling fees if you use a big cloud. Just use a handful of on-site commodity machines.
Frame sync. In order to reduce latency, these systems tend to be unbuffered, which means that the frames have to arrive at a very specific time, and you can't afford significant jitter or (worse) phase drift. If you have one source at 25.000FPS and one at 25.001FPS eventually you're going to be a frame out between them.
Let's do the math, conservatively. Suppose there's an event and the intent is to broadcast at 60 fps (which is on the high side) and that you want to be able to switch between cameras or even composite multiple camera feeds together without skipping frames or interpolating between frames. That gives a budget of 16.7ms per frame. (Hey, this is a lot like making a video game! Fortunately input latency is not such a big deal here because the viewers aren't playing.)
Suppose you give a budget of 4ms to composite the frame. Now you have 12.7ms from the end of the previous frame in which to collect the current frame from each camera and do whatever fancy processing you want to do (drawing first down lines, adding ads, etc). Of course, you can always cheat a bit by pipelining frames, but this adds latency, and maybe you would prefer to avoid that. Let's say you don't want to pipeline and you budget 8.7ms for all this fancy work, which gives you a 4ms window in which to receive all your incoming frames, which need to be in exact lockstep from all cameras. (This is very, very conservative, since, again, this is not a videogame and it's probably fine to buffer all inputs for a few tens or even hundreds of ms. I'm ignoring the time to transfer each frame -- I'm assuming we're counting from the end of the frame transfer time. If it takes a full frame to transfer a frame, then you cannot possibly avoid one frame of transfer latency anyway.)
So you need all those fancy cameras to stay in sync to plus or minus 4ms. That's a piece of cake with basically any modern technology, where "modern" means, I don't know, the last 30 years? NTP can do this. PTP can do this even with a fully software implementation and no assistance from the switches whatsoever. A cellphone can do this. A fancy GPSDO can do orders of magnitude better than this. A decent RTC will take a whole 200 seconds to drift by a problematic amount. The only actual fancy tech needed is the ability for the host controller on each camera to discipline the camera's frame clock, which I imagine any camera worth its salt can do.
I don't see why a $30k clock is useful here, or why very fancy protocols are needed. I do see why there's a need to get everyone to agree on a protocol, though.
I did once watch an event where I was genuinely impressed by the synchronization, though: a parade at a theme park. There were hundreds or thousands of fixed speakers and hundreds of mobile speakers in the parade, and all of them stayed perfectly synchronized, playing parts of the same music, to within the precision of my ears. I'm guessing the design goal was better than 1ms synchronization error, over at least half an hour, across acres of space, in a potentially adverse RF environment (at least the ISM bands would have been horribly polluted by everyone's phones). And possibly the mobile speakers would even have needed to compensate for their own locations due to the speed of sound being kind of low and the actual parade speed possibly being a bit unpredictable.
If I were designing that, I might have used GPSDOs on each mobile element or possibly some kind of wireless clock distribution -- a 20ppm clock is not even close to good enough.
But event broadcasting doesn't have these problems per se because, anywhere there's a camera, there's already a reliably, high-bandwidth data link of some sort so the camera feed can get to where it's going in real time.
Surprisingly, the timing requirements for digital seem to be slightly lower than it was for analog, at least if I heard the engineer correctly on site. It was something like 1.5 microseconds in the old days, but can be like 10 microseconds now. I could be wrong there.
No, you are right. And it is because digital has a much wider 'lock' range than analog. Analog only works 'in the moment' whereas digital can take the history of the signal so far into account and so not lose lock. If it gets too extreme it will still happen though so cumulative problems will still show up only much later.
> Why do you need good clocks? For audio, even with simultaneously playing speakers, you only need to synchronize within a couple of ms unless you need coherence or are a serious audiophile. If if want to maintain sync for an hour I suppose you need decently good clock.
There are many microphones involved in a production, and humans are quite good at detecting desync between audio/video when watching a presenter talk. You cannot fix desynchronization further down the chain if the desynchronization is variable for each source.
You also need synchronization to mix sources (common in any production) without incurring the latency and resampling of asynchronous sample rate conversion.
As someone who's spent a lot of time in this space and is quite interested in lowering the cost of entry and finding ways to simplify it, I'm afraid you've vastly oversimplified the problem.
> sports fans certainly don't care about recompression quality loss...
I think that's quite an assumption. In a modern video chain youd need to decompress and recompress the video from a camera many many times on the way to distribution. Every filter or combining element would need to have onboard decoding and encoding which would introduce significant latency, would be very difficult to maintain quality, and would introduce even more energy requirements than the systems we already deploy.
High quality cameras aren't any good if they throw away their quality at the source before they have an opportunity to be mixed in with the rest of the contribution elements. You certainly wouldn't compress the camera feeds down to what you'd expect to see on a consumer video feed (about 20Mbps for 4K on HEVC).
> normal networked transport like H.265 with MPEG-TS over RTP
If you want to, you can do that already using SMPTE ST 2110-22 which loops in the RTP payload standards defined by the IETF. ST 2110 itself is already using RTP as its core protocol by the way (for everything).
> do time sync by having A/V sync
What do you mean by this? In order to synchronize multiple elements you need a common source of time. Having "good clocks" on each device is not enough: they need to be synchronized to the level that audio matches up correctly, which is much more precise than video as audio uses sample frequencies in the 48Khz-96Khz range, whereas video of course is typically just 60Hz. Each clock needs a way to _become_ good by aligning themselves to some global standard. If you don't have a master clock like PTP, your options are... what... GPS? I mean you _could_ equip each device with its own GPS transponders, but if the cameras cant get a reliable GPS lock then you're out of luck.
> aligning based on audio loud enough to be recorded by all devices
Do you mean physically? Like actual audio being emitted into the space where the devices are? Because some of the devices will be in the stadium where theres very very loud noises on account of the crowd, and some of them will be in the backroom where that audio is not audible. Then you need to factor in the speed of sound, which is absolutely significant in a stadium or other large venue. None of this is particularly practical.
If you mean an audio sound that is sent to each device over a cable, well are we talking SDI (copper)? If so, we wouldnt use audio for that, we would use what's called Black Burst. But what generates the black burst? Typically, its the grandmaster clock. The black bursts on SDI need to be very precise, and that requires a dedicated piece of real time hardware.
If you mean sending it over ethernet, you now need to ensure you factor in the routing delays that will inevitably happen over an open unplanned network. To deal with those delays, we typically do two things. One, we use automatically planned networks, where the routers are aware of the media flows going over each link, and the topology is automatically rearranged in order to minimize or eliminate router buffering (aka software defined networks, typically using NMOS IS standards to handle discovering and accounting for the media essences).
> they need to be synchronized to the level that audio matches up correctly, which is much more precise than video as audio uses sample frequencies in the 48Khz-96Khz range, whereas video of course is typically just 60Hz
Typically video equipment expects the individual pixels to line up, save for some buffering (~1–10µs), not just the individual frame. So your synchronization requirement for video is in the gigahertz range (or about megahertz, if you take the buffering into account), not 60 Hz. (Of course, what matters is normally the absolute offset, not the frequency, but they tend to be somewhat inversely related.)
You might also want to mention AMWA NMOS, which is increasingly used alongside SMPTE 2110 in setups like this. NMOS (Networked Media Open Specifications) defines open, vendor-neutral APIs for device discovery, registration, connection management, and control of IP media systems. In practice, it's what lets 2110 devices automatically find each other, advertise their streams, and be connected or reconfigured via software.
The specs are fully open source and developed in the open, with reference implementations available on GitHub (https://github.com/AMWA-TV)
The specs define REST API's, JSON schemas, certificate provisioning, and service discovery mechanisms (DNS-SD / mDNS), providing an open control framework for IP-based media systems.
There’s also AES70, or OCA (https://news.ycombinator.com/item?id=46934318). More popular in audio than video, something of a competitor to NMOS (although there are parts of NMOS that were very much inspired by OCA). There are open source C++, Python, JavaScript and Swift implementations as well as some commercial ones.
Fun to see.
PipeWire had some decent AES67 support for network audio. Some really fun interesting hardware already tested. Afaik no SMPTE 2110 (which is video) but I don't really know.
I know it's not the use case but I do wish compressed formats were more supported. Not really necessary for production, but these are sort of the only defacto broadly capable network protocols we have for AV, so it would expand the potential uses a lot IMO. There may be some very proprietary JPEG-XS compression, but generally the target seems to be uncompressed.
https://gitlab.freedesktop.org/pipewire/pipewire/-/wikis/AES...
Actually SMPTE 2110 can host video (2110-20), audio (2110-30) and ancillary data (2110-40) essences, and each essence can be delivered independently of the others.
ST 2110-22 standardizes compressed video using JPEG XS. While there is a patent pool for XS, otherwise the format is standardized and open.
It would be nice to see an essence type defined for AVC, but the quality tradeoffs of AVC/HEVC are really not appropriate for the domain that ST 2110 is aiming for: which is the contribution side video network of a broadcast operation.
There are alternative "consumer grade" and "prosumer grade" IP video solutions out there.
There is Teleport which is growing up in the OBS space, but is quite capable (we've used it in production for quite awhile)
https://github.com/fzwoch/obs-teleport
And of course the underpinning of 2110 itself is RTP, which is a standard network protocol, which does have AVC defined as a payload type in RFC 6184
https://www.rfc-editor.org/rfc/rfc6184
Really it's a bit odd to wish that ST 2110 had compressed video when it's really just a specific profile of RTP with some broadcast specific bits on top while RTP itself does support lots of payloads.
ST 2110-10 which provides the timing just standardizes PTP and the meaning of the RTP timestamps (notably a specific epoch), but there's nothing stopping you from using PTP based timestamps for your RTP payloads otherwise.
ST 2110 is not a "plug and play" system by itself. There is a whole standard that adds in such capabilities which is the NMOS IS standards, but none of that is attempting to make "peer to peer" (so to speak) ST 2110 a thing, so actually using it for anything other than a broadcast system is far from trivial, and you'd be better off using something else. NMOS goal is to make auto configuration of ST2110 flows a thing, which they have broadly succeeded in doing.
ST 2110-22 is codec agnostic. It just standardises CBR compression, for which JPEG-XS is a good fit today.
For plug-and-play IPMX (https://ipmx.io/about/) is looking to be a pretty promising approach that combines ST 2110 with NMOS, auth, encryption and other useful features. It's targetted at the ProAV market but IMO should be mostly suitable for consumer use.
Ooh, you're right, and it just adopts the IETF RTP payload types for that. Cool.
Also forgot about IPMX.
Wish I was seeing more ST 2110 and IPMX open source work about. Would really love for good or at least common protocols to be broadly usable, available, via some good libraries.
Jpeg-xs did indeed get ffmpeg support a couple months ago. https://www.phoronix.com/news/FFmpeg-Merges-JPEG-XS
It's worth mentioning NDI (Network Device Interface) as well, which is widely used in the Pro-AV for transporting compressed video and audio over IP.
Generally NDI is not widely used in professional AV. There's a fair bit in prosumer, and a _little_ bit in the low end of pro. But the fact that it's a proprietary protocol (they can claim it's “open” all they want, but the SDK is closed source, there is no spec and they sue people who make open reimplementations), has poor image quality (it's roughly MPEG-2 intraframe, i.e., not very good), has poor latency and isn't very reliable makes it a no-go for most larger installations.
I would argue that it is heavily used in professional AV, but not as much in high-end installations and usually not in broadcast setups.
It is certainly not open and doesn't compare to ST2110. I was mentioning it for compressed video streaming.
Just to add options, there is also the relatively new OpenMediaTransport ( https://www.openmediatransport.org/ ) that aims to be a licensing free, open alternative to NDI. At the moment, there are a number of programs supporting it, but sadly not many cameras, stand alone converters, nor audio gear. If line to see that change.
nice to see an article from my industry! st2110 is such a complex standard which a lot of the hardware mentioned has been molded to deal with.
Most 2110 kit relies on narrow timing. That means packets arriving and leaving in a window on the order of 10 microseconds. Doing that in software reliably for your typical 100gbit interface is challenging.
Challenging but certainly doable with kernel bypass technologies and dedicated CPU cores.
It generally also needs help from the NIC, to pace out the packets at the right timestamps. (Typically delivered as part of said bypass technologies.)
It’s SMPTE. Not SMTPE
Doh! Will fix on my article at least.
I'm amused but not entirely surprised to see that live video production hasn't meaningfully progressed since I was involved 30+ years ago.
Yes, the technology has evolved – digital vs analog (partly – for example analog comms here because digital (optical) "isn't redundant" (lol, what?)); higher resolution; digital overlays and effects, etc. But the basic process of a bunch of humans winging it and yelling to each other hasn't changed at all.
This is an industry ripe for massive disruption, and the first to do it will win big.
That rack cabling is a bit rough. Appreciate it's a live event (I've worked on them myself) but come on :)
> like why they use bundles of analog copper wire for audio instead of digital fiber
Good article. Got me to read the article because I was curious why...
Today's broadcasting truly is a beautifully evolved piece of engineering. For a guy like me who dabbles in time / frequency synchronisation, this is a great example of a use case that physically malfunctions without sync.
...but the first time I learned about SMPTE, was from Frank Zappa's song "Baby Snakes" - interestingly it mentiones both SMPTE and sync. Every time SMPTE is mentioned, this plays in my head.
Unfortunate typo in the headline, reproduced here and once in the article. This is not about email or spam.
Neat post. I wonder what the drift is on those clocks.
Seems the classic legacy overengineered thing that costs 100x production costs because it's a niche system, is 10x more complex than needed for to unnecessary perfectionism and uses 10-100x more people than needed due to employment inerta.
A more reasonable thing is to just use high quality cameras, connect to the venue fiber Internet connection, use normal networked transport like H.265 with MPEG-TS over RTP (sports fans certainly don't care about recompression quality loss...), do time sync by having A/V sync and good clocks on each device and aligning based on audio loud enough to be recorded by all devices, then mix, reencode and distribute on normal GPU-equipped datacenter servers using GPU acceleration
The sort of systems which demand 100% reliability tend to be like that. "Disruption" in the middle of live sports broadcast is unpopular with customers.
While I think you are oversimplifying the timing issue, you are not the first to think that about 2110.
https://stop2110.org/
The engineer on the truck seemed to have the most annoyance with the PTP aspect of 2110, but it seemed nobody questioned the move to 2110, and at least as far as broadcast equipment goes, they're all in on 2110. As a small(ish) YouTuber, NDI is more exciting to me, but I'm not mixing dozens or hundreds of sources for a real time production, and can just re-record if I get a sync issue over the network.
Perfect is the enemy of the good, as always—reading through that site, it seems like no solution is perfect, and the main tradeoff from that authors perspective is bandwidth requirements for UHD.
It looks like most places are only hitting 1080p still, however. And the truck I was looking at could do 1080, but runs the NHL games at 720p.
> it seems like no solution is perfect, and the main tradeoff from that authors perspective is bandwidth requirements for UHD.
The “no standalone switch can give enough bandwidth” issue has generally been solved since that page was written. You can buy 1U switches now off-the-shelf with 160x100G (breaking out from 32x800G). One of the main drivers of IP in this space is that you can just, like, get an Ethernet switch (and scale up in normal Ethernet ways) instead of having to buy super-expensive 12G-SDI routers that have hard upper limits on number of ins/outs.
Of course, most random YouTubers are not going to need this. But they also are not in the market for broadcast trucks.
Yes its a huge benefit. Of course without an NMOS SDN solution, actually reliably routing so much data over a network (especially if incrementally designed) is a huge pain in the ass. But thankfully we have those systems now.
We sort of traded the big expensive SDI switchers for big expensive SDNs
Also, I guess we traded a ton of coax cable for somewhat more manageable single-mode fiber. :-)
I never fully understood why SDI over fiber remains so niche, e.g. UHD people would rather do four chunky 3G-SDI cables instead of a much cheaper and easier-to-handle fiber cable (when the standards very much do exist). But once your signal is IP, then of course fiber is everywhere and readily available, so there seems to be no real blocker there.
I don't know but is there a maximum compression weight on fiber, because in some of these broadcast centers they've got cable trays of SDI that are so heavy and packed that removing a dead line is a fire hazard (because the friction of pulling the line could cause a fire).
They'd obviously need a lot less and the lines are a lot lighter but maybe folks figured if they could avoid repeating that scenario in their design, it might be a good idea :-P
You can build fiber basically arbitrarily solid. A normal patch cable won't be that solid, but the more rugged trunk cables is something like (just pulling out of a data sheet for something I used a while back):
To be clear, this is not specially rugged cable by any means. This is just a normal G12 cable for general use. You can get stuff that's much more solid. It's certainly much lighter than the equivalent SDI copper cable.2110 is certainly popular in the industry. There’s no one way to get video out of a sports venue and across the network to takers, though. Where I work different workflows have SDI, NDI, SRT, RIST, and our own internal stuff uses MPEG TS over UDP and gets routed by a distributed system that determines next-hop routing through our network at each hop. The encoding might be H.264, HVEC, or even JPEG2000.
NDI is indeed quite good for prosumer cases. As a Newtek (now Vizrt) shop, our Tricasters speak it natively and that's a great reason we've made use of it.
That being said, if you aren't already in the Newtek/Vizrt ecosystem, might I recommend exploring Teleport, which is a free and open source NDI alternative built into OBS which has also served us very well.
That's certainly true to an extent. Other commenters have already highlighted necessary complexities. There is absolutely a lot of very entrenched "ways-of-working" that add unnecessary complexity, as with every domain. Not everything is a technical problem though and the social / process side of this sort of setup is what can make it work at all.
The approach that you're hinting mostly describes the general direction of remote production (https://video.matrox.com/en/media/guides-articles/what-is-re...). The big traditional players are already across that (https://www.grassvalley.com/ampp/, https://www.rossvideo.com/use-cases/remote-production/), AWS also has a plethora of services to lock you into their stack (https://aws.amazon.com/media-services/), and there's interesting new players too (https://www.tryiris.ai). There's a heap of different workflows out there, and OB trucks like the one highlighted here are just one of those.
Sounds like you've got it made then: produce the equivalent that fits in a minivan and laugh all the way to the bank.
We're going to need a lot of popcorn to keep us eating as we wait
> do time sync by having A/V sync and good clocks on each device and aligning based on audio loud enough to be recorded by all devices
Why do you need good clocks? For audio, even with simultaneously playing speakers, you only need to synchronize within a couple of ms unless you need coherence or are a serious audiophile. If if want to maintain sync for an hour I suppose you need decently good clock.
But as long as you have any sort of wire, basically any protocol can synchronize well enough. Although synchronizing based on visual and audible sources is certainly an interesting idea. (Audio only is a completely nonstarter for a sporting event: the speed of sound is low and the venues are large. You could easily miss by hundreds of ms.)
> then mix, reencode and distribute on normal GPU-equipped datacenter servers using GPU acceleration
Really? Even ignoring latency, we’re talking quite a few Gbps sustained. A hiccup would suck, and if you’re not careful, you could easily spend multiple millions of dollars per day in egress and data handling fees if you use a big cloud. Just use a handful of on-site commodity machines.
Frame sync. In order to reduce latency, these systems tend to be unbuffered, which means that the frames have to arrive at a very specific time, and you can't afford significant jitter or (worse) phase drift. If you have one source at 25.000FPS and one at 25.001FPS eventually you're going to be a frame out between them.
Let's do the math, conservatively. Suppose there's an event and the intent is to broadcast at 60 fps (which is on the high side) and that you want to be able to switch between cameras or even composite multiple camera feeds together without skipping frames or interpolating between frames. That gives a budget of 16.7ms per frame. (Hey, this is a lot like making a video game! Fortunately input latency is not such a big deal here because the viewers aren't playing.)
Suppose you give a budget of 4ms to composite the frame. Now you have 12.7ms from the end of the previous frame in which to collect the current frame from each camera and do whatever fancy processing you want to do (drawing first down lines, adding ads, etc). Of course, you can always cheat a bit by pipelining frames, but this adds latency, and maybe you would prefer to avoid that. Let's say you don't want to pipeline and you budget 8.7ms for all this fancy work, which gives you a 4ms window in which to receive all your incoming frames, which need to be in exact lockstep from all cameras. (This is very, very conservative, since, again, this is not a videogame and it's probably fine to buffer all inputs for a few tens or even hundreds of ms. I'm ignoring the time to transfer each frame -- I'm assuming we're counting from the end of the frame transfer time. If it takes a full frame to transfer a frame, then you cannot possibly avoid one frame of transfer latency anyway.)
So you need all those fancy cameras to stay in sync to plus or minus 4ms. That's a piece of cake with basically any modern technology, where "modern" means, I don't know, the last 30 years? NTP can do this. PTP can do this even with a fully software implementation and no assistance from the switches whatsoever. A cellphone can do this. A fancy GPSDO can do orders of magnitude better than this. A decent RTC will take a whole 200 seconds to drift by a problematic amount. The only actual fancy tech needed is the ability for the host controller on each camera to discipline the camera's frame clock, which I imagine any camera worth its salt can do.
I don't see why a $30k clock is useful here, or why very fancy protocols are needed. I do see why there's a need to get everyone to agree on a protocol, though.
I did once watch an event where I was genuinely impressed by the synchronization, though: a parade at a theme park. There were hundreds or thousands of fixed speakers and hundreds of mobile speakers in the parade, and all of them stayed perfectly synchronized, playing parts of the same music, to within the precision of my ears. I'm guessing the design goal was better than 1ms synchronization error, over at least half an hour, across acres of space, in a potentially adverse RF environment (at least the ISM bands would have been horribly polluted by everyone's phones). And possibly the mobile speakers would even have needed to compensate for their own locations due to the speed of sound being kind of low and the actual parade speed possibly being a bit unpredictable.
If I were designing that, I might have used GPSDOs on each mobile element or possibly some kind of wireless clock distribution -- a 20ppm clock is not even close to good enough.
But event broadcasting doesn't have these problems per se because, anywhere there's a camera, there's already a reliably, high-bandwidth data link of some sort so the camera feed can get to where it's going in real time.
Surprisingly, the timing requirements for digital seem to be slightly lower than it was for analog, at least if I heard the engineer correctly on site. It was something like 1.5 microseconds in the old days, but can be like 10 microseconds now. I could be wrong there.
No, you are right. And it is because digital has a much wider 'lock' range than analog. Analog only works 'in the moment' whereas digital can take the history of the signal so far into account and so not lose lock. If it gets too extreme it will still happen though so cumulative problems will still show up only much later.
> Why do you need good clocks? For audio, even with simultaneously playing speakers, you only need to synchronize within a couple of ms unless you need coherence or are a serious audiophile. If if want to maintain sync for an hour I suppose you need decently good clock.
There are many microphones involved in a production, and humans are quite good at detecting desync between audio/video when watching a presenter talk. You cannot fix desynchronization further down the chain if the desynchronization is variable for each source.
You also need synchronization to mix sources (common in any production) without incurring the latency and resampling of asynchronous sample rate conversion.
As someone who's spent a lot of time in this space and is quite interested in lowering the cost of entry and finding ways to simplify it, I'm afraid you've vastly oversimplified the problem.
> sports fans certainly don't care about recompression quality loss...
I think that's quite an assumption. In a modern video chain youd need to decompress and recompress the video from a camera many many times on the way to distribution. Every filter or combining element would need to have onboard decoding and encoding which would introduce significant latency, would be very difficult to maintain quality, and would introduce even more energy requirements than the systems we already deploy.
High quality cameras aren't any good if they throw away their quality at the source before they have an opportunity to be mixed in with the rest of the contribution elements. You certainly wouldn't compress the camera feeds down to what you'd expect to see on a consumer video feed (about 20Mbps for 4K on HEVC).
> normal networked transport like H.265 with MPEG-TS over RTP
If you want to, you can do that already using SMPTE ST 2110-22 which loops in the RTP payload standards defined by the IETF. ST 2110 itself is already using RTP as its core protocol by the way (for everything).
> do time sync by having A/V sync
What do you mean by this? In order to synchronize multiple elements you need a common source of time. Having "good clocks" on each device is not enough: they need to be synchronized to the level that audio matches up correctly, which is much more precise than video as audio uses sample frequencies in the 48Khz-96Khz range, whereas video of course is typically just 60Hz. Each clock needs a way to _become_ good by aligning themselves to some global standard. If you don't have a master clock like PTP, your options are... what... GPS? I mean you _could_ equip each device with its own GPS transponders, but if the cameras cant get a reliable GPS lock then you're out of luck.
> aligning based on audio loud enough to be recorded by all devices
Do you mean physically? Like actual audio being emitted into the space where the devices are? Because some of the devices will be in the stadium where theres very very loud noises on account of the crowd, and some of them will be in the backroom where that audio is not audible. Then you need to factor in the speed of sound, which is absolutely significant in a stadium or other large venue. None of this is particularly practical.
If you mean an audio sound that is sent to each device over a cable, well are we talking SDI (copper)? If so, we wouldnt use audio for that, we would use what's called Black Burst. But what generates the black burst? Typically, its the grandmaster clock. The black bursts on SDI need to be very precise, and that requires a dedicated piece of real time hardware.
If you mean sending it over ethernet, you now need to ensure you factor in the routing delays that will inevitably happen over an open unplanned network. To deal with those delays, we typically do two things. One, we use automatically planned networks, where the routers are aware of the media flows going over each link, and the topology is automatically rearranged in order to minimize or eliminate router buffering (aka software defined networks, typically using NMOS IS standards to handle discovering and accounting for the media essences).
> they need to be synchronized to the level that audio matches up correctly, which is much more precise than video as audio uses sample frequencies in the 48Khz-96Khz range, whereas video of course is typically just 60Hz
Typically video equipment expects the individual pixels to line up, save for some buffering (~1–10µs), not just the individual frame. So your synchronization requirement for video is in the gigahertz range (or about megahertz, if you take the buffering into account), not 60 Hz. (Of course, what matters is normally the absolute offset, not the frequency, but they tend to be somewhat inversely related.)