You have a NVR (Network Video Recorder) that can manage 60-70 IP Cameras. It works great with local disks. And you have methods to manage more than one of them, so it’s actually possible to manage some hundreds of cameras in a single environment.
- Recording 4K instead of HD
- Increasing retention from 30 days to 90
- Manage more than 10.000 cameras
- Absolutely no missed video, not for anything
In the normal way, these numbers require about a thousand of those NVRs. Not a small challenge and even if it was possible to manage all of them in a single environment. But it still wouldn’t work because of the reality of large environments: everything fails all the time. So the only option is to embrace that failure and design a solution that can withstand anything!
NVR mentioned above - as most of those things are, is actually a hardware and a software running on top of it. Our basic idea here was that if we could separate those and make the software run virtualized, that would solve all the problems at once. Or, in fact, transform them into something else: Running a converged container cluster. We encourage you to go ahead and read the specifics of that problem and our approach. But considering it is dealt with, our problem now is to modify the software to run on that.
Adaptation to the container world
As any experienced systems engineer would say, no cloud is all unicorns and rainbows, doesn’t matter who runs it. It’s a flexible way to provision resources at great speeds. But this flexibility goes both ways: Those resources can break and go away as fast as they are provisioned. It’s just the nature of the clouds for applications to:
- be killed for some reason
- get locked-up
- lose the host machine
- lose network connectivity
- have IO stalls
- slow down
or encounter any other transient and unexpected errors. And the applications need to be aware of those possibilities and they should handle failures gracefully whenever they can.
For example, when a storage hiccup happens where it’s not possible to write to the drive for some time, one of the classic things to do is to let the writer application (i.e. video recorder) to wait for it while accumulating the incoming data in memory.
Needless to say, this is easier said than done. To say the least, we had to experiment with some arcane garbage collectors to make that happen, even after designing and developing the necessary parts. And that was just one of the things to adapt.
Cluster-wide fault tolerance
And not all the faults can be handled. If the node has some hardware problem or the network goes down, it doesn’t matter how you designed your application because it stops working. In order to handle these types of problems, the entire system should be split into failure domains. Simply put, those are the assumptions we make for the things that will fail independently.
For example if we assume that two nodes cannot fail at the same time, that means each node is a failure domain. Next is to run the same application on two different failure domains, i.e. nodes, so even if one fails the other can continue to work. Easy, right?
For some applications that’s exactly the case. For a video stream recording application though, it is a bit more involved, considering we need very high data throughput for each stream as well as huge amounts of capacity to store them. Adding new instances of the same application and hoping everything will load balance is not a real option.
So what we did was to develop a stripped down version of the application, called a replica NVR which
- records the same streams as the original counterpart
- runs in a different failure domain
- stores the contents on a local drive
- supplies the missing periods back to the original NVR when necessary
Since local drives don’t have much capacity and are not fault-tolerant to begin with, this replica only has the last few hours of streams in which the original NVR is able to restart and get them back into long-term storage. That way, the last requirement of “not missing any data” is solved in a realistic and a more cost-efficient way.
We now have a solution which we can:
- run any number of virtual NVRs, from a few to thousands
- using any amount of capacity required, totalling more than hundreds of petabytes
- automatically assign as many encoders as there are installed
- increase or decrease the number of nodes or all other resources whenever necessary
- be sure that we won’t miss a second of content, even after a number of outages
In effect, this is one of the largest and most cost-effective network video recording solutions to have ever been deployed. And thanks to all technologies involved, it’s now just a matter of deciding the number of cameras.