Deep learning is very resource-intensive. Microsoft is now describing how this can be scaled globally to hundreds of thousands of GPUs.
The software company Microsoft apparently operates a globally distributed service for scheduling machine learning tasks. The system used is called Singularity and is described in a current research paper from Microsoft, which lists 26 authors involved - including Mark Russinovich, CTO of Microsoft's cloud division Azure. The primary goal is to save costs.
The fundamental problem that Microsoft wants to solve here is that especially for the training of deep learning systems, larger and larger systems are required, but these are extremely expensive. So that the purchase still pays off, these systems should be used to their full capacity, for example by distributing all tasks within the Azure cloud to currently free resources.
Microsoft writes: "Singularity pursues a key objective: reducing the cost of AI by maximizing the total useful throughput given a given fixed pool of accelerators on a global scale". Microsoft also writes about the size of the overall system: "Singularity is designed from the ground up to be scalable across a global fleet of hundreds of thousands of GPUs and other AI accelerators".
This is achieved primarily via two main mechanisms: a sophisticated interruption and subsequent migration on the one hand, and elasticity on the other. For the former, Singularity creates a complete RAM snapshot of the state, which can then be transferred and executed immediately. In addition, the team relies on what is known as replica slicing of the actual tasks, which can then be carried out elastically. This is intended to address a variable number of accelerators.
For evaluation, Microsoft uses Nvidia's DGX-2 server, which is connected via Infiniband. Each of these servers uses a Xeon Platinum 8168 with two sockets and 20 CPU cores each as well as 692 GB of RAM. There are also 8 V100 GPUs.