Given the sheer size of Megatron 530B, training and deploying it into production aren’t easy feats — even for enterprises with massive resources. The model was originally trained across 560 Nvidia DGX A100 servers, each hosting 8 Nvidia A100 80GB GPUs. Microsoft and Nvidia say that they observed between 113 to 126 teraflops per second per GPU while training Megatron 530B, which would put the training cost in the millions of dollars. (A teraflop rating measures the performance of hardware including GPUs.)
Nvidia is pitching its DGX SuperPOD as the preferred solution. A line of servers and workstations, SuperPODs are preconfigured DGX A100 systems built using A100 GPUs and Nvidia Mellanox InfiniBand for the compute and storage fabric.
But a single SuperPOD can cost anywhere from $7 million to $60 million, depending on the size of deployment. (A single DGX A100 starts at $199,000.) Nvidia’s SuperPOD subscription service is substantially cheaper — a SuperPOD runs $90,000 per month. Considering that Megatron 530B was trained on Nvidia’s Selene supercomputer, however, which comprises four SuperPODs with 560 A100 GPUs, the expense is beyond what most companies can afford to pay.
Even tech giants like Google parent company Alphabet have run up against budget constraints while training AI models. When Google subsidiary DeepMind’s researchers designed a model to play StarCraft II, they purposefully didn’t try multiple ways of architecting a key component because the training costs would’ve been too high. Similarly, OpenAI didn’t fix a mistake when it implemented GPT-3 — a language model with less than half as many parameters as Megatron 530B — because the cost of training made retraining the model infeasible.
Still, in a recent interview with Next Platform, Catanzaro says he thinks that it’s entirely possible a company will invest a billion dollars in compute time to train a model within the next five years. A University of Massachusetts Amherst study showed that using 2019-era approaches, training an image recognition model with a 5% error rate would cost $100 billion.
While no enterprise has yet come close, DeepMind reportedly set aside $35 million to train an AI system to learn Go. OpenAI is estimated to have spent $4.6 million to $12 million training GPT-3. And AI21 Labs, which developed a language model roughly the size of GPT-3, raised $34.5 million in venture capital before launching its commercial service.