"Because we wanted to continuously improve our invertible representation of chemical space, we needed a platform that would enable rapid experimentation along with ease of management,” said John Parkhill, director of machine learning at Terray. "DGX Cloud offered us a solution that worked seamlessly with the ease and simplicity of cloud. Its high-speed network, purpose-built for multi-node training, was particularly crucial for our needs. Because we’re dealing with datasets of terabytes or larger, we require significant computational resources to train our models effectively."
"Additionally, the ability to rapidly conduct trial-and-error experiments is highly valuable in our model development research, as identifying the most effective hyperparameters is often a challenging task. Fast job execution on DGX Cloud enabled us to quickly identify failures and make the necessary adjustments to the models. For instance, I could perform numerous ablation studies, such as disabling model features, to determine if, for example, altering elements of the transformer’s tokenizer is impactful or inconsequential," said Williams.
"Our process for setting up training jobs went from the hassle of manually pushing code to remote machines and ensuring synchronization to the simplicity of pressing ‘run’ on DGX Cloud. We didn't even have to modify our existing code by much. With the Base Command Platform, the orchestration of multi-node training jobs was essentially automated for us. This enabled us to scale in a way that would have been impossible.”
Having a fixed allocation of nodes on DGX Cloud also created greater efficiencies. "It's a very miserable experience constantly asking for GPU instances from traditional cloud services that they seem to be unable to make available. If I need a new node for an experiment I'm working on, I would not know if and when I would be able to get one. With DGX Cloud, I don't need to worry about that," said Williams.
"As a data scientist, my boundary is no longer a small GPU workstation; it's the entire cloud capacity of Terray. DGX Cloud with Base Command Platform lets me go from a single node to a 32-GPU cluster with push-button simplicity,” Parkhill added. “DGX Cloud gives us the level of abstraction our developers need so they can focus on innovation instead of infrastructure.”
Terray leverages a hybrid solution approach, where they train and build their models on DGX Cloud and deploy and run inference on their on-prem cluster with NVIDIA RTX™ A6000 GPUs. As workloads spike, DGX Cloud provides elasticity and liquidity of resources.
"NVIDIA AI experts were key to our success." Williams said. "We had a dedicated expert inspecting our logs to ensure everything ran smoothly and identifying any issues. By identifying straightforward optimizations in PyTorch and CUDA® that we hadn't thought of, they significantly improved the efficiency of our workloads. Additionally, they assisted in developing scripts that provided valuable insights into telemetry data, allowing us to monitor memory activity and enhance performance. The support from NVIDIA AI experts allowed us to shift our focus from optimizing the process to conducting experiments, as this is primarily an R&D project."