BIP ATL News & Media Platform

collapse
Home / Daily News Analysis / OpenAI-led consortium seeks to address AI processing bottlenecks

OpenAI-led consortium seeks to address AI processing bottlenecks

May 10, 2026  Twila Rosenbaum  6 views
OpenAI-led consortium seeks to address AI processing bottlenecks

An OpenAI-led consortium of tech giants has unveiled a new networking protocol designed to address network congestion, a persistent problem that has become critical with the massive data demands of AI processing. The group includes AMD, Broadcom, Intel, Microsoft, and Nvidia.

The new protocol, called Multipath Reliable Connection (MRC), is specifically built for training models on clusters with 100,000+ GPUs. Rather than forcing data down a few lanes that can become easily congested, MRC distributes traffic across hundreds of network paths simultaneously, ensuring smoother and faster data transfer.

“Network congestion, link, and device failures are the most common sources of delay and jitter in transfers,” OpenAI wrote in a blog post announcing the project. “These problems get more frequent, and harder to solve, as the size of the cluster increases.”

The company highlighted that a single failure could cause a training job to crash, forcing a restart from a saved checkpoint, or stall progress for many seconds while the network recomputed routes. Such interruptions are costly in both GPU cycles and time. “The larger the job we run, the greater the impact of any single link flap or failure. These workloads act as a form of ‘failure amplifier,’ so preventing this has become critical,” the company said.

OpenAI led the development of the protocol and worked with the consortium partners, all of whom made significant technical contributions. The project is hosted and coordinated by the Open Compute Platform (OCP) consortium. Nvidia is integrating its Spectrum-X Ethernet as a key component of MRC. The company says it is running MRC in production at some of the world’s largest AI training clusters, including OpenAI, for training frontier LLM models like ChatGPT and Codex. Spectrum-X is also used in Microsoft’s Fairwater and Oracle Cloud Infrastructure’s Abilene data center, two of the largest AI factories purpose-built for training and deploying leading-edge frontier LLMs.

MRC delivers the best GPU utilization possible by load-balancing traffic across all available paths, avoiding congestion by dynamically steering away from overloaded paths in real time. Conventional network fabrics can take seconds or even tens of seconds to stabilize after failures, according to OpenAI. MRC helps maintain maximum GPU utilization during network slowdowns, congestion, failures, or other disruptive events. Administrators also gain fine-grained visibility and control over traffic paths, monitoring network traffic from a simple, single pane of glass.

OpenAI says MRC’s multi-plane network design can connect more than 100,000 GPUs using only two tiers of Ethernet switches, rather than the three or four tiers currently required by standard 800 Gb/s networks. The MRC specification was published today through the Open Compute Project, along with an accompanying research paper.


Source: Network World News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy