"The infrastructural and architectural shifts triggered by the development of AI-based workflows will have significant implications for the colocation market".
Recent advances in artificial intelligence (AI) have the potential to have a significant impact on a wide range of industries. The demand for reliable and secure cloud services capable of supporting the required training and inference workloads directly depends on the implementation of AI-based apps and processes. This article will explain the most important impacts that these developments may have on data-center infrastructure.
New server hardware will be required for AI workloads
Faster interconnections
Artificial intelligence workloads are often associated with multi-node computing, where distributed systems interact to perform complex calculations. Such distributed systems require high-bandwidth and low-latency communication links to minimize the bottleneck effect associated with data exchange between computing nodes. For a given CPU/GPU/TPU system architecture, the friction can be reduced in following ways:
- by deploying networks based on standards such as InfiniBand, which are equipped with dedicated high-speed fiber-optic connections (inter-rack);
- by increasing the density of memory and computing chipsets on server boards (intra-board and intra-rack).
Specialized processors for artificial intelligence
Training large artificial intelligence models requires high computational parameters. Recent large language models have about 100 billion parameters and need about 1000 petaflop/s. days to train.
Training and inference workloads for artificial intelligence models include a number of complex matrix computations. Specially designed processors can be used in order to reduce time and resources needed for such workloads. Traditional CPUs are not much efficient at performing tensor and matrix calculations. That’s the reason why modernized processors are used for AI workloads instead of traditional ones. Graphics processing units (GPUs) were the first to be used due to their ability to speed up artificial intelligence workflows. It's worth noting that GPUs were originally designed for rendering 3D graphics, but they are capable of performing "general purpose" calculations as well. AI-dedicated application specific integrated circuits (ASICs), including tensor processing units (TPUs) were created in order to increase speed. All these upgraded processors are perfect at performing AI-related computation, but they are quite expensive and require much electrical power. Danseb Consulting (data-center specialist) notes that connections with chip developers will be necessary. A recently interviewed colocation provider states: “to support AI, you need to have a partnership with NVIDIA”.
Higher computing power density
The combination of high processor density within server boards and racks using powerful graphics processors and processors focused on artificial intelligence raises the desired rack power density to a new level. Thus, a rack of GPUs can consume up to 50kW,1, which is much higher than the current average of ~10kW per rack.2. While ASICs (such as TPUs) are being developed to be more power efficient than GPUs, data center operators must be prepared for increased power density, which has practical implications. Danseb Consulting believes that “the compute required for AI creates a significant opportunity for the data-center industry”, as evidenced by the recent growth in demand for racks with a capacity of more than 30 kW for artificial intelligence (AI) applications.
This architectural shift will significantly impact data-center infrastructure
Modernization of power distribution
Data-center infrastructure requires more power due to increasing power densities of racks (this is necessary to eliminate power outages while data processing rooms are half full). Thus, it is recommended for data-center operators to:
- Contact the relevant utility companies in order to discuss the increased requirements for the electrical power grid;
- Carry out improvements to power distribution systems (including uninterruptible power supply (UPS) systems, back-up generators, transformers, power conditioners).
Modernization of cooling technology
The increased rack power density will lead to increasing the amount of heat that must be dissipated. Nowadays, most data centers use traditional air-cooling systems that are capable of maintaining an energy density of up to 20kW per rack.3. As rack density exceeds this limit, data centers will likely need to upgrade their cooling systems – for example, by improving their cold sources (chillers or cooling towers). Upgraded heat exchangers are also capable of increasing the level of cooling systems efficiency by optimizing the heat transfer between hot and cold sources.
Moreover, data-center operators may also take into account emerging technologies such as:
- Immersion cooling, where racks are immersed in a dielectric fluid, eliminating the need for air conditioning infrastructure;
- Direct-to-chip liquid cooling, which implies running liquid coolant directly through microchannels embedded within the processor, removing heat at its source.
Liquid-cooled systems are considered to maintain an energy density of up to 100kW per rack5. Alibaba estimates that immersion cooling can reduce power consumption by 36% compared to air-cooled equipment with 1.5 times power usage effectiveness (PUE).
The implications for the colocation are very serious
These architectural and infrastructural changes have a major impact on the colocation market. We'll take a closer look at these changes below:
- The implementation of advanced cooling systems requires a well-developed plan, especially for equipment that is already in operation and requires significant investments (mainly for hyperscale facilities upgrades);
- Improving already existing data centers to meet the needs for AI workloads can be difficult and laborious, particularly in areas/facilities with limited space (to provide additional cooling/power conditioning) and electricity availability (as local substations can operate at full capacity);
- New use cases will require low-latency processing of artificial intelligence output, and some of them may also benefit from using edge facilities;
- Dark fiber interconnections within and between data centers will provide users with the ability to efficiently connect their artificial intelligence workflows and computing nodes using the selected protocol.
As a result, rapid improvements in artificial intelligence (AI) may have a positive impact on the development of many industries.