AI-Ready Data Centre Design: 100 kW+ Rack Density Infrastructure Guide

The data centre industry spent twenty years optimising for 5–10 kW per rack. The hyperscale era pushed that to 15–20 kW. Now, a single rack of NVIDIA H100 or H200 GPU servers — four servers, 8 GPUs each — draws 60–80 kW. A fully configured DGX H100 pod across two racks hits 120 kW. GB200 NVL72 rack-scale systems from NVIDIA are specified at 120 kW per rack with future roadmap items targeting 160 kW and beyond.

Every infrastructure system in a conventional data centre was designed around a fundamentally different energy density assumption. Floor slabs, structural bays, electrical busways, UPS sizing, cooling plant, fire suppression zones — all of it was specified for a world where 20 kW per rack was considered high-density. Designing an AI-ready facility for 100 kW+ is not an upgrade. It is a new building type that happens to share a name with its predecessor.

This article provides the MEP engineering framework for that new building type — covering every system that must be rethought when rack density increases fivefold.

Understanding the Density Generations

Before designing for a target density, it is essential to understand what density the facility will actually see across its operational life — not just at initial fit-out, but at the 3-year and 5-year hardware refresh cycles.

Generation	Typical Hardware	kW per Rack	Cooling Method	Power Density (kW/m²)
Legacy (pre-2015)	1U servers, spinning disk	3–8 kW	Air — CRAC / raised floor	5–12 kW/m²
Standard (2015–2020)	2U servers, flash storage	8–20 kW	Air — CRAH / hot aisle containment	12–30 kW/m²
High Density (2020–2023)	GPU servers (A100, H100 air)	20–50 kW	Air + rear-door HX / in-row	30–75 kW/m²
AI Dense (2023–2025)	H100 SXM, H200, DGX systems	50–100 kW	DLC cold plate mandatory	75–150 kW/m²
Next-Gen AI (2025+)	GB200 NVL72, Blackwell	100–160 kW	Full liquid cooling / immersion	150–240 kW/m²

Design for the hardware refresh, not the day-one fit-out: A facility designed for 50 kW/rack today will receive 100 kW/rack hardware at its first major refresh — typically 2–3 years after commissioning. Infrastructure that cannot accommodate that refresh will require expensive modification or will constrain the client’s AI capability at exactly the moment they need to scale. Design to the 5-year density horizon, not the immediate procurement list.

Structural Design: Floor, Column Grid, and Overhead Loads

Structural design is the constraint that is hardest to fix after construction. Every other system can be upgraded or supplemented — a floor slab that cannot take the load cannot be reinforced without taking the hall offline. Structural requirements must be established and resolved before any other design work proceeds.

Floor Loading

// Floor loading comparison — conventional vs AI-dense

// Standard data centre rack (20 kW air-cooled)
Rack weight      : ~300 kg  (server + rack + PDU)
Footprint        : 0.6 m × 1.0 m = 0.6 m²
Point load       : ~500 kg/m²  (industry standard slab: 1,200 kg/m² UDL)

// AI rack (100 kW — H100 DGX with CDU)
Rack + servers   : ~550 kg
CDU (filled)     : ~200 kg
Coolant in rack  : ~50 kg
Total            : ~800 kg on 0.6 m² → ~1,333 kg/m²

// Immersion tank (100 kW single-phase)
Tank + fluid + servers : ~1,400 kg on 0.6 m² → ~2,330 kg/m²

// Structural engineer briefing requirement:
// Specify 2,500 kg/m² UDL for AI/immersion zones
// Standard 1,200 kg/m² UDL for ancillary / office zones

Column Grid and Bay Sizing

Conventional data centres use 1.2 m modular raised floor grids aligned to a 600 mm tile module. AI facilities with liquid cooling have no raised floor requirement — pipes, power, and network route overhead. The structural bay spacing should be designed around the cooling infrastructure layout, not the legacy tile grid. A 9 m × 9 m structural bay comfortably accommodates two liquid-cooled rack rows (10 racks per row, 600 mm rack pitch, 1,200 mm cold aisle, 1,500 mm hot aisle) with CDU alcoves at each row end. Confirm bay sizing with the liquid cooling system vendor at concept design stage.

Overhead Structural Capacity

In liquid-cooled AI halls, all infrastructure runs overhead: power busway, coolant supply and return headers, network cable trays, and fibre management. The combined weight of a fully loaded 800A busway, 150 mm diameter coolant pipes (filled), and cable tray can reach 180–250 kg per linear metre of run. Structural beams and hangers must be designed for this loading — it is routinely 3–5× the overhead loading assumed in a conventional data centre design.

Power Distribution: From Utility to Rack

Power distribution for 100 kW+ racks requires rethinking every tier of the electrical system — the utility supply capacity, transformer sizing, UPS architecture, LV switchgear, busway specification, and the rack-level power delivery.

Facility Power Density Calculation

// Facility-level power budget — 1,000 m² AI data hall

// Assumption: 10 rows × 10 racks/row = 100 racks at 100 kW each
IT load            = 100 × 100 kW      = 10,000 kW  (10 MW)
Cooling (CDUs)     = 100 × 8 kW        =    800 kW
Lighting + misc.   =                           100 kW
UPS losses (3%)    = 10,000 × 0.03     =    300 kW
Transformer losses =                           150 kW
──────────────────────────────────────────────────────
Total facility draw=                       11,350 kW
PUE                = 11,350 / 10,000   =     1.14

// Power density: 11,350 kW / 1,000 m² = 11.35 kW/m²  (floor area)
// Compare: conventional DC ≈ 0.5–1.5 kW/m²
// AI-dense DC is 8–20× higher power density per floor area

Transformer Specification

At 10+ MW per hall, transformer sizing and K-Factor rating become critical. GPU server power supplies are highly non-linear loads — high harmonic content (predominantly 5th and 7th order) demands K-rated transformers. For a 10 MW AI hall, a typical deployment uses two 6.3 MVA cast resin transformers (K-20 rated) in N+1 configuration, stepping 11 kV or 33 kV utility supply down to 415V. Each transformer feeds an independent UPS bus (A-bus and B-bus) for full 2N power path redundancy to dual-corded GPU servers.

K-Factor and transformer sizing: KVRM’s detailed guide to power transformer selection — including K-Factor calculation methodology, cast resin vs. oil-immersed comparison, and GIS substation interface design — is covered in our companion article: Power Transformer Selection for GIS-Integrated Substations →

UPS Architecture for AI Loads

AI training workloads impose a particularly demanding UPS duty cycle — sustained near-100% load during training runs lasting hours or days, then rapid ramp-down during model evaluation or job transitions. This load profile stresses UPS battery systems differently from conventional IT loads. Three requirements specific to AI UPS design:

01
Modular UPS — Efficiency at Variable Load
During ramp-up phases (early project, before full rack population), a 10 MW hall may run at 20–30% of design load. A monolithic UPS at 20% load operates at 88–91% efficiency — wasting hundreds of kilowatts. Modular UPS architecture allows individual modules to be switched offline as load decreases, maintaining 95–97% efficiency across the full load range. Specify modular UPS from the outset; retrofitting modularity into a monolithic UPS is not possible.
02
Battery Autonomy Calibrated to Generator Transfer Time
AI GPU servers cannot tolerate a power interruption — a training job running on 512 GPUs will lose all in-progress computation on any supply break. UPS battery autonomy must exceed the maximum credible generator transfer time by a 25% safety margin. For a generator that achieves full load pick-up in 8 seconds (best case), specify 60 seconds minimum battery autonomy — not the commonly specified 10 minutes. Excess battery capacity adds cost and weight without improving AI workload protection.
03
BESS Integration for Extended Backup
Battery Energy Storage Systems (BESS) using lithium iron phosphate (LFP) chemistry are increasingly specified alongside traditional VRLA UPS batteries to extend backup duration for AI halls where generator fuel logistics may take 20–30 minutes during extended utility outages. LFP BESS provides 2–4× the energy density of VRLA at the same footprint, with a 10–15 year life versus 3–5 years for VRLA. NFPA 855 compliance and thermal runaway containment must be addressed in the fire protection design.

Busway and Rack Power Delivery

Parameter	Conventional DC (20 kW/rack)	AI-Dense DC (100 kW/rack)
Busway rating	100–250A tap-off busway	800A–1,600A busway; 400A tap-offs per rack
PDU per rack	32A single-phase or 16A 3-phase	63A or 100A 3-phase; dual-feed (A+B)
Cable cross-section	4–10 mm² per circuit	35–70 mm² per circuit; cable derating essential
Voltage drop (3-phase)	<3% acceptable	<2% — GPU PSUs sensitive to input voltage variation
Power factor	0.85–0.92 typical	0.93–0.97 (modern GPU PSUs) — PF correction less critical
Neutral sizing	50% of phase conductor	100% of phase conductor — harmonic neutral current
Earthing and bonding	Standard PE	Enhanced equipotential bonding; SRG for signal reference

Cooling Architecture for 100 kW+ Racks

This is the section where AI-ready design diverges most decisively from conventional data centre practice. Air cooling is not viable above 30–35 kW/rack for sustained high-utilisation loads. It is not a matter of optimising airflow — the physics of air as a heat transfer medium simply cannot remove 100 kW from a 600 mm × 1,000 mm footprint without creating conditions that are operationally unworkable.

At 100 kW per rack, the choice is not between air cooling and liquid cooling. The choice is between which liquid cooling architecture — direct-to-chip, immersion, or rear-door liquid — and how to integrate it with the residual heat rejection plant.

Cooling Architecture Decision Matrix

Architecture	Max Rack Density	Server OEM Support	Facility Complexity	Retrofit Feasibility	India Deployment Status
Air only (CRAH)	25 kW max (sustained)	All	Low	Existing halls	Standard — not AI-ready
Rear-Door HX + Air	30–40 kW	All	Low-Medium	Good — clamps to existing racks	Available; limited deployments
DLC Cold Plate (H100/H200)	50–100 kW	Major OEMs (Dell, HPE, Lenovo, NVIDIA)	Medium	Moderate — needs CDU room + piping	Active deployments — hyperscalers
DLC + Air (Hybrid)	80–120 kW	OEM hybrid configs	Medium-High	Moderate	Emerging
Single-Phase Immersion	100 kW+ (no ceiling)	Submer, GRC, LiquidStack tanks	High	Low — structural, CDU, fluid management	Pilot deployments; growing

Thermal Envelope Design

For DLC cold plate systems at 100 kW/rack, the thermal design must account for the split between liquid-cooled and air-cooled components within the same server. A typical H100 SXM server removes approximately 70–75% of total heat via the cold plate (GPU, HBM, VRM) and the remaining 25–30% via residual air cooling (PCIe cards, storage, motherboard components, memory DIMMs). This residual air heat — still 25–30 kW per rack — must be managed by in-row air cooling or rear-door heat exchangers. Assuming the room requires no air cooling in a DLC deployment is the most common design error in first-generation AI data centre specifications.

// Thermal split — H100 SXM DLC server (700W TDP per GPU, 8 GPUs)

GPU TDP            = 8 × 700W = 5,600W per server
Total server power = ~7,000W (GPU + memory + compute + PSU losses)

Liquid-cooled      = ~5,100W  (73%)  → removed by cold plate / CDU
Air-cooled         = ~1,900W  (27%)  → rejected to room air

// Per rack (4 servers + networking + PDU)
Rack IT power      = ~30,000W total
Liquid-cooled heat = ~21,500W → to CDU secondary loop
Residual air heat  = ~ 8,500W → to room air  ← CANNOT be ignored

// Room cooling still required: 8,500W × 100 racks = 850 kW of air cooling
// Equivalent to ~20 standard 42 kW CRAH units — a substantial cooling plant

Network Infrastructure: Bandwidth at AI Scale

AI training at scale is as much a networking challenge as a compute challenge. The GPU interconnect bandwidth within a rack and between racks is the performance bottleneck that determines training throughput — and it imposes specific infrastructure requirements that are absent from conventional data centre network design.

InfiniBand HDR / NDR Fabric

GPU-to-GPU communication in distributed training uses InfiniBand at 200 Gb/s (HDR) or 400 Gb/s (NDR) per port. A 100-rack AI cluster requires a non-blocking fat-tree fabric with hundreds of spine and leaf InfiniBand switches. Switch power draw: 1.5–3.5 kW per switch. Include in the electrical load schedule — a 100-switch fabric adds 150–350 kW to the facility load.

Cable Length — Latency and Loss Budget

InfiniBand active optical cables (AOCs) at 400 Gb/s have a maximum length of 100 m. The physical layout of the GPU cluster — rack positions, switch locations, cable routing paths — must be designed so that no GPU-to-switch cable exceeds this length. Overcrowded overhead cable trays also introduce bend radius violations that degrade optical signal integrity. Cable management is a structural and layout design issue, not purely an IT concern.

Storage Network (Parallel File System)

AI training workloads read training datasets from parallel file systems (Lustre, GPFS, WekaIO) at aggregate bandwidths of 1–10 TB/s for large clusters. The storage network — typically 100 GbE or HDR InfiniBand connecting GPU nodes to storage nodes — must be sized independently of the GPU fabric and served from dedicated network switches, not shared with the GPU interconnect fabric.

Management Network Separation

Out-of-band (OOB) management network — BMC/iDRAC access, power monitoring, liquid cooling controls, environmental sensors — must be physically separated from the in-band GPU interconnect fabric. A cooling control system failure that takes down the management network should not impact training jobs; a training job failure should not impact cooling system monitoring. Separate physical switches, VLANs are not sufficient isolation for safety-critical cooling control.

Floor Layout and Spatial Design

The spatial organisation of an AI data hall is fundamentally different from a conventional data centre. The constraints are driven by liquid cooling geometry, overhead infrastructure density, and the physical requirements of GPU cluster interconnect architecture.

Row Configuration for Liquid Cooling

Liquid-cooled AI racks are typically arranged in single-row clusters of 8–12 racks, each cluster served by a dedicated CDU alcove at one end. This contrasts with the conventional long-row layout (20–30 racks per row) of air-cooled halls. Shorter rows reduce hydraulic header length (improving pressure balance), simplify CDU placement, and allow individual cluster isolation for maintenance without affecting adjacent clusters. The trade-off is reduced floor space utilisation — but at 100 kW/rack, the MW-per-m² metric is still dramatically better than a conventional hall.

Aisle Width Requirements

AI data halls with liquid cooling require wider aisles than conventional designs. Minimum aisle widths for 100 kW+ liquid-cooled deployments: cold aisle 1,200 mm (for liquid manifold access and server extraction from open-top immersion tanks or side-access DLC racks); hot aisle 1,500 mm (for CDU positioning and large-form-factor DGX system delivery). These are minimums — 1,500 mm cold aisle and 2,000 mm hot aisle are strongly preferred for operational manageability.

Overhead Infrastructure Zoning

With all infrastructure overhead in a liquid-cooled hall, the ceiling zone must be carefully zoned to prevent conflicts: Zone A (highest) — structural steel and HVAC ducts; Zone B — power busway (800–1,600A); Zone C — coolant supply and return headers (150–200 mm pipe); Zone D — network cable tray (InfiniBand AOC and fibre); Zone E (lowest) — flexible drop hoses to racks. A minimum ceiling height of 4.5 m (clear) is required to accommodate all overhead zones plus maintenance access. Buildings designed for conventional data centres at 3.5 m clear ceiling height cannot accommodate this stack without structural modification.

Fire Protection: New Risks at High Density

AI-dense data centres present a materially different fire risk profile from conventional facilities. Three new risk factors must be addressed explicitly in the fire protection design.

01
Lithium-Ion Battery Thermal Runaway — Servers and BESS
GPU servers contain lithium polymer batteries (in certain DGX configurations and management cards). BESS installations use LFP cells. Both present thermal runaway risk — a self-sustaining exothermic reaction that clean agent suppression systems cannot extinguish (they can only suppress the fire above the cell, not cool the cell itself). Fire zone design must provide physical separation between BESS and IT load zones, with BESS in a dedicated fire compartment with independent suppression and drainage for water application during thermal runaway events.
02
Dielectric Fluid Fire Risk (Immersion Systems)
Mineral oil and synthetic ester dielectric fluids are combustible. AFFF foam must never be used in immersion rooms — it contaminates the fluid and renders entire tanks unusable. Specify water mist or FM-200/Novec 1230 clean agent for immersion zones. Secondary containment (bunded floors, blind sumps) must capture any fluid spill before it reaches areas with ignition sources. A fluid spill at 100 kW of server heat — even with the CDU running — can result in fluid temperature approaching flash point within minutes if a CDU pump fails simultaneously.
03
High-Voltage DC (HVDC) Arc Risk
Some AI server architectures — including NVIDIA’s next-generation GB200 designs — use 48V and 400V HVDC bus architectures within the rack, bypassing traditional AC-to-DC conversion. HVDC arcs are sustained differently from AC arcs (AC naturally extinguishes at zero-crossing; DC does not) and can cause more severe burn damage. Protection coordination for HVDC systems requires specialist electrical review — conventional circuit breaker trip curves and arc flash calculations are not directly applicable to HVDC bus protection.

Monitoring and DCIM for AI Infrastructure

An AI data hall at 100 kW/rack is a high-speed thermal system — the thermal mass of liquid-cooled servers is much lower than air-cooled equivalents, meaning temperature excursions develop faster and consequences of cooling failure are more immediate. Monitoring must be real-time and response must be automated, not operator-dependent.

CDU Real-Time Telemetry

Supply and return temperatures (primary and secondary), flow rates, pump speeds, pressure differentials, and heat exchanger approach temperature — all sampled at 1-second intervals. Trend analysis identifies CDU fouling (rising approach temperature at constant load) weeks before it causes IT thermal throttling. Integrated with DCIM via BACnet or Modbus TCP.

GPU Telemetry Integration (DCGM)

NVIDIA Data Centre GPU Manager (DCGM) exposes GPU junction temperature, thermal throttle status, power draw, and memory temperature per GPU via REST API. Integrating DCGM data with the DCIM platform closes the loop between IT thermal performance and facility cooling system response — enabling automatic CDU setpoint adjustment when GPU junction temperatures approach limits.

Power Metering at Rack and PDU Level

Branch-level current metering at every PDU outlet provides per-server power draw data. At 100 kW/rack, a single server drawing 10% above its rated power is 700W of unexpected heat — invisible without metering but detectable immediately with per-outlet monitoring. PUE calculation point at utility incomer, UPS output, and PDU input provides three-tier efficiency visibility.

Automated Cooling Response

At 100 kW/rack, GPU junction temperatures can reach thermal throttle limits within 60–90 seconds of a CDU pump failure. Manual operator response is too slow. The BMS must be programmed with automatic responses: CDU pump failure → activate standby pump within 5 seconds; CDU supply temperature rise above setpoint → increase secondary pump speed and alert operator; leak detection alarm → isolate affected manifold zone and alert immediately. These sequences must be tested as part of commissioning, not assumed.

AI-Ready Design Checklist

#	Design Parameter	AI-Ready Requirement	Discipline
01	Floor slab UDL	2,500 kg/m² for AI / immersion zones; 1,200 kg/m² ancillary	Structural
02	Clear ceiling height	Minimum 4.5 m clear — overhead zone stack requires this	Structural / Arch.
03	Cooling technology	DLC cold plate or immersion mandatory above 30 kW/rack sustained	Mechanical
04	Residual air cooling	Size for 25–30% of IT load even with DLC — do not assume zero	Mechanical
05	Power busway rating	800A–1,600A overhead busway; 400A rack tap-offs	Electrical
06	Transformer K-Factor	K-20 rated; neutral conductor 100% of phase size	Electrical
07	UPS architecture	Modular; 2N power path; battery autonomy calibrated to generator transfer	Electrical
08	CDU redundancy	N+1 minimum; hot-swap isolation valves on all branches	Mechanical
09	Overhead infrastructure zoning	5-zone stack: structural / busway / coolant / network / drops	All
10	Aisle widths	1,200 mm cold aisle minimum; 1,500 mm hot aisle minimum	Layout
11	InfiniBand fabric power	Include IB switch power in electrical load schedule	Electrical
12	BESS fire compartmentation	Separate fire compartment from IT zone; independent suppression	Fire / Structural
13	Leak detection	Sensing cable full perimeter of all coolant headers; zone-identified alarm	Mechanical / BMS
14	DCIM integration	CDU telemetry + GPU DCGM + PDU branch metering all in DCIM platform	Controls / IT
15	Design density horizon	Specify for 5-year hardware refresh density, not day-one fit-out	All

Conclusion: A New Building Type

An AI data centre designed for 100 kW+ per rack is not a data centre with better cooling. It is a new category of critical infrastructure — with structural loading requirements that approach industrial process facilities, power density that exceeds most manufacturing plants, cooling systems closer to chemical process plant than to HVAC, and networking infrastructure with the complexity of a major telecommunications exchange.

The engineering teams that will deliver these facilities successfully are those that treat the design as genuinely novel — not as an extrapolation of conventional data centre practice. Every assumption inherited from the 10 kW/rack era must be explicitly re-examined. Most of them no longer hold.

India’s AI data centre buildout is accelerating. The facilities that will house the next generation of Indian AI capability are being commissioned in the next 24 months. The engineers designing them today are setting the infrastructure baseline for a decade of AI development.

Designing an AI-Ready Data Centre?

KVRM provides complete MEP and structural engineering for AI-dense data centres — from concept power density modelling through detailed design of liquid cooling, electrical distribution, and fire protection for 100 kW+ rack deployments across India and the Gulf region.

Request a Free Consultation →

AI-Ready Data Centre Design:100 kW+ Rack DensityInfrastructure