AI Chip Supply Chain

The demand for Artificial Intelligence (AI) has led to an explosion in the need for powerful hardware, specifically cutting-edge AI chips. Companies like NVIDIA and AMD practically dominate the GPU market share, experiencing huge revenue jumps due to rising demand for their newest chips, mostly from the data center side.

However, neither NVIDIA nor AMD manufactures their own chips; they rely on a complex, multi-stage global supply chain to bring these highly sought-after processors, like the H100 and B100, to market.

To meet the demands of huge data center buildouts, this supply chain must be locked down and scalable. We can break down the supply chain into three main steps: Raw Materials, Chip Design, and Manufacturing.

1. The Foundation: Raw Materials and the Quest for Purity

The journey of an AI chip begins with the earth itself. The primary ingredient for semiconductor wafers is sand, specifically a material called silicate mineral.

To create the silicon-based wafers used in GPUs, the extracted silicon must be heated up. Crucially, the purity of this material is paramount. To make high-quality wafers, manufacturers require extremely pure quartz, often called 9N silicon (99.9999999% pure) or High Purity Quartz (HPQ).

While the Earth’s crust holds huge deposits of silicon minerals, most are not pure enough for semiconductor production. The availability of the necessary high-grade HPQ is unfortunately not widely available.

  • Key Source: Spruce Mine in North Carolina, United States, is estimated to supply 70% to 90% of the world's HPQ. This concentration has led to fears of a massive chip shortage if the supply from this mine were compromised.
  • Other Dependencies: Besides silicon, similar supply networks exist for other common materials necessary for chip production, including copper (for electrical connections), aluminum (for packaging and heat dissipation), and tin (for bonding chips during assembly).
  • China’s Efforts: China has been trying to reduce its reliance on other countries for silicon, recently discovering more than 35 million tons of HPQ in Qinghe County, Helen Province, and Ote in Xinjiang.

2. The Blueprint: Chip Design and Intellectual Property (IP)

Modern chip design is incredibly complex. For instance, NVIDIA’s H100 GPU contains around 80 billion transistors, with newer B100 class chips reaching over 200 billion. Designing at this scale requires extreme engineering skill to manage signal routing, power delivery through dense copper interconnects, heat dissipation, and transistor placement on a single piece of silicon.

Chip design companies fall into two major categories:

Classification Description Examples
Fabless Companies that design chips but outsource the actual manufacturing. NVIDIA, AMD, Apple, Amazon (Inferentia/Trainium), Google (TPU), Microsoft (Azure Maia 100).
Integrated Device Manufacturers (IDMs) Companies that design their chips in-house as well as manufacture them. Samsung and Intel.

The Power of IP

Modern chip design does not start from scratch. Engineers assemble large, complex structures from preverified building blocks, such as memory cells and logic gates, using a system known as a standard cell library. These are combined like Lego blocks.

As chip sizes shrink (e.g., to 3 nm and 5 nm), the complexity of these building blocks (like adders and multiplexers) increases dramatically. This is why companies like ARM, Synopsys, and Cadence lock down their design IP and lend them out to manufacturing companies via royalties and license fees. Given that these key IP providers are owned and affiliated as American companies, this reliance makes it extremely difficult for countries like China to claim independence in chip design.

3. The Assembly Line: Manufacturing and the Foundries

Once the design is complete, the manufacturing process begins, combining the rare materials with the finalized chip designs from fabless companies like NVIDIA and AMD. These chips ultimately power the training of AI models at frontier labs such as OpenAI and Anthropic.

The companies that manufacture these advanced chips are called foundries.

Key Foundries

Only a few companies globally can manufacture advanced chips:

  • TSMC (Taiwan)
  • Samsung (South Korea)
  • Other players, such as Intel (US) and SMIC (China), have been expanding their ability to make cutting-edge chips.

Manufacturing Steps

Chip manufacturing largely involves a series of complex steps, including: wafer manufacturing, oxidization, photolithography, etching, deposition, metal wiring, EDS, and packaging.

The Ultimate Bottleneck: Extreme Ultraviolet (EUV) Scanners

While all these steps are vital, the biggest bottleneck in the supply chain for foundries is the photolithography stage.

Photolithography is the process where circuit patterns encoded on a photo mask (provided by the fabless company) are printed onto the silicon wafer. This delicate, high-precision process requires specialized equipment known as Extreme Ultraviolet (EUV) scanners.

The investment is immense: a single EUV machine can cost more than $300 million. What makes this a massive bottleneck is that there is essentially only one company in the world that produces these EUV machines: ASML, a Dutch firm. If a company wants to become an advanced foundry, acquiring one of these machines is non-negotiable.

4. The Global Race and Future Outlook

The reliance on a small number of foundries (in Taiwan and South Korea) and a single source for critical manufacturing equipment (ASML) has turned the AI chip supply chain into a geopolitical battleground.

China, recognizing this vulnerability, has repeatedly emphasized plans to reduce its dependence on foreign countries and build a more self-sufficient domestic semiconductor supply chain.

  • China’s Efforts: Chinese companies like Huawei, partnered with SMIC, are actively trying to compete. Huawei plans to manufacture about 600,000 of its Ascend 910C GPUs in 2026 and expand production with upcoming models. However, their current lithography technology struggles to catch up; for example, Huawei’s SA 950 GPU is still only roughly 6% of the performance of NVIDIA's next-generation chips.
  • US/Western Strategy: US companies like NVIDIA, AMD, Google, and Microsoft plan to continue innovating by leveraging their access to the most advanced foundries in Taiwan and South Korea, pushing chip sizes down to 2 to 3 nanometers.

The AI chip supply chain acts like a sophisticated ecosystem where unique resources (HPQ) meet extreme complexity (200 billion transistor designs), all dependent on one indispensable machine (the ASML EUV scanner). This interconnected system determines which nations and companies lead the innovation of next-generation AI models.

4. Beyond the Chip: The Critical Energy Demands of AI

Once the highly complex AI chips (like the H100 or B100) are manufactured, they must be housed in massive data centers and supplied with enormous amounts of power to function. This consumption introduces one of the biggest bottlenecks and geopolitical challenges outside of manufacturing itself: energy supply.

Frontier AI models, such as GPT, Grok, Claude, and Gemini, all require significant power to operate in data centers globally.

The Cost of Training: A Small City’s Consumption

The energy required just to train a large language model (LLM) is staggering. For example, OpenAI’s GPT-4 model—estimated to have 1.7 trillion parameters and trained on 13 trillion tokens—required approximately 20 septillion floating-point operations.

To execute this immense computation, OpenAI likely utilized up to 25,000 A100 GPUs over a period of about 3 months.

  • Hardware Demand: Each A100 GPU consumes up to 400 watts of power.
  • Server Topology: These GPUs are typically grouped into powerful systems. Eight A100 GPUs are connected in a single hardware topology called an NVIDIA HGX server, which consumes between 3 to 6 kW. Newer, faster architectures like the HGX H100 or B200 can draw more than 10 kW.
  • Total Training Energy: The 25,000 A100 GPUs required about 3,125 HGX servers. Calculating the energy used across the 90-day training period results in approximately 44 gigawatt-hours (GWh) of consumption.
  • Context: To put this in perspective, the energy spent training a single LLM like GPT-4 could easily represent the monthly energy consumption of a small city of 50,000 people.

The Inference Challenge: Deployment Energy Overwhelms Training

While training consumes huge bursts of power, deploying the model for general public use (inference) requires more energy than the initial training.

OpenAI, for example, is projected to have over 700 million weekly active users on ChatGPT, receiving over 2.5 billion prompts per day.

  • Query Cost: The energy consumption for processing one single query with GPT-4 is estimated to be around 0.3 Watt-hours.
  • Daily Draw: Servicing 2.5 billion prompts per day requires 750 megawatt-hours of energy daily.
  • Comparative Cost: Over a 90-day period of continuous serving, the energy cost totals approximately 67 GWh, surpassing the estimated 44 GWh cost for the initial training.
  • Compounding Needs: AI companies often serve multiple models concurrently (e.g., Anthropic runs Claude Opus, Sonnet, and backwards-compatible versions), causing energy needs to compound rapidly.

Cooling, Infrastructure, and Geopolitical Strategy

In addition to the operational energy for the chips themselves, data centers require immense resources for cooling, an overhead measured by the Power Usage Effectiveness (PUE) metric. If a data center has a PUE of 1.2, the energy consumption for GPT-4 training (44 GWh) jumps to 52.8 GWh, and deployment consumption (67 GWh) jumps to 80 GWh.

This growing demand is straining energy grids:

  • US Demand: In 2023, US data center usages consumed 176 terawatt-hours, representing about 4.4% of the entire US electricity consumption. This figure is projected to grow to 8% to 10% by 2030.
  • US Infrastructure Buildout: To meet this future need, major companies are planning massive self-generated power facilities. OpenAI is projected to build the Stargate facility with a capacity of up to 5 gigawatts. XAI purchased methane turbines for its Colossus facility. However, US energy infrastructure is primarily driven by private corporations and faces regulatory obstacles, lawsuits, and local government pushbacks regarding permits and infrastructure strain.
  • China’s Centralized Approach: China is adopting a centralized, state-level approach to energy buildout, potentially sidestepping bureaucratic delays seen in the US. China has rapidly expanded renewable and nuclear power, installing over 609 GW of solar and 441 GW of wind by 2023, and has 27 nuclear reactors under construction.

While energy alone is not the sole bottleneck, its supply needs to grow in parallel with advancements in chip technology (like the GB200 GPU), better model architectures (like mixture of experts), and improved cooling systems to determine who will lead the next generation of AI innovation.