Nvidia CEO Jen-Hsun Huang used the opening keynote of the company’s annual GPU Technology Conference to announce a massive new processor designed specifically for deep learning. The Tesla P100 is the first shipping product to use Nvidia’s new Pascal architecture, and is made up of 15.3 billion transistors, which the company says makes it the largest microchip ever fabricated.
The Tesla P100 is built using a new 16nm FinFE manufacturing process and uses 16GB of HBM2 graphics memory which is integrated onto the same chip substrate, which results in memory bandwidth of up to 720GBps. Peak performance is rated at 21.2 Teraflops for half-precision instructions, 10.6 Teraflops for single-precision and 5.3 Teraflops for double-precision workloads. Up to eight Tesla P100 chips can be interconnected using Nvidia’s NVLink bus.
The Tesla P100 is claimed to deliver over 12x the performance of Nvidia’s previous generation Maxwell architecture in neural network training scenarios. Specific applications, such as the AMBER molecular dynamics code, are said to run faster on one Tesla P100 server node than on 48 dual-socket CPU server nodes, according to Nvidia.
Huang also said that the company has deciced to “go all-in on AI”, and that deep learning and artificial intelligence are the company’s fastest growing business area. He named several areas of research, including finding a cure for cancer and understanding climate change, which require computing resources that can scale infinitely.
Massachusetts General Hospital has set up a clinical datacentre which will use Nvidia’s AI processing technology to help diagnose diseases starting with the fields of radiology and pathology, and will use its archive of 10 billion medical images to create a deep learning neural network.
The Tesla P100 will initially be available in Nvidia’s new DGX-1 “deep learning supercomputer” in June, and in servers from a number of manufacturers beginning in early 2017. The DGX-1 will have eight Tesla P100 chips for a combined 170 Teraflops of half-precision performance, and is claimed to be able to deliver the deep learning throughput of 250 traditional x86 servers in a single 3U server enclosure.