Making a Frosted Cake and Processor in Memory vs. Memory in Processor

Making a Frosted Cake and Processor in Memory vs. Memory in Processor

There are two ways to make a frosted cake.

The first is to construct a self-supporting structure out of frosting with a hole in the middle. Then bake a cake. Then carefully insert the cake into the frosting structure without collapsing the frosting or destroying the cake.

Chocolate cake with icing

The second is to bake a cake and smear on frosting.

Combining Memory and CPU

Similarly there are two ways to combine memory and CPU. Embedding memory into an existing CPU seems the most obvious. Embedding CPU into an existing DRAM seems a little more difficult, and it is. However the embedded memory approach turns out to be stupendously expensive while the less obvious approach is cheaper than dirt.

Combining memory and CPU was first proposed by Fish and presented in the Moore/Fish patent filed in August of 1989. The immediately obvious benefit of such a combination was reducing the time for CPU to access memory.

Subsequent to the issuance of the patent DARPA (Defense Advanced Research Projects Administration) and JPL (NASA Jet Propulsion Laboratory) funded about a dozen of the best known computer architects in the country to implement the technology. Architects included David Patterson of Berkeley, inspiration for the SPARC microprocessor and author of the most widely used text in the area, "Computer Architecture, an Integrated Approach", Thomas Sterling of Cal Tech, creator of the massively parallel Beowulf Cluster, and Peter Kogge of Notre Dame, inventor of the Kogge/Stone Adder. Over a period of 10 years DARPA spent as much as $500M on projects related to solving this problem.

The DARPA architects succeeded in combining CPU and memory in several of their efforts. They confirmed that bringing CPU and memory together on a single chip significantly improved performance. However none of the projects left so much as a ripple in the computer world. DARPA succeeded in creating stupendously expensive difficult to make designs with undersized memories. The last of the projects were defunded in December of 2011.

Why DARPA and JPL Failed

DARPA and JPL failed because they chose to make frosted cakes backwards. They called what they did PIM (Processor in Memory), but they really were doing MIP (Memory in Processor). They embedded memory into existing CPUs. CPUs are normally built on complex processes that emphasize performance and are unconcerned about leakage. Dynamic RAMs (DRAM), the most common main memory technology, remembers data by storing charge on small capacitors. DRAM processes must be very low leakage or the memories will forget.

The DARPA and JPL architects solved the problem of combining the two technologies by using a new DRAM technology that could be "embedded" into existing CPU logic. The solution had the following disadvantages:

  1. It required developing a new and complex manufacturing process that was even more expensive than the existing CPU processes.
  2. In order to be embedded, the resulting DRAMs required compromises such that the maximum size was about 1/4 of the size of a normal DRAM.

Why TOMI Technology Succeeded

The TOMI Technology makes CPUs using transistors fabricated on existing DRAMs using unmodified DRAM processes. The TOMI approach has the following advantages over the DARPA approach:

  1. TOMI Technology is manufactured using the same unmodified process on the same production lines as DRAMs.
  2. Since TOMI CPUs are combined with existing commodity DRAMs, no compromise need be made on the memory size.

Problems Overcome by the TOMI Invention

  1. A transistor on a CPU process can be connected by up to a dozen layers of metal. A DRAM process can use only 3 layers.
    1. This means that the transistors in a legacy CPU such as one made by AMD or INTEL cannot be connected together with the limited layers of metal available in a DRAM process.
    2. TOMI required the invention of a very simple CPU that could be connected with limited layers of metal but had the performance necessary to move and calculate on large amounts of data at high speed. For comparison, each core of a multi-core INTEL Xeon has about 250 million transistors. A 64-bit TOMI Celeste has just over 400,000. Simplification was accomplished by extensive benchmarking with MapReduce and graph analysis code to determine the most frequently used instructions. Further reductions came from the invention of a new addressing technique and a way to efficiently connect CPU buses with internal memory buses.
    3. As a result, a TOMI Celeste core is small enough to be laid out as a bit-slice and connected by hand within a few weeks using only 3 layers of metal.

  2. A transistor on a CPU process can be up to 20% faster than a DRAM transistor. It turns out that performance on legacy CPUs such as those made by AMD and INTEL is limited by heat not transistor speed. Up to 60% of the energy turned into heat by a legacy CPU is static power loss due to leakage. This means that a CPU that normally consumes 100 watts is losing 60 of those watts just due to leaky transistors. TOMI Technology is built using DRAM transistors that leak 0.001% of CPU transistors. Therefore TOMI transistors can run just a few degrees over the ambient temperature compared to the 200F degree die temperature of legacy CPUs. Transistors running near ambient temperature can be up to twice as fast as those running at 200°F degrees. Furthermore since the TOMI architecture is implemented with 1/1000th the transistors of legacy CPUs, dynamic power is similarly reduced.

Other Benefits of TOMI Technology

TOMI CPUs are immediately adjacent to the main memory DRAM. Therefore access to main memory is very nearly the same as the speed of access to the cache of a legacy CPU. A cache is a high speed memory placed on a CPU to hold data fetched from main memory. Data is often reused in some computer programs. The cache allows once read data to be re-read from the high speed cache instead of the slow speed main memory. On a legacy CPU a main memory read penalty can be 200-500X the time of a cache read.

Cache effectiveness assumes that data will be reused by a program. This is true of legacy applications such as Microsoft Office. However the most important high performance applications today are those that handle what is called Big Data. Big Data refers to amounts of information that are too large to handle with traditional tools such as single processors and SQL databases. Big Data analysis attempts to predict the future by analyzing patterns in past behavior. Examples include analyzing ecommerce transactions in real time in order to adjust advertising or pricing on the fly.

Two of the most popular techniques for managing Big Data are MapReduce and Graph Analysis. MapReduce sorts huge amounts of data and Graph Analysis locates patterns. The patterns in many Big Data applications are nearly random. This means that the relationships to be analyzed are not likely to be adjacent in memory. Furthermore, once a relationship is identified it will not be referred to again very often.

A cache is only effective if the data it holds is accessed repeatedly. Big Data Graph Analysis destroys cache effectiveness by constantly requiring access to main memory and incurring the 200X speed penalty.

TOMI Technology positions the CPUs within a few microns of main memory thereby eliminating the speed penalty.

Finally, DRAM transistors are about the cheapest transistors that can be made. For example, 1 billion INTEL Xeon transistors made on a CPU process cost more than $200. Four billion transistors on a 4G DRAM made on a memory process cost less than $1.50. A TOMI 8-core 64-bit CPU adds about 4 million transistors to a 4 billion transistor 4G DRAM.

Designs made with TOMI Technology are therefore very inexpensive.

"I actually have to go on record as saying that, at some time, this (TOMI) would be the way to go." EDN

- Dr. David Patterson (RISC visionary, SPARC inventor, IRAM inventor)

"...delighted, even envious" WIRED

- Dr. Thomas Sterling (Creator of Beowolf supercomputer, DARPA Excascale project, Gilgamesh inventor)

"The entity that controls [TOMI] probably controls computer architecture to the end of silicon." WIRED

- Russell Fish