Coder Social home page Coder Social logo

neurosim / 3d_neurosim_v1.0 Goto Github PK

View Code? Open in Web Editor NEW
19.0 2.0 9.0 10.32 MB

Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

Python 11.20% C++ 86.60% C 1.92% Makefile 0.28%
inference-engine 3d-integration ieee iedm xiaochen-peng

3d_neurosim_v1.0's People

Contributors

neurosim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

3d_neurosim_v1.0's Issues

different result

I try to get the resullt of the 2D 7nm SRAM. I use 8-bit VGG-8 network on CIFAR-10 dataset. The VGG-8 network model is from DNN_NeuroSim_V1.4.

I set memcelltype = 1, novelMapping = true, SARADC = true, validated = false, synchronous = false, pipeline = false, M3D = false, technode = 7, featuresize = 18e-9, wireWidth = 1, levelOutput = 16, cellBit = 1, heightInFeatureSizeSRAM = 16, widthInFeatureSizeSRAM = 34.43, widthSRAMCellNMOS = 1, numColMuxed = 8

But I get the readDynamicEnergy is: 9.62642e+07pJ. It is different with the result in 'Benchmarking Monolithic 3D Integration for Compute-in-Memory Accelerators: Overcoming ADC Bottlenecks and Maintaining Scalability to 7nm or Beyond ' which is:
Area: 8.36mm^2, TOPS/W: 30.30, TOPS: 1.95, Power Density: 7.72e-03 W/mm^2, latency: 600us, dynamic energy: 35uJ

Do you have any suggestions to help me get the results similar to those in the paper?

My result is here.

------------------------------ Summary --------------------------------

ChipArea : 9.46458e+06um^2
Chip total CIM array : 3.52389e+06um^2
Total IC Area on chip (Global and Tile/PE local): 931046um^2
Total ADC (or S/As and precharger for SRAM) Area on chip : 2.04312e+06um^2
Total Accumulation Circuits (subarray level: adders, shiftAdds; PE/Tile/Global level: accumulation units) on chip : 1.80574e+06um^2
Other Peripheries (e.g. decoders, mux, switchmatrix, buffers, pooling and activation units) : 1.16078e+06um^2

Chip layer-by-layer readLatency (per image) is: 603729ns
Chip total readDynamicEnergy is: 9.62642e+07pJ
Chip total leakage Energy is: 6.02362e+06pJ
Chip total leakage Power is: 7531.8uW
Chip buffer readLatency is: 314434ns
Chip buffer readDynamicEnergy is: 236904pJ
Chip ic readLatency is: 65154.7ns
Chip ic readDynamicEnergy is: 3.45468e+06pJ

************************ Breakdown of Latency and Dynamic Energy *************************

----------- ADC (or S/As and precharger for SRAM) readLatency is : 173409ns
----------- Accumulation Circuits (subarray level: adders, shiftAdds; PE/Tile/Global level: accumulation units) readLatency is : 10241.2ns
----------- Other Peripheries (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readLatency is : 420079ns
----------- ADC (or S/As and precharger for SRAM) readDynamicEnergy is : 8.11379e+07pJ
----------- Accumulation Circuits (subarray level: adders, shiftAdds; PE/Tile/Global level: accumulation units) readDynamicEnergy is : 8.23443e+06pJ
----------- Other Peripheries (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readDynamicEnergy is : 6.8919e+06pJ

************************ Breakdown of Latency and Dynamic Energy *************************


----------------------------- Performance -------------------------------
Chip Operation Temperature (K): 313
Energy Efficiency TOPS/W (Layer-by-Layer Process): 12.0428
Throughput TOPS (Layer-by-Layer Process): 2.04038
Throughput FPS (Layer-by-Layer Process): 1656.37
Compute efficiency TOPS/mm^2 (Layer-by-Layer Process): 0.21558
Power Density W/mm^2 (Layer-by-Layer Process): 0.0179011
-------------------------------------- Hardware Performance Done --------------------------------------

My 'Param.cpp' is here.

Param::Param() {
	/***************************************** user defined design options and parameters *****************************************/
	operationmode = 2;     		// 1: conventionalSequential (Use several multi-bit RRAM as one synapse)
								// 2: conventionalParallel (Use several multi-bit RRAM as one synapse)
	
	memcelltype = 1;        	// 1: cell.memCellType = Type::SRAM
								// 2: cell.memCellType = Type::RRAM
								// 3: cell.memCellType = Type::FeFET
	
	accesstype = 1;         	// 1: cell.accessType = CMOS_access
								// 2: cell.accessType = BJT_access
								// 3: cell.accessType = diode_access
								// 4: cell.accessType = none_access (Crossbar Array)
	
	transistortype = 1;     	// 1: inputParameter.transistorType = conventional
	
	deviceroadmap = 2;      	// 1: inputParameter.deviceRoadmap = HP
								// 2: inputParameter.deviceRoadmap = LSTP
								
	globalBufferType = false;    // false: register file
								// true: SRAM
	globalBufferCoreSizeRow = 128;
	globalBufferCoreSizeCol = 128;
	
	tileBufferType = false;      // false: register file
								// true: SRAM
	tileBufferCoreSizeRow = 32;
	tileBufferCoreSizeCol = 32;
	
	peBufferType = false;        // false: register file
								// true: SRAM
	
	chipActivation = true;      // false: activation (reLu/sigmoid) inside Tile
								// true: activation outside Tile
						 		
	reLu = true;                // false: sigmoid
								// true: reLu
								
	novelMapping = true;        // false: conventional mapping
								// true: novel mapping
								
	SARADC = true;              // false: MLSA
	                            // true: sar ADC
	currentMode = true;         // false: MLSA use VSA
	                            // true: MLSA use CSA
	
	pipeline = false;            // false: layer-by-layer process --> huge leakage energy in HP
								// true: pipeline process
	speedUpDegree = 8;          // 1 = no speed up --> original speed
								// 2 and more : speed up ratio, the higher, the faster
								// A speed-up degree upper bound: when there is no idle period during each layer --> no need to further fold the system clock
								// This idle period is defined by IFM sizes and data flow, the actual process latency of each layer may be different due to extra peripheries
	
	validated = false;			// false: no calibration factors
								// true: validated by silicon data (wiring area in layout, gate switching activity, post-layout performance drop...)
								
	synchronous = false;			// false: asynchronous
								// true: synchronous, clkFreq will be decided by sensing delay
								
	M3D = false;                 // false: run 2D simulation
								// true: run M3D simulation
								
	/*** algorithm weight range, the default wrapper (based on WAGE) has fixed weight range of (-1, 1) ***/
	algoWeightMax = 1;
	algoWeightMin = -1;
	
	/*** conventional hardware design options ***/
	clkFreq = 1e9;                      // Clock frequency
	temp = 300;                         // Temperature (K)
	// technode: 130	 --> wireWidth: 175
	// technode: 90		 --> wireWidth: 110
	// technode: 65      --> wireWidth: 105
	// technode: 45      --> wireWidth: 80
	// technode: 32      --> wireWidth: 56
	// technode: 22      --> wireWidth: 40
	// technode: 14      --> wireWidth: 25
	// technode: 10, 7   --> wireWidth: 18
	technode = 7;                      // Technology
	featuresize = 18e-9;                // Wire width for subArray simulation
	wireWidth = 18;                     // wireWidth of the cell for Accuracy calculation
	globalBusDelayTolerance = 0.1;      // to relax bus delay for global H-Tree (chip level: communication among tiles), if tolerance is 0.1, the latency will be relax to (1+0.1)*optimalLatency (trade-off with energy)
	localBusDelayTolerance = 0.1;       // to relax bus delay for global H-Tree (tile level: communication among PEs), if tolerance is 0.1, the latency will be relax to (1+0.1)*optimalLatency (trade-off with energy)
	treeFoldedRatio = 4;                // the H-Tree is assumed to be able to folding in layout (save area)
	maxGlobalBusWidth = 2048;           // the max buswidth allowed on chip level (just a upper_bound, the actual bus width is defined according to the auto floorplan)
										// NOTE: Carefully choose this number!!!
										// e.g. when use pipeline with high speedUpDegree, i.e. high throughput, need to increase the global bus width (interface of global buffer) --> guarantee global buffer speed

	numRowSubArray = 128;               // # of rows in single subArray
	numColSubArray = 128;               // # of columns in single subArray
	
	/*** option to relax subArray layout ***/
	relaxArrayCellHeight = 0;           // relax ArrayCellHeight or not
	relaxArrayCellWidth = 0;            // relax ArrayCellWidth or not
	
	numColMuxed = 8;                    // How many columns share 1 ADC (for eNVM and FeFET) or parallel SRAM
	levelOutput = 16;                   // # of levels of the multilevelSenseAmp output, should be in 2^N forms; e.g. 32 levels --> 5-bit ADC
	cellBit = 1;                        // precision of memory device 
	
	/*** parameters for SRAM ***/
	// due the scaling, suggested SRAM cell size above 22nm: 160F^2
	// SRAM cell size at 14nm: 300F^2
	// SRAM cell size at 10nm: 400F^2
	// SRAM cell size at 7nm: 600F^2
	heightInFeatureSizeSRAM = 16;        // SRAM Cell height in feature size  
	widthInFeatureSizeSRAM = 34.43;        // SRAM Cell width in feature size  
	widthSRAMCellNMOS = 1;                            
	widthSRAMCellPMOS = 1;
	widthAccessCMOS = 1;
	minSenseVoltage = 0.1;
	
	/*** parameters for analog synaptic devices ***/
	heightInFeatureSize1T1R = 4;        // 1T1R Cell height in feature size
	widthInFeatureSize1T1R = 12;         // 1T1R Cell width in feature size
	heightInFeatureSizeCrossbar = 2;    // Crossbar Cell height in feature size
	widthInFeatureSizeCrossbar = 2;     // Crossbar Cell width in feature size
	
	resistanceOn = 6e3;               // Ron resistance at Vr in the reported measurement data (need to recalculate below if considering the nonlinearity)
	resistanceOff = 6e3*150;           // Roff resistance at Vr in the reported measurement dat (need to recalculate below if considering the nonlinearity)
	maxConductance = (double) 1/resistanceOn;
	minConductance = (double) 1/resistanceOff;
	
	readVoltage = 0.5;	                // On-chip read voltage for memory cell
	readPulseWidth = 10e-9;             // read pulse width in sec
	accessVoltage = 1.1;                // Gate voltage for the transistor in 1T1R
	resistanceAccess = resistanceOn*IR_DROP_TOLERANCE;            // resistance of access CMOS in 1T1R
	writeVoltage = 2;					// Enable level shifer if writeVoltage > 1.5V
	
	/*** Calibration parameters ***/
	if(validated){
		alpha = 1.44;	// wiring area of level shifter
		beta = 1.4;  	// latency factor of sensing cycle
		gamma = 0.5; 	// switching activity of DFF in shifter-add and accumulator
		delta = 0.15; 	// switching activity of adder 
		epsilon = 0.05; // switching activity of control circuits
		zeta = 1.22; 	// post-layout energy increase
	}		
	
	/***************************************** user defined design options and parameters *****************************************/
	
	
	
	/***************************************** Initialization of parameters NO need to modify *****************************************/
	
	if (memcelltype == 1) {
		cellBit = 1;             // force cellBit = 1 for all SRAM cases
	} 
	
	/*** initialize operationMode as default ***/
	conventionalParallel = 0;
	conventionalSequential = 0;
	BNNparallelMode = 0;                
	BNNsequentialMode = 0;              
	XNORsequentialMode = 0;          
	XNORparallelMode = 0;         
	switch(operationmode) {
		case 6:	    XNORparallelMode = 1;               break;     
		case 5:	    XNORsequentialMode = 1;             break;     
		case 4:	    BNNparallelMode = 1;                break;     
		case 3:	    BNNsequentialMode = 1;              break;     
		case 2:	    conventionalParallel = 1;           break;     
		case 1:	    conventionalSequential = 1;         break;     
		default:	printf("operationmode ERROR\n");	exit(-1);
	}
	
	/*** parallel read ***/
	parallelRead = 0;
	if(conventionalParallel || BNNparallelMode || XNORparallelMode) {
		parallelRead = 1;
	} else {
		parallelRead = 0;
	}
	
	/*** Initialize interconnect wires ***/
	switch(wireWidth) {
		case 175: 	AR = 1.60; Rho = 2.20e-8; break;  // for technode: 130
		case 110: 	AR = 1.60; Rho = 2.52e-8; break;  // for technode: 90
		case 105:	AR = 1.70; Rho = 2.68e-8; break;  // for technode: 65
		case 80:	AR = 1.70; Rho = 3.31e-8; break;  // for technode: 45
		case 56:	AR = 1.80; Rho = 3.70e-8; break;  // for technode: 32
		case 40:	AR = 1.90; Rho = 4.03e-8; break;  // for technode: 22
		case 25:	AR = 2.00; Rho = 5.08e-8; break;  // for technode: 14
		case 18:	AR = 2.00; Rho = 6.35e-8; break;  // for technode: 7, 10
		case -1:	break;	// Ignore wire resistance or user define
		default:	exit(-1); puts("Wire width out of range"); 
	}
	
	if (memcelltype == 1) {
		wireLengthRow = wireWidth * 1e-9 * heightInFeatureSizeSRAM;
		wireLengthCol = wireWidth * 1e-9 * widthInFeatureSizeSRAM;
	} else {
		if (accesstype == 1) {
			wireLengthRow = wireWidth * 1e-9 * heightInFeatureSize1T1R;
			wireLengthCol = wireWidth * 1e-9 * widthInFeatureSize1T1R;
		} else {
			wireLengthRow = wireWidth * 1e-9 * heightInFeatureSizeCrossbar;
			wireLengthCol = wireWidth * 1e-9 * widthInFeatureSizeCrossbar;
		}
	}
	Rho *= (1+0.00451*abs(temp-300));
	if (wireWidth == -1) {
		unitLengthWireResistance = 1.0;	// Use a small number to prevent numerical error for NeuroSim
		wireResistanceRow = 0;
		wireResistanceCol = 0;
	} else {
		unitLengthWireResistance =  Rho / ( wireWidth*1e-9 * wireWidth*1e-9 * AR );
		wireResistanceRow = unitLengthWireResistance * wireLengthRow;
		wireResistanceCol = unitLengthWireResistance * wireLengthCol;
	}
	/***************************************** Initialization of parameters NO need to modify *****************************************/
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.