# **TNPU:** Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit

<u>Sunho Lee</u>, Jungwoo Kim, Seonjin Na, Jongse Park, and Jaehyuk Huh



## **Vulnerabilities of integrated NPU**

• NPU is widely used in the form of <u>System-on-a-Chip</u>.



## **Trusted Execution Environment (CPU)**

- Access control
- Counter-based memory protection



## **Trusted Execution Environment (NPU)**

- CPU: On-chip hardware and related software
- TNPU: + NPU-related hardware/software



# **Trusted Execution Environment (NPU)**

- CPU: On-chip hardware and related software
- TNPU: + NPU-related hardware/software



1) Access control, 2) Memory Protection for NPU



## Validate Access from NPU-MMU

- Access control
  - CPU MMU: Traditional validation table
  - NPU IOMMU



#### Validate Access from NPU

- Access control: <u>Extended validation table (EEPCM)</u>
  - CPU MMU: Traditional validation entries
  - NPU IOMMU: Additional validation entries



## **Naive Memory Protection to NPU**

- Memory protection
  - Counter-based encryption & integrity protection
  - Counter Freshness Validation



#### **Naive Memory Protection to NPU**

- Average <u>19.2%</u> performance degradation
- Reason: Counter-cache miss rate (7.9%)



## **Naive Memory Protection to NPU**

- Average <u>19.2%</u> performance degradation
- Reason: Counter-cache miss rate (7.9%)

#### Norm. Exec Time Counter Cache Miss

#### A novel <u>memory protection technique</u> for NPU is necessary!



#### **NPU Execution Model**

- Execution: \**mvin*  $\rightarrow$  *preload*  $\rightarrow$  *compute*  $\rightarrow$  \*\**mvout* 
  - The software controls NPU data movement by commands



\*mvin: move-in, \*\*mvout: move-out, \*\*\*SPM: Scratchpad Memory

## **Tensor-based Computing**

- Tensor-granular computation
  - Per-tensor version number is sufficient: Tensor-unit memory access



#### **Tree-less Integrity Protection**

- Counter → Version number controlled by <u>software</u>
  - Security granularity: Cacheline → <u>Tensor</u>
  - Storage requirement: Only <u>0.14KB</u> on average



Problem: NPU executes layer operation at once? (i.e Many large tensors are not fitted into SPM)

#### **Challenge: Intra-layer Computing**

• Tensor  $\rightarrow$  One or multiple tiles for intra-layer computing





## **Challenge: Intra-layer Computing**

• Tensor  $\rightarrow$  One or multiple tiles for intra-layer computing



Tile-granular version number is necessary in intra-layer!



### **Tile-granular Version Number**

• Tensor  $\rightarrow$  One or multiple tiles for intra-layer computing



## **Tensor/Tile Version Number**

- Tensor/Tile version number
  - Granularity: Cacheline → Tensor/<u>Tile (Intra-layer)</u>
  - Storage requirement: Only <u>1.3KB</u> on average
- **expand**, **merge**: Granularity translation operation



## **Evaluation Environment**

- Cycle-level simulation modified from \*SCALE-Sim
- Two edge-level system-on-a-chip configurations
  - Samsung Exynos 990 (Small NPU), ARM Ethos N77 (Large NPU)
- Workloads: 14 models in MLPerf, DeepBench

|           | Small NPU<br>(Samsung Exynos 990)   | Large NPU<br>(ARM Ethos N77)     |
|-----------|-------------------------------------|----------------------------------|
| PE        | 32 x 32                             | 45 x 45                          |
| Bandwidth | 11 GB/s (4 channels)                | 22 GB/s (4 channels)             |
| Frequency | 2.75 GHz<br>(both processor/memory) | 1 GHz<br>(both processor/memory) |
| SPM       | 480KB in total                      | 1MB in total                     |
| Precision | Float16                             | Float16                          |

\* A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim (ISPASS 2020)

### **Evaluation Result (Single NPU)**

- Performance improvement: <u>8.75%</u>
  - Data traffic reduction: <u>7.67%</u>
- Remaining performance degradation: <u>8.80%</u> (Comp. Unsecure)
  - Stored-hash-value (Message-authentication-code; MAC)



#### **Evaluation Result (Multiple NPUs)**

- Scalability: Slope (TNPU) < Slope (Baseline)</li>
- Performance improvement:  $8.75\% \rightarrow 11\%$



#### Summary

- Result
  - Trusted Execution environment for NPU
  - Performance improvement: <u>8.75%</u> (single), <u>11%</u> (3-NPU)

#### Challenge

- Counter tree overhead
- Idea
  - Counter → Tensor/tile-granular version number

#### Further Work

Stored-hash-value (MAC) optimization

#### Thank you