# Supporting Secure Multi-GPU Computing with Dynamic and Batched Metadata Management

**Seonjin Na<sup>1</sup>**, Jungwoo Kim<sup>2</sup>, Sunho Lee<sup>2</sup>, Jaehyuk Huh<sup>2</sup> <sup>1</sup>Georgia Institute of Technology, <sup>2</sup>KAIST





## Importance of Multi-GPU Computing





## Secure GPU Computing Efforts in Academia and Industry



# Lack of data protection mechanism optimized for multi-GPU systems



## Our Goal: Efficient Data Protection for Multi-GPU System



## Contents

Introduction

Background and Motivation

Key insights and Main Idea

Evaluation

### Background: Protecting Transferred Data through Interconnect

#### **Confidentiality & Integrity**

Authenticated en/decryption

#### **Freshness**

Replay attack protection



### Background: Protecting Transferred Data through Interconnect

#### **Confidentiality & Integrity**

Authenticated en/decryption

#### **Freshness**

Replay attack protection



## Authenticated En/Decryption with Pre-Computation [1]



#### Store pre-generated Enc.pad/ Auth. pad





Stored in on-chip area

# Prior Pad Table Management (Private) [1]

Maintains same # of pad entries for all commu. pairs in a system



## Performance Impact of # of Pad Table Entries (Private) [1]



## Performance Breakdown Analysis

- Secure multi-GPU incurs average 19.5% performance degradation
  - Auth. en/decryption: 8.2% slowdown, Metadata traffic: 11.3% slowdown

+8.2%

+ 11.3%

# Performance bottlenecks of secure communication 1. Authenticated en/decryption

2. Additional security metadata traffic



## Contents

Introduction

Background and Motivation

Key insights and Main Idea

Evaluation

#### Key Insight 1: Dynamic Behavior of Communication Patterns



#### Key Insight 2: Burstiness of Communication in Multi-GPU System

Analyze distribution of cycles for gathering 16 transmitted data blocks
 Cycle distribution

## **Our Key Observations**

- 1. Dynamic behavior of communication patterns
  - 2. Burstiness of communication



#### Main Idea of This Work

Challenge 1: authenticated en/decryption overhead

Dynamic pad table management



Increase opportunity to hide authenticated en/decryption latency

Challenge 2: additional bandwidth by security metadata

Batched MAC generation& verification



Reduce security metadata traffic

# Dynamic Pad Table Management

on GPU1

Dynamically adjust pad table entries based on communication pattern



### Batched MAC Generation & Verification

Generate coarse-grained MAC to reduce metadata bandwidth





Traffic 64 Data + 64 Metadata (for decryption) + 1 Batch info + 1 MAC & ACK

#### Contents

Introduction

Background and Motivation

Key insights and Main Idea

Evaluation

## **Evaluation Methodology**

- Simulator: MGPUSim [ISCA '19]
- Workloads: 17 apps from various benchmark suites
  - AMD APP SDK, DNN Mark, Hetero Mark, Polybench, SHOC benchmark suites
- System configuration: Models 4 GPU system (AMD R9Nano GPU)

| GPU Configuration                                   |                            |
|-----------------------------------------------------|----------------------------|
| Compute Unit                                        | 64 CUs per GPU, 1.0 GHz    |
| L1 Inst / Vector / Scalar Caches<br>Shared L2 Cache | 16 KB / 32KB / 16KB<br>2MB |
| DRAM                                                | 4GB HBM Memory, 512 GB/s   |
| CPU-GPU, GPU-GPU Interconnect                       | 32 GB/s, 50 GB/s           |
| Security Configuration                              |                            |
| Authenticated encryption/decryption                 | 40 cycles [1,2]            |

## Performance Comparison Result

- Compared with two different mechanisms
  - **Private**<sup>[1]</sup>: Uses same number of pad entries for all pairs
  - Cached<sup>[1]</sup>: Allocates pad table entries like LRU cache

All configs have the same total # of pad table entries







## More Results in the Paper

- Scalability study to the number of GPUs
- Sensitivity study to authenticated encryption/decryption latency
- Hardware overhead of our design

## Summary

#### Problem

Secure communication degrades multi-GPU system performance

#### Key Idea

- Dynamic pad table management exploit dynamic communication patterns
- Batched MAC generation leverage burstiness nature of communication

#### Evaluation results

Reduces perf. overhead by 11.6%, 8.4% compared to Private, Cached

# Backup Slides

# Performance Comparison Result

- Compared with two different mechanisms
  - **Private**: Uses fixed number of pad entries
  - Cached: Manages pad table entries like LRU cache





