Edge AI TensorRT Jetson CUDA

YOLOv8 on Jetson Orin Nano: From PyTorch to TensorRT INT8

Hokwang Choi · April 2026 · 18 min read

TL;DR

Hardware: Jetson Orin Nano 8GB (MAXN_SUPER) + Samsung PRO Plus 256GB microSD
Model: YOLOv8n, 640x640 input, batch size 1
Measurement: GPU-only latency via CUDA events and trtexec
Results: TensorRT INT8 achieves 3.49ms / 286 FPS (6x faster than PyTorch)
Code: github.com/hokwangchoi/jetson-orin-nano-benchmarks

Hardware Setup

Jetson Orin Nano Developer Kit
Jetson Orin Nano 8GB Developer Kit

The Jetson Orin Nano is NVIDIA's entry-level edge AI platform. I'm using the 8GB developer kit with external storage for model files and faster I/O.

DeviceJetson Orin Nano 8GB Developer Kit
GPU1024 CUDA cores, 32 Tensor Cores (Ampere)
Memory8GB LPDDR5 unified (CPU + GPU shared)
StorageSamsung PRO Plus 256GB microSD (160MB/s read)
L4T Version36.4.7
JetPack6.2
TensorRT10.3
CUDA12.6

Power Configuration

The Orin Nano supports multiple power modes. I use MAXN SUPER for benchmarking -- this unlocks maximum GPU and memory clocks with no power cap.

# Check available power modes
nvpmodel -p

# Set MAXN SUPER mode (mode ID varies by JetPack version)
sudo nvpmodel -m 2  # MAXN_SUPER on JetPack 6.2

# Lock clocks to maximum frequency
sudo jetson_clocks

# Verify
nvpmodel -q
# NV Power Mode: MAXN_SUPER

In production, you'd choose a power mode based on thermal/power budget. But for benchmarking, MAXN SUPER gives us the hardware's true capability.

What We Measure (And What We Don't)

Inference latency has multiple components. Most benchmarks conflate them -- we don't.

+-------------+--------------------+--------------+-----------+ | | Host Latency (end-to-end) | +-------------+--------------------+--------------+-----------+ | Python | H2D Transfer | GPU Compute | D2H | | Overhead | (Host -> Device) | (Kernels) | Transfer | +-------------+--------------------+--------------+-----------+ | ~0.1-1ms | ~0.5ms | <-- THIS ONE | ~0.2ms | +-------------+--------------------+--------------+-----------+

We report GPU Compute Time only -- the actual kernel execution. This excludes:

Why? Because GPU compute is the optimization target. Transfers are fixed by data size; Python overhead vanishes in C++ deployment. Reporting end-to-end latency from Python would conflate things you can't optimize with things you can.

Measurement Tools

RuntimeToolWhat It Captures
PyTorch CUDA Events GPU kernel execution only
TensorRT trtexec GPU Compute Time GPU kernel execution only
Both Nsight Systems Per-kernel timing, memory ops, Tensor Core usage
Power tegrastats System power draw during inference

CUDA Events for PyTorch

time.perf_counter() measures wall-clock time including CPU overhead. CUDA events record timestamps on the GPU timeline:

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
output = model(input)
end_event.record()

torch.cuda.synchronize()
gpu_time_ms = start_event.elapsed_time(end_event)  # GPU-only

trtexec Output

TensorRT's trtexec reports timing breakdown automatically:

=== Performance summary ===
Throughput: 119.5 qps
Latency: min = 8.21 ms, max = 9.12 ms, mean = 8.42 ms  <- Host latency
GPU Compute Time: min = 8.30 ms, mean = 8.36 ms        <- We report this
H2D Latency: min = 0.35 ms, mean = 0.48 ms
D2H Latency: min = 0.15 ms, mean = 0.22 ms

Results

Runtime Precision GPU Latency FPS TOPS Util% Power
PyTorchFP3220.88 ms480.424.2%11.4 W
TensorRTFP328.36 ms1201.0410.4%14.3 W
TensorRTFP164.43 ms2251.969.8%14.1 W
TensorRTINT83.49 ms2862.496.2%12.5 W

TOPS = Achieved Tera Operations Per Second. Util% = percentage of hardware's theoretical peak (40 TOPS INT8, 20 TFLOPS FP16 on Orin Nano MAXN SUPER).

Figure 1: GPU latency across runtimes and precisions (lower is better)

The Optimization Pipeline

Step 1: PyTorch Baseline

Ultralytics provides a clean API. The model runs on GPU but with no inference optimization:

from ultralytics import YOLO

model = YOLO("yolov8n.pt")
results = model.predict("image.jpg")

Step 2: Export to ONNX

ONNX is the intermediate representation. We export to ONNX for two reasons:

yolo export model=yolov8n.pt format=onnx imgsz=640 opset=17

# Inspect with Netron (optional)
# Upload yolov8n.onnx to https://netron.app

Step 3: Build TensorRT Engines

TensorRT compiles the ONNX graph into an optimized engine. This includes:

# FP32 -- baseline TensorRT
trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n_fp32.engine

# FP16 -- uses Tensor Cores
trtexec --onnx=yolov8n.onnx --fp16 --saveEngine=yolov8n_fp16.engine

# INT8 -- maximum compression
trtexec --onnx=yolov8n.onnx --int8 --saveEngine=yolov8n_int8.engine
Note: Engines are hardware-specific. An engine built on Orin Nano won't run on Xavier or desktop GPUs. Rebuild for each target.

Understanding FLOPs, TOPS, and Hardware Utilization

Vendors quote hardware performance in TOPS (Tera Operations Per Second). The Orin Nano claims 40 TOPS INT8 in MAXN SUPER mode. But what does that mean in practice?

FLOPs: Model Complexity

FLOPs (Floating Point Operations) measures the computational work for one inference. It's a property of the model architecture, independent of hardware:

A multiply-accumulate (MAC) counts as 2 FLOPs. Most of these operations come from convolutions: FLOPs = 2 x Cin x Cout x K² x H x W.

TOPS: Achieved Throughput

TOPS measures how many operations per second the hardware actually executes:

TOPS = Model FLOPs / Latency (seconds) / 10¹² Example: YOLOv8n with 5ms latency TOPS = 8.7 x 10⁹ / 0.005 / 10¹² = 1.74 TOPS

Utilization: The Reality Check

Utilization compares achieved TOPS to hardware's theoretical peak:

Utilization = Achieved TOPS / Peak TOPS x 100% Orin Nano MAXN SUPER peaks: - INT8: 40 TOPS - FP16: 20 TFLOPS - FP32: ~10 TFLOPS (estimated)

Why is utilization typically low (5-20%)?

Utilization is a diagnostic tool: low utilization with acceptable latency is fine. Low utilization with poor latency suggests optimization opportunities -- memory access patterns, kernel fusion, or larger batch sizes where applicable.

How Quantization Works

FP16: Tensor Core Acceleration

FP16 halves memory bandwidth and enables Tensor Cores -- specialized matrix multiply units that compute 4x4 FP16 matrix ops in a single cycle. The Orin Nano has 32 Tensor Cores.

For most vision models, FP16 has no accuracy loss. The dynamic range (±65504) is sufficient for normalized image data and typical weight distributions.

INT8: Calibration Required

INT8 reduces precision from 32 bits to 8 bits -- a 4x reduction. This requires mapping the FP32 value range to [-128, 127]:

FP32 range: [-3.2, 4.1] -> scale = 4.1/127 = 0.032 -> INT8 = round(FP32 / scale) Example: 2.5 -> round(2.5 / 0.032) = 78 (INT8)

The scale factors are determined by calibration -- running representative data through the network and measuring activation ranges.

TensorRT's Default INT8

Without a calibration dataset, trtexec --int8 uses min-max calibration:

For production, provide calibration images:

trtexec --onnx=yolov8n.onnx --int8 \
    --calib=calibration_images/ \
    --saveEngine=yolov8n_int8_calibrated.engine

Power Monitoring with tegrastats

The benchmark script samples power during inference using tegrastats:

# Sample output
RAM 3456/7620MB (lfb 1x1MB) SWAP 0/3810MB CPU [25%@1510,22%@1510,...]
GR3D_FREQ 99% VDD_IN 8500mW VDD_CPU_GPU_CV 3200mW VDD_SOC 1800mW

Key fields:

We sample at 100ms intervals during the benchmark window and report mean power. This is approximate -- for precise energy measurement, use a power meter.

Profiling with Nsight Systems

For detailed kernel analysis, we generate Nsight Systems profiles:

nsys profile --trace=cuda,nvtx -o yolov8_int8 \
    trtexec --loadEngine=yolov8n_int8.engine --duration=5

# Open in GUI
nsys-ui yolov8_int8.nsys-rep

Reading the Timeline

The Nsight timeline shows multiple rows of activity:

Nsight Systems timeline showing YOLOv8 INT8 inference
Figure 1: Nsight Systems timeline for YOLOv8n INT8 (MAXN_SUPER) -- click to enlarge
Row What It Shows --------------------------------------------------------- CPU (0-5) CPU core activity (should be minimal) CUDA HW (Orin) Hardware-level kernel/memory activity +- Kernel GPU compute kernels (yellow/orange) +- Memory DMA transfers (magenta) Streams Software command queues +- Stream 19 Main inference (86.0% of work) +- Stream 18 Memory HtoD transfers (7.1%) +- Stream 20 Memory DtoH transfers (6.9%) TensorRT High-level enqueue spans [2.564 ms] CUDA API Driver calls (cudaEventSynchronize, etc.)

CUDA Streams: Software vs Hardware

A common confusion: CUDA Stream ≠ Streaming Multiprocessor (SM).

ConceptTypePurpose
CUDA Stream Software queue Order operations, enable overlap
Streaming Multiprocessor Physical hardware Execute threads (128 CUDA cores each)

TensorRT creates multiple streams to overlap compute and memory transfers:

Stream 19: [Kernel A][Kernel B][Kernel C]------------> (compute) Stream 18: ------[Memcpy HtoD]------[Memcpy HtoD]---> (input upload) Stream 20: ------------[Memcpy DtoH]------[Memcpy DtoH]> (output download) | v All run on same 8 SMs -- GPU scheduler distributes work

Pipeline Overlap: Why It's Fast

The Orin has separate hardware engines for compute and memory:

+---------------------------------------------------------+ | Orin Nano SoC | +------------------------+--------------------------------+ | DMA Engine | GPU Compute (8 SMs) | | (Memory Copy) | (1024 CUDA cores) | +------------------------+--------------------------------+ | Memcpy HtoD ----> | | | Memcpy DtoH <---- | sm80_xmma_fprop (conv) | | | void culn... (layernorm) | | Runs in PARALLEL! | generatedNati... (fused) | +------------------------+--------------------------------+

This enables triple buffering -- three inferences overlap:

Without pipelining: [HtoD][Compute][DtoH] [HtoD][Compute][DtoH] [HtoD][Compute][DtoH] |<------ 7ms ------>| |<------ 7ms ------>| Throughput: 143 FPS With pipelining (what TensorRT does): [HtoD][Compute][DtoH] [HtoD][Compute][DtoH] [HtoD][Compute][DtoH] |<-5ms->|<-5ms->|<-5ms->| Throughput: 200 FPS (compute-bound, not transfer-bound)

Kernel Names Explained

Zooming into the timeline reveals individual CUDA kernels:

Kernel NameWhat It Does
sm80_xmma_fprop_impl... Convolution via Tensor Cores (xmma = matrix multiply-accumulate)
void CUTENSOR_NAMES... cuTENSOR library ops (optimized tensor operations)
generatedNati... TensorRT-generated fused kernel (multiple ops combined)
void culn... Layer normalization
Reformatting Co... Memory layout transform (NHWC <-> NCHW for Tensor Cores)
/model.*/conv... NVTX markers showing which YOLOv8 layer is executing

Timing Breakdown

Different rows show different timing spans:

NVTX span [~4ms] |<------------------------------------------------->| | | | CPU prep | GPU kernels [~3.5ms] | Sync overhead | | [~0.2ms] | | [~0.3ms] | | |<-------------------->| | | | Stream 19 work | | | | | | | TensorRT enqueue [2.564ms] | | | |<---------------------->| | |
MetricValueWhat It Measures
trtexec GPU Compute 3.49 ms CUDA events around kernels (averaged, most accurate)
TensorRT enqueue ~2.5-3.5 ms Single enqueue() call (varies per inference)
NVTX full span ~3.5-4 ms End-to-end including CPU overhead

For benchmarking, use GPU Compute Time -- it's what the GPU actually spent. For real-time latency (robotics, AV), use NVTX span -- it's camera-to-result time.

Jetson Memory Architecture

Unlike desktop GPUs with separate VRAM, Jetson uses unified memory:

Desktop GPU (Discrete): +-------------+ PCIe 16GB/s +-------------+ | CPU + RAM | <---------------> | GPU + VRAM | | (Host) | Slow copy! | (Device) | +-------------+ +-------------+ Jetson Orin Nano (Unified): +-------------------------------------------------+ | Shared LPDDR5 (8GB, 68 GB/s) | +----------------------+--------------------------+ | +----------+----------+ v v +--------+ +--------+ | CPU | | GPU | | A78 x6 | | 8 SMs | +--------+ +--------+ HtoD/DtoH on Jetson = cache flush + address mapping (fast!)

What We Learned

Our Nsight analysis reveals TensorRT is already well-optimized:

The low utilization (4-8%) is expected for YOLOv8n -- it's too small to saturate the GPU. Larger models or batch sizes would show higher utilization.

Further Optimization: CUDA Programming

After TensorRT INT8, what's left? Custom CUDA kernels for operations TensorRT doesn't optimize.

1. Fused Pre/Post-Processing

Image preprocessing (resize, normalize, channel reorder) typically runs on CPU. A custom CUDA kernel can fuse these into a single GPU pass:

__global__ void preprocessKernel(
    const uint8_t* input,   // HWC uint8 [0,255]
    float* output,          // CHW float [0,1] normalized
    int H, int W
) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= W || y >= H) return;

    // BGR -> RGB, uint8 -> float, normalize
    int idx = y * W + x;
    output[0 * H * W + idx] = input[idx * 3 + 2] / 255.0f;  // R
    output[1 * H * W + idx] = input[idx * 3 + 1] / 255.0f;  // G
    output[2 * H * W + idx] = input[idx * 3 + 0] / 255.0f;  // B
}

2. GPU-Accelerated NMS

Non-Maximum Suppression (NMS) is often a CPU bottleneck. NVIDIA provides efficientNMSPlugin for TensorRT, but custom implementations can further optimize for specific IoU thresholds and box counts.

3. Stream Pipelining

Overlap data transfer with compute using CUDA streams:

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Pipeline: while frame N computes, frame N+1 transfers
cudaMemcpyAsync(d_input1, h_frame1, size, cudaMemcpyHostToDevice, stream1);
context->enqueueV3(stream1);  // Inference on frame 1

cudaMemcpyAsync(d_input2, h_frame2, size, cudaMemcpyHostToDevice, stream2);
// Frame 2 transfer overlaps with frame 1 inference

4. Kernel Profiling

Use Nsight Compute for kernel-level optimization:

ncu --set full -o profile trtexec --loadEngine=yolov8n_int8.engine

This shows occupancy, memory throughput, and instruction mix per kernel -- essential for identifying optimization opportunities.

Appendix: Power Mode Comparison

The Orin Nano supports multiple power modes. Here's how 15W mode compares to MAXN_SUPER:

Precision 15W Mode MAXN_SUPER Speedup
TensorRT FP3211.04 ms8.36 ms1.32x
TensorRT FP166.13 ms4.43 ms1.38x
TensorRT INT84.90 ms3.49 ms1.40x

MAXN_SUPER draws ~3-4W more power but delivers 30-40% better performance. For battery-powered or thermally-constrained deployments, 15W mode offers a reasonable tradeoff -- still achieving 204 FPS with INT8.

Conclusion

Reproduce It

git clone https://github.com/hokwangchoi/jetson-orin-nano-benchmarks.git
cd jetson-orin-nano-benchmarks/vision-benchmarks/scripts

# Install dependencies
pip3 install ultralytics numpy

# Run benchmark
python3 benchmark_yolov8.py

Results are saved to assets/results/. Nsight profiles are in assets/results/nsight_profiles/.

What's Next: VLM for Autonomous Driving

Object detection is just one piece of the perception stack. The next benchmark will explore Vision-Language Models (VLMs) on Jetson -- models that can reason about scenes, not just detect objects.

For autonomous driving applications, VLMs enable:

Candidates include Qwen2.5-VL-3B and NVIDIA Cosmos Reason 2B -- both designed for edge deployment. Stay tuned.