We all know it’s important to use GPU resources efficiently, especially during inference. One easy and highly effective way to achieve this is to reorder some of your inference logic to exploit PyTorch’s asynchronous GPU operations. This becomes especially important if data I/O represents a significant portion of your inference time. Let’s take a look at a pseudocode example, first without attempting to exploit asynchronous GPU operations and then a modified version that does.

model <- init_pytorch_model().cuda()
while True:
  # Reading in the next batch takes a significant
  # amount of IO time. During this time the GPU
  # is idle.
  batch <- get_next_batch().cuda()
  result <- model(batch)

  # Pulling the results onto the cpu forces all
  # GPU computation to complete so we are not
  # benefiting from async operations.
  output_result(result.cpu())

In the example above assume that ~100% of the compute time used is spent in get_next_batch() and model(batch). If get_next_batch() takes 1 second and model(batch) takes 2 seconds, the best GPU utilization we can achieve is ~66%. Let’s reorganize the code so that we get close to 100%:

model <- init_pytorch_model().cuda()
prev_result <- None
while True:
  # Reading in the next batch still takes a
  # significant amount of IO time, but because
  # we haven't pulled prev_result onto the CPU
  # yet, the previous GPU computations are
  # executing concurrently with I/O (except
  # for the first iteration)
  curr_batch <- get_next_batch().cuda()
  
  # Now that batch I/O has completed we synchronize
  # and output the previous results.
  if prev_result is not None:
    output_result(prev_result.cpu())

  # start GPU computations for curr_batch
  prev_result <- model(curr_batch)

This simple reordering of operations will take us from ~66% GPU utilization to ~100% and will increase up the number of inferences per second by ~50%.