Tutorial
To run examples below, you need CMake 3.23 or above (if you decide to use CMake) and C++ compiler that supports C++14 (tested on MSVC v14.43, GCC 8, Clang 10 or above).
C++
C++
1. Install prerequisites.
To run this example app, you need CMake 3.23 or above (if you decide to use CMake) and a C++ compiler that supports C++14 (tested on MSVC v14.43, GCC 9, Clang 11 or above).
# debian-based distros
sudo apt-get install build-essential cmake ninja-build2. Install Optimium Runtime. Please click here to install the runtime.
3. Add Optimium Runtime for dependency.
find_package(Optimium-Runtime REQUIRED)
target_link_libraries(MyExecutable PRIVATE Optimium::Runtime)
# Use C++14
set(CMAKE_CXX_STANDARD 14)pkg-config is supported for non-CMake users.
You can get compiler options via `pkg-config --libs --cflags optimium-runtime`.Optimium Runtime requires C++14 to compile correctly. For that,
set(CMAKE_CXX_STANDARD 14)to set C++ language version globally. Orset_target_properties(<TARGET> PROPERTIES CXX_STANDARD 14)to use C++14 only for your cmake target.
IMPORTANT! If you're using Android, please refer to below code.
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE BOTH)
find_package(Optimium-Runtime REQUIRED)
target_link_libraries(MyExecutable PRIVATE Optimium::Runtime)pkg-config is supported for non-CMake users.
You can get compiler options via `pkg-config --libs --cflags optimium-runtime`.Plus, if you're using Android you must addandroid:extractNativeLibs=true in AndroidManifest.xml file.
<application ...
android:extractNativeLibs="true"
...>4. Initialize the runtime
Before loading a model, you should initialize the runtime. This is done by declaring an rt::AutoInit variable.
You can optionally specify a scheduling policy, and also modify the verbosity level and output path of the logger.
To configure logging, use logging::setLogLevel() and/or logging::addLogWriter() functions before rt::AutoInit class.
#include <Optimium/Runtime.h>
#include <Optimium/Runtime/Logging/LogSettings.h>
#include <Optimium/Runtime/Logging/AndroidLogWriter.h>
#include <Optimium/Runtime/Logging/ConsoleLogWriter.h>
#include <Optimium/Runtime/Logging/FileLogWriter.h>
int main(...) {
// change verbosity to debug level
rt::logging::setLogLevel(rt::LogLevel::Debug);
// add console log writer.
rt::logging::addLogWriter(std::make_unique<rt::logging::ConsoleWriter>());
// add file log writer to "output.log" file.
rt::logging::addLogWriter(std::make_unique<rt::logging::FileWriter>("output.log"));
// add Android Logcat log writer. this logger is only available for Android.
rt::logging::addLogWriter(std::make_unique<rt::logging::AndroidWriter>());
// Explicitly initialize and finalize the runtime.
rt::initialize();
// ... (load model, run inference, etc.)
rt::finalize();
}Initialization lifecycle
The runtime must be initialized before any model loading or inference, and finalized after all work is done. There are two approaches:
Approach 1: Explicitinitialize() / finalize() calls
Call rt::initialize() at the start of your program and rt::finalize() at the end. This gives you full control over the runtime lifecycle.
int main(...) {
rt::initialize();
// ... load models, run inference ...
rt::finalize();
}Approach 2:rt::AutoInit (RAII)
rt::AutoInit calls rt::initialize() in its constructor and rt::finalize() in its destructor. Because rt::finalize() shuts down the runtime entirely, it is critical that AutoInit outlives all inference operations. If declared as a local variable inside a function, the runtime will be finalized when the variable goes out of scope β any inference running at that point will fail.
For this reason, rt::AutoInit should be declared as a static global variable, or at the very top of main() before any other runtime operations:
// Recommended: static global β runtime lives for the entire process lifetime.
static rt::AutoInit Init;
int main(...) {
// ... load models, run inference ...
// rt::finalize() is called automatically when the process exits.
}// Also OK: top of main() β runtime lives until main() returns.
int main(...) {
rt::AutoInit Init;
// ... load models, run inference ...
// rt::finalize() is called when Init goes out of scope.
}Warning: Do NOT declare
rt::AutoInitinside a narrow scope (e.g. inside a loop or a helper function). If the destructor runs while inference requests are still in progress, the runtime will be finalized prematurely, causing undefined behavior or crashes.
Check available devices
It is recommended to check available devices before loading a model.
DeviceNotFoundError is a common error when you load a model without checking whether or not the required device is present.
You can get list of available devices from rt::getLocalInfo() function.
int main(...) {
// ...
// Iterate every devices to check the device exists.
rt::HostInfo Local = rt::getLocalInfo();
bool Found = false;
for (rt::DeviceID ID : Local.Devices) {
if (ID.getPlatform() == rt::PlatformKind::Native) {
Found = true;
break;
}
}
if (!Found) {
std::cout << "error: cannot find needed device."
<< std::endl;
}
}HostInfo contains the following members:
int ID- Host identifierStringRef Name- Host nameDeviceKind Architecture- Host CPU architectureOSKind OS- Operating system (Linux, Android, Windows, MacOS, IOS)ArrayRef<DeviceID> Devices- Available devices on this host
DeviceID provides the following methods:
PlatformKind getPlatform()- Platform kind (Native, XNNPack, CUDA, Vulkan, OpenCL, SNPE, QNN)DeviceKind getDeviceKind()- Device kind (x86, x64, ARM, ARM64, RISCV64, NVIDIA, Mali, Adreno, Hexagon, etc.)uint32_t getIndex()- Device index among the same kind on the hostuint32_t getHostID()- Host IDHostInfo getHostInfo()- Get host info for this deviceconst Capability &getCapability()- Get hardware capabilities (X86Capability, ARMCapability, etc.)std::string toString()- String representationstatic DeviceID from(PlatformKind, DeviceKind, uint8_t Index, uint8_t Host)- Create a DeviceID
5. Load a model
Model represents an ML model as you know. You can load a model via rt::loadModel() function.
Optimium Runtime automatically searches the devices described in the model.
To specify the device to run the model, you should manually specify the device to use.
You can configure various options through the ModelLoadOptions struct.
Unlike other AI inference engines like Tensorflow Lite, Optimium allows the model format to be a folder. Since the folder is considered as the model, you should type the path to the model when loading it from the Optimum Runtime. You must always copy the model along with its folder.
int main(...) {
// ...
// load a model with default options (auto-detected devices).
rt::Model Model = rt::loadModel("path/to/model");
// load a model with manual configurations.
rt::ModelLoadOptions Options;
// set devices for running the model.
Options.Devices = { ... };
// if true, fail when the exact device is not available (no fallback to localhost).
Options.Strict = false;
// enable memory optimization (share buffers between non-overlapping tensors).
Options.EnableMemoryOptimization = true;
// enable runtime checks for debugging. (NOT YET IMPLEMENTED)
Options.EnableRuntimeChecks = false;
// treat denormal floats as zero.
Options.DisableDenormals = true;
// path to save intermediate tensors for debugging.
Options.IntermediateSavePath = "";
// passphrase for encrypted models.
Options.Passphrase = "";
// set the number of threads to be used for running the model.
Options.ThreadCount = 4;
// set which cores the threads used for inference will be assigned to.
Options.Cores = {0, 1, 2, 3};
// set the scheduler type.
// Simple: sequential execution (default)
// Exclusive: first-come first-served with resource queuing
// Pipeline: pipelined multi-stage execution across devices
Options.Scheduler = rt::SchedulerType::Simple;
rt::Model Model = rt::loadModel("path/to/model", Options);
// ..
}6. Listing model information
You can find information about the model using some informative functions.
To get information about the tensor, use Model.getTensorInfo() function with its name. You can also access by index using Model.getInputTensorInfo(index) and Model.getOutputTensorInfo(index).
Also, you can get list of input or output tensor names by Model.getInputNames() for input and Model.getOutputNames() for output.
Additionally, Model.getName() returns the model's name, Model.getTensors() returns all tensor infos, and Model.getOperations() returns a list of OpInfo (containing the operation name and device).
int main(...) {
// ...
// print model name
std::cout << "model name: " << Model.getName() << std::endl;
// print list of input tensor info
std::cout << "input tensors" << std::endl;
for (rt::StringRef Name : Model.getInputNames())
std::cout << Model.getTensorInfo(Name) << std::endl;
// print list of output tensor info
std::cout << "output tensors" << std::endl;
for (rt::StringRef Name : Model.getOutputNames())
std::cout << Model.getTensorInfo(Name) << std::endl;
// access by index
const rt::TensorInfo &FirstInput = Model.getInputTensorInfo(0);
const rt::TensorInfo &FirstOutput = Model.getOutputTensorInfo(0);
// list operations
for (const auto &Op : Model.getOperations())
std::cout << "op: " << Op.Name << " on device: " << Op.Device << std::endl;
}TensorInfo, the return value of functions above, represents information about each tensor. It contains the tensor's name, type, shape, alignment, padding, and quantization scheme (if quantized). It also has OptOut and Constant flags.
You can access member variables to get details of the tensor.
int main(...) {
// ...
const rt::TensorInfo &Info = Model.getTensorInfo("input_0");
std::cout << "name of tensor: "
<< Info.Name << std::endl;
std::cout << "shape of tensor: "
<< Info.Shape << std::endl;
std::cout << "alignment of tensor: "
<< Info.Alignment << std::endl;
std::cout << "type of tensor: "
<< Info.Type << std::endl;
std::cout << "padding of tensor: "
<< Info.Padding << std::endl;
std::cout << "tensor size in bytes: "
<< Info.getTensorSize() << std::endl;
if (Info.Scheme)
std::cout << "scheme of tensor: "
<< *(Info.Scheme) << std::endl;
// ...
}Dynamic Shape Models
Models with dynamic (symbolic) shapes support shape inference. You can compute the output shape from input shapes using Model.inferShape():
int main(...) {
// ...
// infer output shape from input shapes (by-name)
std::map<std::string, rt::TensorShape> InputShapes;
InputShapes["input_0"] = rt::TensorShape({1, 3, 224, 224});
rt::TensorShape OutputShape = Model.inferShape("output_0", InputShapes);
// infer output shape from input shapes (by-index)
std::vector<rt::TensorShape> InputShapeList = { rt::TensorShape({1, 3, 224, 224}) };
rt::TensorShape OutputShape2 = Model.inferShape("output_0", rt::make_array(InputShapeList));
// ...
}7. Creating a request
InferRequest represents a single inference that the model runs. Users can create multiple InferRequests and execute the same model without interfering with other requests.
Additionally, users can achieve target throughput by queueing multiple requests efficiently.
Creating request is done by calling Model.createRequest() function.
int main(...) {
// ...
rt::InferRequest Request = Model.createRequest();
// ...
}8. Prepare inputs and outputs
Before running the inference, you should prepare input and output tensors for the request.
To create a tensor, you can use rt::tensor() function.
int main(...) {
// ...
// Create float32 tensor shaped 32x32.
rt::TypedTensor<float> f32_tensor = rt::tensor<float>({32, 32});
// Create generic tensor shaped 8.
rt::Tensor i16_tensor = rt::tensor(rt::ElementType::I16, {8});
// Create float16 tensor shaped 32x32 with user-provided buffer.
rt::float16* Data = new rt::float16[32 * 32];
rt::TypedTensor<rt::float16> f16_tensor = rt::tensor<rt::float16>({32, 32}, Data);
}When you create tensor with user-provided buffer, you should take more care with the tensor.
Optimium Runtime does not take ownership of the provided buffer. So the buffer should not be freed before the tensor is finalized.
Also, the runtime always assumes you provide a valid buffer. It might cause severe error if you provided an invalid buffer (smaller buffer than expected, invalid pointer address, etc.).
After creating a tensor, you can access tensor memory using Tensor.data() function. Putting data in the tensor can be done by trivial functions like memcpy or std::copy.
If you want to fill tensor with a scalar value, use TypedTensor.fill() function.
You can also save and load tensors to/from files using Tensor.save() and Tensor.load().
int main(...) {
// ...
// Example for load data from existing data.
// Assume variable 'Data' contains data.
std::vector<float> Data;
rt::TypedTensor<float> Tensor = rt::tensor<float>({32, 32});
std::copy(Data.begin(), Data.end(), Tensor.data());
// Example for load data from file.
// <fstream> is required.
std::ifstream File("path/to/file", std::ios::in | std::ios::binary);
File.read(reinterpret_cast<char *>(Tensor.data()), Tensor.getTensorSize());
// Fill scalar value. 'Tensor' must be 'TypedTensor' type.
Tensor.fill(1.0f);
// Save tensor to file.
Tensor.save("tensor.bin");
// Load tensor from file.
Tensor.load("tensor.bin");
// Release tensor data explicitly.
Tensor.release();
// ...
}Optimium Runtime provides types that are not supported natively in C++ and those types are defined in files located at Optimium/Runtime/Types.
Tensor type and wrapped C++ type are listed below.
| Element Type | C++ Type |
|---|---|
ElementType::F16 | rt::float16 |
ElementType::BF16 | rt::bfloat16 |
ElementType::TF32 | rt::tfloat32 |
ElementType::QS8 | rt::qs8 |
ElementType::QU8 | rt::qu8 |
ElementType::QS16 | rt::qs16 |
ElementType::QU16 | rt::qu16 |
ElementType::QS32 | rt::qs32 |
Additional element types available:
I8, U8, I16, U16, I32, U32, I64, U64, F16, F32, F64, Bool, String
Optimium Runtime only recognizes those C++ types for corresponding tensor type when the runtime checks data type for the tensor. Other data types are not recognized and results in compilation error.
9. Running an inference
Running an inference is done by calling two functions: infer and wait. Request.infer() function requests starting the inference to the runtime and returns immediately and Request.wait() function waits for the previously requested inference to finish (regardless of failure).
Request.infer() requires two arguments, inputs and outputs, and has two variants, one accepts rt::ArrayRef (reference type to array-like value) and the other accepts std::map.
The function that accepts rt::ArrayRef should contain all tensors that the model requires and should have same order with inputs or outputs of the model.
int main(...) {
// ...
// running inference with input and output list
std::vector<rt::Tensor> Inputs;
std::vector<rt::Tensor> Outputs;
// Create empty tensors.
for (rt::StringRef Name : Model.getInputNames()) {
const rt::TensorInfo& Info = Model.getTensorInfo(Name);
Inputs.push_back(rt::tensor(Info.Type, Info.Shape));
}
for (rt::StringRef Name : Model.getOutputNames()) {
const rt::TensorInfo& Info = Model.getTensorInfo(Name);
Outputs.push_back(rt::tensor(Info.Type, Info.Shape));
}
// rt::make_array is helper function that creates rt::ArrayRef.
Request.infer(rt::make_array(Inputs), rt::make_array(Outputs));
Request.wait();
// ...
}Another function that accepts std::map should contain all tensors that the model requires and should have matching tensor name with inputs or outputs of the model.
int main(...) {
// ...
// running inference with input and output map
std::map<std::string, rt::Tensor> Inputs;
std::map<std::string, rt::Tensor> Outputs;
// Create empty tensors
for (rt::StringRef Name : Model.getInputNames()) {
const rt::TensorInfo& Info = Model.getTensorInfo(Name);
Inputs[Name] = rt::tensor(Info.Type, Info.Shape);
}
for (rt::StringRef Name : Model.getOutputNames()) {
const rt::TensorInfo& Info = Model.getTensorInfo(Name);
Outputs[Name] = rt::tensor(Info.Type, Info.Shape);
}
Request.infer(Inputs, Outputs);
Request.wait();
// ...
}Note that you do not need to create tensors at every inferences. Tensors can be reused between requests and/or models unless tensors are used simultaneously. (e.g. Cannot use request A's output tensor X for request B's input while request A is running. It is OK to use tensor X after request A is finished.)
Request.wait() function has an optional argument, timeout, which represents the duration to wait. If the method returns false, it represents that the inference was finished before the timeout was reached. The method returns true when the inference was not finished before the timeout was reached. If called with no argument (or zero), it waits indefinitely.
Note that Request.wait() method should always be called to check error that happened during inference. It might cause undefined behavior if the request starts the inference in a fault state.
To check the state of the request, use Request.getStatus() function. The possible states are InferStatus::Ready, InferStatus::Running, and InferStatus::Fault.
int main(...) {
// ...
// waits until inference is finished.
Request.infer(...);
Request.wait();
// waits 500 milliseconds to finish inference
using namespace std::chrono_literals;
Request.infer(...);
if (Request.wait(500ms))
std::cout << "inference not finished after 500ms"
<< std::endl;
else
std::cout << "inference was finished within 500ms"
<< std::endl;
// check the state of the request
std::cout << "status of request: "
<< Request.getStatus()
<< std::endl;
}Callbacks
You can register a callback to be called when the inference finishes:
int main(...) {
// ...
Request.addCallback([](rt::InferStatus Status, std::exception_ptr Err) {
if (Status == rt::InferStatus::Fault) {
try { std::rethrow_exception(Err); }
catch (const std::exception &E) {
std::cerr << "Inference error: " << E.what() << std::endl;
}
}
});
Request.infer(...);
Request.wait();
}Cancellation
You can cancel an in-progress inference:
Request.infer(...);
// ... later:
Request.cancel();10. Profiling
InferRequest supports profiling mode for performance measurement:
int main(...) {
// ...
rt::ProfileOptions Options;
Options.Repeat = 100; // number of repetitions
Options.WarmUp = 10; // warm-up count
Options.WarmUpTime = std::chrono::microseconds(1000000); // warm-up duration
Options.StopThreshold = std::chrono::microseconds(0); // stop threshold
Options.CheckPeriod = 0; // check period
Options.EventBufferSize = rt::kDefaultEventBufferSize; // event buffer size (default: 4MB)
// profile() is blocking (unlike infer() which is async)
Request.profile(Inputs, Outputs, Options);
// access profiling events
for (const auto &Event : Request.getProfileEvents()) {
std::cout << "Event: " << rt::toString(Event.Kind)
<< " at " << Event.TimeStamp.time_since_epoch().count()
<< std::endl;
}
// access model metadata from the request
std::cout << "Model: " << Request.getModelName() << std::endl;
for (const auto &Op : Request.getModelOperations())
std::cout << " Op: " << Op.Name << std::endl;
}You can also use ProfileEventRecorder for custom lock-free event recording:
auto Recorder = std::make_shared<rt::ProfileEventRecorder>(1024 * 1024);
Request.setRecorder(Recorder);
Request.profile(Inputs, Outputs, Options);
// Recorder now contains events with lock-free accessProfile event kinds include: ModelExecuteBegin/End, LayerExecuteBegin/End, LaunchBegin/End, DeviceExecuteBegin/End, CopyBegin/End, QueueBegin/End, WaitBegin/End.
Intermediate Tensor Access
When EnableMemoryOptimization is disabled in ModelLoadOptions, or IntermediateSavePath is set, you can access intermediate tensors for debugging:
rt::ModelLoadOptions Options;
Options.EnableMemoryOptimization = false;
rt::Model Model = rt::loadModel("path/to/model", Options);
auto Request = Model.createRequest();
Request.infer(Inputs, Outputs);
Request.wait();
// access intermediate tensor
rt::Tensor IntermediateTensor = Request.getTensor("intermediate_tensor_name");11. Error handling
Optimium Runtime uses exception-based error handling. All exceptions inherit from std::runtime_error.
The following exception types are available:
InvalidArgumentError- Invalid argument passedInvalidStateError- Unexpected internal stateInvalidOperationError- Operation not allowed in current stateTypeError- Type mismatchShapeError- Shape mismatch or incompatible shapesExtensionError- Extension loading or initialization failureDeviceError- Device not found or operation failureModelError- Model loading or compilation failureRequestError- Request operation errorInferError- Inference execution failureOutOfResourceError- Resource allocation failureContainerError- Model container is invalid or corruptedRemoteError- Remote communication errorIOError- I/O operation errorNetworkError- Network communication errorOSError- Operating system errorNotImplementedError- Feature not yet implemented
Additionally, the Result<T> template is used internally for monadic error propagation.
12. Extensions
You can load hardware backend extensions dynamically:
int main(...) {
// ...
rt::loadExtension("path/to/extension.so");
// ...
}Built-in extensions include XNNPack, Vulkan, OpenCL, SNPE, QNN, and CUDA.
Python
Python
Do not put any Optimium Runtime related objects in the global scope or create circular references to them.
This can lead to memory leakages or undefined behavior due to differences in memory management model between C++ and Python.
1. Install Optimium Runtime. Please click here to install the runtime.
2. Import Optimium Runtime
To use Optimium Runtime, you should import optimium.runtime package. On Python, unlike C++, initialization is done at import time.
Also you can modify the verbosity level or output path of log.
import optimium.runtime as rt
def main():
# change verbosity to debug level
rt.logging.set_loglevel(rt.LogLevel.DEBUG)
# enable logger that writes logs on console
rt.logging.enable_console_log()
# enable logger that writes logs on file
rt.logging.enable_file_log("output.log")If you want to defer initialization of Optimium Runtime, declare environment variable OPTIMIUM_RT_DEFER_INIT before importing the runtime. And call rt.initialize() before using the runtime.
import os
os.environ["OPTIMIUM_RT_DEFER_INIT"] = "TRUE"
import optimium.runtime as rt
rt.initialize() # must be called before use any runtime components.Additional environment variables
OPTIMIUM_RT_DEFER_INIT- Defer automatic initializationOPTIMIUM_RT_DEBUG- Enable debug loggingOPTIMIUM_RT_ENABLE_LOG- Enable loggingOPTIMIUM_RT_LOGFILE- Log file path (if set, logs to file; otherwise console)
Version information
version = rt.get_version()
print(f"Optimium Runtime v{version.major}.{version.minor}.{version.patch}")
print(f"Build: {version.build_info}")Check available devices
It is recommended to check available devices before loading a model.
DeviceNotFoundError is a common error when you load a model without checking whether or not the required device is present.
You can get a list of available devices from rt.get_local_info() function.
def main():
# ...
# Iterate every devices to check the device exists.
local = rt.get_local_info()
found = False
for dev in local.devices:
if dev.platform == rt.PlatformKind.NATIVE:
found = True
if not found:
print("cannot find needed device")HostInfo has the following properties:
id- Host identifier (int)name- Host name (str)architecture- Host CPU architecture (DeviceKind)os- Operating system (OSKind: LINUX, ANDROID, WINDOWS, MACOS, IOS)devices- Available devices (Sequence[DeviceID])
DeviceID has the following properties:
platform- Platform kind (PlatformKind: NATIVE, XNNPACK, CUDA, VULKAN, OPENCL, SNPE, QNN)device_kind- Device kind (DeviceKind: X86, X64, ARM, ARM64, RISCV32, RISCV64, NVIDIA, MALI, ADRENO, HEXAGON, etc.)index- Device index among the same kind on the hosthost_id- Host IDhost_info- HostInfo for this devicecapability- Hardware capabilities (X86Capability, ARMCapability, SPIRVCapability, CUDACapability, RISCVCapability, HexagonCapability)
3. Load a model
Model represents an ML model as you know. You can load a model via rt.load_model() function.
Optimium Runtime automatically finds and uses devices described in the model. To specify the device that runs the model, you should manually specify the device to use.
You can configure various options by passing them as keyword arguments to the rt.load_model() function.
Unlike other AI inference engines like Tensorflow Lite, Optimium allows the model format to be a folder. Since the folder is considered as the model, you should type the path to the model when loading it from the Optimum Runtime. You must always copy the model along with its folder.
def main():
# ...
# load a model with auto-detected devices.
model = rt.load_model("path/to/model")
# load a model with manual configurations.
model = rt.load_model("path/to/model",
devices=[...],
# if true, fail when exact device is unavailable (no fallback).
strict=False,
# enable memory optimization (share buffers between non-overlapping tensors).
memory_optimization=True,
# treat denormal floats as zero.
disable_denormals=True,
# path to save intermediate tensors for debugging.
intermediate_save_path=None,
# passphrase for encrypted models.
passphrase=None,
# number of threads for inference.
threads=4,
# which cores the threads will be assigned to.
cores=[0, 1, 2, 3],
# scheduler type: SIMPLE (default), EXCLUSIVE, or PIPELINE.
scheduler_type=rt.SchedulerType.SIMPLE)Scheduler Types
rt.SchedulerType.SIMPLE(default) - Sequential execution with exclusive per-request resources. Low latency.rt.SchedulerType.EXCLUSIVE- First-come first-served with resource queuing.rt.SchedulerType.PIPELINE- Pipelined multi-stage execution across devices.
4. Listing model information
You can find information about the model using some informative methods.
To get information about a tensor, use model.get_tensor() method with its name. You can also access by index using model.get_input_tensor(index) and model.get_output_tensor(index).
You can get list of input or output tensor names via model.input_names property for inputs and model.output_names property for outputs.
Additional model properties: model.name returns the model name, model.tensors returns all tensor infos, model.operations returns a list of OpInfo (with name and device properties), and model.is_dynamic indicates whether the model has dynamic shapes.
def main():
# ...
# print model name
print(f"model name: {model.name}")
# print list of input tensor info
print("input tensors")
for name in model.input_names:
print(model.get_tensor(name))
# print list of output tensor info
print("output tensors")
for name in model.output_names:
print(model.get_tensor(name))
# access by index
first_input = model.get_input_tensor(0)
first_output = model.get_output_tensor(0)
# list operations
for op in model.operations:
print(f"op: {op.name} on device: {op.device}")TensorInfo, the return value of the properties and methods above, represents information about each tensor. It contains the tensor's name, type, shape, alignment, padding, and quantization scheme (if quantized). It also has opt_out and constant flags.
You can access properties to get details of the tensor.
def main():
# ...
info = model.get_tensor("input_0")
print(f"name of tensor: {info.name}")
print(f"shape of tensor: {info.shape}")
print(f"alignment of tensor: {info.alignment}")
print(f"type of tensor: {info.type}")
print(f"padding of tensor: {info.padding}")
print(f"tensor size in bytes: {info.size}")
print(f"opt out: {info.opt_out}")
print(f"constant: {info.constant}")
if info.scheme:
print(f"quantization scheme of tensor: {info.scheme}")
if info.scheme.per_channel:
print(f"per-channel on axis: {info.scheme.axis}")
for i in range(len(info.scheme)):
param = info.scheme[i]
print(f" channel {i}: scale={param.scale}, zero_point={param.zero_point}")
else:
param = info.scheme.get_param()
print(f"per-tensor: scale={param.scale}, zero_point={param.zero_point}")Dynamic Shape Models
Models with dynamic (symbolic) shapes support shape inference. You can compute the output shape from input shapes using model.infer_shape():
def main():
# ...
# infer output shape from input shapes (by-name dict)
output_shape = model.infer_shape("output_0", {
"input_0": rt.TensorShape(1, 3, 224, 224)
})
print(f"inferred output shape: {output_shape}")
# infer output shape from input shapes (by-index list)
output_shape = model.infer_shape("output_0", [
rt.TensorShape(1, 3, 224, 224)
])TensorShape provides the following:
rank- Number of dimensionsdynamic- Whether the shape has symbolic dimensionssize- Total element count (raises on dynamic shapes)strides- Row-major element stridesis_compatible(other)- Check if shapes are compatible- Indexing with
shape[i](supports negative indices) len(shape)returns the rank
Expr represents a symbolic dimension:
Expr(42)- Constant dimensionExpr("batch")- Symbolic/named dimensionis_const(),is_symbol(),value,symbolproperties- Arithmetic:
+,-,*,/,%,min(),max()
5. Creating a request
InferRequest represents a single inference that the model runs. Users can create multiple InferRequests and execute the same model without interfering with other requests.
Additionally, users can achieve target throughput by queueing multiple requests efficiently.
Creating request is done by calling model.create_request() method.
def main():
# ...
request = model.create_request()6. Prepare inputs and outputs
Before running the inference, you should prepare input and output tensors for the request.
To create tensor, you can use rt.tensor() function.
# ...
import numpy as np
def main():
# ...
# create float32 tensor shaped 32x32 (uninitialized)
f32_tensor = rt.tensor(shape=(32, 32), dtype=rt.ElementType.F32)
# you can use numpy-style alias for dtype
f32_tensor = rt.tensor(shape=(32, 32), dtype=rt.float32)
# create int16 tensor filled with 123
i16_tensor = rt.tensor(123, shape=(8,), dtype=rt.int16)
# create tensor from list
tensor_from_list = rt.tensor([[1,2,3], [4,5,6], [7,8,9]]) # default dtype is rt.float32
# create tensor from numpy (with copy by default)
arr = np.random.random((32, 32)).astype(np.float16)
tensor_from_np = rt.tensor(arr)
# create tensor from numpy without copy (zero-copy, shares memory)
tensor_zero_copy = rt.tensor(arr, copy=False)The rt.tensor() function supports these creation modes:
rt.tensor(shape=(3, 4), dtype=rt.float32)- Uninitialized tensorrt.tensor(0.0, shape=(3, 4), dtype=rt.float32)- Fill with scalar valuert.tensor([[1, 2], [3, 4]])- From nested list (default dtype: float32)rt.tensor(numpy_array)- From numpy array (copies by default)rt.tensor(numpy_array, copy=False)- Zero-copy from numpy (shares memory)
Available dtype aliases:
| Alias | ElementType |
|---|---|
rt.int8 | ElementType.I8 |
rt.uint8 | ElementType.U8 |
rt.int16 | ElementType.I16 |
rt.uint16 | ElementType.U16 |
rt.int32 | ElementType.I32 |
rt.uint32 | ElementType.U32 |
rt.int64 | ElementType.I64 |
rt.uint64 | ElementType.U64 |
rt.float16 | ElementType.F16 |
rt.float32 | ElementType.F32 |
rt.float64 | ElementType.F64 |
rt.bfloat16 | ElementType.BF16 |
rt.tfloat32 | ElementType.TF32 |
rt.bool_ | ElementType.BOOL |
rt.str_ | ElementType.STRING |
rt.qint8 | ElementType.QS8 |
rt.quint8 | ElementType.QU8 |
rt.qint16 | ElementType.QS16 |
rt.quint16 | ElementType.QU16 |
rt.qint32 | ElementType.QS32 |
You can also convert between ElementType and numpy dtypes:
# ElementType -> numpy dtype
np_dtype = rt.ElementType.F32.to_dtype()
# numpy dtype -> ElementType
elem_type = rt.ElementType.from_dtype(np.float32)You can convert rt.Tensor to numpy.ndarray by tensor.to_numpy() method.
Note that rt.Tensor does not provide way to access tensor data directly. You should convert to numpy.ndarray to access the data.
def main():
# ...
tensor = rt.tensor(...)
# not supported
# val = tensor[...]
arr = tensor.to_numpy()
val = arr[...]rt.Tensor provides the following properties:
shape- TensorShape of the tensortype- ElementType of the tensor
And the following methods:
to_numpy()- Convert to numpy array (zero-copy, shares memory)fill(value)- Fill the tensor with a scalar value
Tensor print configuration
You can configure how tensors are printed:
# set the maximum number of elements displayed per dimension (default: 8, -1 = unlimited)
rt.config.set_print_threshold(10)
# set the decimal precision for floating point output (default: 6, range 0-15)
rt.config.set_print_precision(4)7. Running an inference
Running an inference is done by calling two functions: infer and wait. request.infer() method requests starting the inference to the runtime and returns immediately and request.wait() method waits for previously requested inference until it finished (regardless of failure).
request.infer() method requires two arguments, inputs and outputs, and it has two variants, one accepts sequences (e.g. list, tuple) and the other accepts dict.
The method that accepts sequences should contain all tensors that the model requires and should have same order with inputs or outputs of the model.
def main():
# ...
# running inference with input and output list
inputs = []
outputs = []
# create empty tensors
for name in model.input_names:
info = model.get_tensor(name)
inputs.append(rt.tensor(shape=info.shape, dtype=info.type))
for name in model.output_names:
info = model.get_tensor(name)
outputs.append(rt.tensor(shape=info.shape, dtype=info.type))
request.infer(inputs, outputs)
request.wait()Another function that accepts dict should contain all tensors that the model requires and should have matching tensor name with inputs or outputs of the model.
def main():
# ...
# running inference with input and output dict
inputs = {}
outputs = {}
# create empty tensors
for name in model.input_names:
info = model.get_tensor(name)
inputs[name] = rt.tensor(shape=info.shape, dtype=info.type)
for name in model.output_names:
info = model.get_tensor(name)
outputs[name] = rt.tensor(shape=info.shape, dtype=info.type)
request.infer(inputs, outputs)
request.wait()request.infer() method also accepts numpy.ndarray as a tensor. You can pass them to the method without conversion to rt.Tensor.
def main():
# ...
# running inference with numpy arrays
# assume the model accepts two float32 tensors and
# returns one float32 tensor.
inputs = [
np.random.random((32, 32)).astype(np.float32),
np.random.random((32, 32)).astype(np.float32)
]
outputs = [
np.zeros((32, 32), dtype=np.float32)
]
request.infer(inputs, outputs)
request.wait()Note that you do not need to create tensors at every inferences. Tensors can be reused between requests and/or models unless tensors are used simultaneously. (e.g. cannot use request A's output tensor X for request B's input while request A is running. It is OK to use tensor X after request A is finished.)
request.wait() method has an optional argument, timeout, which represents microseconds to wait. If the method returns False, the inference was finished before the timeout was reached. The method returns True when the inference was not finished before the timeout was reached. If called with no argument (or zero), it waits indefinitely.
Note that request.wait() method should always be called to check error that happened during inference. It might cause undefined behavior if the request started the inference in fault state.
To check the state of the request, use request.status property. The possible states are InferStatus.READY, InferStatus.RUNNING, and InferStatus.FAULT.
def main():
# ...
# waits until inference is finished.
request.infer(...)
request.wait()
# waits 500000 microseconds (500ms) to finish inference
request.infer(...)
if request.wait(500000):
print("inference not finished after 500ms")
else:
print("inference was finished within 500ms")
# check status of the inference.
print(f"current state of request: {request.status}")Callbacks
You can register a callback to be called when the inference finishes:
def on_complete(status, error):
if status == rt.InferStatus.FAULT:
print(f"Inference error: {error}")
request.set_callback(on_complete)
request.infer(inputs, outputs)
request.wait()Cancellation
You can cancel an in-progress inference:
request.infer(inputs, outputs)
# ... later:
request.cancel()8. Profiling
InferRequest supports profiling mode for performance measurement:
def main():
# ...
# profile() is blocking (unlike infer() which is async)
request.profile(inputs, outputs,
repeat=100,
warmup=10,
warmup_time=1000000, # microseconds
stop_threshold=0, # microseconds
check_period=0,
event_buffer_size=0) # 0 means use default (4MB)
# access raw profiling events
for event in request.get_profile_events():
print(f"Event: {event.kind} at {event.timestamp}")
# access computed durations (nanoseconds, grouped by model/layer/copy)
durations = request.get_profile_durations()
print(f"Model: {durations.model_name}")
print(f"Model durations (ns): {durations.model_durations}")
for name, layer_dur in durations.layer_durations.items():
print(f" Layer '{name}': {layer_dur}")
for name, copy_dur in durations.copy_durations.items():
print(f" Copy '{name}': {copy_dur}")ProfileEvent
Profile event kinds (ProfileEventKind enum): MODEL_EXECUTE_BEGIN/END, LAYER_EXECUTE_BEGIN/END, LAUNCH_BEGIN/END, DEVICE_EXECUTE_BEGIN/END, COPY_BEGIN/END, QUEUE_BEGIN/END, WAIT_BEGIN/END.
ProfileEvent properties:
kind- ProfileEventKindtimestamp- Timestamp in microseconds (int)operation- Operation index (for LAYER_EXECUTE and WAIT events)thread_id- Thread ID (for LAYER_EXECUTE events)tensor_id- Tensor ID (for COPY events)source_device- Source DeviceID (for COPY events)dest_device- Destination DeviceID (for COPY events)
ProfileDurations
request.get_profile_durations() returns a ProfileDurations object that automatically pairs begin/end events and computes durations:
model_name- Name of the model (str)model_durations- List of model-level durations in nanoseconds (List[int])layer_durations- Dict mapping operation names to duration lists in nanoseconds (Dict[str, List[int]])copy_durations- Dict mapping tensor names to duration lists in nanoseconds (Dict[str, List[int]])
BasicProfiler Utility
For convenient profiling, use BasicProfiler:
from optimium.runtime.profile import BasicProfiler
profiler = BasicProfiler(
"path/to/model",
memory_optimization=True,
disable_denormals=False,
threads=1,
cores=None,
intermediate_save_path=""
)
profiler.profile(
repeat=100,
warmup=10,
warmup_time=1000000, # microseconds
)
# access statistics
print(f"Model: min={profiler.model_stat.min}, "
f"max={profiler.model_stat.max}, "
f"mean={profiler.model_stat.mean:.2f}")
for stat in profiler.layer_stats:
print(f" Layer {stat.name}: mean={stat.mean:.2f}")
# save results to JSON
profiler.dump_json("profile_results.json")BasicProfiler can also connect to a remote server:
profiler = BasicProfiler(
"path/to/model",
remote_address="192.168.1.100",
remote_port=32264
)For multiple input/output batches, use BatchProfiler:
from optimium.runtime.profile import BatchProfiler
profiler = BatchProfiler("path/to/model")
profiler.profile(
repeat=100,
inputs=[batch1_inputs, batch2_inputs, ...],
outputs=[batch1_outputs, batch2_outputs, ...]
)
# access per-batch and total statistics
for i, stat in enumerate(profiler.batch_stat):
print(f"Batch {i}: mean={stat.model_stat.mean:.2f}")
print(f"Total: mean={profiler.total_stat.model_stat.mean:.2f}")Intermediate Tensor Access
When memory_optimization=False or intermediate_save_path is set, you can access intermediate tensors:
model = rt.load_model("path/to/model", memory_optimization=False)
request = model.create_request()
request.infer(inputs, outputs)
request.wait()
# access intermediate tensor
intermediate = request.get_tensor("intermediate_tensor_name")
print(intermediate.to_numpy())9. Remote Inference
Optimium Runtime supports remote inference via network connection:
import optimium.runtime as rt
from optimium.runtime import remote
# connect to remote server
session = remote.connect(
"192.168.1.100",
port=32264,
enable_secure_connection=True,
enable_compression=True
)
# check remote host info
print(f"Remote host: {session.host_info}")
# load model on remote
model = session.load_model(
"path/to/model",
threads=4,
cores=[0, 1, 2, 3],
scheduler_type=rt.SchedulerType.SIMPLE
)
# model info
print(f"Model: {model.name}")
for info in model.input_tensor_infos:
print(f" Input: {info.name} {info.shape} {info.type}")
for info in model.output_tensor_infos:
print(f" Output: {info.name} {info.shape} {info.type}")
# run profiling (asynchronous - call wait() after)
inputs = {"input_0": rt.tensor(shape=(1, 3, 224, 224), dtype=rt.float32)}
outputs = {"output_0": rt.tensor(shape=(1, 1000), dtype=rt.float32)}
model.profile(inputs, outputs, repeat=100, warmup=10)
model.wait()
# get profile events and durations
events = model.get_profile_events()
durations = model.get_profile_durations()
# get specific tensor result
result = model.get_tensor("output_0")Remote model properties:
name- Model nameinput_tensor_infos- Input tensor info listoutput_tensor_infos- Output tensor info listtensor_infos- All tensor infosoperations- Operation name listis_dynamic- Whether the model has dynamic shapes
Remote model methods:
profile(inputs, outputs, repeat=1, *, warmup=0, warmup_time=0, stop_threshold=0, check_period=0, event_buffer_size=0)- Asynchronous profiling (accepts dict or sequence, with Tensor or numpy)wait(timeout=0)- Wait for profiling to complete (timeout in microseconds)cancel()- Cancel in-progress profilingget_tensor(name)- Get tensor result after completionget_profile_events()- Get raw profiling eventsget_profile_durations()- Get computed ProfileDurations (nanosecond durations grouped by model/layer/copy)is_saved_tensor(name)- Check if tensor is savedinfer_shape(output_name, input_shapes)- Infer output shape from input shapes (dict)
Remote session's load_model accepts these additional keyword arguments: devices, strict, memory_optimization, disable_denormals, intermediate_save_path, passphrase, threads, cores, scheduler_type.
10. Error handling
Optimium Runtime maps C++ exceptions to Python exceptions:
rt.InvalidArgumentError(inherits fromValueError)rt.InvalidStateErrorrt.InvalidOperationErrorrt.TypeError(inherits fromTypeError)rt.ShapeErrorrt.ExtensionErrorrt.DeviceErrorrt.ModelErrorrt.RequestErrorrt.InferErrorrt.OutOfResourceErrorrt.ContainerErrorrt.RemoteErrorrt.IOError(inherits fromIOError)rt.NetworkError(inherits fromIOError)rt.OSError(inherits fromOSError)rt.NotImplementedError(inherits fromNotImplementedError)
All inherit from rt.RuntimeException (which inherits from RuntimeError).
try:
model = rt.load_model("nonexistent/path")
except rt.ContainerError as e:
print(f"Container error: {e}")
except rt.ModelError as e:
print(f"Model error: {e}")11. Loading extensions
You can load hardware backend extensions dynamically:
rt.load_extension("path/to/extension.so")Updated about 1 month ago