Tutorial

To run examples below, you need CMake 3.23 or above (if you decide to use CMake) and C++ compiler that supports C++14 (tested on MSVC v14.43, GCC 8, Clang 10 or above).


C++

C++

1. Install prerequisites.

To run this example app, you need CMake 3.23 or above (if you decide to use CMake) and a C++ compiler that supports C++14 (tested on MSVC v14.43, GCC 9, Clang 11 or above).

# debian-based distros
sudo apt-get install build-essential cmake ninja-build

2. Install Optimium Runtime. Please click here to install the runtime.

3. Add Optimium Runtime for dependency.

find_package(Optimium-Runtime REQUIRED)
target_link_libraries(MyExecutable PRIVATE Optimium::Runtime)

# Use C++14
set(CMAKE_CXX_STANDARD 14)
pkg-config is supported for non-CMake users.
You can get compiler options via `pkg-config --libs --cflags optimium-runtime`.

Optimium Runtime requires C++14 to compile correctly. For that, set(CMAKE_CXX_STANDARD 14) to set C++ language version globally. Or set_target_properties(<TARGET> PROPERTIES CXX_STANDARD 14) to use C++14 only for your cmake target.

IMPORTANT! If you're using Android, please refer to below code.

set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE BOTH)
find_package(Optimium-Runtime REQUIRED)
target_link_libraries(MyExecutable PRIVATE Optimium::Runtime)
pkg-config is supported for non-CMake users.
You can get compiler options via `pkg-config --libs --cflags optimium-runtime`.

Plus, if you're using Android you must addandroid:extractNativeLibs=true in AndroidManifest.xml file.

 <application ...
              android:extractNativeLibs="true"
              ...>

4. Initialize the runtime

Before loading a model, you should initialize the runtime. This is done by declaring an rt::AutoInit variable.

You can optionally specify a scheduling policy, and also modify the verbosity level and output path of the logger.

To configure logging, use logging::setLogLevel() and/or logging::addLogWriter() functions before rt::AutoInit class.

#include <Optimium/Runtime.h>
#include <Optimium/Runtime/Logging/LogSettings.h>
#include <Optimium/Runtime/Logging/AndroidLogWriter.h>
#include <Optimium/Runtime/Logging/ConsoleLogWriter.h>
#include <Optimium/Runtime/Logging/FileLogWriter.h>

int main(...) {
    // change verbosity to debug level
    rt::logging::setLogLevel(rt::LogLevel::Debug);

    // add console log writer.
    rt::logging::addLogWriter(std::make_unique<rt::logging::ConsoleWriter>());

    // add file log writer to "output.log" file.
    rt::logging::addLogWriter(std::make_unique<rt::logging::FileWriter>("output.log"));

    // add Android Logcat log writer. this logger is only available for Android.
    rt::logging::addLogWriter(std::make_unique<rt::logging::AndroidWriter>());

    // Explicitly initialize and finalize the runtime.
    rt::initialize();

    // ... (load model, run inference, etc.)

    rt::finalize();
}

Initialization lifecycle

The runtime must be initialized before any model loading or inference, and finalized after all work is done. There are two approaches:

Approach 1: Explicitinitialize() / finalize() calls

Call rt::initialize() at the start of your program and rt::finalize() at the end. This gives you full control over the runtime lifecycle.

int main(...) {
    rt::initialize();

    // ... load models, run inference ...

    rt::finalize();
}

Approach 2:rt::AutoInit (RAII)

rt::AutoInit calls rt::initialize() in its constructor and rt::finalize() in its destructor. Because rt::finalize() shuts down the runtime entirely, it is critical that AutoInit outlives all inference operations. If declared as a local variable inside a function, the runtime will be finalized when the variable goes out of scope β€” any inference running at that point will fail.

For this reason, rt::AutoInit should be declared as a static global variable, or at the very top of main() before any other runtime operations:

// Recommended: static global β€” runtime lives for the entire process lifetime.
static rt::AutoInit Init;

int main(...) {
    // ... load models, run inference ...
    // rt::finalize() is called automatically when the process exits.
}
// Also OK: top of main() β€” runtime lives until main() returns.
int main(...) {
    rt::AutoInit Init;

    // ... load models, run inference ...
    // rt::finalize() is called when Init goes out of scope.
}

Warning: Do NOT declare rt::AutoInit inside a narrow scope (e.g. inside a loop or a helper function). If the destructor runs while inference requests are still in progress, the runtime will be finalized prematurely, causing undefined behavior or crashes.

Check available devices

It is recommended to check available devices before loading a model.

DeviceNotFoundError is a common error when you load a model without checking whether or not the required device is present.

You can get list of available devices from rt::getLocalInfo() function.

int main(...) {
    // ...

    // Iterate every devices to check the device exists.
    rt::HostInfo Local = rt::getLocalInfo();

    bool Found = false;
    for (rt::DeviceID ID : Local.Devices) {
        if (ID.getPlatform() == rt::PlatformKind::Native) {
            Found = true;
            break;
        }
    }

    if (!Found) {
        std::cout << "error: cannot find needed device."
                  << std::endl;
    }
}

HostInfo contains the following members:

  • int ID - Host identifier
  • StringRef Name - Host name
  • DeviceKind Architecture - Host CPU architecture
  • OSKind OS - Operating system (Linux, Android, Windows, MacOS, IOS)
  • ArrayRef<DeviceID> Devices - Available devices on this host

DeviceID provides the following methods:

  • PlatformKind getPlatform() - Platform kind (Native, XNNPack, CUDA, Vulkan, OpenCL, SNPE, QNN)
  • DeviceKind getDeviceKind() - Device kind (x86, x64, ARM, ARM64, RISCV64, NVIDIA, Mali, Adreno, Hexagon, etc.)
  • uint32_t getIndex() - Device index among the same kind on the host
  • uint32_t getHostID() - Host ID
  • HostInfo getHostInfo() - Get host info for this device
  • const Capability &getCapability() - Get hardware capabilities (X86Capability, ARMCapability, etc.)
  • std::string toString() - String representation
  • static DeviceID from(PlatformKind, DeviceKind, uint8_t Index, uint8_t Host) - Create a DeviceID

5. Load a model

Model represents an ML model as you know. You can load a model via rt::loadModel() function.

Optimium Runtime automatically searches the devices described in the model.

To specify the device to run the model, you should manually specify the device to use.

You can configure various options through the ModelLoadOptions struct.

Unlike other AI inference engines like Tensorflow Lite, Optimium allows the model format to be a folder. Since the folder is considered as the model, you should type the path to the model when loading it from the Optimum Runtime. You must always copy the model along with its folder.

int main(...) {
    // ...

    // load a model with default options (auto-detected devices).
    rt::Model Model = rt::loadModel("path/to/model");


    // load a model with manual configurations.
    rt::ModelLoadOptions Options;

    // set devices for running the model.
    Options.Devices = { ... };

    // if true, fail when the exact device is not available (no fallback to localhost).
    Options.Strict = false;

    // enable memory optimization (share buffers between non-overlapping tensors).
    Options.EnableMemoryOptimization = true;

    // enable runtime checks for debugging. (NOT YET IMPLEMENTED)
    Options.EnableRuntimeChecks = false;

    // treat denormal floats as zero.
    Options.DisableDenormals = true;

    // path to save intermediate tensors for debugging.
    Options.IntermediateSavePath = "";

    // passphrase for encrypted models.
    Options.Passphrase = "";

    // set the number of threads to be used for running the model.
    Options.ThreadCount = 4;

    // set which cores the threads used for inference will be assigned to.
    Options.Cores = {0, 1, 2, 3};

    // set the scheduler type.
    // Simple: sequential execution (default)
    // Exclusive: first-come first-served with resource queuing
    // Pipeline: pipelined multi-stage execution across devices
    Options.Scheduler = rt::SchedulerType::Simple;

    rt::Model Model = rt::loadModel("path/to/model", Options);
    // ..
}

6. Listing model information

You can find information about the model using some informative functions.

To get information about the tensor, use Model.getTensorInfo() function with its name. You can also access by index using Model.getInputTensorInfo(index) and Model.getOutputTensorInfo(index).

Also, you can get list of input or output tensor names by Model.getInputNames() for input and Model.getOutputNames() for output.

Additionally, Model.getName() returns the model's name, Model.getTensors() returns all tensor infos, and Model.getOperations() returns a list of OpInfo (containing the operation name and device).

int main(...) {
    // ...

    // print model name
    std::cout << "model name: " << Model.getName() << std::endl;

    // print list of input tensor info
    std::cout << "input tensors" << std::endl;
    for (rt::StringRef Name : Model.getInputNames())
        std::cout << Model.getTensorInfo(Name) << std::endl;

    // print list of output tensor info
    std::cout << "output tensors" << std::endl;
    for (rt::StringRef Name : Model.getOutputNames())
        std::cout << Model.getTensorInfo(Name) << std::endl;

    // access by index
    const rt::TensorInfo &FirstInput = Model.getInputTensorInfo(0);
    const rt::TensorInfo &FirstOutput = Model.getOutputTensorInfo(0);

    // list operations
    for (const auto &Op : Model.getOperations())
        std::cout << "op: " << Op.Name << " on device: " << Op.Device << std::endl;
}

TensorInfo, the return value of functions above, represents information about each tensor. It contains the tensor's name, type, shape, alignment, padding, and quantization scheme (if quantized). It also has OptOut and Constant flags.

You can access member variables to get details of the tensor.

int main(...) {
    // ...

    const rt::TensorInfo &Info = Model.getTensorInfo("input_0");

    std::cout << "name of tensor: "
              << Info.Name << std::endl;
    std::cout << "shape of tensor: "
              << Info.Shape << std::endl;
    std::cout << "alignment of tensor: "
              << Info.Alignment << std::endl;
    std::cout << "type of tensor: "
              << Info.Type << std::endl;
    std::cout << "padding of tensor: "
              << Info.Padding << std::endl;
    std::cout << "tensor size in bytes: "
              << Info.getTensorSize() << std::endl;

    if (Info.Scheme)
        std::cout << "scheme of tensor: "
                  << *(Info.Scheme) << std::endl;

    // ...
}

Dynamic Shape Models

Models with dynamic (symbolic) shapes support shape inference. You can compute the output shape from input shapes using Model.inferShape():

int main(...) {
    // ...

    // infer output shape from input shapes (by-name)
    std::map<std::string, rt::TensorShape> InputShapes;
    InputShapes["input_0"] = rt::TensorShape({1, 3, 224, 224});
    rt::TensorShape OutputShape = Model.inferShape("output_0", InputShapes);

    // infer output shape from input shapes (by-index)
    std::vector<rt::TensorShape> InputShapeList = { rt::TensorShape({1, 3, 224, 224}) };
    rt::TensorShape OutputShape2 = Model.inferShape("output_0", rt::make_array(InputShapeList));

    // ...
}

7. Creating a request

InferRequest represents a single inference that the model runs. Users can create multiple InferRequests and execute the same model without interfering with other requests.

Additionally, users can achieve target throughput by queueing multiple requests efficiently.

Creating request is done by calling Model.createRequest() function.

int main(...) {
    // ...

    rt::InferRequest Request = Model.createRequest();

    // ...
}

8. Prepare inputs and outputs

Before running the inference, you should prepare input and output tensors for the request.

To create a tensor, you can use rt::tensor() function.

int main(...) {
    // ...

    // Create float32 tensor shaped 32x32.
    rt::TypedTensor<float> f32_tensor = rt::tensor<float>({32, 32});

    // Create generic tensor shaped 8.
    rt::Tensor i16_tensor = rt::tensor(rt::ElementType::I16, {8});

    // Create float16 tensor shaped 32x32 with user-provided buffer.
    rt::float16* Data = new rt::float16[32 * 32];
    rt::TypedTensor<rt::float16> f16_tensor = rt::tensor<rt::float16>({32, 32}, Data);
}

When you create tensor with user-provided buffer, you should take more care with the tensor.

Optimium Runtime does not take ownership of the provided buffer. So the buffer should not be freed before the tensor is finalized.

Also, the runtime always assumes you provide a valid buffer. It might cause severe error if you provided an invalid buffer (smaller buffer than expected, invalid pointer address, etc.).

After creating a tensor, you can access tensor memory using Tensor.data() function. Putting data in the tensor can be done by trivial functions like memcpy or std::copy.

If you want to fill tensor with a scalar value, use TypedTensor.fill() function.

You can also save and load tensors to/from files using Tensor.save() and Tensor.load().

int main(...) {
    // ...

    // Example for load data from existing data.
    // Assume variable 'Data' contains data.
    std::vector<float> Data;

    rt::TypedTensor<float> Tensor = rt::tensor<float>({32, 32});
    std::copy(Data.begin(), Data.end(), Tensor.data());

    // Example for load data from file.
    // <fstream> is required.
    std::ifstream File("path/to/file", std::ios::in | std::ios::binary);
    File.read(reinterpret_cast<char *>(Tensor.data()), Tensor.getTensorSize());

    // Fill scalar value. 'Tensor' must be 'TypedTensor' type.
    Tensor.fill(1.0f);

    // Save tensor to file.
    Tensor.save("tensor.bin");

    // Load tensor from file.
    Tensor.load("tensor.bin");

    // Release tensor data explicitly.
    Tensor.release();

    // ...
}

Optimium Runtime provides types that are not supported natively in C++ and those types are defined in files located at Optimium/Runtime/Types.

Tensor type and wrapped C++ type are listed below.

Element TypeC++ Type
ElementType::F16rt::float16
ElementType::BF16rt::bfloat16
ElementType::TF32rt::tfloat32
ElementType::QS8rt::qs8
ElementType::QU8rt::qu8
ElementType::QS16rt::qs16
ElementType::QU16rt::qu16
ElementType::QS32rt::qs32

Additional element types available: I8, U8, I16, U16, I32, U32, I64, U64, F16, F32, F64, Bool, String

Optimium Runtime only recognizes those C++ types for corresponding tensor type when the runtime checks data type for the tensor. Other data types are not recognized and results in compilation error.

9. Running an inference

Running an inference is done by calling two functions: infer and wait. Request.infer() function requests starting the inference to the runtime and returns immediately and Request.wait() function waits for the previously requested inference to finish (regardless of failure).

Request.infer() requires two arguments, inputs and outputs, and has two variants, one accepts rt::ArrayRef (reference type to array-like value) and the other accepts std::map.

The function that accepts rt::ArrayRef should contain all tensors that the model requires and should have same order with inputs or outputs of the model.

int main(...) {
    // ...

    // running inference with input and output list
    std::vector<rt::Tensor> Inputs;
    std::vector<rt::Tensor> Outputs;

    // Create empty tensors.
    for (rt::StringRef Name : Model.getInputNames()) {
        const rt::TensorInfo& Info = Model.getTensorInfo(Name);
        Inputs.push_back(rt::tensor(Info.Type, Info.Shape));
    }

    for (rt::StringRef Name : Model.getOutputNames()) {
        const rt::TensorInfo& Info = Model.getTensorInfo(Name);
        Outputs.push_back(rt::tensor(Info.Type, Info.Shape));
    }

    // rt::make_array is helper function that creates rt::ArrayRef.
    Request.infer(rt::make_array(Inputs), rt::make_array(Outputs));
    Request.wait();

    // ...
}

Another function that accepts std::map should contain all tensors that the model requires and should have matching tensor name with inputs or outputs of the model.

int main(...) {
    // ...

    // running inference with input and output map
    std::map<std::string, rt::Tensor> Inputs;
    std::map<std::string, rt::Tensor> Outputs;

    // Create empty tensors
    for (rt::StringRef Name : Model.getInputNames()) {
        const rt::TensorInfo& Info = Model.getTensorInfo(Name);
        Inputs[Name] = rt::tensor(Info.Type, Info.Shape);
    }

    for (rt::StringRef Name : Model.getOutputNames()) {
        const rt::TensorInfo& Info = Model.getTensorInfo(Name);
        Outputs[Name] = rt::tensor(Info.Type, Info.Shape);
    }

    Request.infer(Inputs, Outputs);
    Request.wait();

    // ...
}

Note that you do not need to create tensors at every inferences. Tensors can be reused between requests and/or models unless tensors are used simultaneously. (e.g. Cannot use request A's output tensor X for request B's input while request A is running. It is OK to use tensor X after request A is finished.)

Request.wait() function has an optional argument, timeout, which represents the duration to wait. If the method returns false, it represents that the inference was finished before the timeout was reached. The method returns true when the inference was not finished before the timeout was reached. If called with no argument (or zero), it waits indefinitely.

Note that Request.wait() method should always be called to check error that happened during inference. It might cause undefined behavior if the request starts the inference in a fault state.

To check the state of the request, use Request.getStatus() function. The possible states are InferStatus::Ready, InferStatus::Running, and InferStatus::Fault.

int main(...) {
    // ...

    // waits until inference is finished.
    Request.infer(...);
    Request.wait();

    // waits 500 milliseconds to finish inference
    using namespace std::chrono_literals;

    Request.infer(...);
    if (Request.wait(500ms))
        std::cout << "inference not finished after 500ms"
                  << std::endl;
    else
        std::cout << "inference was finished within 500ms"
                  << std::endl;

    // check the state of the request
    std::cout << "status of request: "
              << Request.getStatus()
              << std::endl;
}

Callbacks

You can register a callback to be called when the inference finishes:

int main(...) {
    // ...

    Request.addCallback([](rt::InferStatus Status, std::exception_ptr Err) {
        if (Status == rt::InferStatus::Fault) {
            try { std::rethrow_exception(Err); }
            catch (const std::exception &E) {
                std::cerr << "Inference error: " << E.what() << std::endl;
            }
        }
    });

    Request.infer(...);
    Request.wait();
}

Cancellation

You can cancel an in-progress inference:

Request.infer(...);
// ... later:
Request.cancel();

10. Profiling

InferRequest supports profiling mode for performance measurement:

int main(...) {
    // ...

    rt::ProfileOptions Options;
    Options.Repeat = 100;          // number of repetitions
    Options.WarmUp = 10;           // warm-up count
    Options.WarmUpTime = std::chrono::microseconds(1000000); // warm-up duration
    Options.StopThreshold = std::chrono::microseconds(0);    // stop threshold
    Options.CheckPeriod = 0;       // check period
    Options.EventBufferSize = rt::kDefaultEventBufferSize;   // event buffer size (default: 4MB)

    // profile() is blocking (unlike infer() which is async)
    Request.profile(Inputs, Outputs, Options);

    // access profiling events
    for (const auto &Event : Request.getProfileEvents()) {
        std::cout << "Event: " << rt::toString(Event.Kind)
                  << " at " << Event.TimeStamp.time_since_epoch().count()
                  << std::endl;
    }

    // access model metadata from the request
    std::cout << "Model: " << Request.getModelName() << std::endl;
    for (const auto &Op : Request.getModelOperations())
        std::cout << "  Op: " << Op.Name << std::endl;
}

You can also use ProfileEventRecorder for custom lock-free event recording:

auto Recorder = std::make_shared<rt::ProfileEventRecorder>(1024 * 1024);
Request.setRecorder(Recorder);
Request.profile(Inputs, Outputs, Options);
// Recorder now contains events with lock-free access

Profile event kinds include: ModelExecuteBegin/End, LayerExecuteBegin/End, LaunchBegin/End, DeviceExecuteBegin/End, CopyBegin/End, QueueBegin/End, WaitBegin/End.

Intermediate Tensor Access

When EnableMemoryOptimization is disabled in ModelLoadOptions, or IntermediateSavePath is set, you can access intermediate tensors for debugging:

rt::ModelLoadOptions Options;
Options.EnableMemoryOptimization = false;
rt::Model Model = rt::loadModel("path/to/model", Options);

auto Request = Model.createRequest();
Request.infer(Inputs, Outputs);
Request.wait();

// access intermediate tensor
rt::Tensor IntermediateTensor = Request.getTensor("intermediate_tensor_name");

11. Error handling

Optimium Runtime uses exception-based error handling. All exceptions inherit from std::runtime_error.

The following exception types are available:

  • InvalidArgumentError - Invalid argument passed
  • InvalidStateError - Unexpected internal state
  • InvalidOperationError - Operation not allowed in current state
  • TypeError - Type mismatch
  • ShapeError - Shape mismatch or incompatible shapes
  • ExtensionError - Extension loading or initialization failure
  • DeviceError - Device not found or operation failure
  • ModelError - Model loading or compilation failure
  • RequestError - Request operation error
  • InferError - Inference execution failure
  • OutOfResourceError - Resource allocation failure
  • ContainerError - Model container is invalid or corrupted
  • RemoteError - Remote communication error
  • IOError - I/O operation error
  • NetworkError - Network communication error
  • OSError - Operating system error
  • NotImplementedError - Feature not yet implemented

Additionally, the Result<T> template is used internally for monadic error propagation.

12. Extensions

You can load hardware backend extensions dynamically:

int main(...) {
    // ...
    rt::loadExtension("path/to/extension.so");
    // ...
}

Built-in extensions include XNNPack, Vulkan, OpenCL, SNPE, QNN, and CUDA.

Python

Python

Do not put any Optimium Runtime related objects in the global scope or create circular references to them.

This can lead to memory leakages or undefined behavior due to differences in memory management model between C++ and Python.

1. Install Optimium Runtime. Please click here to install the runtime.

2. Import Optimium Runtime

To use Optimium Runtime, you should import optimium.runtime package. On Python, unlike C++, initialization is done at import time.

Also you can modify the verbosity level or output path of log.

import optimium.runtime as rt

def main():
    # change verbosity to debug level
    rt.logging.set_loglevel(rt.LogLevel.DEBUG)

    # enable logger that writes logs on console
    rt.logging.enable_console_log()

    # enable logger that writes logs on file
    rt.logging.enable_file_log("output.log")

If you want to defer initialization of Optimium Runtime, declare environment variable OPTIMIUM_RT_DEFER_INIT before importing the runtime. And call rt.initialize() before using the runtime.

import os
os.environ["OPTIMIUM_RT_DEFER_INIT"] = "TRUE"

import optimium.runtime as rt
rt.initialize()  # must be called before use any runtime components.

Additional environment variables

  • OPTIMIUM_RT_DEFER_INIT - Defer automatic initialization
  • OPTIMIUM_RT_DEBUG - Enable debug logging
  • OPTIMIUM_RT_ENABLE_LOG - Enable logging
  • OPTIMIUM_RT_LOGFILE - Log file path (if set, logs to file; otherwise console)

Version information

version = rt.get_version()
print(f"Optimium Runtime v{version.major}.{version.minor}.{version.patch}")
print(f"Build: {version.build_info}")

Check available devices

It is recommended to check available devices before loading a model.

DeviceNotFoundError is a common error when you load a model without checking whether or not the required device is present.

You can get a list of available devices from rt.get_local_info() function.

def main():
    # ...

    # Iterate every devices to check the device exists.
    local = rt.get_local_info()

    found = False
    for dev in local.devices:
        if dev.platform == rt.PlatformKind.NATIVE:
            found = True

    if not found:
        print("cannot find needed device")

HostInfo has the following properties:

  • id - Host identifier (int)
  • name - Host name (str)
  • architecture - Host CPU architecture (DeviceKind)
  • os - Operating system (OSKind: LINUX, ANDROID, WINDOWS, MACOS, IOS)
  • devices - Available devices (Sequence[DeviceID])

DeviceID has the following properties:

  • platform - Platform kind (PlatformKind: NATIVE, XNNPACK, CUDA, VULKAN, OPENCL, SNPE, QNN)
  • device_kind - Device kind (DeviceKind: X86, X64, ARM, ARM64, RISCV32, RISCV64, NVIDIA, MALI, ADRENO, HEXAGON, etc.)
  • index - Device index among the same kind on the host
  • host_id - Host ID
  • host_info - HostInfo for this device
  • capability - Hardware capabilities (X86Capability, ARMCapability, SPIRVCapability, CUDACapability, RISCVCapability, HexagonCapability)

3. Load a model

Model represents an ML model as you know. You can load a model via rt.load_model() function.

Optimium Runtime automatically finds and uses devices described in the model. To specify the device that runs the model, you should manually specify the device to use.

You can configure various options by passing them as keyword arguments to the rt.load_model() function.

Unlike other AI inference engines like Tensorflow Lite, Optimium allows the model format to be a folder. Since the folder is considered as the model, you should type the path to the model when loading it from the Optimum Runtime. You must always copy the model along with its folder.

def main():
    # ...

    # load a model with auto-detected devices.
    model = rt.load_model("path/to/model")

    # load a model with manual configurations.
    model = rt.load_model("path/to/model",
                          devices=[...],
                          # if true, fail when exact device is unavailable (no fallback).
                          strict=False,
                          # enable memory optimization (share buffers between non-overlapping tensors).
                          memory_optimization=True,
                          # treat denormal floats as zero.
                          disable_denormals=True,
                          # path to save intermediate tensors for debugging.
                          intermediate_save_path=None,
                          # passphrase for encrypted models.
                          passphrase=None,
                          # number of threads for inference.
                          threads=4,
                          # which cores the threads will be assigned to.
                          cores=[0, 1, 2, 3],
                          # scheduler type: SIMPLE (default), EXCLUSIVE, or PIPELINE.
                          scheduler_type=rt.SchedulerType.SIMPLE)

Scheduler Types

  • rt.SchedulerType.SIMPLE (default) - Sequential execution with exclusive per-request resources. Low latency.
  • rt.SchedulerType.EXCLUSIVE - First-come first-served with resource queuing.
  • rt.SchedulerType.PIPELINE - Pipelined multi-stage execution across devices.

4. Listing model information

You can find information about the model using some informative methods.

To get information about a tensor, use model.get_tensor() method with its name. You can also access by index using model.get_input_tensor(index) and model.get_output_tensor(index).

You can get list of input or output tensor names via model.input_names property for inputs and model.output_names property for outputs.

Additional model properties: model.name returns the model name, model.tensors returns all tensor infos, model.operations returns a list of OpInfo (with name and device properties), and model.is_dynamic indicates whether the model has dynamic shapes.

def main():
    # ...

    # print model name
    print(f"model name: {model.name}")

    # print list of input tensor info
    print("input tensors")
    for name in model.input_names:
        print(model.get_tensor(name))

    # print list of output tensor info
    print("output tensors")
    for name in model.output_names:
        print(model.get_tensor(name))

    # access by index
    first_input = model.get_input_tensor(0)
    first_output = model.get_output_tensor(0)

    # list operations
    for op in model.operations:
        print(f"op: {op.name} on device: {op.device}")

TensorInfo, the return value of the properties and methods above, represents information about each tensor. It contains the tensor's name, type, shape, alignment, padding, and quantization scheme (if quantized). It also has opt_out and constant flags.

You can access properties to get details of the tensor.

def main():
    # ...

    info = model.get_tensor("input_0")

    print(f"name of tensor: {info.name}")
    print(f"shape of tensor: {info.shape}")
    print(f"alignment of tensor: {info.alignment}")
    print(f"type of tensor: {info.type}")
    print(f"padding of tensor: {info.padding}")
    print(f"tensor size in bytes: {info.size}")
    print(f"opt out: {info.opt_out}")
    print(f"constant: {info.constant}")

    if info.scheme:
        print(f"quantization scheme of tensor: {info.scheme}")
        if info.scheme.per_channel:
            print(f"per-channel on axis: {info.scheme.axis}")
            for i in range(len(info.scheme)):
                param = info.scheme[i]
                print(f"  channel {i}: scale={param.scale}, zero_point={param.zero_point}")
        else:
            param = info.scheme.get_param()
            print(f"per-tensor: scale={param.scale}, zero_point={param.zero_point}")

Dynamic Shape Models

Models with dynamic (symbolic) shapes support shape inference. You can compute the output shape from input shapes using model.infer_shape():

def main():
    # ...

    # infer output shape from input shapes (by-name dict)
    output_shape = model.infer_shape("output_0", {
        "input_0": rt.TensorShape(1, 3, 224, 224)
    })
    print(f"inferred output shape: {output_shape}")

    # infer output shape from input shapes (by-index list)
    output_shape = model.infer_shape("output_0", [
        rt.TensorShape(1, 3, 224, 224)
    ])

TensorShape provides the following:

  • rank - Number of dimensions
  • dynamic - Whether the shape has symbolic dimensions
  • size - Total element count (raises on dynamic shapes)
  • strides - Row-major element strides
  • is_compatible(other) - Check if shapes are compatible
  • Indexing with shape[i] (supports negative indices)
  • len(shape) returns the rank

Expr represents a symbolic dimension:

  • Expr(42) - Constant dimension
  • Expr("batch") - Symbolic/named dimension
  • is_const(), is_symbol(), value, symbol properties
  • Arithmetic: +, -, *, /, %, min(), max()

5. Creating a request

InferRequest represents a single inference that the model runs. Users can create multiple InferRequests and execute the same model without interfering with other requests.

Additionally, users can achieve target throughput by queueing multiple requests efficiently.

Creating request is done by calling model.create_request() method.

def main():
    # ...

    request = model.create_request()

6. Prepare inputs and outputs

Before running the inference, you should prepare input and output tensors for the request.

To create tensor, you can use rt.tensor() function.

# ...
import numpy as np

def main():
    # ...

    # create float32 tensor shaped 32x32 (uninitialized)
    f32_tensor = rt.tensor(shape=(32, 32), dtype=rt.ElementType.F32)

    # you can use numpy-style alias for dtype
    f32_tensor = rt.tensor(shape=(32, 32), dtype=rt.float32)

    # create int16 tensor filled with 123
    i16_tensor = rt.tensor(123, shape=(8,), dtype=rt.int16)

    # create tensor from list
    tensor_from_list = rt.tensor([[1,2,3], [4,5,6], [7,8,9]])  # default dtype is rt.float32

    # create tensor from numpy (with copy by default)
    arr = np.random.random((32, 32)).astype(np.float16)
    tensor_from_np = rt.tensor(arr)

    # create tensor from numpy without copy (zero-copy, shares memory)
    tensor_zero_copy = rt.tensor(arr, copy=False)

The rt.tensor() function supports these creation modes:

  • rt.tensor(shape=(3, 4), dtype=rt.float32) - Uninitialized tensor
  • rt.tensor(0.0, shape=(3, 4), dtype=rt.float32) - Fill with scalar value
  • rt.tensor([[1, 2], [3, 4]]) - From nested list (default dtype: float32)
  • rt.tensor(numpy_array) - From numpy array (copies by default)
  • rt.tensor(numpy_array, copy=False) - Zero-copy from numpy (shares memory)

Available dtype aliases:

AliasElementType
rt.int8ElementType.I8
rt.uint8ElementType.U8
rt.int16ElementType.I16
rt.uint16ElementType.U16
rt.int32ElementType.I32
rt.uint32ElementType.U32
rt.int64ElementType.I64
rt.uint64ElementType.U64
rt.float16ElementType.F16
rt.float32ElementType.F32
rt.float64ElementType.F64
rt.bfloat16ElementType.BF16
rt.tfloat32ElementType.TF32
rt.bool_ElementType.BOOL
rt.str_ElementType.STRING
rt.qint8ElementType.QS8
rt.quint8ElementType.QU8
rt.qint16ElementType.QS16
rt.quint16ElementType.QU16
rt.qint32ElementType.QS32

You can also convert between ElementType and numpy dtypes:

# ElementType -> numpy dtype
np_dtype = rt.ElementType.F32.to_dtype()

# numpy dtype -> ElementType
elem_type = rt.ElementType.from_dtype(np.float32)

You can convert rt.Tensor to numpy.ndarray by tensor.to_numpy() method.

Note that rt.Tensor does not provide way to access tensor data directly. You should convert to numpy.ndarray to access the data.

def main():
    # ...

    tensor = rt.tensor(...)
    # not supported
    # val = tensor[...]

    arr = tensor.to_numpy()
    val = arr[...]

rt.Tensor provides the following properties:

  • shape - TensorShape of the tensor
  • type - ElementType of the tensor

And the following methods:

  • to_numpy() - Convert to numpy array (zero-copy, shares memory)
  • fill(value) - Fill the tensor with a scalar value

Tensor print configuration

You can configure how tensors are printed:

# set the maximum number of elements displayed per dimension (default: 8, -1 = unlimited)
rt.config.set_print_threshold(10)
# set the decimal precision for floating point output (default: 6, range 0-15)
rt.config.set_print_precision(4)

7. Running an inference

Running an inference is done by calling two functions: infer and wait. request.infer() method requests starting the inference to the runtime and returns immediately and request.wait() method waits for previously requested inference until it finished (regardless of failure).

request.infer() method requires two arguments, inputs and outputs, and it has two variants, one accepts sequences (e.g. list, tuple) and the other accepts dict.

The method that accepts sequences should contain all tensors that the model requires and should have same order with inputs or outputs of the model.

def main():
    # ...

    # running inference with input and output list
    inputs = []
    outputs = []

    # create empty tensors
    for name in model.input_names:
        info = model.get_tensor(name)
        inputs.append(rt.tensor(shape=info.shape, dtype=info.type))

    for name in model.output_names:
        info = model.get_tensor(name)
        outputs.append(rt.tensor(shape=info.shape, dtype=info.type))

    request.infer(inputs, outputs)
    request.wait()

Another function that accepts dict should contain all tensors that the model requires and should have matching tensor name with inputs or outputs of the model.

def main():
    # ...

    # running inference with input and output dict
    inputs = {}
    outputs = {}

    # create empty tensors
    for name in model.input_names:
        info = model.get_tensor(name)
        inputs[name] = rt.tensor(shape=info.shape, dtype=info.type)

    for name in model.output_names:
        info = model.get_tensor(name)
        outputs[name] = rt.tensor(shape=info.shape, dtype=info.type)

    request.infer(inputs, outputs)
    request.wait()

request.infer() method also accepts numpy.ndarray as a tensor. You can pass them to the method without conversion to rt.Tensor.

def main():
    # ...

    # running inference with numpy arrays
    # assume the model accepts two float32 tensors and
    # returns one float32 tensor.
    inputs = [
        np.random.random((32, 32)).astype(np.float32),
        np.random.random((32, 32)).astype(np.float32)
    ]
    outputs = [
        np.zeros((32, 32), dtype=np.float32)
    ]

    request.infer(inputs, outputs)
    request.wait()

Note that you do not need to create tensors at every inferences. Tensors can be reused between requests and/or models unless tensors are used simultaneously. (e.g. cannot use request A's output tensor X for request B's input while request A is running. It is OK to use tensor X after request A is finished.)

request.wait() method has an optional argument, timeout, which represents microseconds to wait. If the method returns False, the inference was finished before the timeout was reached. The method returns True when the inference was not finished before the timeout was reached. If called with no argument (or zero), it waits indefinitely.

Note that request.wait() method should always be called to check error that happened during inference. It might cause undefined behavior if the request started the inference in fault state.

To check the state of the request, use request.status property. The possible states are InferStatus.READY, InferStatus.RUNNING, and InferStatus.FAULT.

def main():
    # ...

    # waits until inference is finished.
    request.infer(...)
    request.wait()

    # waits 500000 microseconds (500ms) to finish inference
    request.infer(...)

    if request.wait(500000):
        print("inference not finished after 500ms")
    else:
        print("inference was finished within 500ms")

    # check status of the inference.
    print(f"current state of request: {request.status}")

Callbacks

You can register a callback to be called when the inference finishes:

def on_complete(status, error):
    if status == rt.InferStatus.FAULT:
        print(f"Inference error: {error}")

request.set_callback(on_complete)
request.infer(inputs, outputs)
request.wait()

Cancellation

You can cancel an in-progress inference:

request.infer(inputs, outputs)
# ... later:
request.cancel()

8. Profiling

InferRequest supports profiling mode for performance measurement:

def main():
    # ...

    # profile() is blocking (unlike infer() which is async)
    request.profile(inputs, outputs,
                    repeat=100,
                    warmup=10,
                    warmup_time=1000000,    # microseconds
                    stop_threshold=0,       # microseconds
                    check_period=0,
                    event_buffer_size=0)    # 0 means use default (4MB)

    # access raw profiling events
    for event in request.get_profile_events():
        print(f"Event: {event.kind} at {event.timestamp}")

    # access computed durations (nanoseconds, grouped by model/layer/copy)
    durations = request.get_profile_durations()
    print(f"Model: {durations.model_name}")
    print(f"Model durations (ns): {durations.model_durations}")
    for name, layer_dur in durations.layer_durations.items():
        print(f"  Layer '{name}': {layer_dur}")
    for name, copy_dur in durations.copy_durations.items():
        print(f"  Copy '{name}': {copy_dur}")

ProfileEvent

Profile event kinds (ProfileEventKind enum): MODEL_EXECUTE_BEGIN/END, LAYER_EXECUTE_BEGIN/END, LAUNCH_BEGIN/END, DEVICE_EXECUTE_BEGIN/END, COPY_BEGIN/END, QUEUE_BEGIN/END, WAIT_BEGIN/END.

ProfileEvent properties:

  • kind - ProfileEventKind
  • timestamp - Timestamp in microseconds (int)
  • operation - Operation index (for LAYER_EXECUTE and WAIT events)
  • thread_id - Thread ID (for LAYER_EXECUTE events)
  • tensor_id - Tensor ID (for COPY events)
  • source_device - Source DeviceID (for COPY events)
  • dest_device - Destination DeviceID (for COPY events)

ProfileDurations

request.get_profile_durations() returns a ProfileDurations object that automatically pairs begin/end events and computes durations:

  • model_name - Name of the model (str)
  • model_durations - List of model-level durations in nanoseconds (List[int])
  • layer_durations - Dict mapping operation names to duration lists in nanoseconds (Dict[str, List[int]])
  • copy_durations - Dict mapping tensor names to duration lists in nanoseconds (Dict[str, List[int]])

BasicProfiler Utility

For convenient profiling, use BasicProfiler:

from optimium.runtime.profile import BasicProfiler

profiler = BasicProfiler(
    "path/to/model",
    memory_optimization=True,
    disable_denormals=False,
    threads=1,
    cores=None,
    intermediate_save_path=""
)

profiler.profile(
    repeat=100,
    warmup=10,
    warmup_time=1000000,    # microseconds
)

# access statistics
print(f"Model: min={profiler.model_stat.min}, "
      f"max={profiler.model_stat.max}, "
      f"mean={profiler.model_stat.mean:.2f}")

for stat in profiler.layer_stats:
    print(f"  Layer {stat.name}: mean={stat.mean:.2f}")

# save results to JSON
profiler.dump_json("profile_results.json")

BasicProfiler can also connect to a remote server:

profiler = BasicProfiler(
    "path/to/model",
    remote_address="192.168.1.100",
    remote_port=32264
)

For multiple input/output batches, use BatchProfiler:

from optimium.runtime.profile import BatchProfiler

profiler = BatchProfiler("path/to/model")
profiler.profile(
    repeat=100,
    inputs=[batch1_inputs, batch2_inputs, ...],
    outputs=[batch1_outputs, batch2_outputs, ...]
)

# access per-batch and total statistics
for i, stat in enumerate(profiler.batch_stat):
    print(f"Batch {i}: mean={stat.model_stat.mean:.2f}")
print(f"Total: mean={profiler.total_stat.model_stat.mean:.2f}")

Intermediate Tensor Access

When memory_optimization=False or intermediate_save_path is set, you can access intermediate tensors:

model = rt.load_model("path/to/model", memory_optimization=False)
request = model.create_request()
request.infer(inputs, outputs)
request.wait()

# access intermediate tensor
intermediate = request.get_tensor("intermediate_tensor_name")
print(intermediate.to_numpy())

9. Remote Inference

Optimium Runtime supports remote inference via network connection:

import optimium.runtime as rt
from optimium.runtime import remote

# connect to remote server
session = remote.connect(
    "192.168.1.100",
    port=32264,
    enable_secure_connection=True,
    enable_compression=True
)

# check remote host info
print(f"Remote host: {session.host_info}")

# load model on remote
model = session.load_model(
    "path/to/model",
    threads=4,
    cores=[0, 1, 2, 3],
    scheduler_type=rt.SchedulerType.SIMPLE
)

# model info
print(f"Model: {model.name}")
for info in model.input_tensor_infos:
    print(f"  Input: {info.name} {info.shape} {info.type}")
for info in model.output_tensor_infos:
    print(f"  Output: {info.name} {info.shape} {info.type}")

# run profiling (asynchronous - call wait() after)
inputs = {"input_0": rt.tensor(shape=(1, 3, 224, 224), dtype=rt.float32)}
outputs = {"output_0": rt.tensor(shape=(1, 1000), dtype=rt.float32)}

model.profile(inputs, outputs, repeat=100, warmup=10)
model.wait()

# get profile events and durations
events = model.get_profile_events()
durations = model.get_profile_durations()

# get specific tensor result
result = model.get_tensor("output_0")

Remote model properties:

  • name - Model name
  • input_tensor_infos - Input tensor info list
  • output_tensor_infos - Output tensor info list
  • tensor_infos - All tensor infos
  • operations - Operation name list
  • is_dynamic - Whether the model has dynamic shapes

Remote model methods:

  • profile(inputs, outputs, repeat=1, *, warmup=0, warmup_time=0, stop_threshold=0, check_period=0, event_buffer_size=0) - Asynchronous profiling (accepts dict or sequence, with Tensor or numpy)
  • wait(timeout=0) - Wait for profiling to complete (timeout in microseconds)
  • cancel() - Cancel in-progress profiling
  • get_tensor(name) - Get tensor result after completion
  • get_profile_events() - Get raw profiling events
  • get_profile_durations() - Get computed ProfileDurations (nanosecond durations grouped by model/layer/copy)
  • is_saved_tensor(name) - Check if tensor is saved
  • infer_shape(output_name, input_shapes) - Infer output shape from input shapes (dict)

Remote session's load_model accepts these additional keyword arguments: devices, strict, memory_optimization, disable_denormals, intermediate_save_path, passphrase, threads, cores, scheduler_type.

10. Error handling

Optimium Runtime maps C++ exceptions to Python exceptions:

  • rt.InvalidArgumentError (inherits from ValueError)
  • rt.InvalidStateError
  • rt.InvalidOperationError
  • rt.TypeError (inherits from TypeError)
  • rt.ShapeError
  • rt.ExtensionError
  • rt.DeviceError
  • rt.ModelError
  • rt.RequestError
  • rt.InferError
  • rt.OutOfResourceError
  • rt.ContainerError
  • rt.RemoteError
  • rt.IOError (inherits from IOError)
  • rt.NetworkError (inherits from IOError)
  • rt.OSError (inherits from OSError)
  • rt.NotImplementedError (inherits from NotImplementedError)

All inherit from rt.RuntimeException (which inherits from RuntimeError).

try:
    model = rt.load_model("nonexistent/path")
except rt.ContainerError as e:
    print(f"Container error: {e}")
except rt.ModelError as e:
    print(f"Model error: {e}")

11. Loading extensions

You can load hardware backend extensions dynamically:

rt.load_extension("path/to/extension.so")