Releases: microsoft/Olive
Olive-ai 0.10.1
Improvements and Bug Fixes
Olive-ai 0.10.0
New Features
- Quark Quantization for ONNX Models (#2236) — New
QuarkQuantizationpass viaolive runwith support for int8/uint8/int16/uint16/int32/uint32/bf16/bfp16 and CLE/SmoothQuant/AdaRound/AdaQuant. - Embedding Quantization & RTN Improvements (#2238) — Added
QuantEmbedding, a composableRtnpass, and a unified checkpoint format aligned withMatMulNBits/GatherBlockQuantized(block/shape constraints enforced; AutoGPTQ/AutoAWQ export updated to 2D params). - Word Embedding Tying Surgery (#2240) —
TieWordEmbeddingsties input embeddings andlm_headfor both unquantized (Gemm) and quantized (MatMulNBits+GatherBlockQuantized) graphs. - Custom ONNX Model Naming (#2235) — Allows specifying a custom ONNX model name in the output directory.
- Intel OpenVINO Weight Compression Pass (#2180) — Adds NNCF-based weight compression for HF/ONNX models to OpenVINO or compressed ONNX.
Improvements
- AIMET Enhancements (#2158, #2187, #2215) — Adds Sequential MSE, enables AIMET in
quantizeCLI, and supports manual precision overrides. - GPTQ Updates (#2202, #2203) — Supports user-provided module overrides and
transformers >= 4.53. - Quantization Export Compatibility (#2218) — Updates checks for
ort-genai > 0.9.0and fixes minorOnnxDAGname clashes. - Torch Dynamo Export Alignment (#2185) —
extract_adapterrecovers folded LoRA and decomposes DORA-fusedGemmtoMatMulfor quantization. - Post-Surgery Deduplication (#2228) — Runs
DeduplicateHashedInitializersPassafter surgeries to remove duplicate initializers. - QNN Execution Provider: GPU Enablement (#2220) — Enables QNN-EP GPU, updates
StaticLLMandContextBinaryGeneration, keeps NPU default. - Run API Ergonomics (#2199) —
olive.run()now accepts a dictrun_config. - OpenVINO Config Overrides (#2191) — Allows overriding
genai_config.jsonproperties in OV encapsulation. - ReplaceAttentionMaskValue Robustness (#2213) — Adds
ShapetoALLOWED_CONSUMER_OPSfor text-encoder graphs. - Implicit Olive Version Tagging (#2183) — Automatically embeds the Olive version in saved ONNX model protos.
Olive-ai 0.9.3
New Features:
- Compatibility with Windows ML for ONNX model inference and evaluation (#2052, #2056, #2059, #2084).
Gptqquantization supportslm_headquantization and more generic weight packing (#2137).
Improvements
optimizeCLI supportsWebGPUexecution provider (#2076) andNVTensorRtRTXexecution provider (#2078).quantizeCLI supports Gptq pass as an implementation (#2115).Onnx static quantizationsupports strided calibration data for lower memory usage (#2086).- Extra options can be provided directedly to the
ModelBuilderpass (#2107). LMEvaluatorhas a new ORT backend withIOBindingleading to large speedup in runtime (#2133).OnnxFloatToFloat16allows more granular control throughop_include_listandnode_include_list(#2134).AIMETquantization pass: Support for exclude op types (#2055), pre-quantized models (#2111), LLM augmented dataloaders (#2108), LPBQ (#2119), and Adaround (#2140).
Deprecation
As per the deprecation warning in the previous release, the following Azure ML related features have been removed:
- Azure ML system
- Azure ML resource types: model, datastore, job outputs.
- Remote workflow
- Azure ML artifact packaging
Other removed features include:
IsolatedORT System(#2070)Quantization Aware Training(#2089)AppendPrePostProcessingOpspass (#2090)SNPEpasses (#2098)
Recipes Migration
All recipes have been migrated to olive-recipes repository.
Olive-ai 0.9.2
New Features:
- Selective Mixed Precision. (#1898)
- Native GPTQ Implementation with support for Selective Mixed Precision. (#1949)
- Blockwise RTN Quantization for ONNX models. (#1899)
- Ability to add custom metadata in ONNX model. (#1900)
- New simplified
olive optimizeCLI command and theolive.quantize()Python API for effortless model optimization with minimal developer input. See CLI usage and Python API docs for more details. (#1996) - New command line
olive run-passprovides advanced users ability to run individual passes. (#1904)
New Integrations
- GPTQModel. (#1999)
- AIMET (#2028). This is a work in progress.
- ONNX model support while targeting OpenVINO. (#2019)
QuarkQuantization: AMD Quark quantization for LLMs. (#2010)VitisGenerateModelLLMfor optimized LLM model generation for Vitis AI Execution Provider. (#2010)
Improvements
- New graph surgeries including
dla transformers,DecomposeRotaryEmbeddingandDecomposeQuickGelu. (#2018, #1972, #2000) - Exposed
WorkflowOutputin Python API and added unified APIs for CLI commands. (#1907) - Refactored Docker system for simplified setup and execution. (#1990)
- ExtractAdapters:
- Added support for DORA and LoHA adapters. (#1611)
- NVMO quantization:
- OnnxPeepholeOptimizer:
- Removed
fuse_transpose_qatandpatch_unsupported_argmax_operator. (#1976)
- Removed
Deprecation
Azure ML will be deprecated in the next release, including:
- Azure ML system
- Azure ML workspace model
- Remote workflow
Recipes Migration
All recipes are being migrated to the olive-recipes repository. New recipes will be added and maintained there going forward.
Olive-ai 0.9.1
Minor release to fix following issues
- OpenVINO Encapsulation pad_token_id fix (#1847)
- Add support for Nvidia TensorRT RTX execution provider in Olive (#1852)
- Basic support for ONNX auto EP selection introduced in onnxruntime v1.22.0 (#1854, #1863)
- Add Nvidia TensorRT-RTX Olive recipe for vit, clip and bert examples (#1858)
- gate optimum[openvino] version to <=1.24 (#1864)
Olive-ai 0.9.0
Feature Updates
- Implement lm-eval-harness based LLM quality evaluator for ONNX GenAI models #1720
- Update minimum supported target opset for ONNX to 17. #1741
- QDQ support for ModelBuilder pass #1736
- Refactor OnnxOpVersionConversion to conditionally use onnxscript version converter #1784
- HQQ Quantizer Pass #1799, #1835
- Introducing global definitions for Precision & PrecisionBits #1808
- Improvements in PeepholeHoleOptimizer #1697, #1698
New Passes
- OnnxScriptFusion: ONNX script fusion
- OpenVINOEncapsulation, OpenVINOReshape, OpenVINOIoUpdate: OpenVINO encapsulation #1754
- TrtMatMulToConvTransform: Convert non-4D MatMul to Transpose-Conv-Transpose sequence
- OpenVINOOptimumConversion: Add optimum Intel® pass for converting a Huggingface Model to an OpenVINO Model
- Graph Surgeries
- MatMulAddGemm: Graph surgery to transform Add Op followed by Matmul as Gemm op
- PowReduceSumPowDiv2LpNorm: Graph surgery to merge Pow ReduceSum Pow Div pattern to L2Norm
- OnnxHqqQuantization: Implements 4-bit HQQ quantization
- VitisAIAddMetaData: Adds metadata to an ONNX model based on specified model attributes.
New/Updated Examples
- Alibaba-NLP/gte #1695
- DeepSeek
- OpenVINO #1786
- Google BERT
- Google VIT
- Intel BERT
- Laion Clip
- Llama3
- OpenVINO #1786
- Meta Llama3
- QDQ #1707
- OpenAI Clip (16 and 32)
- Phi3.5
- Phi4
- OpenVINO #1828
- Qwen
- Resnet50
- Sentence Transformers CLIP
- Stable Diffusion
- QDQ #1730
Deprecated Examples
Deprecated Passes
- InsertBeamSearchOp #1805
Olive-ai 0.8.0
New Features (Passes)
QuaRotperforms offline weight rotationSpinQuantperforms offline weight rotationStaticLLMconverts dynamic shaped llm into a static shaped llm for NPUs.GraphSurgeriesapplies surgeries to ONNX model. Surgeries are modular and individually configurable.LoHa,LoKrandDoRAfinetuningOnnxQuantizationPreprocessapplies quantization preprocessing.EPContextBinaryGeneratorcreates EP specific context binary onnx models.ComposeOnnxModelscomposes split onnx models.OnnxIOFloat16ToFloat32replaced with more genericOnnxIODataTypeConverter
Command Line Interface
New command line tools have been added and existing tools have been improved.
generate_config_fileoption to save the workflow config file.extract-adapterscommand to extract multiple adapters from a PyTorch model.- Simplied
quantizecommand
Improvements
- Better output model structure for workflow and CLI runs.
- New
no_artifactsoptions in workflow config to disable saving run artifacts such as footprints.
- New
- Hf data preprocessing:
- Dataset is truncated if
max_samplesis set. - Empty text are filtered.
padding_sideis configurable and defaults to"right".
- Dataset is truncated if
SplitModelpass keeps QDQ nodes together in the same split.OnnxPeepholeOptimizer: constant folding + onnxoptimizer added.CaptureSplitInfo: Separate split for memory intensive module.OnnxConversion:- Dynamic shapes for dynamo export.
optimizeoption to perform constant folding and redundancies elimination on dynamo exported model.
GPTQ: Default wikitest calibration dataset. Patch to support newer versions oftransformers.MatMulNBitsToQDQ:nodes_to_excludeoption.SplitModel:split_assignmentsoption to provide custom split assignments.CaptureSplitInfo:block_to_splitcan be a single block (str) or multiple blocks (list).OnnxMatMul4Quantizer: Support onnxruntime 1.18+OnnxQuantization:- Support onnxruntime 1.18+.
op_types_to_excludeoption.LLMAugmentedDataLoaderaugments the calibration data for llms with kv cache and other missing inputs.
- New document theme and organization.
- Reimplement search logic to include passes in search space.
Examples:
- New QNN EP examples:
- SLMs:
- Phi-3.5
- Deepseek R1 Distill
- Llama 3.2
- MobileNet
- ResNet
- CLIP VIT
- BAAI/bge-small-en-v1.5
- Table Transformer Detection
- adetailer
- SLMs:
- Deepseek R1 Distill Finetuning
timmMobileNet
Olive-ai 0.7.1.1
Same as 0.7.1 with updated dependencies for nvmo extra and NVIDIA TensorRT Model Optimizer example doc.
Refer 0.7.1 Release Notes for other details.
Olive-ai 0.7.1
Command Line Interface
New command line tools have been added and existing tools have been improved.
olive --helpworks as expected.auto-opt:- The command chooses a set of passes compatible with the provided model type, precision and accelerator information.
- New options to split a model, either using
--num-splitsor--cost-model.
Improvements
ExtractAdapters:- Support lora adapter nodes in Stable Diffusion unet or text-embedding models.
- Default initializers for quantized adapter to run the model without adapter inputs.
GPTQ:- Avoid saving unused bias weights (all zeros).
- Set
use_exllamatoFalseby default to allow exporting and fine-tuning external GPTQ checkpoints.
AWQ: Patch autoawq to run quantization on newer transformers versions.- Atomic
SharedCacheoperations - New
CaptureSplitInfoandSplitpasses to split models into components. Number of splits can be user provided or inferred from a cost model. disable_searchis deprecated from pass configuration in an olive workflow config.OrtSessionParamsTuningredone to use olive search features.OrtModelOptimizerrenamed toOrtPeepholeOptimizerand some bug fixes.
Examples:
- Stable Diffusion: New MultiLora Example
- Phi3: New int quantization example using
nvidia-modelopt
Olive-ai 0.7.0
Command Line Interface (CLI)
Introducing new command line interface for Olive with support to execute well-defined concrete workflows without user having to ever create or edit a config manually. CLI workflow commands can be chained i.e. output of one execution can be fed as input to the next, to facilitate ease of operations for the entire pipeline. Below is a list of few CLI workflow commands -
- finetune - Fine-tune a model on a dataset using peft and optimize the model for ONNX Runtime
- capture-onnx-graph: Capture ONNX graph for a Huggingface model.
- auto-opt: Automatically optimize a model for performance.
- quantize: Quantize model using given algorithm for desired precision and target.
- tune-session-params: Automatically tune the session parameters for a ONNX model.
- generate-adapter: Generate ONNX model with adapters as inputs.
Improvements
- Added support for yaml based workflow config
- Streamlined DataConfig management
- Simplified workflow configuration
- Added shared cache support for intermediate models and supporting data files
- Added QuaRoT quantization pass for PyTorch models
- Added support to evaluate generative PyTorch models
- Streamlined support for user-defined evaluators
- Enabled use of llm-evaluation-harness for generative model evaluations
Examples
- Llama
- Updated multi-lora example to use ORT genreate() API
- Updated to demonstrate use of shared cache
- Phi3
- Updated to demonstrate evaluation using lm-eval harness
- Updated to showcase search across three different QLoRA ranks
- Added Vision tutorial