Skip to content

[RFC][Quantization] Support quantized models from TensorflowLite #2351

@FrozenGene

Description

@FrozenGene

Let me reference @ajtulloch 's comment about quantization workflow firstly:

  1. Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

  2. (optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

  3. Train the model as usual

  4. Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

  • calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
  • using activation ranges learned during training (c2/tf).
  1. Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

  2. Deploy the quantized graph.

However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of #2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

Welcome any feedback.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions