[RFC][Quantization] Support quantized models from TensorflowLite

Let me reference @ajtulloch 's [comment](https://github.com/dmlc/tvm/pull/2116#issuecomment-444694200) about quantization workflow firstly:

> 1. Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.
> 
> 2. (optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.
> 
> 3. Train the model as usual
> 
> 4. Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by
> - calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
> - using activation ranges learned during training (c2/tf).
> 
> 5. Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable. 
> 
> 6. Deploy the quantized graph.



However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last. 

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of https://github.com/dmlc/tvm/pull/2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

- [x] Support TFLite FP32 Relay frontend. PR: https://github.com/dmlc/tvm/pull/2365

- [ ] Support TFLite INT8 Relay frontend

- [ ] Extend the attribute of the convolution and related ops to support quantization

- [ ] Auto-TVM on ARM CPU can work with INT8

Welcome any feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Quantization] Support quantized models from TensorflowLite #2351

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC][Quantization] Support quantized models from TensorflowLite #2351

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions