The new cases added for quantized matmuls are:
1. vec-vec
2. vec-mat
3. mat-vec
each of which are now lowered to expand(s), quantized_matmul, and
collapse.
1. onnx.MatMulInteger now converts to aten.matmul instead of aten.mm
2. aten.matmul, for ranks >=2, now allows quantized inputs and will
lower to linalg::quantized_matmul or linalg::quantized_batch_matmul.
3. added AtenMatmulOp to the FuseQuantizeOps rewrite patters
QuantizeOperands, QuantizeTransposedOperands, and QuantizeAccumulator
4. added several tests, including some to test AtenMmOp with varying
quantization signed-ness.
5. a quantized matmul mat-vec test is added to verify the failure to
lower to linalg; cleaned of out-of-date code related to common
torch-mlir lowering xfails.
6. in debugging a real model with quantized matmuls, I found a bug on
the scalarize-shapes pass which resulted from the aten.full op folder
returning an incompatible result type. This is fixed by the small change
here to
[lib/Dialect/Torch/IR/TorchOps.cpp](https://github.com/llvm/torch-mlir/compare/main...zjgarvey:torch-mlir:MatMulIntegerFix?expand=1#diff-dc8ed165c207918e606490eee3984b1ad51d7034e6aac36fc046bf47f6f03f4f).
1. Changes the linalg lowering for dequantization ops to always sign
cast to float to prevent misrepresenting uint32 overflow on subtraction
with zero point.
2. Adds a basic quantized model test which only quantizes and
dequantizes and now passes with these changes in linalg and onnx
configs.
3. Changes the aten.mm lowering to allow mismatched quantized types.
4. If a quantized matmul arg is uint8, we shift by 128 to faithfully
represent the quantization as a signed i8 quantization. This worked fine
in the AtenMmOp lowering, but I'd be happy to move it to a rewrite in
FuseQuantizedOps.cpp instead if that seems more appropriate.
With the changes 3 and 4, the QuantizedMLP_basic and
QuantizedSingleLayer_basic e2e tests now passes with the onnx config.
The `convertTensorToElementType` function expects it's argument to have
a valid tensor type that is not `Torch::NoneType`. This PR checks that
the bias tensor is not of type `Torch::NoneType` before calling
`convertTensorToElementType` on the bias tensor argument in the
`matchAndRewrite` member function of the `ConvertAtenConvolutionOp`
class.
See the related issues here:
[SHARK-Turbine#556](https://github.com/nod-ai/SHARK-Turbine/issues/556)
1. Adds uint8 casting to onnx.Cast op
2. Fixes an issue with onnx.DequantizeLinear when the scale comes with
shape [1].
3. Adds support for unsigned types in an AtenItemOp folder
4. Adds a simpler quantized model for easier debugging
5. Adds a fusion pass to convert [quant -> dequant -> transpose -> mm]
patterns to [transpose -> quant -> mm].
6. Moved some xfails that are still not passing, but for different
reasons than onnx.cast failures.
Leaning on the QDQ functionality in torch we can support the QLinearConv
operation by piggybacking through `torch.Convolution`. This includes
some changes such as allowing the `onnx` rewriter to run recursively.
Doing so allows `QLinearConv` to decopmose to `onnx.Convolution` which
is then lowered to `torch`.
Linalg has quantized specific operations. We can lower to these
operations when there is a known zeropoint and scale operations. This
allows the `convolution` to occur with lower bitwidth's, improving the
overall performance.
After noticing a number of commits with unrelated formatting changes,
I think something was changed with clang-format at one point and we're
seeing a number of unrelated changes. Doing a refresh can help avoid
this.
The changes made here came from
```
find lib -iname *.h -o -iname *.cpp | xargs clang-format -i --style=llvm
find include -iname *.h -o -iname *.cpp | xargs clang-format -i --style=llvm
find projects -iname *.h -o -iname *.cpp | xargs clang-format -i --style=llvm
```
This preserves sparsity at the most obvious places of lowering TORCH
tensors to MLIR RankedTensorType tensors. Other places are marked for
audit. With some initial lowering tests.
This includes custom op matching for decomposed operations and fusing
dequantization into dense operations. As a validation we compare
to the dequant+mm torch implementation.
convolution with [time,batch,channel] ordering, as opposed to the
default [batch, channel, time]. Currently implementing by transposing
the input and output, but may need to get its own implementation in the
future because this is supposed to be an op that gives a speedup. This
is used by fairseq
(https://github.com/facebookresearch/fairseq/issues/172).
(in case you were wondering like me, this is different from transposed
convolution. Transposed convolution has fractional strides).
---------
Co-authored-by: Xida Ren <xida.ren.dev@gmail.com>
Co-authored-by: Frederik Harwath <frederik.harwath@amd.com>
The linalg Op `linalg.conv_2d_ngchw_fgchw` had a bug where
1. Weights were accessed as G,F,C,H,W instead of as F,G,C,H,W
2. Output was accessed as N,F,G,H,W instead of as N,G,F,H,W
Now this has been fixed in
https://github.com/llvm/llvm-project/pull/73855 which broke the
torch-mlir lowering to that Op.
This patch switches lowering in torch-mlir to the newly introduced
`linalg.conv_2d_ngchw_gfchw` op which accesses weights in an order that
is compatible with PyTorch's memory layout.
Fix https://github.com/llvm/torch-mlir/issues/2622
Despite aten.mm requiring the input and output types match, we still opt
to maintain signedness semantics in case later passes try to do any sort
of integer type narrowing.
When importing dynamic shaped programs from Dynamo, via torch.compile or
torch.export, we can assume that strict symbolic shape checks have been
done prior to generating torch IR. Among other shape checking, this
eliminates the case where an unknown dimension can be dynamically '1' in
a way that signals a broadcast.
Adds a `isAssumingStrictSymbolicShapes` utility which consults a
`torch.assume_strict_symbolic_shapes` attribute on an enclosing scope
and returns true if present.
In the linalg pipeline, many runtime checks are elided when this returns
true.
This commit also adds the support for non-unit output padding in the
case of transposed convolution.
Signed-Off By: Vivek Khandelwal<vivek@nod-labs.com>
- Support for non-prefixed accessors has been removed. See:
https://reviews.llvm.org/D136727
- Rename `operands` to `methodOperands` in `prim.CallMethod` since the
name `operands` overlaps with a builtin method name. See:
https://reviews.llvm.org/D136727
- Add passes in refbackend to lower memref.subview. See:
https://reviews.llvm.org/D136377
- Replace `CopyToValueTensorOps` first in `RewriteViewLikeSubgraph` in
maximize-value-semantics.
The current implementation of the `RewriteViewLikeSubgraph` pass in
maximize-value-semantics creates temporarily invalid IR. In
particular, given a forward slice starting from a
`CopyToNonValueTensorOp` and ending in `CopyToValueTensorOp`s, the
pass first replaces all uses of the `CopyToNonValueTensorOp` with
its operand, which results in all the `CopyToValueTensorOp` users
having their operand have type `!torch.vtensor`, which is invalid.
The correct way to do things is to first replace all the
`CopyToValueTensorOp`s with their operand, and then replace all uses
of the `CopyToNonValueTensorOp` with its operand.
This only started failing now because the generated accessor
`getOperand` for the `CopyToValueTensorOp` now returns a
`TypedValue<NonValueTensorType>`, which has an assert checking that
the value returned is of the expected type.
Summary of changes:
- Replace call to `MemoryEffectOpInterface::hasNoEffect`
with `isMemoryEffectFree`.
- Make fix for the dynamic dims, since
`kDynamicSize` value changed to
`std::numeric_limits<int64_t>::min()` from `-1` in llvm
- `makeShapeLLVMCompatible` and `makeShapeTorchCompatible`
utilities convert shapes in order to remain consistent
with the Torch and MLIR semantics.
- Update tags
llvm: 147fe9de29dc13c14835127b35280c4d95c8e8ba
mhlo: 1944b5fa6062ec4c065d726c9c5d64f1487ee8c5
Signed-Off By: Vivek Khandelwal<vivek@nod-labs.com>
This commit renames the patterns used to match on lists of constant
values to `m_TorchListOfConstant{valueType}s`. This is needed to avoid
ambiguity for when `valueType` has `Optional` in it. In particular, it
makes it clear whether the values in the list are optional or the list
itself is optional.
This commit makes the following changes needed to update bump LLVM:
- Replace `linalg.init_tensor` with `tensor.empty` (see:
https://reviews.llvm.org/D135129)
- Replace `NoSideEffect` with `Pure` (see
https://reviews.llvm.org/D135505)
- Replace `body` region accessor for `ReduceOp` and `ReduceWindowOp`
with `getBody`
- Fix incorrect use of `tosa::ReduceSumOp` in `AtenNativeLayerNormOp`
conversion pattern. The result type of `tosa::ReduceSumOp` must have
the same rank as the input type. (see:
https://www.mlplatform.org/tosa/tosa_spec.html#_reduce_sum)
Co-authored-by: Ashay Rane <ashay@users.noreply.github.com>
Co-authored-by: Ashay Rane <ashay@users.noreply.github.com>
This commit removes the `weight` tensor from the inputs of one of the
`linalg.generic` ops generated by the `aten.convolution` linalg
lowering, since the indexed values are not actually used by the body
of the `linalg.generic`. Moreover, in general the `weight` tensor does
not have the same shape as the output tensor of the `linalg.generic`,
so both tensors being indexed by the same indexing maps is wrong.
Summary of changes:
- Updated references to the Arith dialect
(https://reviews.llvm.org/D134762)
- Switched to prefixed accessors for MemRef dialect
(https://reviews.llvm.org/D134995)
- Fixed warnings about signed/unsigned comparisons, ignored return
values, and unused variables
This commit adds decomposition of `aten.linear` op. Due to limited
support at tosa backend in case of dynamic dimensions, this
decomposition is currently disabled for tosa backend.
Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>
This commit lowers `aten.matmul` to `linalg.BatchMatmul` under the
following conditions:
1. The result of matrix multiplication must have batch dimensions,
i.e., rank greater than 2.
2. The resultant matrix must have at most 1 dynamic batch dimension.
It also handles broadcasting of batch dimensions when batch dimensions
of the matrices are broadcastable.
Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>
This helps keep things organized and also exposes more parallelism to
the build system. It seems though that most of the compile time is
actually spent in the headers though, so the wall time doesn't decrease
as much as I had hoped (and now that the headers are being included
multiple times, the cpu time actually increases a lot, sadly -- will try
to dig into this).