This is a large change because prior to this point, Python files in the
project were not consistently formatted. This reformats them all with
black defaults.
Based on experience with prior projects, if you have a dev/long-term
branch with Python patches, you can minimize merge conflicts prior to
rebasing to include this commit by running `black` on your modified
Python files, squashing, and then rebasing/merging.
A choice was made to quantize the return type of Relu with a scale and
zero point copied from the input's quantization scheme. With this
choice, the torch-to-linalg conversion of quantized Relu essentially
computes max(input, zeroPoint) in the elementwise payload.
* promote input to output element-type when lowering to stablehlo, so
that it could satisfy stablehlo's type constraints.
* split promote-to-fp unary ops from fp-only unary ops.
This commit also cleans up the OnnxToTorch lowering for the Squeeze and
Unsqueeze op and adds the support for handling edge cases.
Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>
Version number was set too high. Lowered to support more cases allows
more tests to pass.
Co-authored-by: Robert Suderman <rsuderman@Roberts-MacBook-Pro.local>
Previous implementation erroneously mixed up num_outputs with
slice_size. New version correctly computs the slice size and directly
performs slicing rather than leveraging `aten.split.tensor`. This is due
to `onnx` supporting a fixed number of splits making the size
computation more easily computeable when lowering to `aten` rather than
deferring to `aten.split.tensor`.
---------
Co-authored-by: Robert Suderman <rsuderman@Roberts-MacBook-Pro.local>
We can map to `tensor.reshape` for handling multiple output dynamic
shapes. Later we can perform a more complex analysis for indentifying
expand/collapse cases from the tensor.reshape.
Initially we planned to handle this identification at the `torch` level
however it will be easier to handle once converted to core
mlir-dialects.
Decomposition RepeatInterleaveSelfInt with following ops:
```python
def my_repeat_interleave(input, repeats, dim=None):
if dim is None:
# Flatten the input and then repeat
return input.flatten().unsqueeze(-1).tile((1, repeats)).flatten()
else:
# Calculate the shape after repeat
expanded_shape = list(input.shape)
expanded_shape[dim] *= repeats
# Repeat the tensor along the specified dimension
repeat_shape = [1] * (input.dim() + 1)
repeat_shape[dim + 1] = repeats
input = input.unsqueeze(-1)
# Tile and then reshape
tiled = torch.tile(input, repeat_shape)
# Rearrange and reshape
repeated = tiled.reshape(*expanded_shape)
return repeated
```
I passed the tests of stablehlo and linalg. When testing onnx, strange
things happened.
In torch-mlir's CI **torch_nightly** and my own
environment(torch==2.4.0.dev20240318+cpu), it can **pass the pass**.
In torch-mlir's CI **torch_stable**, it **failed**.
The test case is `RepeatInterleaveSelfIntNoDimModule_basic`, the result
shape should be [120].
```python
class RepeatInterleaveSelfIntNoDimModule(torch.nn.Module):
def __init__(self):
super().__init__()
@export
@annotate_args([
None,
([3, 4, 5], torch.float32, True),
])
def forward(self, x):
return x.repeat_interleave(2)
@register_test_case(module_factory=lambda: RepeatInterleaveSelfIntNoDimModule())
def RepeatInterleaveSelfIntNoDimModule_basic(module, tu: TestUtils):
module.forward(tu.rand(3, 4, 5))
```
The error log is as follows:
```
Unexpected outcome summary: (onnx)
****** Failed tests - 1 tests
FAIL - "RepeatInterleaveSelfIntNoDimModule_basic"
@ trace item #0 - call to "forward"
@ output of call to "forward"
ERROR: shape (torch.Size([6, 4, 5])) is not equal to golden shape (torch.Size([120]))
```
@rsuderman
Would you please help me check what's wrong with my PR? Thanks a lot.
The new cases added for quantized matmuls are:
1. vec-vec
2. vec-mat
3. mat-vec
each of which are now lowered to expand(s), quantized_matmul, and
collapse.
1. onnx.MatMulInteger now converts to aten.matmul instead of aten.mm
2. aten.matmul, for ranks >=2, now allows quantized inputs and will
lower to linalg::quantized_matmul or linalg::quantized_batch_matmul.
3. added AtenMatmulOp to the FuseQuantizeOps rewrite patters
QuantizeOperands, QuantizeTransposedOperands, and QuantizeAccumulator
4. added several tests, including some to test AtenMmOp with varying
quantization signed-ness.
5. a quantized matmul mat-vec test is added to verify the failure to
lower to linalg; cleaned of out-of-date code related to common
torch-mlir lowering xfails.
6. in debugging a real model with quantized matmuls, I found a bug on
the scalarize-shapes pass which resulted from the aten.full op folder
returning an incompatible result type. This is fixed by the small change
here to
[lib/Dialect/Torch/IR/TorchOps.cpp](https://github.com/llvm/torch-mlir/compare/main...zjgarvey:torch-mlir:MatMulIntegerFix?expand=1#diff-dc8ed165c207918e606490eee3984b1ad51d7034e6aac36fc046bf47f6f03f4f).
- Added linalg lowering for `AtenFloorDivideScalarOp`
- Needed `AtenDivScalarModeOp` for the decomp.
- Added linalg lowering for `AtenDivScalarModeOp`
- Moved linalg payload logic to `createDivModePayload()` since the logic
was nearly identical for both `AtenDivScalarModeOp` and
`AtenDivTensorModeOp`. Just a template function
- Added `AtenDivScalarModeOp` lowering for stablehlo
Pytorch's
[`torch.floor_divide()`](https://pytorch.org/docs/stable/generated/torch.floor_divide.html)
in a previous version (for a reason unknown to me) preformed a
truncation instead of "floor". The already implemented op
`AtenFloorDivideTensorOp` was done before this change. However, this
wasn't caught because our testcases only tested positive floor division.
I changed this to floor as well as adding a few test cases.
If there is only a single value scattered there can be an implicit batch
dimension. This includes a check for the implicit batch dimension when
reshaping the update tensor. It includes an e2e test to verify
correctness.
Fix the case PrimListUnpackOp's result num is not equal to PrimList
length.
See the following example:
```python
def forward(self, x):
if len(x.shape) == 5:
b0, t, c0, h0, w0 = x.shape
b, c, h, w = torch.mul(b0, t), c0, h0, w0
else:
b1, c1, h1, w1 = x.shape
b, c, h, w = b1, c1, h1, w1
res = torch.reshape(x, [b, c, h, w])
return res
```
Without this fix, the following error message will occur:
```
/root/torch-mlir/externals/llvm-project/mlir/lib/IR/PatternMatch.cpp:118: virtual void mlir::RewriterBase::replaceOp(mlir::Operation *, mlir::ValueRange): Assertion `op->getNumResults() == newValues.size() && "incorrect # of replacement values"' failed.
```
Previously, it could only handle the situations where outputsize == (1,
1) or outputsize == (input_H, input_W). Now it supports all situations
where input_H % output_H== 0 && input_W % output_W == 0
1. Changes the linalg lowering for dequantization ops to always sign
cast to float to prevent misrepresenting uint32 overflow on subtraction
with zero point.
2. Adds a basic quantized model test which only quantizes and
dequantizes and now passes with these changes in linalg and onnx
configs.
3. Changes the aten.mm lowering to allow mismatched quantized types.
4. If a quantized matmul arg is uint8, we shift by 128 to faithfully
represent the quantization as a signed i8 quantization. This worked fine
in the AtenMmOp lowering, but I'd be happy to move it to a rewrite in
FuseQuantizedOps.cpp instead if that seems more appropriate.
With the changes 3 and 4, the QuantizedMLP_basic and
QuantizedSingleLayer_basic e2e tests now passes with the onnx config.
This PR only performs a lit test. In lieu of an e2e test, https://github.com/nod-ai/SHARK-TestSuite/pull/142 makede sure that the lowering works & the numbers check out.
Co-authored-by: Xida Ren <xida.ren.dev@gmail.com>
Shapes can be processed as tensors to represent the set of dimensions.
As reshapes take a list of scalars this can result in a single dynamic
dimension blocking the adjacent static dimensions.
This pass attempts to de-couple tensor computations related to shapes
and propagate values to better support lowering scalar tensor
computations.
There is an issue with stablehlo's linalg compilation. Canonicalization
appears to cleanup the issues until we can determine what in
mlir/stablehlo is the source of the issue.
See the related issues here:
[SHARK-Turbine#556](https://github.com/nod-ai/SHARK-Turbine/issues/556)
1. Adds uint8 casting to onnx.Cast op
2. Fixes an issue with onnx.DequantizeLinear when the scale comes with
shape [1].
3. Adds support for unsigned types in an AtenItemOp folder
4. Adds a simpler quantized model for easier debugging
5. Adds a fusion pass to convert [quant -> dequant -> transpose -> mm]
patterns to [transpose -> quant -> mm].
6. Moved some xfails that are still not passing, but for different
reasons than onnx.cast failures.
Two e2e tests (AdaptiveAveragePool1/2dUnitOutputSizeDynamic) were
failing due to numerics. This was as a result of passing -1 as the
kernel size in the lowering for the corresponding onnx op
GlobalAveragePool.
Reshaping tensors depend on directly matching individual dimensions to
their corresponding dim in the `torch.view` reshape dimensions. This
involves decoupling dynamic dimensions from their static counterparts
and support cleanup / canonicalization.
This commit adds the OnnxToTorch lowering for the Mish, Softplus,
HardSwish, Trilu, ThresholdedRelu op
Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>
The previous conversions for AtenAdaptiveAvgPool1dOp and
AtenAdaptiveMaxPool2dOp are refactored into a general templated
conversion that works for all of the AtenAdaptive...PoolNdOp's.
New support is added for the following ops:
1. AtenAdaptiveMaxPool1d
2. AtenAdaptiveMaxPool3d
3. AtenAdaptiveAvgPool3d
Support is also provided for passing inputs without batch dimensions.
For example, applying adaptive_avg_pool2d to an input tensor of rank 3.
After [pytorch #118162](https://github.com/pytorch/pytorch/pull/118162)
gets down to torch-mlir, I'll add a test for AdaptiveMaxPool1d with
return_indices (which will pass with that upstream fix).
---------
Co-authored-by: James Newling <james.newling@gmail.com>
This adds support for converting DynamicQuantizeLinear from torch-onnx
to torch.
I could not get an e2e test to pass, since there seems to be some issues
with uint8 casting somewhere lower in the pipeline. For example
compiling with IREE for llvm-cpu, I would get either the correct zero
point (if zp < 128) or the correct zero-point minus 256 (if zp >= 128).
The output tensor seems to always return a tensor of zeros, which also
occurs when running uint8 examples through QuantizeLinear.
Edit: the first problem can be resolved by casting the output back to
uint8 on output, the second problem is resolved with PR #3018
Added support for dynamic shapes in `flattenusingints` op in tosa
dialect. Due to this some Argmax tests pass
This PR fixes this issue https://github.com/llvm/torch-mlir/issues/3004
The following tests pass after this PR
```
1. "ArgmaxIntModule_basic"
2. "ArgmaxIntModule_multiple_maxs"
3. "ArgmaxModule_basic"
```
Reduce mean lowerings did not succesfully lower to `linalg` via torched.
There were two separate paths that could be consolidated to a single
simpler pass. This resulted in a significant improvement in test
coverage.
If the broadcast shape is length-1 at a dim while `?` in the input dim
then we need to broadcast to the dynamic dim. This is equivalent to
taking a max of two dimensions.
This folds small version of the tensor-scalar comparison operators as
they are commonly used for shape computations. This includes le, lt, ge,
gt, eq, and ne.
The current padding operation was not functional for dynamic shapes.
Updated and enabled tests so that onnx.pad tests pass.
Work TBD for reflection padding.
Set PyTorch and TorchVision version to nightly release 2024-03-07.
This commit also removes the deprecated constraints API:
342e7929b8
Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>
This mostly copy-pastes the reduce minimum implementation to reduce max
to improve test coverage. We also improve the aten lowering for min/max
dim for unsigned types.
The addition of an e2e test is actually provided in the Shark-Testsuite.
This adds 2 test cases for the gridsampler e2e test.
Also as intended there were some items found which needed correction, so
the Gridsampler op has also a change.
Current implementation depends on using `aten.view` which has issues
inferring tensor collapse/expand operations during the lowering to
`linalg`. Using flatten and unsqueeze better infers what the later
reshape behavior.
Add e2d support for `aten.linalg_norm` by decompose it to
`aten.linalg_vector_norm`.
Lowering to `aten.linalg_matrix_norm` is still unsupported.
To Test:
`python -m e2e_testing.main -v`
---------
Co-authored-by: Ze Zhang <ze.zhang@getcruise.com>
A bunch of small fixes are interlinked and trigger crashes if not
addressed as a group. This includes:
- aten view when expand from a rank-0 tensor
- slice folder with negative indices
- `aten._shape_as_tensor` folder on a rank-0 tensor
- `aten.cat` of a tensor with a length-0 tensor
The corrective transpose at the end is computed incorrectly. Is it
actually computin the inverse transpose. Inverting the permutations
fixes the issue.
Torch lowering only supported the most recent version. Refactored the
lowering so more easily handle default values and optional operands /
attributes.
Added Support for float dtype in in torch.arange in TOSA Dialect
This resolves the following issue :-
https://github.com/llvm/torch-mlir/issues/2762
The following test cases are passing after this change
1. ArangeDtypeIntModule_basic
2. ArangeFloatModule_basic
3. ArangeNegativeStartFloatModule_basic
4. ArangeStartFloatModule_basic
5. ArangeStartNegativeStepFloatModule_basic
6. ArangeStartOutDtypeModule_basic
7. ArangeStartStepFloatModule_basic
---------
Co-authored-by: James Newling <james.newling@gmail.com>
There is no reason to treat `ConstantOfShape` as a specialized import
any as there exists a onnx-to-torch equivalent. Dropping the import
coding and adding support for resource conversion substantially
increases test coverage for dynamically shaped tests.
Strided slicing can occur with a negative stride. In these cases we need
to bound end differently. This included removing a function that was
generating bad limits.
Onnx slice lowering used arange needlessly instead of directly
constructing the constant dimension values. This makes lowerings to
linalg struggle as multiple folders are required to get what is a
constant index value.
We can route the torch tests via `onnx` using the `torch.onnx.export`
tooling. We can then reimport, lower to torch, and compile to linalg to
validate the onnx path is working correctly.
The current implementation exposes some failures in the `onnx` path so
we cannot enable the onnx test suite yet due to segmentation faults.
Some operations include a backend matcher for specialized operations. We
map these back to generics so they appropriately match to the high
performance versions. This is done for the attention operation.
There is no lowering support for math::AbsIOp, so if the operand is an
integer type, it will fail to lower to math::AbsFOp since the op operand
#0 must be floating-point-like.
This adds a few passes that will ensure linalg with sparse tensors are
properly lowered to loops and can run using the ExecutionEngine for
testing (a few details on parameter passing from PyTorch still TBD)
Test results:
$ ./tools/e2e_test.sh --config linalg
Summary:
Passed: 1144
Expectedly Failed: 8
$ python -m e2e_testing.main --config=torchdynamo -v
Summary:
Passed: 960
Expectedly Failed: 163
Filed issue:
https://github.com/pytorch/pytorch/issues/119407
So that the CumSum Op in OPT can get the constant that it requires to be lowered to TMTensor
---------
Co-authored-by: Rob Suderman <rob.suderman@gmail.com>
Co-authored-by: Xida Ren <xida.ren.dev@gmail.com>
Linalg has quantized specific operations. We can lower to these
operations when there is a known zeropoint and scale operations. This
allows the `convolution` to occur with lower bitwidth's, improving the
overall performance.
This includes custom op matching for decomposed operations and fusing
dequantization into dense operations. As a validation we compare
to the dequant+mm torch implementation.
convolution with [time,batch,channel] ordering, as opposed to the
default [batch, channel, time]. Currently implementing by transposing
the input and output, but may need to get its own implementation in the
future because this is supposed to be an op that gives a speedup. This
is used by fairseq
(https://github.com/facebookresearch/fairseq/issues/172).
(in case you were wondering like me, this is different from transposed
convolution. Transposed convolution has fractional strides).
---------
Co-authored-by: Xida Ren <xida.ren.dev@gmail.com>
Co-authored-by: Frederik Harwath <frederik.harwath@amd.com>
Handle both `torch.dequantize` and `torch.quantize_per_tensor` including
the op based quantization parameter tracking. This includes adding
`qint32` to torch types as it was missing during the initial type
inclusion.
For testing we only have `torch.int8` and `torch.float` types on
function boundaries as the `qint8` types require passing the scale
and zero point quantization information which is not supported yet.
This PR updates the torch-to-tosa conversion with following changes:
- Support torch.none as min/max input argument for tosa.clamp op
- Support negative value as start index for tosa.slice op
- Add tosa.logical_or lowering support
e2e test:
python -m e2e_testing.main --config=tosa
LIT tests:
cmake --build build --target tools/torch-mlir/all
---------
Co-authored-by: Ze Zhang <ze.zhang@getcruise.com>
Adaptive pooling ops can only be decomposed into their non-adaptive
counterparts in trivial cases.
For example, the current decomposition for AtenAdaptiveAvgPool1dOp in
DecomposeComplexOps.cpp supports outSize = inSize (i.e., do literally
nothing), and outSize = 1 (i.e., do a batched average).
The reason adaptive pooling ops are difficult to lower to linalg is that
they are not constantly strided. They are computed by taking an input
tensor of shape (N, C, Hin), and an output size Hout, and computing the
output tensor at position (n,c, h) in the following way:
1. compute st(h) = (h*Hin)//Hout
2. compute en(h) = 1 + ((h+1)*Hin -1)//Hout
3. apply a computation (max or avg) to the slice: INPUT[n, c,
st(h):en(h)]
The provided sample implementation (for ConvertAtenAdaptiveAvgPool1dOp)
uses tensor.extract to access the input tensor inside the payload of a
linalg generic op. This is likely an unattractive use of linalg generic
ops, which is why I am asking for some more targeted feedback on the
validity of this approach before attempting to support the many other
adaptive pooling ops.
Specifically:
- Is the performance of this implementation bad enough to warrant
targeting different dialects entirely? e.g. TMtensor/linalg ext/ etc.
- If the provided implementation is of acceptable performance to the
community, then is it permissable to remove the Adaptive pooling
decompositions from DecomposeComplexOps.cpp? Based on the current
structure of the -torch-decompose-complex-ops pass, it does not seem
possible to only decompose the adaptive ops in special cases (it seems
to get stuck in an infinite loop on a match failure). I would be happy
to instead incorporate the case logic into the conversion directly, and
remove the decompositions once they are rendered completely obsolete.
As long as this approach is acceptable, I can clean up the
implementation with some helper functions, and quickly add support for
each of the remaining Adaptive pooling ops.