This includes custom op matching for decomposed operations and fusing
dequantization into dense operations. As a validation we compare
to the dequant+mm torch implementation.
The logic here is very similar to the conversion for AdaptiveAvgPool1d
#2661 with a few modifications:
1. buffVal = -inf instead of 0
2. the main linalg generic op accumulates a max, instead of a sum, to
the first output tensor
3. avg pooling requires dividing the sum pool by the kernel width, which
we stored as an auxilliary tensor (kSizeTensor). Here, the auxiliary
tensor will be recording the indices. Strangely enough, the only
signature available for this function is to return indices, and it
appears that they must be computed whether the user desires them or not.
See
[pytorch/torch/nn/functional.py](https://github.com/pytorch/pytorch/blob/main/torch/nn/functional.py#L1174).
Before writing other adaptive pooling conversions, the logic of this
decomposition should be rolled into a helper function that will work for
both max and avg pooling ops. Even the auxiliary tensor should likely be
automated. This code was written in a slightly more tedious way than
strictly necessary (often using loops to fill SmallVectors up to rank-2,
which is only two in this case), in order to more easily facilitate the
transition to a helper function.
convolution with [time,batch,channel] ordering, as opposed to the
default [batch, channel, time]. Currently implementing by transposing
the input and output, but may need to get its own implementation in the
future because this is supposed to be an op that gives a speedup. This
is used by fairseq
(https://github.com/facebookresearch/fairseq/issues/172).
(in case you were wondering like me, this is different from transposed
convolution. Transposed convolution has fractional strides).
---------
Co-authored-by: Xida Ren <xida.ren.dev@gmail.com>
Co-authored-by: Frederik Harwath <frederik.harwath@amd.com>
Handle both `torch.dequantize` and `torch.quantize_per_tensor` including
the op based quantization parameter tracking. This includes adding
`qint32` to torch types as it was missing during the initial type
inclusion.
For testing we only have `torch.int8` and `torch.float` types on
function boundaries as the `qint8` types require passing the scale
and zero point quantization information which is not supported yet.
This PR updates the torch-to-tosa conversion with following changes:
- Support torch.none as min/max input argument for tosa.clamp op
- Support negative value as start index for tosa.slice op
- Add tosa.logical_or lowering support
e2e test:
python -m e2e_testing.main --config=tosa
LIT tests:
cmake --build build --target tools/torch-mlir/all
---------
Co-authored-by: Ze Zhang <ze.zhang@getcruise.com>
Adaptive pooling ops can only be decomposed into their non-adaptive
counterparts in trivial cases.
For example, the current decomposition for AtenAdaptiveAvgPool1dOp in
DecomposeComplexOps.cpp supports outSize = inSize (i.e., do literally
nothing), and outSize = 1 (i.e., do a batched average).
The reason adaptive pooling ops are difficult to lower to linalg is that
they are not constantly strided. They are computed by taking an input
tensor of shape (N, C, Hin), and an output size Hout, and computing the
output tensor at position (n,c, h) in the following way:
1. compute st(h) = (h*Hin)//Hout
2. compute en(h) = 1 + ((h+1)*Hin -1)//Hout
3. apply a computation (max or avg) to the slice: INPUT[n, c,
st(h):en(h)]
The provided sample implementation (for ConvertAtenAdaptiveAvgPool1dOp)
uses tensor.extract to access the input tensor inside the payload of a
linalg generic op. This is likely an unattractive use of linalg generic
ops, which is why I am asking for some more targeted feedback on the
validity of this approach before attempting to support the many other
adaptive pooling ops.
Specifically:
- Is the performance of this implementation bad enough to warrant
targeting different dialects entirely? e.g. TMtensor/linalg ext/ etc.
- If the provided implementation is of acceptable performance to the
community, then is it permissable to remove the Adaptive pooling
decompositions from DecomposeComplexOps.cpp? Based on the current
structure of the -torch-decompose-complex-ops pass, it does not seem
possible to only decompose the adaptive ops in special cases (it seems
to get stuck in an infinite loop on a match failure). I would be happy
to instead incorporate the case logic into the conversion directly, and
remove the decompositions once they are rendered completely obsolete.
As long as this approach is acceptable, I can clean up the
implementation with some helper functions, and quickly add support for
each of the remaining Adaptive pooling ops.
Adds a lowering to Linalg for reflection_pad1d. Based on ideas/code from draft PR
https://github.com/llvm/torch-mlir/pull/2693.
---------
Co-authored-by: Kumar Deepak <kumar@xilinx.com>
As noted in the plan when this work started, we need to produce an ORT
EP plugin for a downstream project, and this will necessitate a C-based
ONNX importer (as opposed to the existing Python one). Because this
comes with dependencies that we do not want to impart on various
projects, this is optional in torch-mlir. It is also factored so that it
can be used as standalone sources in downstreams that need it. Since it
only depends on public C APIs on the MLIR side, this will make build
coupling a lot better (since a C++ dep is not needed on the compiler and
it is trivial to dynamically load).
Our original plan was just to maintain this fork off to the side in our
ORT plugin, but once work started, it seemed better to write it clean
and contribute it upstream for anyone to use. We expect that for non-ORT
use, the Python importer will have better ergonomics for most folks.
I will follow-up with a test suite refactor so that we can drive the
Python or C importer.
This is a relatively mechanical port from Python to C, borrowing some
scaffolding from the old JitIR importer. It does attempt to lay some
groundwork for external data, which will need to be implemented on the
Python side as well.
This PR adds the `enable_ir_printing` option to `torch_mlir.compile`,
which can be used to print the IR for all intermediate passes.
When running the added test file via:
```shell
$ python test/python/compile.py 2> tiny.stderr
```
the file `tiny.stderr` is about 700 KB.
This replaces the lowering of aten.cat with tensor.concat, allowing more
efficient handling of concatenations in downstream flows. The refbackend
populates concat decomposition patterns that can be used to recover the
previous lowering.
This is part 1 of 2, which will also include upstreaming the FX
importer. I started with ONNX because it forces some project layout
updates and is more self contained/easier as a first step.
Deviating somewhat from the RFCs on project layout, I made the following
decisions:
* Locating the `onnx_importer.py` into `torch_mlir.extras` as Maks
already has opened up that namespace and it seemed to fit. Better to
have fewer things at that level.
* Setup the build so that the root project only contains MLIR Python and
pure Python deps (like the importers), but this can be augmented with
the `projects/` adding more depending on which features are enabled.
* The default build continues to build everything whereas in
`TORCH_MLIR_ENABLE_ONLY_MLIR_PYTHON_BINDINGS=1` mode, it builds a
`torch-mlir-core` wheel with the pure contents only.
`onnx_importer.py` and `importer_smoke_test.py` are almost verbatim
copies from SHARK-Turbine. I made some minor local alterations to adapt
to paths and generalize the way they interact with the outer project. I
expect I can copy these back to Turbine verbatim from here. I also
updated the license boilerplate (they have the same license but slightly
different project norms for the headers) but retained the correct
copyright.
Other updates:
* Added the ONNX importer unit test (which also can generate test data)
in lit, conditioned on the availability of the Python `onnx` package. In
a followup once I know everything is stable, I'll add another env var
that the CI can set to always enable this so we know conclusively if
tests pass.
* Moved the ONNX conversion readme to `docs/`.
* Renamed CMake option `TORCH_MLIR_ENABLE_ONLY_MLIR_PYTHON_BINDINGS` ->
`TORCH_MLIR_ENABLE_PYTORCH_EXTENSIONS` and inverted the sense. Made the
JitIR importer and LTC options `cmake_dependent_options` for robustness.
For easier tracking of issues, sort the TOSA passing list. It is still
significantly smaller then the XFAIL list would be.
Resolves#2620, at least until the xfail list gets smaller than the
passing list.
Signed-off-by: Eric Kunze <eric.kunze@arm.com>
`AtenStackOp` defines this folder for list operand containing single
element:
```
OpFoldResult AtenStackOp::fold(FoldAdaptor adaptor) {
auto list = getOperand(0).getDefiningOp<PrimListConstructOp>();
if (!list || !list->hasOneUse() || list.getElements().size() != 1)
return nullptr;
return list.getElements()[0];
}
```
However, unlike `AtenCatOp`, `AtenStackOp` cannot be folded away for
single element list operand because the result from a stack operation
contains an additional dimension (of size 1, like expand_shape).
This PR removes the `AtenStackOp::fold` method, and adds an e2e test for
single element list input case, which fails on current `main` as
follows:
```
Unexpected outcome summary: (linalg)
****** Failed tests - 1 tests
FAIL - "TensorsStackSingleElementListModule_basic"
@ trace item #0 - call to "forward"
@ output of call to "forward"
ERROR: shape (torch.Size([10, 32])) is not equal to golden shape (torch.Size([10, 1, 32]))
```
Thanks Chris Lalau Keraly for the bug report.
This commit adds the OnnxToTorch support for BitwiseXor, BitwiseOr, Div, Equal, Cast,
Ceil, Floor, Cos, and Clip op.
This commit also adds the TorchToLinalg support for aten.clamp.Tensor and aten.clamp_min.Tensor op.
Signed-Off By: vivekkhandelwal1424@gmail.com
The linalg Op `linalg.conv_2d_ngchw_fgchw` had a bug where
1. Weights were accessed as G,F,C,H,W instead of as F,G,C,H,W
2. Output was accessed as N,F,G,H,W instead of as N,G,F,H,W
Now this has been fixed in
https://github.com/llvm/llvm-project/pull/73855 which broke the
torch-mlir lowering to that Op.
This patch switches lowering in torch-mlir to the newly introduced
`linalg.conv_2d_ngchw_gfchw` op which accesses weights in an order that
is compatible with PyTorch's memory layout.
Fix https://github.com/llvm/torch-mlir/issues/2622
llvm-project: bbd2b08b95fe76bea138c1b03c1cd42ed3ee04df
stablehlo: ab709fe48de88c67717abfbd7ef17425eb95ddaf
These commits were chosen in order to account for an MLIR API break from
3dbac2c007
which required a patch to stablehlo. We integrate a bit beyond that
commit to deal with some revert/reapply cycles in the intervening range
which were discovered in another downstream.
Further, it requires adaptation to the stablehlo API breaks introduced
from https://github.com/openxla/stablehlo/pull/1872 which are along for
the ride.
Since some stablehlo builders were changed to directly take int64_t
array refs, also traced that up some call stacks to eliminate some
signed/unsigned mismatches that result.
Also adds a few TOSA tests to the passing set that seem to work now.
Despite aten.mm requiring the input and output types match, we still opt
to maintain signedness semantics in case later passes try to do any sort
of integer type narrowing.
Adds a lowering for the torch.aten.argmin operator to linalg via decomposition into torch.aten.min.dim.
---------
Co-authored-by: Franz Haniel <franz.haniel@amd.com>
The function `getTypeForScalarType` currently takes an argument to
specify the signedness of integer types. This is leakage of backend
specific requirements into the torch dialect world. Because
`getTypeForScalarType` is a utility function for the torch dialect, it
should only produce types that match the sign conventions used by
PyTorch (regular integers are signed and unsigned integers are
unsigned).
This commit removes the signedness argument from
`getTypeForScalarType`, and moves the backend specific handling of
integer types to the backend code.
This commit adds the OnnxToTorch support for Atan, Bitshift, BitwiseAnd,
and BitwiseNot op.
This commit also adds the TorchToLinalg support for AtenBitwiseLeftShiftTensorOp.
Signed-Off By: vivekkhandelwal@nod-labs.com
This was unfortunately being initialized in a directory below its first use. This was causing the first configure to mis-detect the ABI flags, which was causing type conversion failures at runtime.
Fixes#2298 and hardens some additional messages and checks to better make it clear when something goes awry.
The aten.reshape ops in the decomposition are replaced with prims.collapse
and prims.split_dim ops, which means that the cases where the lowering of
reshape from torch to linalg which are not supported, are avoided.
Essentially, by using the collapse and split_dim ops instead of the
reshape ops, we are not "losing" the information that the reshapes do not
arbitrarily mix dimensions. Which makes lowering easy.
3 additional tests added:
- fully dynamic,
- dynamic only the spatial dimensions,
- dynamic only in the non-spatial dimensions.
Adds support for lowering to prims split_op.
Similar design to collapse op lowering in
https://github.com/llvm/torch-mlir/pull/2572, with some
small differences, because the split_dim op (in pytorch) is
view-changing whereas the collapse is not. The difference
means that
1) it must be registered in the function Torch::isViewLikeOp
2) it must be be added to the "expected fail" set for the torch dynamo backend.