In torch.index_put like ops, `values` is only required to be
broadcastable to `input[indices]`, rather than exact dimension match.
This patch fixes the problem by add additional
stablehlo.dynamic_broadcast_in_dim before creating stablehlo.scatter op.
BTW, this patch also enhance the `getBroadcastResultShape` utility in
hlo namespace.
# Tracking
[Issue](https://github.com/nod-ai/SHARK-ModelDev/issues/848)
[TorchToLinalg Op
Support](https://github.com/nod-ai/SHARK-ModelDev/issues/347)
# Description
Aten_TrilinearOp is an implementation of a "trilinear einstein sum".
Essentially, just an einsum across 3 tensors.
There are a few inputs:
## Tensor Inputs
- i1, i2, i3 - The three input tensors for the _trilinear op.
## Expands
These inputs allow you to unsqueeze an input tensor at the specified
dims as a pre-processing step to make the shapes compatible for the rest
of the op:
- expand1: List[int], expand2: List[int], expand3: List[int]
## sumdim
- sumdim: List[int] - After applying element wise multiplication, the
values in sumdim denote where to collapse a dimension by summing over it
## unroll_dim
- unroll_dim: int - In the PyTorch implementation, this specifies a
dimension where you could slice the input tensors, multiply and sum
them, then concatenate the results in an output tensor. This complicates
the implementation significantly, but doesn't change the result, so I
opted against it. Along with that, a previously accepted path for
solving this involved reusing the AtenEinsumOp, which also would also
ignore this input.
# Solution
After trying a bunch of more complicated approaches for it, this op
actually ended up being quite simple: [See
_trilinear](https://dev-discuss.pytorch.org/t/defining-the-core-aten-opset/1464)
`_trilinear = (i1.unsqueeze(expand1) * i2.unsqueeze(expand2) *
i3.unsqueeze(expand3)).sum(sumdim)`
Wish I saw this earlier, but watcha gonna do: 🙃
## Not Reusing AtenEinsumOp
Frankly, I found multiple cases where valid inputs would have numerical
mismatches for EinsumOp, even when running tests against EinsumOp
directly. I think it has something to do with the singleton dimensions.
Will need to look into this further, but once I realized the simplified
approach, it appeared to be more reliable and much simpler.
Either way (credit to @zjgarvey), there are improvements to the einsum
op here. When I was originally trying to use the op, intermediate
tensors were being flattened properly, but then its 0th dimension was
being cast from a static dim to a dynamic dim due to integers not
folding correctly in the MLIR. Figured it's worth keeping these
improvements for future reusers of EinsumOp.
# The zero'd out dim "bug"
For some reason, if you specify a dimension in all `expands`,
```i.e.
[expand1=[0], expand2=[0], expand3=[0]],
[expand1=[1], expand2=[1], expand3=[1]]
```
The _trilinear op would specify `0` for that dimension in the output
shape, unless it was also included in `sumdim`. This goes against the
implementation of torch.einsum:
```
>>> a, b, c = [torch.rand(1, 3, 3, 3) for i in range(3)] # Simulate expand at dim=0 for all input tensors
>>> torch.einsum('abcd,abcd,abcd->abcd', a, b, c).shape
torch.Size([1, 3, 3, 3])
```
And is just straight up incorrect mathematically. I considered
"replacing" singleton dims with zeroed out dims, but that seemed like
carrying over a bug. Instead, I included a test for the case, verified
that the singleton dimensions were handled the way that torch.einsum
handles it, instead of torch._trilinear, and xfailed it with a note as
to why.
This commit adds the torch-onnx-to-torch-backend pipeline which
converts the Torch Onnx IR to Torch Backend IR.
This commit also moves the `ScalarizeShapes` pass from the
`torch-backend-to-linalg-on-tensors-backend-pipeline` to the
`torch-onnx-to-torch-backend` pipeline since the primary goal of
this pass is to scalarize the shapes in the IR coming from the
Onnx models.
Set PyTorch and TorchVision version to nightly release 2024-10-15.
Tracker issue for the failing tests added to xfail_set in this PR.
Issue: https://github.com/llvm/torch-mlir/issues/3796
This commit disables the failing sparse tensor tests since they are not
maintained on day-to-day basis and blocks the roll PyTorch update for now.
Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>
This was preventing dynamic dims in an ONNX model from being reified (causing the generation of `tensor.cast`s and preventing fusion in iree):
```mlir
%2 = torch.vtensor.literal(dense<[4, 256]> : tensor<2xsi64>) : !torch.vtensor<[2],si64>]
%7 = torch.prim.ListConstruct %int2 : (!torch.int) -> !torch.list<int>
%8 = torch.aten.reshape %2, %7 : !torch.vtensor<[2],si64>, !torch.list<int> -> !torch.vtensor<[2],si64>
//... chain of foldable ops linking %2 to the `shape` operand of a `torch.aten.broadcast_to ... -> !torch.vtensor<[?,?],si64>`
```
# Description
Implementation of the op for `torch.aten.unfold`: [TorchToLinalg Op
Support #347](https://github.com/nod-ai/SHARK-ModelDev/issues/849)
Documentation of op can be found here: [PyTorch
Docs](https://pytorch.org/docs/stable/generated/torch.Tensor.unfold.html)
For this op, we apply a sliding window of some `size` along a single
`dimension`, with `step` in between iterations.
`Declaration: aten::unfold(Tensor(a) self, int dimension, int size, int
step) -> Tensor(a)`
The resulting `unfolded` tensor modifies the shape of `dimension` to be
equal to the number of blocks that the sliding windows extracts/inserts,
with an additional dimension of `size` appended (the number of cols of
the output tensor directly translates from the size of the sliding
window).
So if we had a tensor of rank 3 (A x B x C), with dimension = 1, size =
2 and step = 2:
(A x B x C) |=> (A x (B - size) // step + 1 x C x size)
After extracting the window from the input tensor, we insert the (1 x
size) slice into the output tensor. We can make this simpler by mapping
the output indices from the input indices, like they do in the official
implementation:
[PyTorch
Code](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/lowering.py#L1694)
- Support Bidirectional LSTM (utilising the forward LSTM layer with
flipped Inputs and Outputs)
- Support layout 1
- Support default cases for attr `clip` and `input_forget`
- Support returning partial outputs (1-3)
- fixes for alt_e2e_tests lstm tests (1,2,3)
This commit adds the support for negative step values in
aten.slice.Tensor op. Although, PyTorch does not allow negative step
value for slice op but the Onnx.Slice op supports negative step value
which eventually lowers to torch.aten.slice.Tensor op. Hence, the
support is added for handling those kind of values during the
Torch->Linalg lowering of aten.slice.Tensor op.
Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>
Enabled mask and is_causal parameters for torch.aten.scaled_dot_product
attention + relevant comments + tests.
The tests added highlight the new capabilities introduced in this PR,
including:
Attention with F16 mask
Attention with Boolean mask
Causal attention with same Q K V shapes
Causal attention without Q K V shapes
Made sure that one cannot input both mask and is_causal.
Addresses an issue in <https://github.com/llvm/torch-mlir/issues/3651>
where some unflatten ops generated from onnx models weren't propagating
static shape information. It may be necessary to add further
optimizations for the more general case when some static information is
present in the unflatten (or possibly reshape/view) op's `sizes` list,
but not reflected in the output shape. These ops will only successfully
infer shapes if the `sizes` list is gotten from a list of constant ints
(with possibly one -1). A common example where this fails is when some
of the `sizes` are determined from `aten.size.int` ops on dynamic
tensors, and other `sizes` are known statically.
This PR includes:
- a canonicalizer for `aten.unflatten.int` which converts to
`aten.unsqueeze` when it is expanding one dim to two, and one of the new
dims is statically 1.
- an improvement to the folder for `aten.__or__.bool` which does not
rely on *both* operands being static.
This PR add `floordiv` to the `PY_BUILTIN_TO_TORCH_OP`. For
`aten.mul.int` and `aten.floordiv.int` ops, we add new Canonicalization
Patterns as follow:
```
%1 = torch.aten.mul.int %input, %const-5
%2 = torch.aten.mul.int %1, %const-6
```
Will be replaced by
`torch.aten.mul.int %input, %const-30`
And
```
%1 = torch.aten.mul.int %input, %const-5
%2 = torch.aten.floordiv.int %1, %const-5
```
Will directly return `%input`
This PR also relaxes the `float` type constraint in TorchToTosa for the
`AtenRsubScalarOp` conversion.
To test:
`cmake --build build --target check-torch-mlir-all`
Set PyTorch and TorchVision version to nightly release 2024-08-18.
This commit also updates the `scaled_dot_product_attention` op.
A new attribute `enable_gqa` has been added. As of now, only the
default value for the same is supported.
Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>
The pattern `m_OnnxListOfConstantInts` previously only checked if the
attr inside an `onnx.Constant` op is a `DenseResourceElementsAttr`, but
didn't handle `ElementsAttr`'s. This patch adds support for
`ElementsAttr` and provides an example of it's use via a lit test for
`onnx.Unsqueeze`.
The static uneven divisible AdaptiveAvgPool2d means that although the
input size is not an integer multiple of ouput size, but the kernel and
stride size can also be fixed (not dynamic). The derivation logic of
kernel and stride size is consistent with
torch/_decomp/decomposations.py:adaptive_avg_pool2d as described in the
following:
1. Stride Size
Firstly , derive the start index in each reduce operation according to
the output size (`n`), `start_index = ([0, 1, ..., n - 1] * input_size)
// output_size`. For each index `k`, if `k * (input_size % output_size)
< output_size`, then the current and previous stride keeps the same as
`input_size // output_size`. So suppose `(n-1) * (input_size %
output_size) < output_size`, the stride in the whole AdaptiveAvgPool2d
process keeps static, as `input_size // output_size`.
2. Kernel Size
torch/_decomp/decomposations.py:adaptive_avg_pool2d calculates a static
kernel size when the input/output sizes satisfy either of the two
conditions, `input_size % output_size == 0` or `output_size %
(input_size % output_size) == 0`. Here if `input_size % output_size ==
0`, then the kernel size equals `input_size // output_size`, otherwise
`input_size // output_size + 1.`