We already had the `promoteTrailingOutTensor` flag, but weren't using
it. A inplaceVariantKernelName flag needed to be added.
This change is a little dissatisfying, as the conversions done by the
RecognizeKernelsPass are currently non-orthogonal. In particular,
`kDropResultAndAliasArg0` probably won't work as intended if mixed with
these (we probably need to promote kDropResultAndAliasArg0 to not be an
arg-level thing anyway, as we have done with promoteTrailingOutTensor).
This involved adding a new op `numpy.overwrite_array`.
```
numpy.overwrite_array %arg2 overwrites %arg0 : tensor<2x3xf32>, !numpy.ndarray<[2,3]:f32>
```
This models the destructive update behavior. Note that in the above op,
we cannot simply RAUW %arg0 with a suitably conveted %arg2 (for example,
%arg0 might have uses that are not dominated by %arg2, or might have an
alias relation with some other array in the program). In general, we
need a pass analogous to "SSA-formation" which knows how to see through
these to uncover an underlying tensor program.
Also, add tanh_out_e2e.py/div_inplace_e2e.py and fix some bitrot in
refjit.py which is my running example I'm trying to get working.
Note that unlike aten.matmul which has dynamic behavior
depending on the argument ranks (can do matrix-matrix, matrix-vector,
batch matmul, etc.), aten.mm is just a vanilla matrix
multiply, which can be lowered precisely to tcf.matmul.
The "test" is really just an example that I stared at while getting my
feet wet with this. We probably want something that actually tests this
as part of `ninja check-npcomp`.
* A bit gross because I took the chance to upgrade all of the backend bits to the new MLIR Python bindings and we still co-mingle the old and new for now.
* Since the Python created PassManagers are configured for explicit nesting, I had to upgrade some of the pass pipelines to be explicit.
* The demo in mul_maximum_e2e.py now compiles, runs through PyTorch and through the JIT, prints and asserts the same results.
* I am not claiming that this is the prettiest API in this patch: consider that this is just directly using low-level APIs and there should be an intervening high level API.
* Conversions are very simple, suporting mul, maximum and add (alpha=1 only).
* Example added with pass pipeline needed to run.
* Much missing off of the golden path but sufficient for such simple cases.