This is a really major and invasive restructuring of the way we get
torch operators (`torch::jit::Operator` / `c10::OperatorHandle`) into
MLIR. Please forgive the challenging review, but due to the sheer
invasiveness, it wasn't really practical do do it in sane smaller
pieces.
This fully replaces everything that was already working on the
TorchScript path (actually, more -- we added tanh support to
TorchToLinalg in order to delete the older code paths). Additionally,
I've kept the lights on for the acap path too, including what little e2e
stuff was working before (for expediency I made a few tiny compromises
along the way that will be easy to undo when we give that path proper
attention).
Overview of the new design:
- The torch operator `somens::someunqualname.someoverloadname` is
imported as `torch.somens.someunqualname.someoverloadname` (skip the
last dotted part if the overload name is empty), OR, if we don't have
such an op registered, it is imported as
`torch.operator "somens.someunqualname.someoverloadname" (...) : ...`.
- The addition of the "overload name" is a critical element here, as
the `(ns,unqual,overload)` triple is unique, which solves a lot of
problems we were having.
- This involves having separate MLIR ops for the `trailing_` and
`.out` variants and all the different overloads. This seemed
necessary, because the set of overloads is so wild and varied and
unstructured. The previous design was leaning into some underlying
structure that just isn't there -- the default situation is
the "random overload that we want to manage on the MLIR side",
rather than that being an exception. E.g. `aten::ne` (not-equal)
has 21 overloads, only 4 of which are c10 dispatcher ops see
[gist](https://gist.github.com/silvasean/190ba918c550c956260e21254e1b8aa1),
and the "out" variant is really called `.Tensor_out` instead of
`.out` as it frequently is for other ops.
- Rationale for all being in `torch` namespace: the set of operators
are so varied and unstructured that "dialect per namespace"
doesn't result in anything resembling the typical MLIR dialect
boundary expectations. We could maybe draw the boundary at
dispatcher ops vs non-dispatcher ops, but that doesn't seem to
really result in very much useful structure at this point in time.
- Note: within the torch operator registry, we effectively have a
mini-basicpy subdialect (already type-resolved), which is reasonably
structured.
- The existing Torch op interfaces are also removed -- now that we
track the overload name, we can losslessly find the original
operator.
- Instead of `ATenRecognizeKernelsPass`, we now have a
`ReduceOpVariantsPass` that keys off certain traits (and perhaps
eventually interfaces) to reduce variants of ops to a smaller set,
ideally operating on immutable tensors and using surrounding ops to
model the mutability/aliasing aspects.
- Note: `torch.ns.unqual.overload` ops allow both immutable and
mutable tensors (unlike the previous hard distinction in the common
case). This is a premonition for a future change that will introduce a
bona fide `!torch.tensor` type that will clean up a bunch of stuff.
- `TorchToLinalg` / `TorchToStd` supercede the existing
"ATen->TCF->TCP->Linalg" path.
- The new `torch_ods_gen.py` supercedes `torch_signature_ods_gen.py`.
It should look somewhat familiar, but the benefit of hindsight has
allowed a lot of simplifications.
The overall trend seems to be to make the `torch` dialect a nice layer
independent of anything else. It feels like as a natural result of
various future changes we will be removing the reliance on basicpy+numpy
dialects and have a nice self-contained type system too that properly
models the TorchScript type system (including proper subtyping,
mutable/immutable tensors, optional dtype, etc.).
Recommended review order:
- Start at some of the new import IR, e.g. in
`frontends/pytorch/test/node_import/prim.py`,
`frontends/pytorch/test/acap_export/test_export_add3.py`, and other
tests.
- `frontends/pytorch/python/torch_mlir_utils/codegen/torch_ods_gen.py`
and associated generated files:
- `include/npcomp/Dialect/Torch/IR/GeneratedAtenOps.td`
- `include/npcomp/Dialect/Torch/IR/GeneratedPrimOps.td`
- Inspect `ReduceOpVariants.cpp` / `reduce-op-variants.mlir` and the new
traits in `include/npcomp/Dialect/Torch/IR/TorchTraits.h`
- Various code changes in the import path in
`frontends/pytorch/csrc/builder`. Probably most interesting is the new
code in `torch_to_mlir_utils.cpp` that has the logic to create the
`torch.operator` ops or `torch.ns.unqual.overload` ops.
This is the [new ResNet IR](https://gist.github.com/silvasean/5407aafb710d07612b7b5b92eabecebe),
just to be able to look at a substantial sample of IR in the new style.
- aten::relu_, aten::max_pool2d, aten::adaptive_avg_pool2d, aten::batch_norm, aten::conv2d
No aten-to-linalg conversion for the latter ones, as they are fairly
substantial. At this point, I'm trying to get shape inference and stuff
working for them and the IR cleaned up.
This trait lets us model the semantics of various aten/torch/numpy ops
that are insensitive to type refinements. This replaces
hardcoded/inconsistent checks for this property.
To show usage of this new trait, we fix up some old uses, and improve
RefineTypes to be smarter about rewriting with this trait.
Interestingly, TorchScript has its own op (`torch::jit::Operator`)
registry separate from the dispatcher (it is a superset of the
dispatcher).
This is where the "prim" ops and some "aten" ops (that should probably
be renamed to "prim") live. In particular, `aten::__is__` is in that
latter category of "aten but really prim". This registry is also the
source of truth for what the TorchScript interpreter calls into when it
executes.
The bulk of the "not part of the dispatcher" ops live in
09feb5f579/torch/csrc/jit/runtime/register_prim_ops.cpp (L82)
And the registry itself lives in:
09feb5f579/torch/csrc/jit/runtime/operator.cpp (L196)
This fold further reduces the IR of ResNet by folding away some
more not-taken branches. These not-taken branches in ResNet require
first-class handling of the list type which we don't yet have on any
backend.
This is the start of a push to getting ResNet running.
This involves throwing in the towel on an O0 pipelinie for now. See note
in the code. We keep an options struct with `optimize` flag, but it
default to true for now.
This removes the need for defining all of the custom propagation logic,
and also adds support for propagating value knowledge across branches,
through regions, and across calls.
These tests pass on the reference backend.
- Add aten.linear op + shape xfer function + ATen->Linalg lowering.
- Note: this needs to be more automated, and needs to cover more cases.
- Current not implemented caveats:
- size-1 broadcasting for bias vector (either static-size-1 or ? case)
- higher-rank aten.linear ops (not produced by torch.nn.Linear though)
- type promotion (still don't even know the exact rules here)
- Add folder for torch.derefine op. Now the inliner can clean it up as
it inlines. (call boundaries are a main place we need to insert
torch.derefine) This is brittle -- the other important case is control
flow which will need to be handled via an extension to
RefineTypes.cpp (as will more robust call handling). River has an
in-flight patch to update it to the new dataflow framework so I didn't
want to do anything intrusive here.
- Also adjust torch.derefine syntax to use the keyword `to` instead of
`->`, as most type-only, cast-like ops do.
This inlines global slots if possible. This allows them to participate
in folding, canonicalization, shape inference, etc.
Example use cases:
- inlining weights and biases that are readonly during inference
- inlining the "training" bool to allow stuff to fold away
For training use cases (especially internal training loop), we will need
something smarter to get good performance. That would look like an "SSA
formation" which promotes the global slots to tensors in the program,
flushing them back to the slots at the minimal number of necessary
places. We might want to let backends do that transformation though.
This also interacts with shape inference (type bounds on the slots to
even lower them to backends in the first place).
- Move frontend lowering pipelines to c++ (this helps with reproducing
failures in npcomp-opt)
- Add debugging printouts when compilation fails on RefBackendTestConfig
The experience now when a test fails during MLIR lowering is now like this:
```
NPCOMP TorchScript Object Graph IR -> NPCOMP Backend IR lowering failed with the following diagnostics:
failed to legalize operation 'torch.global_slot'
Module does not conform to npcomp's backend contract. See dialect conversion legality information above.
Error can be reproduced with:
$ npcomp-opt -torchscript-to-npcomp-backend-pipeline /tmp/ResNet18Module.mlir
```
And when TorchScript->MLIR import fails it looks like this:
```
PyTorch TorchScript module -> NPCOMP Object Graph IR import failed with the following diagnostics:
unhandled prim operation: %18 : int = prim::min(%17) # /usr/local/google/home/silvasean/.local/lib/python3.9/site-packages/torch/nn/functional.py:4532:4
```
Also,
- Add `--filter=<regex>` to e2e test harness to filter tests.
- Add a few prim ops that were needed to import ResNet18
- Fix torch.prim.Loop.condition assemblyFormat (it previously would not
round-trip in the case of no loop-carried variables)
As described in the code comment:
```
When we import TorchScript IR, we import their entire "compilation unit",
which can contain numerous functions unrelated to the current program,
which breaks torch-globalization-pipeline; for example, there can be
random functions referencing types that haven't been imported
as part of the root `torch.nn.Module` we imported. Those will
be unreferenced private functions which symbol-dce will clean up nicely.
```
This situation is really easy to hit in jupyter notebooks, where the
same cell is evaluated multiple times. That results in the same
class name (at the Python level, e.g. class `Foo` in the top-level
main module). Internally to PyTorch, it handles this situation by
mangling in a unique number to the names of ClassType's and such. When
we import the new ClassType's, we see not just the new
torch::jit::Function's in the CompilationUnit, but, also all the old
ones, which reference ClassType's that are not reachable from the
`torch.nn.Module` that we imported.
Note: there is no way to avoid importing the whole CompilationUnit
(including these old remnants) without doing a fairly complicated call
graph reachability analysis of which functions are reachable from the
methods of the ClassType's we imported. It turns out that once we are
inside MLIR, we model visibility correctly so that `symbol-dce`
"Just Works" for this use case. That is to say, this is not a quick
hack, but rather seems like a totally palatable long-term solution.
This is our first op with error semantics, and stresses the system.
There are a few design notes of special interest:
- RefineTypes.cpp's note about shape inference in the presence of code
that dynamically produces and error, and it is provable statically.
- ATenToLinalg.cpp's notes about future automation of the ATen->linalg
path.
- The notes in Passes.td about using low-tech `std.assert` ops instead
of `shape.assuming`.
Note: Doesn't work on IREE yet due to the `std.assert` op (needs to be
lowered to `vm.fail` on the IREE side).
Currently implemented as a simple intraprocedural dataflow analysis over
a standard ShapedType lattice (hasRank, sizes, and elementType).
It currently hardcodes a few key pieces of information:
- shape transfer functions
- whether it is legal to update the operand type of an op
This needs to be made pluggable obviously and the core propagation logic
moved somewhere agnostic.
This pass incorporates torch.type_bound info and also removes NoneType
returns (eventually it will rewrite tuple types too, but can't yet
because !basicpy.TupleType doesn't track element types).
Recommend looking at adjust-calling-conventions.mlir first to see what
it is doing, and holding your nose for the implementation of the pass.
I decided to implement this with the conversion framework, because it
gives us *some* goodies for type conversion -- mainly avoiding large
amounts of tricky RAUW dances. Unfortunately, the conversion framework
isn't a perfect fit for a couple reasons:
- the incorporation of torch.type_bound is a context-sensitive rewrite
(requires looking at the arg attr, not just the type).
- NoneType conversion is 1->0, which requires some special handling
- (not implemented yet) 1->N tuple type conversions require special
handling.
It's a little bit scary, but on balance doing it the other way would
have its own downsides.
- renames of OwningRewritePatternList -> RewritePatternSet
- also `insert` to `add`
- RewritePatternSet holds a context now
- memref dialect split from std
This happens in practice with e.g. ResNet from torchvision (multiple
instances of the same BatchNorm class).
The key observation is that for this program, and the expected set of
programs, we can convert the program to the same globalized form with a
bit more static analysis and effort to suitably monomorphize the
program. Though what we are doing here is fairly annoying to implement,
it saves any nontrivial later pass from having to do similar analyses
(or worse). E.g. shape inference would need to be object-graph aware,
mutation/lifetime analyses would have to be aware, etc. Additionally, it
would make us front-load what it means to have a !torch.nn.Module type
on an ABI boundary, which we are just not ready to handle.
I'm really, really hoping that in practice we can get away with
this, otherwise it's going to be really rough designing a representation
(and implementing everything to back it) that is convenient to transform
and gracefully scales from full object graph (in the most dynamic case)
down to a fixed set of global slots like we have here (in the most
static case, which we presume a lot of practical programs fall into).
This also involved introducing a
`torch-prepare-for-globalize-object-graph` pass that does a minimal set of
lowerings to simplify the IR into a more orthogonal and analyzable form,
and a `torch-globalize-pipeline` helper.
Recommended review order:
- updated documentation in Passes.td
- new tests in `globalize-object-graph-multiple-instances*.mlir`
- implementation of GlobalizeObjectGraph.cpp
- PrepareForGlobalizeObjectGraph.cpp + prepare-for-globalize-object-graph.mlir
- misc stuff like torch-globalize-pipeline pipeline definition.
With this, we can import, globalize, and inline resnet18 from
torchvision:
https://gist.github.com/silvasean/821586afc19b67d9fb72030b2e0adeb8
This primarily unlocks proper handling of free functions (that is,
functions that are not methods of any torch.nn.Module).
Recommended review order:
- `ivalue_importer.cpp` + `ivalue_import/functions*.py`
- `GlobalizeObjectGraph.cpp` + test case
- misc other stuff
The `torch::jit::CompilationUnit` is basically a backing store or
"context" holding all the possible functions in the program. The
previous code was not explicitly accessing this data structure, since it
just imported the `torch::jit::Function`'s that it saw attached to
methods.
Subtly, any time a TorchScript module called into a free function, the
free function gets incorporated into the torch::jit::CompilationUnit,
but doesn't show up anywhere when dumping the module, except in the
curious pattern:
```
%5 : Function = prim::Constant[name="adaptive_avg_pool2d"]()
%6 : Tensor = prim::CallFunction(%5, %input.1, %4)
```
That is, calls are indirect calls, and are accessed via `prim::Constant`
materializing a function object. Even stranger, the `name` attribute here
doesn't really even tell the full story -- it doesn't correspond to
anything. It turns out that the c10::FunctionType itself actually holds
a pointer to the `torch::jit::Function` in the compilation unit
directly (so there is actually no indirection in prim::CallMethod,
because any two values of the same FunctionType call the same
function!). E.g. when converting the IR to bytecode, the "name" is
ignored [code link](1d6bd15790/torch/csrc/jit/runtime/interpreter.cpp (L937)).
We do import `prim::CallFunction` as a `std.call_indirect` though
because it's more braindead to do it that way (it gets canonicalized to
a direct call easily).
This is a much simpler representation than the ad-hoc initializer
function we had before. It is also less general, but given the rationale
in Passes.td it seems like the right tradeoff right now.
We can probably carry this representation for quite a while, and when we
can't, it likely means that TorchScript has fixed their object identity
bug and we probably need to just upgrade to a more general object graph
modeling (more general than GlobalizeObjectGraph).
In particular, we don't want to deal with defining and carrying around
this initializer function concept until we need it. For example, if we
want to constant-fold the global slots into uses, this is a much better
representation, and it plays better with symbol-dce (the initializer
function counts as a "use" of the symbol).
(the alternative would have been to write a pass that converts the
initializer function to this form when possible, but I realized that
lots of information had been lost which made that fairly annoying -- it
was all self-inflicted anyway, so best to just go to the source
(GlobalizeObjectGraph) before the information is lost)
Now symbol-dce works nicely (no more "training" bools)
```
pt_util ~/tmp/classifier.pt --import --exported-name forward \
| npcomp-opt -torch-globalize-object-graph -inline -symbol-dce
```
IR: https://gist.github.com/silvasean/8abe63d70d24e29d6db9170ccc8d512b
The first use case is to annotate certain program constructs as either
exported or private. In this commit we plumb it down to
GlobalizeObjectGraph which makes use of this information.
Recommended review order:
1. class_annotator.h/.cpp + `test/module_import/annotations/*`
- New abstractions to communicate with Python code and annotate.
2. IR changes in TorchOps.td
- Adding "private" attribute to various things.
3. ivalue_import.cpp changes
- Module + ClassAnnotator = annotated IR
4. GlobalizeObjectGraph.cpp + tests
- use new "private" attributes to create "private" IR.
- also, tweak some of the op deleting mechanics, which was triggering
some memory errors / assertions
With this, we can run the classifier through and inline it as follows:
```
frontends/pytorch/utils/pt_util.py --import --exported-name forward ~/tmp/classifier.pt \
| npcomp-opt -torch-globalize-object-graph -inline
```
IR: https://gist.github.com/silvasean/32dcad9f6270557f412094a77cecdd69
This required restructuring of how we model TorchScript on import. The
main difference is that now we split out a `torch.class_type` that holds
methods and declarations of the types of each slot. This is more
consistent with TorchScript (our previous representation was
"denormalized").
Recommended reading order:
1. check out the description of `torch.class_type` in `TorchOps.td` and
look at `test/Dialect/Torch/ops.mlir` and
`frontends/pytorch/test/module_import/` to familiarize with the new
representation.
- Just look at the new IR. The diff between the old names and new
names is confusing.
2. check out `test/Dialect/Torch/globalize-object-graph*.mlir`
and read along with the pass description in
`include/npcomp/Dialect/Torch/Transforms/Passes.td`
3. Read the code in `GlobalizeObjectGraph.cpp` and miscellaneous changes
in `ivalue_importer.cpp`, `TorchOps.cpp`, etc.