- aten::relu_, aten::max_pool2d, aten::adaptive_avg_pool2d, aten::batch_norm, aten::conv2d
No aten-to-linalg conversion for the latter ones, as they are fairly
substantial. At this point, I'm trying to get shape inference and stuff
working for them and the IR cleaned up.
This trait lets us model the semantics of various aten/torch/numpy ops
that are insensitive to type refinements. This replaces
hardcoded/inconsistent checks for this property.
To show usage of this new trait, we fix up some old uses, and improve
RefineTypes to be smarter about rewriting with this trait.
Interestingly, TorchScript has its own op (`torch::jit::Operator`)
registry separate from the dispatcher (it is a superset of the
dispatcher).
This is where the "prim" ops and some "aten" ops (that should probably
be renamed to "prim") live. In particular, `aten::__is__` is in that
latter category of "aten but really prim". This registry is also the
source of truth for what the TorchScript interpreter calls into when it
executes.
The bulk of the "not part of the dispatcher" ops live in
09feb5f579/torch/csrc/jit/runtime/register_prim_ops.cpp (L82)
And the registry itself lives in:
09feb5f579/torch/csrc/jit/runtime/operator.cpp (L196)
This fold further reduces the IR of ResNet by folding away some
more not-taken branches. These not-taken branches in ResNet require
first-class handling of the list type which we don't yet have on any
backend.
This is the start of a push to getting ResNet running.
This involves throwing in the towel on an O0 pipelinie for now. See note
in the code. We keep an options struct with `optimize` flag, but it
default to true for now.
This removes the need for defining all of the custom propagation logic,
and also adds support for propagating value knowledge across branches,
through regions, and across calls.
These tests pass on the reference backend.
- Add aten.linear op + shape xfer function + ATen->Linalg lowering.
- Note: this needs to be more automated, and needs to cover more cases.
- Current not implemented caveats:
- size-1 broadcasting for bias vector (either static-size-1 or ? case)
- higher-rank aten.linear ops (not produced by torch.nn.Linear though)
- type promotion (still don't even know the exact rules here)
- Add folder for torch.derefine op. Now the inliner can clean it up as
it inlines. (call boundaries are a main place we need to insert
torch.derefine) This is brittle -- the other important case is control
flow which will need to be handled via an extension to
RefineTypes.cpp (as will more robust call handling). River has an
in-flight patch to update it to the new dataflow framework so I didn't
want to do anything intrusive here.
- Also adjust torch.derefine syntax to use the keyword `to` instead of
`->`, as most type-only, cast-like ops do.
This inlines global slots if possible. This allows them to participate
in folding, canonicalization, shape inference, etc.
Example use cases:
- inlining weights and biases that are readonly during inference
- inlining the "training" bool to allow stuff to fold away
For training use cases (especially internal training loop), we will need
something smarter to get good performance. That would look like an "SSA
formation" which promotes the global slots to tensors in the program,
flushing them back to the slots at the minimal number of necessary
places. We might want to let backends do that transformation though.
This also interacts with shape inference (type bounds on the slots to
even lower them to backends in the first place).
- Move frontend lowering pipelines to c++ (this helps with reproducing
failures in npcomp-opt)
- Add debugging printouts when compilation fails on RefBackendTestConfig
The experience now when a test fails during MLIR lowering is now like this:
```
NPCOMP TorchScript Object Graph IR -> NPCOMP Backend IR lowering failed with the following diagnostics:
failed to legalize operation 'torch.global_slot'
Module does not conform to npcomp's backend contract. See dialect conversion legality information above.
Error can be reproduced with:
$ npcomp-opt -torchscript-to-npcomp-backend-pipeline /tmp/ResNet18Module.mlir
```
And when TorchScript->MLIR import fails it looks like this:
```
PyTorch TorchScript module -> NPCOMP Object Graph IR import failed with the following diagnostics:
unhandled prim operation: %18 : int = prim::min(%17) # /usr/local/google/home/silvasean/.local/lib/python3.9/site-packages/torch/nn/functional.py:4532:4
```
Also,
- Add `--filter=<regex>` to e2e test harness to filter tests.
- Add a few prim ops that were needed to import ResNet18
- Fix torch.prim.Loop.condition assemblyFormat (it previously would not
round-trip in the case of no loop-carried variables)
The E2E tests can be run with
```
npcpy frontends/pytorch/e2e_testing/torchscript/main.py
```
This commit adds a couple items supporting that end, including new sugar
for annotations (no more raw use of ClassAnnotator!).
Recommended review order:
1. `frontends/pytorch/e2e_testing/torchscript/main.py` for
the harness + `basic.py` in that directory for examples of tests.
2. Annotation sugar in `frontends/pytorch/python/torch_mlir/torchscript/annotations.py`
and unittest in `frontends/pytorch/test/ivalue_import/annotations/sugar.py`
3. Global test registry / sugar in
`frontends/pytorch/python/torch_mlir/torchscript/e2e_test/registry.py`
4. `frontends/pytorch/python/torch_mlir/torchscript/e2e_test/framework.py`
for the meat of the testing framework (start at `run_tests`), and
looking at the backend configs in
`frontends/pytorch/python/torch_mlir/torchscript/e2e_test/configs`
for examples of backends. This is likely the bulk of review time.
5. Unit tests of the framework logic in `frontends/pytorch/test/torchscript_e2e_test`
There's TODO's scattered throughout, but this seems functional enough to
start pulling stuff into and kicking the tires. A few missing pieces:
1. Marking test expected pass/fail per backend.
2. Figuring out how best to fit this into dev workflows.
3. IREE TestConfig.
Also, forgive this Python newbie... Any advice on Python code structure
/ library design would be much appreciated.
As described in the code comment:
```
When we import TorchScript IR, we import their entire "compilation unit",
which can contain numerous functions unrelated to the current program,
which breaks torch-globalization-pipeline; for example, there can be
random functions referencing types that haven't been imported
as part of the root `torch.nn.Module` we imported. Those will
be unreferenced private functions which symbol-dce will clean up nicely.
```
This situation is really easy to hit in jupyter notebooks, where the
same cell is evaluated multiple times. That results in the same
class name (at the Python level, e.g. class `Foo` in the top-level
main module). Internally to PyTorch, it handles this situation by
mangling in a unique number to the names of ClassType's and such. When
we import the new ClassType's, we see not just the new
torch::jit::Function's in the CompilationUnit, but, also all the old
ones, which reference ClassType's that are not reachable from the
`torch.nn.Module` that we imported.
Note: there is no way to avoid importing the whole CompilationUnit
(including these old remnants) without doing a fairly complicated call
graph reachability analysis of which functions are reachable from the
methods of the ClassType's we imported. It turns out that once we are
inside MLIR, we model visibility correctly so that `symbol-dce`
"Just Works" for this use case. That is to say, this is not a quick
hack, but rather seems like a totally palatable long-term solution.
This pass verifies that a given module satisfies the contract that we
have for backends. This is phrased as an "allowlist", because we want to
keep this interface tight. Also, this gives much better diagnostics than
a backend randomly crashing or failing to compile would (though they
could still be improved).
This was especially painful because if we had
`tensor<?x!numpy.any_dtype>` slip through, at some point RefBackend
would convert it to a memref type and trip the "verify type invariants"
assertion which gives no location or anything and crashed the process,
which was very unpleasant.
We implement this with the dialect conversion framework, which works
reasonably well and was quick to put together and familiar, but is still
very "op oriented". We probably want to make this hand-rolled
eventually, especially the error reporting (the most useful kind of
error for a dialect conversion user is not necessarily the best for this
use case). Also, in production, these error will go to users, and need
to be surfaced carefully such as "the compiler needs a type annotation
on this function parameter" which in general requires some special
analysis, wordsmithing, and overall awareness of the e2e use case (such
as how much we can lean into certain source locations) to provide a
meaningful user-level diagnostic.
Also, add `inline` to the current frontend lowering pass pipeline to
allow slightly more complicated programs that otherwise would fail on
shape inference.
This is our first op with error semantics, and stresses the system.
There are a few design notes of special interest:
- RefineTypes.cpp's note about shape inference in the presence of code
that dynamically produces and error, and it is provable statically.
- ATenToLinalg.cpp's notes about future automation of the ATen->linalg
path.
- The notes in Passes.td about using low-tech `std.assert` ops instead
of `shape.assuming`.
Note: Doesn't work on IREE yet due to the `std.assert` op (needs to be
lowered to `vm.fail` on the IREE side).
Recommended review order:
- Changes in frontends/pytorch/examples/
- Changes in python/npcomp/compiler/pytorch/backend/
- Boilerplate for the `npcomp-iree-backend-lower-linkage` pass.
This change separates out a
`npcomp.compiler.pytorch.backend.frontend_lowering` module that does the
common lowering for all backends. The individual compiler backends
`npcomp.compiler.pytorch.backend.{refjit,iree}` now accept a loosely
defined "TCP + scalar code" IR mix that will be formalized in the
future as the interface to codegen backends.
This also required adding a small pass
`npcomp-iree-backend-lower-linkage` which adds `iree.module.export` onto
functions, and layering that into the frontend flow. The pass doesn't
require a C++-level dependency on IREE, which is nice for now. TBD how
we are going to handle lists (we hope we can get away with sneakerneting
some td files and relying on loose IR compatibility).
Running through IREE requires the ability to import `iree.compiler` and
`iree.runtime`, which can be obtained as follows:
```
python3 -m pip install iree-compiler-snapshot iree-runtime-snapshot -f https://github.com/google/iree/releases/tag/snapshot-20210406.200
PYTHONPATH="${PYTHONPATH}:${MY_IREE_BUILD}/bindings/python/"
```
This patch makes it painfully clear that we don't have any e2e testing
harness to really plug into, and also don't have a usable Python API to
our compiler stack (something usable in a jupyter notebook).
That will be addressed in subsequent commits. We've been flying by the
seat of our pants with this `examples` directory that isn't subject to
any kind of testing or real usability concerns.
This revamps the TORCH_TO_TCF_PASSES to reflect the new layering that we
are doing in the compiler. See comments there for the layering.
Also adds `frontends/pytorch/examples/torchscript_tanh_e2e.py` as an
"example". E2E testing story TBD (want to get IREE working first).
This pass allows shape information to be propagated to return types,
which is nontrivial and cannot be cleanly put anywhere else as it
changes the public ABI, which is a concern that we want to keep
concentrated in one place.
Currently implemented as a simple intraprocedural dataflow analysis over
a standard ShapedType lattice (hasRank, sizes, and elementType).
It currently hardcodes a few key pieces of information:
- shape transfer functions
- whether it is legal to update the operand type of an op
This needs to be made pluggable obviously and the core propagation logic
moved somewhere agnostic.
The current implementation is just sufficient to do a unary aten.tanh
from the e2e spike, and just applies some local rewrite patterns. I've
sketched out the more full explanation of where this pass eventually
need to go in the pass docs.
Adding this required adding `numpy.tensor_static_info_cast`, which is
the tensor analog of `numpy.static_info_cast`. This op encapsulates the
same numpy-specific "no runtime code" casting semantics, in particular
the interpretation of `!numpy.any_dtype`. The
`numpy.tensor_static_info_cast` I see in practice now are "information
erasing" and will be removed by a later pass that exploits the fact that
aten ops are agnostic to the static info in the operand types (so
substituting a type with more static info is fine).
Side note: we *need* to do dtype and rank inference before aten->tcf
(which will eventually mostly be aten->linalg+guards), because each aten
op is idiosyncratically overloaded based on dtype and rank. Without
copying that idiosyncratic overloading into lower layers (layering
violation), we cannot really lower it to anything until we do that.
This pass incorporates torch.type_bound info and also removes NoneType
returns (eventually it will rewrite tuple types too, but can't yet
because !basicpy.TupleType doesn't track element types).
Recommend looking at adjust-calling-conventions.mlir first to see what
it is doing, and holding your nose for the implementation of the pass.
I decided to implement this with the conversion framework, because it
gives us *some* goodies for type conversion -- mainly avoiding large
amounts of tricky RAUW dances. Unfortunately, the conversion framework
isn't a perfect fit for a couple reasons:
- the incorporation of torch.type_bound is a context-sensitive rewrite
(requires looking at the arg attr, not just the type).
- NoneType conversion is 1->0, which requires some special handling
- (not implemented yet) 1->N tuple type conversions require special
handling.
It's a little bit scary, but on balance doing it the other way would
have its own downsides.
These allow users to annotate a known "type bound" on the argument,
which can seed shape/dtype inference. We don't rewrite the function
types as part of the import process (it will happen in a
yet-to-be-written pass) because:
1. We would need to interprocedurally rewrite all calls to keep the IR
consistent. Currently, we have a place after GlobalizeObjectGraph but
before we convert to tensors where this is convenient to do. Ideally,
we would do this on the object graph representation.
1. We don't necessarily know that adjusting the function type is a legal
calling convention change. The pass will have blessed knowledge (by
the pass pipeline author) that adjusting the argument type based on
the type bound is safe (which it frequently is).
2. Note that in principle, a type bound could be a fairly general thing
(such as maximum sizes of dimensions, unions of multiple concrete
types, etc.). The pass will in principle have logic to interpret the
type bounds and to determine a suitable "best" (and legal) argument
type.
- renames of OwningRewritePatternList -> RewritePatternSet
- also `insert` to `add`
- RewritePatternSet holds a context now
- memref dialect split from std
* Adds f32 scalar argument support across the ABI boundary.
* Adds support for passing input type / shape information
across the ABI boundary
* Adds support for parsing / creating input FloatAttr's in
`npcomp-run-mlir`
We already had the `promoteTrailingOutTensor` flag, but weren't using
it. A inplaceVariantKernelName flag needed to be added.
This change is a little dissatisfying, as the conversions done by the
RecognizeKernelsPass are currently non-orthogonal. In particular,
`kDropResultAndAliasArg0` probably won't work as intended if mixed with
these (we probably need to promote kDropResultAndAliasArg0 to not be an
arg-level thing anyway, as we have done with promoteTrailingOutTensor).
This involved adding a new op `numpy.overwrite_array`.
```
numpy.overwrite_array %arg2 overwrites %arg0 : tensor<2x3xf32>, !numpy.ndarray<[2,3]:f32>
```
This models the destructive update behavior. Note that in the above op,
we cannot simply RAUW %arg0 with a suitably conveted %arg2 (for example,
%arg0 might have uses that are not dominated by %arg2, or might have an
alias relation with some other array in the program). In general, we
need a pass analogous to "SSA-formation" which knows how to see through
these to uncover an underlying tensor program.
Also, add tanh_out_e2e.py/div_inplace_e2e.py and fix some bitrot in
refjit.py which is my running example I'm trying to get working.
We should generally be using torch_signature_ods_gen.py for generating
these. Somehow this one slipped through manually.
There is no `aten::conv2d_overridable` in the op registry AFAICT so I
removed that alias.
torchvision nightly has not bump to 0.10.0 alpha, so pip installs
torchvision==0.9.0 even with the --pre flag.
Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>
* Import ATen conv2d conversion and test
This is a first attempt at expanding ATen-to-TCF conversion for the
conv2d operator. Eventually, this will come in use when lowering a
high-level conv-based model.
This happens in practice with e.g. ResNet from torchvision (multiple
instances of the same BatchNorm class).
The key observation is that for this program, and the expected set of
programs, we can convert the program to the same globalized form with a
bit more static analysis and effort to suitably monomorphize the
program. Though what we are doing here is fairly annoying to implement,
it saves any nontrivial later pass from having to do similar analyses
(or worse). E.g. shape inference would need to be object-graph aware,
mutation/lifetime analyses would have to be aware, etc. Additionally, it
would make us front-load what it means to have a !torch.nn.Module type
on an ABI boundary, which we are just not ready to handle.
I'm really, really hoping that in practice we can get away with
this, otherwise it's going to be really rough designing a representation
(and implementing everything to back it) that is convenient to transform
and gracefully scales from full object graph (in the most dynamic case)
down to a fixed set of global slots like we have here (in the most
static case, which we presume a lot of practical programs fall into).
This also involved introducing a
`torch-prepare-for-globalize-object-graph` pass that does a minimal set of
lowerings to simplify the IR into a more orthogonal and analyzable form,
and a `torch-globalize-pipeline` helper.
Recommended review order:
- updated documentation in Passes.td
- new tests in `globalize-object-graph-multiple-instances*.mlir`
- implementation of GlobalizeObjectGraph.cpp
- PrepareForGlobalizeObjectGraph.cpp + prepare-for-globalize-object-graph.mlir
- misc stuff like torch-globalize-pipeline pipeline definition.
With this, we can import, globalize, and inline resnet18 from
torchvision:
https://gist.github.com/silvasean/821586afc19b67d9fb72030b2e0adeb8