Recommended review order:
- Changes in frontends/pytorch/examples/
- Changes in python/npcomp/compiler/pytorch/backend/
- Boilerplate for the `npcomp-iree-backend-lower-linkage` pass.
This change separates out a
`npcomp.compiler.pytorch.backend.frontend_lowering` module that does the
common lowering for all backends. The individual compiler backends
`npcomp.compiler.pytorch.backend.{refjit,iree}` now accept a loosely
defined "TCP + scalar code" IR mix that will be formalized in the
future as the interface to codegen backends.
This also required adding a small pass
`npcomp-iree-backend-lower-linkage` which adds `iree.module.export` onto
functions, and layering that into the frontend flow. The pass doesn't
require a C++-level dependency on IREE, which is nice for now. TBD how
we are going to handle lists (we hope we can get away with sneakerneting
some td files and relying on loose IR compatibility).
Running through IREE requires the ability to import `iree.compiler` and
`iree.runtime`, which can be obtained as follows:
```
python3 -m pip install iree-compiler-snapshot iree-runtime-snapshot -f https://github.com/google/iree/releases/tag/snapshot-20210406.200
PYTHONPATH="${PYTHONPATH}:${MY_IREE_BUILD}/bindings/python/"
```
This patch makes it painfully clear that we don't have any e2e testing
harness to really plug into, and also don't have a usable Python API to
our compiler stack (something usable in a jupyter notebook).
That will be addressed in subsequent commits. We've been flying by the
seat of our pants with this `examples` directory that isn't subject to
any kind of testing or real usability concerns.
This pass allows shape information to be propagated to return types,
which is nontrivial and cannot be cleanly put anywhere else as it
changes the public ABI, which is a concern that we want to keep
concentrated in one place.
Currently implemented as a simple intraprocedural dataflow analysis over
a standard ShapedType lattice (hasRank, sizes, and elementType).
It currently hardcodes a few key pieces of information:
- shape transfer functions
- whether it is legal to update the operand type of an op
This needs to be made pluggable obviously and the core propagation logic
moved somewhere agnostic.
The current implementation is just sufficient to do a unary aten.tanh
from the e2e spike, and just applies some local rewrite patterns. I've
sketched out the more full explanation of where this pass eventually
need to go in the pass docs.
Adding this required adding `numpy.tensor_static_info_cast`, which is
the tensor analog of `numpy.static_info_cast`. This op encapsulates the
same numpy-specific "no runtime code" casting semantics, in particular
the interpretation of `!numpy.any_dtype`. The
`numpy.tensor_static_info_cast` I see in practice now are "information
erasing" and will be removed by a later pass that exploits the fact that
aten ops are agnostic to the static info in the operand types (so
substituting a type with more static info is fine).
Side note: we *need* to do dtype and rank inference before aten->tcf
(which will eventually mostly be aten->linalg+guards), because each aten
op is idiosyncratically overloaded based on dtype and rank. Without
copying that idiosyncratic overloading into lower layers (layering
violation), we cannot really lower it to anything until we do that.
This pass incorporates torch.type_bound info and also removes NoneType
returns (eventually it will rewrite tuple types too, but can't yet
because !basicpy.TupleType doesn't track element types).
Recommend looking at adjust-calling-conventions.mlir first to see what
it is doing, and holding your nose for the implementation of the pass.
I decided to implement this with the conversion framework, because it
gives us *some* goodies for type conversion -- mainly avoiding large
amounts of tricky RAUW dances. Unfortunately, the conversion framework
isn't a perfect fit for a couple reasons:
- the incorporation of torch.type_bound is a context-sensitive rewrite
(requires looking at the arg attr, not just the type).
- NoneType conversion is 1->0, which requires some special handling
- (not implemented yet) 1->N tuple type conversions require special
handling.
It's a little bit scary, but on balance doing it the other way would
have its own downsides.
These allow users to annotate a known "type bound" on the argument,
which can seed shape/dtype inference. We don't rewrite the function
types as part of the import process (it will happen in a
yet-to-be-written pass) because:
1. We would need to interprocedurally rewrite all calls to keep the IR
consistent. Currently, we have a place after GlobalizeObjectGraph but
before we convert to tensors where this is convenient to do. Ideally,
we would do this on the object graph representation.
1. We don't necessarily know that adjusting the function type is a legal
calling convention change. The pass will have blessed knowledge (by
the pass pipeline author) that adjusting the argument type based on
the type bound is safe (which it frequently is).
2. Note that in principle, a type bound could be a fairly general thing
(such as maximum sizes of dimensions, unions of multiple concrete
types, etc.). The pass will in principle have logic to interpret the
type bounds and to determine a suitable "best" (and legal) argument
type.
- renames of OwningRewritePatternList -> RewritePatternSet
- also `insert` to `add`
- RewritePatternSet holds a context now
- memref dialect split from std
* Adds f32 scalar argument support across the ABI boundary.
* Adds support for passing input type / shape information
across the ABI boundary
* Adds support for parsing / creating input FloatAttr's in
`npcomp-run-mlir`
We already had the `promoteTrailingOutTensor` flag, but weren't using
it. A inplaceVariantKernelName flag needed to be added.
This change is a little dissatisfying, as the conversions done by the
RecognizeKernelsPass are currently non-orthogonal. In particular,
`kDropResultAndAliasArg0` probably won't work as intended if mixed with
these (we probably need to promote kDropResultAndAliasArg0 to not be an
arg-level thing anyway, as we have done with promoteTrailingOutTensor).
This involved adding a new op `numpy.overwrite_array`.
```
numpy.overwrite_array %arg2 overwrites %arg0 : tensor<2x3xf32>, !numpy.ndarray<[2,3]:f32>
```
This models the destructive update behavior. Note that in the above op,
we cannot simply RAUW %arg0 with a suitably conveted %arg2 (for example,
%arg0 might have uses that are not dominated by %arg2, or might have an
alias relation with some other array in the program). In general, we
need a pass analogous to "SSA-formation" which knows how to see through
these to uncover an underlying tensor program.
Also, add tanh_out_e2e.py/div_inplace_e2e.py and fix some bitrot in
refjit.py which is my running example I'm trying to get working.
* Import ATen conv2d conversion and test
This is a first attempt at expanding ATen-to-TCF conversion for the
conv2d operator. Eventually, this will come in use when lowering a
high-level conv-based model.
This happens in practice with e.g. ResNet from torchvision (multiple
instances of the same BatchNorm class).
The key observation is that for this program, and the expected set of
programs, we can convert the program to the same globalized form with a
bit more static analysis and effort to suitably monomorphize the
program. Though what we are doing here is fairly annoying to implement,
it saves any nontrivial later pass from having to do similar analyses
(or worse). E.g. shape inference would need to be object-graph aware,
mutation/lifetime analyses would have to be aware, etc. Additionally, it
would make us front-load what it means to have a !torch.nn.Module type
on an ABI boundary, which we are just not ready to handle.
I'm really, really hoping that in practice we can get away with
this, otherwise it's going to be really rough designing a representation
(and implementing everything to back it) that is convenient to transform
and gracefully scales from full object graph (in the most dynamic case)
down to a fixed set of global slots like we have here (in the most
static case, which we presume a lot of practical programs fall into).
This also involved introducing a
`torch-prepare-for-globalize-object-graph` pass that does a minimal set of
lowerings to simplify the IR into a more orthogonal and analyzable form,
and a `torch-globalize-pipeline` helper.
Recommended review order:
- updated documentation in Passes.td
- new tests in `globalize-object-graph-multiple-instances*.mlir`
- implementation of GlobalizeObjectGraph.cpp
- PrepareForGlobalizeObjectGraph.cpp + prepare-for-globalize-object-graph.mlir
- misc stuff like torch-globalize-pipeline pipeline definition.
With this, we can import, globalize, and inline resnet18 from
torchvision:
https://gist.github.com/silvasean/821586afc19b67d9fb72030b2e0adeb8
In terms of IR structure, TorchScript allows types to vary in many
circumstances where MLIR requires pointer-identical types. In particular,
it is valid to pass any subtype in place of a type. For example, if an
`Optional[int]` is required somewhere in the IR, it is legal to pass a
value of just `int` (but not the other way around; see
`torch.prim.unchecked_cast`). In effect, every *use* can have a different
type.
We introduce a new op `torch.derefine` that models that impedance
mismatch. This op allows casting a value from one type to a type that it
is a subtype of to model this behavior.
Recommended review order:
- TorchOps.td for new torch.derefine (and updated docs for
`torch.prim.unchecked_cast`)
- new test code in if.py, loop.py, function-derefine.py
- new code in node_importer.cpp for handling derefinement insertion
- function_importer.cpp and utils changes in torch_to_mlir_utils.cpp
Properly handling derefinement on function boundaries required
relayering the code so that graph_importer.cpp/.h is now
function_importer.cpp/.h because only the `torch::jit::Function`
(actually the `c10::FunctionSchema` it holds) knows the derefined types that are
actually needed at the boundary (see `function-derefine.py` for a test).
Annoyingly, this churns all the functions which are now prefixed with
`__torch__.` but that is more correct anyway (that is their linkage name
in the `torch::jit::CompilationUnit`; the previous `mb.import_function`
was actually buggy in the case of functions calling each other as it
would reference their unqualified name).
With this change, we can import `resnet18` from `torchvision` :)
IR: https://gist.github.com/silvasean/6426a5272d8a6c7caae533fce05ab704
This primarily unlocks proper handling of free functions (that is,
functions that are not methods of any torch.nn.Module).
Recommended review order:
- `ivalue_importer.cpp` + `ivalue_import/functions*.py`
- `GlobalizeObjectGraph.cpp` + test case
- misc other stuff
The `torch::jit::CompilationUnit` is basically a backing store or
"context" holding all the possible functions in the program. The
previous code was not explicitly accessing this data structure, since it
just imported the `torch::jit::Function`'s that it saw attached to
methods.
Subtly, any time a TorchScript module called into a free function, the
free function gets incorporated into the torch::jit::CompilationUnit,
but doesn't show up anywhere when dumping the module, except in the
curious pattern:
```
%5 : Function = prim::Constant[name="adaptive_avg_pool2d"]()
%6 : Tensor = prim::CallFunction(%5, %input.1, %4)
```
That is, calls are indirect calls, and are accessed via `prim::Constant`
materializing a function object. Even stranger, the `name` attribute here
doesn't really even tell the full story -- it doesn't correspond to
anything. It turns out that the c10::FunctionType itself actually holds
a pointer to the `torch::jit::Function` in the compilation unit
directly (so there is actually no indirection in prim::CallMethod,
because any two values of the same FunctionType call the same
function!). E.g. when converting the IR to bytecode, the "name" is
ignored [code link](1d6bd15790/torch/csrc/jit/runtime/interpreter.cpp (L937)).
We do import `prim::CallFunction` as a `std.call_indirect` though
because it's more braindead to do it that way (it gets canonicalized to
a direct call easily).
This is a much simpler representation than the ad-hoc initializer
function we had before. It is also less general, but given the rationale
in Passes.td it seems like the right tradeoff right now.
We can probably carry this representation for quite a while, and when we
can't, it likely means that TorchScript has fixed their object identity
bug and we probably need to just upgrade to a more general object graph
modeling (more general than GlobalizeObjectGraph).
In particular, we don't want to deal with defining and carrying around
this initializer function concept until we need it. For example, if we
want to constant-fold the global slots into uses, this is a much better
representation, and it plays better with symbol-dce (the initializer
function counts as a "use" of the symbol).
(the alternative would have been to write a pass that converts the
initializer function to this form when possible, but I realized that
lots of information had been lost which made that fairly annoying -- it
was all self-inflicted anyway, so best to just go to the source
(GlobalizeObjectGraph) before the information is lost)
Now symbol-dce works nicely (no more "training" bools)
```
pt_util ~/tmp/classifier.pt --import --exported-name forward \
| npcomp-opt -torch-globalize-object-graph -inline -symbol-dce
```
IR: https://gist.github.com/silvasean/8abe63d70d24e29d6db9170ccc8d512b
The first use case is to annotate certain program constructs as either
exported or private. In this commit we plumb it down to
GlobalizeObjectGraph which makes use of this information.
Recommended review order:
1. class_annotator.h/.cpp + `test/module_import/annotations/*`
- New abstractions to communicate with Python code and annotate.
2. IR changes in TorchOps.td
- Adding "private" attribute to various things.
3. ivalue_import.cpp changes
- Module + ClassAnnotator = annotated IR
4. GlobalizeObjectGraph.cpp + tests
- use new "private" attributes to create "private" IR.
- also, tweak some of the op deleting mechanics, which was triggering
some memory errors / assertions
With this, we can run the classifier through and inline it as follows:
```
frontends/pytorch/utils/pt_util.py --import --exported-name forward ~/tmp/classifier.pt \
| npcomp-opt -torch-globalize-object-graph -inline
```
IR: https://gist.github.com/silvasean/32dcad9f6270557f412094a77cecdd69
With this + manually setting private visibility on everything, a simple
classifier can be reduced to this IR, which is looking pretty lean and
mean:
https://gist.github.com/silvasean/19e7e2e21a61ff197aeac0dd864d188f
Also, include a utility script for importing `.pt` models.
```
pt_util.py --import classifier.pt | npcomp-opt -torch-globalize-object-graph
```
This required restructuring of how we model TorchScript on import. The
main difference is that now we split out a `torch.class_type` that holds
methods and declarations of the types of each slot. This is more
consistent with TorchScript (our previous representation was
"denormalized").
Recommended reading order:
1. check out the description of `torch.class_type` in `TorchOps.td` and
look at `test/Dialect/Torch/ops.mlir` and
`frontends/pytorch/test/module_import/` to familiarize with the new
representation.
- Just look at the new IR. The diff between the old names and new
names is confusing.
2. check out `test/Dialect/Torch/globalize-object-graph*.mlir`
and read along with the pass description in
`include/npcomp/Dialect/Torch/Transforms/Passes.td`
3. Read the code in `GlobalizeObjectGraph.cpp` and miscellaneous changes
in `ivalue_importer.cpp`, `TorchOps.cpp`, etc.
It turns out that this was easiest to structure as a general IValue
importer, since torch module are just one of the possible IValue's.
We import the IValue object graph in a braindead fashion into basicpy
ops and a new `torch.nn_module` op that is used to model the
attributes/methods of a torch::jit::Module IValue. See `Torch/ops.mlir`
for an example, and also check out the .py import tests in
`frontends/pytorch/test/module_import`.
As part of this change, a few housekeeping tasks:
- extract some helpers from graph_importer.cpp
- more helpers around the C API
- misc touchups
- TensorFromElementsOp -> tensor::FromElementsOp
- `cmpi "eq", ...` -> `cmpi eq, ...`. Same for `cmpf`
- syntax change for private func ops
- some changes to the python bindings
* Most updates are mechanical except:
* python/npcomp/__init__.py and python/NpcompModule.cpp: New init/registration bits to replace some automatic things being done in the old bindings. Also an annoying linkage hack that I'll need to triage next.
* NpcompModule.cpp: New python helpers for custom types and other hard to reach items (for the new bindings).
* PybindUtils.h: Extended type casting so that the local extension can directly exchange Mlir* C types.
* python/npcomp/dialects/*: Build support and ODS bindings for local dialects.
* mlir_utils.py: Defines an ImportContext to replace the old/bad "Helper" class that tracked locations, and insertion points. This has a number of methods on it that would be good candidates to think about better ways to do them upstream.
* Also hoisted a few stand-alone samples to dedicated unit tests as they covered important things.
* More cleanup can be done, but keeping this patch as mechanical as possible to stay in NFC land (this is big enough).
Changes:
- linalg init tensor change (outs+init -> just outs)
- IntegerType::get and other builtin types now take the context as the
first arg
- LLVMType::* is gone. Now LLVM Types are just regular Type's.
--version_script doesn't work on OSX.
Shared libs are .dylibs on OSX.
TEST=Build on iMac Pro. M1 has other issues will be fixed later
Change-Id: I2bda46349a878b8265e273c05d8db6b46c0df633
Date: Mon Nov 30 15:20:30 2020 -0800
Changes:
- finalizing-bufferize is stricter now, and we need to pull in a DimOp
bufferization which was previously working by happenstance. The
offending DimOp's are actually created by the linalg bufferization
(which creates dim ops on the original tensor values, not the
converted memrefs), so the fix is moving std-bufferize after
linalg-bufferize.
Best as I can tell (e.g. from LeakSanitizer), this fixes all the leaks
except for those due to buffers created internally to the codegenned
code itself (up next I'll add the buffer deallocation pass to fix
those).
The main change is that instead of attempting to pass `refbackrt::Tensor`
to the codegenned function directly, we make all the ABI types be
UnrankedMemRef which gets passed awkwardly (but workably) as a
`{size_t rank, void *ptrToDescriptor}` on the ABI. The reason why
refbackrt::Tensor wasn't workable is that is that MLIR doesn't really
have a way to deal with the lifetime of unranked memref descriptors that
happen inside the function, which is inevitably what would happen in the
old code that would emit runtime calls to
`refbackrt.to_memref/refbackrt.from_memref` to convert back and forth to
`refbackrt::Tensor` inside the codegenned code.
So, instead of the `refbackrt.to_memref/refbackrt.from_memref` with no
real sound basis for valid lifetime management, we now have a lovely
piece of code in `refbackrt::invoke` in `Runtime.cpp` that just barely
seems to be sound. We rely on the codegenned code having these
properties, which it seems to have:
- it won't free memref descriptors or their backing buffer for arguments
of UnrankedMemRef type.
- it will allocate a separate memref descriptor for each result
UnrankedMemRef (which is ensured by having a separate memref_cast for
each)
- we can sniff the `allocatedPtr`'s (i.e. the backing buffer pointers)
to avoid double-freeing in the case of aliasing of the backing buffer
(including backing buffers for arguments feeding into results)
- to catch the case of statically allocated data (which we need to avoid
passing to `free`) , check if the `allocatedPtr` is (no joke) equal to
`0xDEADBEEF`, because there is otherwise no way to distinguish
statically allocated from malloc'ed data... (std.global_memref lowering
to LLVM by happenstance sets the allocatedPtr equal to `0xDEADBEEF`,
presumably mainly as a debugging thing)
Even with all this, we *still* need to (internally to refbackrt::invoke)
make copies of all inputs/outputs! And the details of how the LLVM-level
ABI gets laid out for e.g. function arguments/returns is still super
tricky.
This really highlights how deficient memref is as the general runtime
type for our use case. It's stewing in my mind how best to improve the
situation. My general gut feeling is that IREE's abstractions for this
are "right", but I need to think more how to distill those aspects of
IREE's design in a "reference" way for RefBackend.
Some implementation notes:
- In terms of how this is implemented, this did catch a bug in our ABI
wrapper functions in LowerToLLVM.cpp, which I had to fix (it happened to
work before through some combination of npcomprt::Tensor being passed as
a single pointer + probably me infinite-monkey-ing it until it worked)
- This actually removes 2 out of the 3 compiler runtime functions (the
only one left is "abort_if". (most of the memref descriptor code moved
from CopmilerRuntime.cpp to Runtime.cpp)
- this also means deleting `refbackrt.from_memref` and
`refbackrt.to_memref`
* Going through TODOs on the PyTorch side, this is a big cause of them (not being able to have constants for signed/unsigned).
* Added complex while in here since we're at the phase where it is better to just have things complete than partially done.
* Organizes the BasicPyOps.td file by function.
* Renamed `to_boolean` -> `as_predicate_value` (trying to consistently use "predicate" to refer to i1/low-level types and Bool/Boolean to refer to Python bool types).