This allows building NPCOMP as an external project of LLVM, similar to
how CIRCT can be built: https://github.com/llvm/circt/pull/227.
The CMake options to use this build style look like this:
```
-DLLVM_EXTERNAL_PROJECTS=npcomp \
-DLLVM_EXTERNAL_NPCOMP_SOURCE_DIR=/path/to/mlir-npcomp \
```
- TensorFromElementsOp -> tensor::FromElementsOp
- `cmpi "eq", ...` -> `cmpi eq, ...`. Same for `cmpf`
- syntax change for private func ops
- some changes to the python bindings
* Most updates are mechanical except:
* python/npcomp/__init__.py and python/NpcompModule.cpp: New init/registration bits to replace some automatic things being done in the old bindings. Also an annoying linkage hack that I'll need to triage next.
* NpcompModule.cpp: New python helpers for custom types and other hard to reach items (for the new bindings).
* PybindUtils.h: Extended type casting so that the local extension can directly exchange Mlir* C types.
* python/npcomp/dialects/*: Build support and ODS bindings for local dialects.
* mlir_utils.py: Defines an ImportContext to replace the old/bad "Helper" class that tracked locations, and insertion points. This has a number of methods on it that would be good candidates to think about better ways to do them upstream.
* Also hoisted a few stand-alone samples to dedicated unit tests as they covered important things.
* More cleanup can be done, but keeping this patch as mechanical as possible to stay in NFC land (this is big enough).
Changes:
- linalg init tensor change (outs+init -> just outs)
- IntegerType::get and other builtin types now take the context as the
first arg
- LLVMType::* is gone. Now LLVM Types are just regular Type's.
Best as I can tell (e.g. from LeakSanitizer), this fixes all the leaks
except for those due to buffers created internally to the codegenned
code itself (up next I'll add the buffer deallocation pass to fix
those).
The main change is that instead of attempting to pass `refbackrt::Tensor`
to the codegenned function directly, we make all the ABI types be
UnrankedMemRef which gets passed awkwardly (but workably) as a
`{size_t rank, void *ptrToDescriptor}` on the ABI. The reason why
refbackrt::Tensor wasn't workable is that is that MLIR doesn't really
have a way to deal with the lifetime of unranked memref descriptors that
happen inside the function, which is inevitably what would happen in the
old code that would emit runtime calls to
`refbackrt.to_memref/refbackrt.from_memref` to convert back and forth to
`refbackrt::Tensor` inside the codegenned code.
So, instead of the `refbackrt.to_memref/refbackrt.from_memref` with no
real sound basis for valid lifetime management, we now have a lovely
piece of code in `refbackrt::invoke` in `Runtime.cpp` that just barely
seems to be sound. We rely on the codegenned code having these
properties, which it seems to have:
- it won't free memref descriptors or their backing buffer for arguments
of UnrankedMemRef type.
- it will allocate a separate memref descriptor for each result
UnrankedMemRef (which is ensured by having a separate memref_cast for
each)
- we can sniff the `allocatedPtr`'s (i.e. the backing buffer pointers)
to avoid double-freeing in the case of aliasing of the backing buffer
(including backing buffers for arguments feeding into results)
- to catch the case of statically allocated data (which we need to avoid
passing to `free`) , check if the `allocatedPtr` is (no joke) equal to
`0xDEADBEEF`, because there is otherwise no way to distinguish
statically allocated from malloc'ed data... (std.global_memref lowering
to LLVM by happenstance sets the allocatedPtr equal to `0xDEADBEEF`,
presumably mainly as a debugging thing)
Even with all this, we *still* need to (internally to refbackrt::invoke)
make copies of all inputs/outputs! And the details of how the LLVM-level
ABI gets laid out for e.g. function arguments/returns is still super
tricky.
This really highlights how deficient memref is as the general runtime
type for our use case. It's stewing in my mind how best to improve the
situation. My general gut feeling is that IREE's abstractions for this
are "right", but I need to think more how to distill those aspects of
IREE's design in a "reference" way for RefBackend.
Some implementation notes:
- In terms of how this is implemented, this did catch a bug in our ABI
wrapper functions in LowerToLLVM.cpp, which I had to fix (it happened to
work before through some combination of npcomprt::Tensor being passed as
a single pointer + probably me infinite-monkey-ing it until it worked)
- This actually removes 2 out of the 3 compiler runtime functions (the
only one left is "abort_if". (most of the memref descriptor code moved
from CopmilerRuntime.cpp to Runtime.cpp)
- this also means deleting `refbackrt.from_memref` and
`refbackrt.to_memref`
* Going through TODOs on the PyTorch side, this is a big cause of them (not being able to have constants for signed/unsigned).
* Added complex while in here since we're at the phase where it is better to just have things complete than partially done.
* Organizes the BasicPyOps.td file by function.
* Renamed `to_boolean` -> `as_predicate_value` (trying to consistently use "predicate" to refer to i1/low-level types and Bool/Boolean to refer to Python bool types).
It was annoying that we were creating shape.get_extent in the middle of
the bufferization pipeline, as it required running convert-shape-to-std
at an awkward place. To make that cleaner, just open-code the
extract_element ops that shape.get_extent expands into.
This is a little gross, but it helps with the macroscopic pipeline
ordering issues. Anyway, the train is long-gone of trying to treat
shapes as some special data type that should only be operated on with
shape ops.
Also,
- reorder tensor constant bufferize (which is a module pass) to bracket
all the bufferization function passes, to make the parallelism
opportunities there clearer. Now we have a very clean little
bufferization segment of our pipeline construction.
* IREE doesn't have proper install support, so there is some temporary hoaky hacking in our CMakeLists.txt to shuttle some symlinks around.
* Reworked the original numpy e2e with IREE test to pipe through iree-translate.
* Removed all of the C++-level dependencies.
* Will generalize and apply to the PyTorch backend in a followup.
This vastly simplifies our code, allowing deleting multiple ops,
simplifying multiple passes, and removing a whole pass.
Now `refback` dialect is down to one op (refback.alloc_memref, which
simplifies allocations to just take a shape instead of individual
extents).
This involved adding a `tcp.splatted` op to splat a dynamically sized
init tensor. See rationale in TCPOps.td docs.
One interesting observation is that when lowering tcf.matmul to
linalg.matmul, we need to both 1) create the error checks and 2)
calculate a shape transfer function to create the init tensors.
Previously, 2) was deferred to bufferizing tcp.matmul later. I'm not
sure if this is a conflation of concerns or not. For now, it's not a big
burden.
* Conversions are very simple, suporting mul, maximum and add (alpha=1 only).
* Example added with pass pipeline needed to run.
* Much missing off of the golden path but sufficient for such simple cases.
* convolution, convolution_backward, _log_softmax, _log_softmax_backward_data, nll_loss_forward, nll_loss_backward, nll_loss2d_forward, nll_loss2d_backward, copy_
* Extends the recognition logic and metadata for handling inplace transformations, optional tensors, ints, lists and dropped args.
* The kernel_calls generated by test_conv_nllloss_grads.py now convert to ATen.
* The result *almost* comes out as a pure tensor program with the exception of the copy_ op, which I will do some followup work to deal with.
* More progress on #97
Now, the only bufferization we have left is lowering tensor constants to
memref, which will hopefully proceed soon after Rahul's new
std.global_memref lands + the lowering to LLVM IR. Then I'll port
LowerConstantTensorsToMemref to upstream and we'll be 100% upstream
bufferization, except for our local TCP dialect (which will probably go
away and be replaced by std elementwise + linalg named ops on tensors :)
).
* Deletes prior code generator from previous attempt (moved some of it into this one).
* Renames old generated tablegen source to "Legacy".
* Generates ODS and import rules for most binary and unary arithmetic ops.
* Removes old generated ops and integration tests that were testing details of the prior setup.
Register the following for the multiply op:
- tcf.mul
- tcp.mul
- TCP->TCP lowering
- Shape transfer, broadcasted multiplicands
- Lower to standard `MulFOp` op
* Two op interfaces, one for querying instance metadata and one for getting static data needed to construct an op from a generic form.
* For torch.generic_kernel ops, metadata is splatted in during capture from Torch (it comes from the op registry, which will work for either device capture or graph import).
* Moved the 'add' out of the generated set so I can experiment on it. It implements the TorchBuildableKernelOpInterface interface which provides its metadata.
* The ATenRecognizeKernelsPass pass generically lowers from a torch.generic_kernel to recognized ops that implement the TorchBuildableKernelOpInterface, handling the various types of transformations that we allow at this stage.
* Adds Basicpy List, Tuple, Dict types and plumbs through C API.
* Started debugging the issues around aten::conv2d capture, but a PyTorch bug is suspected.
* Was able to manually verify that the basic conv2d forward test captures correctly with a workaround.
* Need to resolve some printing issues upstream and move these tests to an integration test target (they take ~seconds to run).
The time has come for BypassShapes/LowerShapedResultsToMemref to go away :(
For the reference backend, being consistent with upstream conventions is
the name of the game now.
This is a step down in a number of ways, e.g. test clarity and
separation of concerns. But it is fewer files and fewer tests, and
*does* address the "TODO: This is really fragile". It also eliminates two
more ops from the refback dialect (sadly, they are the
shaped_results/yield that we were getting kind of fond of, but alas).
Now that it has grown source/target materialization capabilities
(spelled with ops tensor_load/tensor_to_memref), we can use it. We can
also now delete refback.memref_to_tensor/refback.tensor_to_memref.
This is also a first step to reducing the downstream functionality
needed in the refback dialect.
* Also adds two lit tests to verify that all of our extensions load without fireworks, which is a good indication that the shared library deps are sane.
* Bumps llvm-project version to use D89167.
Now the reference backend is cleanly accepts "TCP"+scalar ops.
We introduce tcf-refback-lowering-pipeline which also does TCF->TCP
conversion for convenience until we have a "target interface".
* Need to have a dag of shared library deps in order to interop across python extensions (as presented in ODM).
* Introduced add_npcomp_library and friends to mirror the MLIR setup.
* Adds a libNPCOMP.so shared library.
* Redirects tools and extensions to link against libNPCOMP.so (instead of static libs).
* Moves all libraries to lib/, all binaries to bin/ and all python extensions to python/. The invariant is that the rpaths are setup to have a one level directory structure.
* Reworks the _torch_mlir extension to build like the others (still need to come up with a consolidated rule to do this instead of open coded).
* Includes an upstream version bump to pick up needed changes.
Sizes with dynamic linking (stripped, release, asserts enabled):
libNPCOMP.so: 43M (includes much of the underlying LLVM codegen deps)
libMLIR.so: 31M
_npcomp.so: 1.6M (python extension)
_torch_mlir.so: 670K (python extension)
npcomp-capi-ir-test: 6.3K
npcomp-opt: 351K
npcomp-run-mlir: 461K
mnist-playground: 530K
Still more can be done to normalize and optimize but this gets us structurally to the starting point.
I now realize that VerboseCamelCase is not the best choice for dialect
directory/file names and C++ identifiers (take e.g. "Linalg", "Basicpy",
etc. as prior art here; not LinearAlgebra or BasicPython). If I had to
name the convention it seems to be "Shortword" (or of course just
acronym dialects like LLVM, SCF, etc.).
This rename also has the side benefit of differentiating RefBackend
directories, which now refer to the actual backend itself, from
Refback/Refbackrt, which are the dialects which happen to be used by
that backend.
This is the first in a patch series that is refactoring the
constellation of things variously called or associated with "E2E",
"RefE2E", "npcomprt", and "TCP" into a more cleanly layered result.
Concretely, this first patch fixes the fact that TCP was basically
acting like a dumping ground needed by the reference backend. This
splits it out, which is fairly mechanical, but touches a lot of lines of
code (basically replacing `tcp` with `refback` and `TCP` with
`RefBackend).
Now, the RefBackend dialect is that dumping ground, which
is slighly better, as it starts allowing TCP to become a nice clean
middle layer that is not related per se to the reference backend.
The previous name RefE2E or "reference e2e flow" was super confusing.
Now that we are seeing more clearly where the "backend" distinction
lies, the [RefBackend] commit tag is born :)
I was seeing some miscompiles due to the uninitialized data read here
before. Interestingly, this was masked in some of our previous test
cases, since the uninitialized data "always" was so small that it would
present as a rounding error for the 1.0-10.0 sized values that the
matmul was computing on.
* Adds at::Tensor -> MlirValue tracking.
* Adds conversions for tensor and scalar types to MLIR types.
* Adds npcomp C APIs for constructing custom types.
* Reworks pybind include so as to get Torch pybind helpers (needed to pass at::Tensor type from Python->C++).
Now that we upstreamed our pass, we can remove it.
The final pass that landed upstream doesn't do the shape.assuming
canonicalization to legalize that op away, so added a
restricted-canonicalizer pass that allowed to run just shape dialect
canonicalizations, which deletes the shape.assuming.
The pass ended up kind of ugly. See the TODO's on it for some potential
cleaner directions.
Date: Fri Sep 18 13:55:52 2020 -0700
- Update to linalg syntax
- New generated builders are better. Custom builder for
tcp.shaped_results is now redundant.
This cleans up the lowering pipeline to easily allow extending to
multiple binary ops. It looks fairly repetitive at multiple levels, but
I don't want to prematurely generalize. I think that in principle we
could derive a large swatch of TCF + TCP from a single linalg-style
specification. Another direction is to use an OpInterface (something
like "buildLinalgGenericBody"). I'm keeping my eye on it.
In a subsequent commit, I'll mechanically add a set of binary ops
modeled off of the std arithmetic ops.
It was previously going through this awkward route that prematurely
created linalg.generic ops, which was an annoying layering problem since
we can't compute a shape transfer function for linalg.generic in the
general case. Now we pass it through the same path as tcp.matmul, with
the shape transfer function being defined for tcp.add.
This also removed the need for TCPToLinalg (now deleted). The equivalent
of that is happening in lower-shaped-results-to-memref. One interesting
outcome of this: we're basically using linalg as a "Buffer TCP". We
might want to look into using named structured ops for more of TCP, but
that would be a big velocity hit since then any change to the ODS /
verification for those ops would be a change to the upstream structured
op ODS generator. After we have more experience defining this manually,
we should re-evaluate rebasing TCP on generated named linalg ops.
I'm pretty happy with how this turned out. It looks pretty much like it
should -- one change at each layer. This particular op bottoms out on
linalg which takes care of the rest.
- Add tcf.matmul
- Add tcp.matmul
- Add TCF->TCP lowering
- Add tcp.matmul shape transfer function (BypassShapes.cpp)
- Add tcp.matmul -> linalg.matmul lowering (LowerShapedResultsToMemref.cpp)
- Add support to LowerShapeConstraints for lowering the new
shape.cstr_require
This matmul op is pretty limited in its capabilities. There is no
batching and no multidimensional contraction. Certainly more design work
will be needed to find the right abstractions that aren't too general
but also help to canonicalize many cases from frontends. This is mainly
to show that adding a new op needn't be very "scary" once we have the
e2e infra in place.
Also,
- this clears out some exploratory cruft from the TCF dialect now that
this is starting to become real.