Best as I can tell (e.g. from LeakSanitizer), this fixes all the leaks
except for those due to buffers created internally to the codegenned
code itself (up next I'll add the buffer deallocation pass to fix
those).
The main change is that instead of attempting to pass `refbackrt::Tensor`
to the codegenned function directly, we make all the ABI types be
UnrankedMemRef which gets passed awkwardly (but workably) as a
`{size_t rank, void *ptrToDescriptor}` on the ABI. The reason why
refbackrt::Tensor wasn't workable is that is that MLIR doesn't really
have a way to deal with the lifetime of unranked memref descriptors that
happen inside the function, which is inevitably what would happen in the
old code that would emit runtime calls to
`refbackrt.to_memref/refbackrt.from_memref` to convert back and forth to
`refbackrt::Tensor` inside the codegenned code.
So, instead of the `refbackrt.to_memref/refbackrt.from_memref` with no
real sound basis for valid lifetime management, we now have a lovely
piece of code in `refbackrt::invoke` in `Runtime.cpp` that just barely
seems to be sound. We rely on the codegenned code having these
properties, which it seems to have:
- it won't free memref descriptors or their backing buffer for arguments
of UnrankedMemRef type.
- it will allocate a separate memref descriptor for each result
UnrankedMemRef (which is ensured by having a separate memref_cast for
each)
- we can sniff the `allocatedPtr`'s (i.e. the backing buffer pointers)
to avoid double-freeing in the case of aliasing of the backing buffer
(including backing buffers for arguments feeding into results)
- to catch the case of statically allocated data (which we need to avoid
passing to `free`) , check if the `allocatedPtr` is (no joke) equal to
`0xDEADBEEF`, because there is otherwise no way to distinguish
statically allocated from malloc'ed data... (std.global_memref lowering
to LLVM by happenstance sets the allocatedPtr equal to `0xDEADBEEF`,
presumably mainly as a debugging thing)
Even with all this, we *still* need to (internally to refbackrt::invoke)
make copies of all inputs/outputs! And the details of how the LLVM-level
ABI gets laid out for e.g. function arguments/returns is still super
tricky.
This really highlights how deficient memref is as the general runtime
type for our use case. It's stewing in my mind how best to improve the
situation. My general gut feeling is that IREE's abstractions for this
are "right", but I need to think more how to distill those aspects of
IREE's design in a "reference" way for RefBackend.
Some implementation notes:
- In terms of how this is implemented, this did catch a bug in our ABI
wrapper functions in LowerToLLVM.cpp, which I had to fix (it happened to
work before through some combination of npcomprt::Tensor being passed as
a single pointer + probably me infinite-monkey-ing it until it worked)
- This actually removes 2 out of the 3 compiler runtime functions (the
only one left is "abort_if". (most of the memref descriptor code moved
from CopmilerRuntime.cpp to Runtime.cpp)
- this also means deleting `refbackrt.from_memref` and
`refbackrt.to_memref`
* Going through TODOs on the PyTorch side, this is a big cause of them (not being able to have constants for signed/unsigned).
* Added complex while in here since we're at the phase where it is better to just have things complete than partially done.
* Organizes the BasicPyOps.td file by function.
* Renamed `to_boolean` -> `as_predicate_value` (trying to consistently use "predicate" to refer to i1/low-level types and Bool/Boolean to refer to Python bool types).
Although `refCount` is initialized as `std::atomic<int> refCount{0};` in
the definition of Tensor, our tail-allocating malloc would ignore it,
resulting in bogus values that led to leaks.
Caught with LeakSanitizer, but I added an assertion that the refcount is
non-negative to begin with, which should catch this bug in the future
fairly consistently (assuming the garbage refcount is negative half the
time).
* Does not handle all features yet but should conservatively fail on unsupported things.
* Location tracking is still somewhat mismatched between what TorchScript and MLIR do. Likely need a better heuristic for tracking locations from defs for nodes that do not carry location.
* Sets the ground-work for a specialized/generic split but only implements the generic side.
* Had some evidence that this requires a recent bump of PT nightly (within the last month) to pick up pybind11 2.6, which includes some cross-module symbol fixes (vs the previously sync'd version). No source changes, but older versions fail to cast function types at runtime.
* Incorporates source fixes.
* Uses upstream pybind11 detection logic.
* Patches CI.
* This may break the CI, which will need to be fixed manually in a followup.
Note that unlike aten.matmul which has dynamic behavior
depending on the argument ranks (can do matrix-matrix, matrix-vector,
batch matmul, etc.), aten.mm is just a vanilla matrix
multiply, which can be lowered precisely to tcf.matmul.
The "test" is really just an example that I stared at while getting my
feet wet with this. We probably want something that actually tests this
as part of `ninja check-npcomp`.
After the recent change of cmake variables
from PYTHON_INCLUDE_DIRS to Python3_INCLUDE_DIRS
and PYTHON_LIBRARIES to Python3_LIBRARIES, there were
a few files that still had references to the old
variables. This patch fixes that.
It was annoying that we were creating shape.get_extent in the middle of
the bufferization pipeline, as it required running convert-shape-to-std
at an awkward place. To make that cleaner, just open-code the
extract_element ops that shape.get_extent expands into.
This is a little gross, but it helps with the macroscopic pipeline
ordering issues. Anyway, the train is long-gone of trying to treat
shapes as some special data type that should only be operated on with
shape ops.
Also,
- reorder tensor constant bufferize (which is a module pass) to bracket
all the bufferization function passes, to make the parallelism
opportunities there clearer. Now we have a very clean little
bufferization segment of our pipeline construction.
* IREE doesn't have proper install support, so there is some temporary hoaky hacking in our CMakeLists.txt to shuttle some symlinks around.
* Reworked the original numpy e2e with IREE test to pipe through iree-translate.
* Removed all of the C++-level dependencies.
* Will generalize and apply to the PyTorch backend in a followup.
This vastly simplifies our code, allowing deleting multiple ops,
simplifying multiple passes, and removing a whole pass.
Now `refback` dialect is down to one op (refback.alloc_memref, which
simplifies allocations to just take a shape instead of individual
extents).
* In most situations, this eliminates the need to explicitly set a path to the Torch cmake files.
* Also upgrades to new Python3 find package. (should eliminate 2.x mismatches)
* Since PyTorch is located by asking Python where it is, this eliminates a lot of causes of mismatch. (one source of truth)
* Fixes#107
* I wouldn't say I love what had to be done here. Worth a conversation with the PT devs (probably as part of a rollup of a bunch of this stuff).
* A bit gross because I took the chance to upgrade all of the backend bits to the new MLIR Python bindings and we still co-mingle the old and new for now.
* Since the Python created PassManagers are configured for explicit nesting, I had to upgrade some of the pass pipelines to be explicit.
* The demo in mul_maximum_e2e.py now compiles, runs through PyTorch and through the JIT, prints and asserts the same results.
* I am not claiming that this is the prettiest API in this patch: consider that this is just directly using low-level APIs and there should be an intervening high level API.
This involved adding a `tcp.splatted` op to splat a dynamically sized
init tensor. See rationale in TCPOps.td docs.
One interesting observation is that when lowering tcf.matmul to
linalg.matmul, we need to both 1) create the error checks and 2)
calculate a shape transfer function to create the init tensors.
Previously, 2) was deferred to bufferizing tcp.matmul later. I'm not
sure if this is a conflation of concerns or not. For now, it's not a big
burden.
* We're building libLLVM.so anyway. Saves a lot of time/space to link tools against it.
* MLIR tools do not yet respect this (but it doesn't seem to hurt).
* Incorporates a dep on the new MLIRPublicAPI shared library.
* More work is needed to further separate npcomp between public API and impl libraries, but amalgamating them will hold until then.
* Conversions are very simple, suporting mul, maximum and add (alpha=1 only).
* Example added with pass pipeline needed to run.
* Much missing off of the golden path but sufficient for such simple cases.
* convolution, convolution_backward, _log_softmax, _log_softmax_backward_data, nll_loss_forward, nll_loss_backward, nll_loss2d_forward, nll_loss2d_backward, copy_
* Extends the recognition logic and metadata for handling inplace transformations, optional tensors, ints, lists and dropped args.
* The kernel_calls generated by test_conv_nllloss_grads.py now convert to ATen.
* The result *almost* comes out as a pure tensor program with the exception of the copy_ op, which I will do some followup work to deal with.
* More progress on #97
Two changes:
- no more "verifyPasses" constructor arg for PassManager
- OpPassManager defaults to requiring explicit "nest" calls when created
via the C++ API. The behavior upstream for mlir-opt still obeys the
"implicit" mode, so I just slapped that onto all our pass managers.
I pinged https://reviews.llvm.org/D90671 to get a signal for whether we
are expected to migrate to explicit mode. If so, I'll do that too later.
* Enables -gsplit-dwarf for both LLVM and NPCOMP, reducing the occurrence of the ~GB scale binaries.
* CMake shared linking seems incompatible with this, so shared objects are still "too big" but there are few of them.
* Reduces disk thrash on clean/install of everything.
Now, the only bufferization we have left is lowering tensor constants to
memref, which will hopefully proceed soon after Rahul's new
std.global_memref lands + the lowering to LLVM IR. Then I'll port
LowerConstantTensorsToMemref to upstream and we'll be 100% upstream
bufferization, except for our local TCP dialect (which will probably go
away and be replaced by std elementwise + linalg named ops on tensors :)
).
The current code was inserting all build_list ops
after the last constant op since it was assuming that all
elements being passed in were constants.
This patch replaces that patch with a new function that
inserts the build_list ops before the terminator.
Also modifies test_export_conv2d_fwd.py since its output
no longer matches.
TEST: Added test_export_cat.py which is the code in #102
* This is sufficient to capture the forward and backward pass and gradients of a convolutional model with an nllloss.
* As with the forward conv, the backward conv is a special case wrapped in an enigma on the PyTorch side. There aren't many like it, so special casing is just what we do.
* When I traced this, I found that the copy_ op is not yet boxing compatible so I had to map it manually. If there are many more like this, I'll probably do something a bit more clever to reduce duplication.
* This exposes new signature patterns that will need to be handled by the ATen lowering. Will take care of that next: It will be nice to have an e2e of a non-trivial case with full gradients.
* Fixes#97.