torch-mlir

Commit Graph

Author	SHA1	Message	Date
Sean Silva	46aa6d0a24	[RefBackend] Fix leaks related to ABI boundaries. Best as I can tell (e.g. from LeakSanitizer), this fixes all the leaks except for those due to buffers created internally to the codegenned code itself (up next I'll add the buffer deallocation pass to fix those). The main change is that instead of attempting to pass `refbackrt::Tensor` to the codegenned function directly, we make all the ABI types be UnrankedMemRef which gets passed awkwardly (but workably) as a `{size_t rank, void ptrToDescriptor}` on the ABI. The reason why refbackrt::Tensor wasn't workable is that is that MLIR doesn't really have a way to deal with the lifetime of unranked memref descriptors that happen inside the function, which is inevitably what would happen in the old code that would emit runtime calls to `refbackrt.to_memref/refbackrt.from_memref` to convert back and forth to `refbackrt::Tensor` inside the codegenned code. So, instead of the `refbackrt.to_memref/refbackrt.from_memref` with no real sound basis for valid lifetime management, we now have a lovely piece of code in `refbackrt::invoke` in `Runtime.cpp` that just barely seems to be sound. We rely on the codegenned code having these properties, which it seems to have: - it won't free memref descriptors or their backing buffer for arguments of UnrankedMemRef type. - it will allocate a separate memref descriptor for each result UnrankedMemRef (which is ensured by having a separate memref_cast for each) - we can sniff the `allocatedPtr`'s (i.e. the backing buffer pointers) to avoid double-freeing in the case of aliasing of the backing buffer (including backing buffers for arguments feeding into results) - to catch the case of statically allocated data (which we need to avoid passing to `free`) , check if the `allocatedPtr` is (no joke) equal to `0xDEADBEEF`, because there is otherwise no way to distinguish statically allocated from malloc'ed data... (std.global_memref lowering to LLVM by happenstance sets the allocatedPtr equal to `0xDEADBEEF`, presumably mainly as a debugging thing) Even with all this, we still* need to (internally to refbackrt::invoke) make copies of all inputs/outputs! And the details of how the LLVM-level ABI gets laid out for e.g. function arguments/returns is still super tricky. This really highlights how deficient memref is as the general runtime type for our use case. It's stewing in my mind how best to improve the situation. My general gut feeling is that IREE's abstractions for this are "right", but I need to think more how to distill those aspects of IREE's design in a "reference" way for RefBackend. Some implementation notes: - In terms of how this is implemented, this did catch a bug in our ABI wrapper functions in LowerToLLVM.cpp, which I had to fix (it happened to work before through some combination of npcomprt::Tensor being passed as a single pointer + probably me infinite-monkey-ing it until it worked) - This actually removes 2 out of the 3 compiler runtime functions (the only one left is "abort_if". (most of the memref descriptor code moved from CopmilerRuntime.cpp to Runtime.cpp) - this also means deleting `refbackrt.from_memref` and `refbackrt.to_memref`	2020-11-25 13:09:58 -08:00
Stella Laurenzo	3937dd14cb	Add basicpy.numeric_constant op. * Going through TODOs on the PyTorch side, this is a big cause of them (not being able to have constants for signed/unsigned). * Added complex while in here since we're at the phase where it is better to just have things complete than partially done.	2020-11-24 16:44:40 -08:00
Stella Laurenzo	bea0af419d	NFC: Prefactor some basicpy ops in advance of more type work. * Organizes the BasicPyOps.td file by function. * Renamed `to_boolean` -> `as_predicate_value` (trying to consistently use "predicate" to refer to i1/low-level types and Bool/Boolean to refer to Python bool types).	2020-11-24 15:49:37 -08:00
Sean Silva	0b7c443256	[RefBackend] Properly initialize refbackrt::Tensor refcount. Although `refCount` is initialized as `std::atomic<int> refCount{0};` in the definition of Tensor, our tail-allocating malloc would ignore it, resulting in bogus values that led to leaks. Caught with LeakSanitizer, but I added an assertion that the refcount is non-negative to begin with, which should catch this bug in the future fairly consistently (assuming the garbage refcount is negative half the time).	2020-11-24 12:01:35 -08:00
Sean Silva	1dfcfa9cd1	Add aten.mm op and "test" it e2e. Note that unlike aten.matmul which has dynamic behavior depending on the argument ranks (can do matrix-matrix, matrix-vector, batch matmul, etc.), aten.mm is just a vanilla matrix multiply, which can be lowered precisely to tcf.matmul. The "test" is really just an example that I stared at while getting my feet wet with this. We probably want something that actually tests this as part of `ninja check-npcomp`.	2020-11-20 17:21:24 -08:00
Sean Silva	64a7e83184	[RefBackend] Add refback-tcf-to-tcp-pipeline This allows invoking TCF to TCP-level conversion more easily, and starts us towards a path of factoring it out of the RefBackend.	2020-11-17 12:33:37 -08:00
Sean Silva	358159a6eb	[RefBackend] Open-code shape.get_extent as extract_element It was annoying that we were creating shape.get_extent in the middle of the bufferization pipeline, as it required running convert-shape-to-std at an awkward place. To make that cleaner, just open-code the extract_element ops that shape.get_extent expands into. This is a little gross, but it helps with the macroscopic pipeline ordering issues. Anyway, the train is long-gone of trying to treat shapes as some special data type that should only be operated on with shape ops. Also, - reorder tensor constant bufferize (which is a module pass) to bracket all the bufferization function passes, to make the parallelism opportunities there clearer. Now we have a very clean little bufferization segment of our pipeline construction.	2020-11-17 11:00:38 -08:00
Stella Laurenzo	a7ff87a922	Sever C++ level depend on IREE and rebase on exe and python interface. * IREE doesn't have proper install support, so there is some temporary hoaky hacking in our CMakeLists.txt to shuttle some symlinks around. * Reworked the original numpy e2e with IREE test to pipe through iree-translate. * Removed all of the C++-level dependencies. * Will generalize and apply to the PyTorch backend in a followup.	2020-11-16 21:32:56 -08:00
Sean Silva	5227d52c26	[RefBackend] Use std.global_memref instead of homegrown thing This vastly simplifies our code, allowing deleting multiple ops, simplifying multiple passes, and removing a whole pass. Now `refback` dialect is down to one op (refback.alloc_memref, which simplifies allocations to just take a shape instead of individual extents).	2020-11-13 18:43:50 -08:00
Sean Silva	32388d938b	Make some passes run on FuncOp so they can run in parallel.	2020-11-13 16:12:18 -08:00
Stella Laurenzo	b4c7ae1e0c	Repurpose numpy-compiler compiler/runtime flow for PyTorch. * A bit gross because I took the chance to upgrade all of the backend bits to the new MLIR Python bindings and we still co-mingle the old and new for now. * Since the Python created PassManagers are configured for explicit nesting, I had to upgrade some of the pass pipelines to be explicit. * The demo in mul_maximum_e2e.py now compiles, runs through PyTorch and through the JIT, prints and asserts the same results. * I am not claiming that this is the prettiest API in this patch: consider that this is just directly using low-level APIs and there should be an intervening high level API.	2020-11-11 10:38:13 -08:00
Sean Silva	1c7c362e29	[TCP] Replace tcp.matmul with linalg.matmul. This involved adding a `tcp.splatted` op to splat a dynamically sized init tensor. See rationale in TCPOps.td docs. One interesting observation is that when lowering tcf.matmul to linalg.matmul, we need to both 1) create the error checks and 2) calculate a shape transfer function to create the init tensors. Previously, 2) was deferred to bufferizing tcp.matmul later. I'm not sure if this is a conflation of concerns or not. For now, it's not a big burden.	2020-11-10 18:58:28 -08:00
Sean Silva	0427aacb0b	[TCP] Replace elementwise ops with std elementwise ops.	2020-11-10 18:58:28 -08:00
Stella Laurenzo	e60dc2470e	Add aten.maximum op and conversions from aten->tcf. * Conversions are very simple, suporting mul, maximum and add (alpha=1 only). * Example added with pass pipeline needed to run. * Much missing off of the golden path but sufficient for such simple cases.	2020-11-04 17:20:54 -08:00
Stella Laurenzo	6c702b149f	Add a number of kernels and new patterns. * convolution, convolution_backward, _log_softmax, _log_softmax_backward_data, nll_loss_forward, nll_loss_backward, nll_loss2d_forward, nll_loss2d_backward, copy_ * Extends the recognition logic and metadata for handling inplace transformations, optional tensors, ints, lists and dropped args. * The kernel_calls generated by test_conv_nllloss_grads.py now convert to ATen. * The result almost comes out as a pure tensor program with the exception of the copy_ op, which I will do some followup work to deal with. * More progress on #97	2020-11-04 14:36:59 -08:00
Sean Silva	3dab9056f0	Bump llvm-project to eb8d386d513bf4243d0adb814d862af25b8c4e2f Two changes: - no more "verifyPasses" constructor arg for PassManager - OpPassManager defaults to requiring explicit "nest" calls when created via the C++ API. The behavior upstream for mlir-opt still obeys the "implicit" mode, so I just slapped that onto all our pass managers. I pinged https://reviews.llvm.org/D90671 to get a signal for whether we are expected to migrate to explicit mode. If so, I'll do that too later.	2020-11-04 14:14:46 -08:00
Sean Silva	57e58b9272	[RefBackend] Use upstream func-bufferize pass. Now, the only bufferization we have left is lowering tensor constants to memref, which will hopefully proceed soon after Rahul's new std.global_memref lands + the lowering to LLVM IR. Then I'll port LowerConstantTensorsToMemref to upstream and we'll be 100% upstream bufferization, except for our local TCP dialect (which will probably go away and be replaced by std elementwise + linalg named ops on tensors :) ).	2020-11-02 17:38:33 -08:00
Sean Silva	0761df9f58	Bump llvm-project to 72ddd559b8aafef402091f8e192e025022e4ebef - Fixup to OpBuilderDAG - Update for affine map naming	2020-10-30 18:12:41 -07:00
Marius Brehler	30adf9e6b0	Fix TCP_MulOp tablegen definition	2020-10-28 19:28:15 +01:00
Stella Laurenzo	c08935a418	Rewrite ATen ODS code generator to be based on new op registry and new signature recognition system. * Deletes prior code generator from previous attempt (moved some of it into this one). * Renames old generated tablegen source to "Legacy". * Generates ODS and import rules for most binary and unary arithmetic ops. * Removes old generated ops and integration tests that were testing details of the prior setup.	2020-10-28 10:37:37 -07:00
Aaron J Arthurs	94ea6f7c92	[RefBackend] Support element-wise multiply op Register the following for the multiply op: - tcf.mul - tcp.mul - TCP->TCP lowering - Shape transfer, broadcasted multiplicands - Lower to standard `MulFOp` op	2020-10-27 19:41:23 -07:00
Stella Laurenzo	510f226df2	Expose signature metadata to ops and implement ATenRecognizeKernelsPass pass. * Two op interfaces, one for querying instance metadata and one for getting static data needed to construct an op from a generic form. * For torch.generic_kernel ops, metadata is splatted in during capture from Torch (it comes from the op registry, which will work for either device capture or graph import). * Moved the 'add' out of the generated set so I can experiment on it. It implements the TorchBuildableKernelOpInterface interface which provides its metadata. * The ATenRecognizeKernelsPass pass generically lowers from a torch.generic_kernel to recognized ops that implement the TorchBuildableKernelOpInterface, handling the various types of transformations that we allow at this stage.	2020-10-26 20:31:45 -07:00
Stella Laurenzo	91fc83d2e7	NFC: Transition ATen passes to tablegen registration.	2020-10-22 17:12:44 -07:00
Stella Laurenzo	9618c2dbf7	NFC: Re-organize ATen directory structure and fix warnings. * Still some more work to do on the Transforms tree to bring it in line with the others (will do that as I add things).	2020-10-22 14:13:26 -07:00
Sean Silva	14470f9ff6	[RefBackend] Use upstream std bufferization. It now subsumes the one we had.	2020-10-21 16:46:56 -07:00
Stella Laurenzo	58adb6bd8e	Work around various PyTorch issues in support of convolution. * Enables the conv2d fwd test and ResA (which are both small). * Deletes resnet18 and vgg, which both run but generate output that crashes FileCheck and lit (or at least makes them take an eternity).	2020-10-21 12:44:31 -07:00
Stella Laurenzo	029815152e	Add remaining pieces to capture full example models. * Adds Basicpy List, Tuple, Dict types and plumbs through C API. * Started debugging the issues around aten::conv2d capture, but a PyTorch bug is suspected. * Was able to manually verify that the basic conv2d forward test captures correctly with a workaround. * Need to resolve some printing issues upstream and move these tests to an integration test target (they take ~seconds to run).	2020-10-19 22:16:59 -07:00
Stella Laurenzo	9e52f6235b	More progress on PyTorch acap device capture. * Now gets far enough to capture batch_norm. * Has some issues still with in-place ops. * Can materialize constants. * Includes an upgrade to PyTorch nightly, which has important bug fixes for fallback and boxed kernel dispatch. * Fixes #78, #79, #80. * Will do more testing in a follow-up once further bugs are fixed that facilitate getting at the other features.	2020-10-15 21:43:21 -07:00
Sean Silva	06a8ba6900	[RefBackend] Use more idiomatic bufferize pattern for TCP. The time has come for BypassShapes/LowerShapedResultsToMemref to go away :( For the reference backend, being consistent with upstream conventions is the name of the game now. This is a step down in a number of ways, e.g. test clarity and separation of concerns. But it is fewer files and fewer tests, and does address the "TODO: This is really fragile". It also eliminates two more ops from the refback dialect (sadly, they are the shaped_results/yield that we were getting kind of fond of, but alas).	2020-10-15 20:15:53 -07:00
Sean Silva	b6bdc8cc4f	[RefBackend] Use upstream BufferizeTypeConverter Now that it has grown source/target materialization capabilities (spelled with ops tensor_load/tensor_to_memref), we can use it. We can also now delete refback.memref_to_tensor/refback.tensor_to_memref. This is also a first step to reducing the downstream functionality needed in the refback dialect.	2020-10-15 15:58:51 -07:00
Sean Silva	93fc21dad0	[RefBackend] Split out TCF->TCP conversion. Now the reference backend is cleanly accepts "TCP"+scalar ops. We introduce tcf-refback-lowering-pipeline which also does TCF->TCP conversion for convenience until we have a "target interface".	2020-10-12 11:56:39 -07:00
Sean Silva	631c8070df	[RefBackend] Put JITModule in refback namsepace.	2020-10-08 09:07:00 -07:00
Sean Silva	7edb5f3641	[RefBackend] Rename RefBackend dialect to Refback I now realize that VerboseCamelCase is not the best choice for dialect directory/file names and C++ identifiers (take e.g. "Linalg", "Basicpy", etc. as prior art here; not LinearAlgebra or BasicPython). If I had to name the convention it seems to be "Shortword" (or of course just acronym dialects like LLVM, SCF, etc.). This rename also has the side benefit of differentiating RefBackend directories, which now refer to the actual backend itself, from Refback/Refbackrt, which are the dialects which happen to be used by that backend.	2020-10-08 09:07:00 -07:00
Sean Silva	bf99a82832	[RefBackend] Rename Npcomprt dialect to Refbackrt.	2020-10-08 09:07:00 -07:00
Sean Silva	83ad70ef54	[RefBackend] Move runtime related code under npcomp/RefBackend/ Other than the dialect definitions (which will live in standard Dialect/ subdirectory), the goal here is to keep RefBackend-related code nested in {include/npcomp,lib,test}/RefBackend.	2020-10-08 09:07:00 -07:00
Sean Silva	21255d5f8e	[RefBackend] Rename "E2E" to RefBackend.	2020-10-07 10:29:48 -07:00
Sean Silva	5017430dc7	[RefBackend] Split out RefBackend (refback) dialect from TCP. This is the first in a patch series that is refactoring the constellation of things variously called or associated with "E2E", "RefE2E", "npcomprt", and "TCP" into a more cleanly layered result. Concretely, this first patch fixes the fact that TCP was basically acting like a dumping ground needed by the reference backend. This splits it out, which is fairly mechanical, but touches a lot of lines of code (basically replacing `tcp` with `refback` and `TCP` with `RefBackend). Now, the RefBackend dialect is that dumping ground, which is slighly better, as it starts allowing TCP to become a nice clean middle layer that is not related per se to the reference backend. The previous name RefE2E or "reference e2e flow" was super confusing. Now that we are seeing more clearly where the "backend" distinction lies, the [RefBackend] commit tag is born :)	2020-10-07 10:29:48 -07:00
Stella Laurenzo	ad3ddb9edb	Implement torch.kernel_call capture. * Had to stop short of modifying the function return signature because of a missing C-API upstream. * Committing here is good enough for a test and will resolve the various TODOs about upstream APIs next.	2020-10-06 21:54:28 -07:00
Stella Laurenzo	3d74337be0	Add a torch.kernel_call op and associated predicates.	2020-09-29 15:10:38 -07:00
Stella Laurenzo	2c9ca79c89	Add boilerplate for Torch dialect.	2020-09-28 15:26:17 -07:00
Stella Laurenzo	b5f010284f	Add boilerplate to do device capture (pytorch 1.6). * Uses the new dispatcher API. * Just prints to the console for the moment when an op is captured. * Executes the op through the existing implementation.	2020-09-28 10:30:54 -07:00
Sean Silva	16c26ef57e	[RefE2E] Use upstream shape constraint conversion pass. Now that we upstreamed our pass, we can remove it. The final pass that landed upstream doesn't do the shape.assuming canonicalization to legalize that op away, so added a restricted-canonicalizer pass that allowed to run just shape dialect canonicalizations, which deletes the shape.assuming. The pass ended up kind of ugly. See the TODO's on it for some potential cleaner directions.	2020-09-28 09:34:44 -07:00
Sean Silva	6ea37cfed6	Bump llvm-project to 9ed1e5873c19eb817fb9e36d0262c7effee5d35e Date: Fri Sep 18 13:55:52 2020 -0700 - Update to linalg syntax - New generated builders are better. Custom builder for tcp.shaped_results is now redundant.	2020-09-28 09:34:44 -07:00
Sean Silva	f9b37c55b7	[RefE2E] Add support for unary ops exp and tanh This is fairly mechanical.	2020-09-24 18:41:30 -07:00
Sean Silva	c69e9fabc5	[RefE2E] Add support for "max". This cleans up the lowering pipeline to easily allow extending to multiple binary ops. It looks fairly repetitive at multiple levels, but I don't want to prematurely generalize. I think that in principle we could derive a large swatch of TCF + TCP from a single linalg-style specification. Another direction is to use an OpInterface (something like "buildLinalgGenericBody"). I'm keeping my eye on it. In a subsequent commit, I'll mechanically add a set of binary ops modeled off of the std arithmetic ops.	2020-09-22 18:38:32 -07:00
Stella Laurenzo	bc7c852379	Add more ops from the original integration. * Still need to add a systematic mechanism for discovering gradient ops. * Work needed on the various _ suffixed inplace ops. * Other randoms still not mapped. * Outside of this commit, I do have enough commented/reworked to roughly build but that will take another handful of commits to get going.	2020-09-18 19:11:18 -07:00
Sean Silva	276f5b80ea	[RefE2E] Add assemblyFormat for TCF and TCP ops and tidy up.	2020-09-18 15:03:53 -07:00
Sean Silva	dc8afc9271	[RefE2E] Refactor how tcf.add is lowered. It was previously going through this awkward route that prematurely created linalg.generic ops, which was an annoying layering problem since we can't compute a shape transfer function for linalg.generic in the general case. Now we pass it through the same path as tcp.matmul, with the shape transfer function being defined for tcp.add. This also removed the need for TCPToLinalg (now deleted). The equivalent of that is happening in lower-shaped-results-to-memref. One interesting outcome of this: we're basically using linalg as a "Buffer TCP". We might want to look into using named structured ops for more of TCP, but that would be a big velocity hit since then any change to the ODS / verification for those ops would be a change to the upstream structured op ODS generator. After we have more experience defining this manually, we should re-evaluate rebasing TCP on generated named linalg ops.	2020-09-18 15:03:53 -07:00
Sean Silva	d8675f8ad2	[RefE2E] Add support for matmul. I'm pretty happy with how this turned out. It looks pretty much like it should -- one change at each layer. This particular op bottoms out on linalg which takes care of the rest. - Add tcf.matmul - Add tcp.matmul - Add TCF->TCP lowering - Add tcp.matmul shape transfer function (BypassShapes.cpp) - Add tcp.matmul -> linalg.matmul lowering (LowerShapedResultsToMemref.cpp) - Add support to LowerShapeConstraints for lowering the new shape.cstr_require This matmul op is pretty limited in its capabilities. There is no batching and no multidimensional contraction. Certainly more design work will be needed to find the right abstractions that aren't too general but also help to canonicalize many cases from frontends. This is mainly to show that adding a new op needn't be very "scary" once we have the e2e infra in place. Also, - this clears out some exploratory cruft from the TCF dialect now that this is starting to become real.	2020-09-18 11:31:01 -07:00
Sean Silva	75f57b461e	Totally rework RefE2E tensor to memref flow. (#42 ) This now gets the overall "RefE2E" compilation stack to a point that I'm fairly happy with. We simplify it by mostly embracing the "descriptor" view of the world. The overall flow is best understood by reading through the createE2ELoweringPipeline function in lib/E2E/E2E.cpp That function creates a pass pipeline that lowers from "TCF" (which is ~numpy level of abstraction) down to LLVM IR. A brief high-level summary of what happens there: 1. TCF to TCP conversion. This involves reifying error handling in the form of shape constraints. See test/Conversion/TCFToTCP/basic.mlir 2. Lowering shape constraints. This converts shape constraints into eager error-handling code. See test/E2E/lower-shape-constraints.mlir This pass will soon go upstream. Because this lowers to std.assert, some later passes like LowerToNpcomprtABI and LowerToLLVM are updated to properly plumb this through e2e. See test/npcomp-run-mlir/invalid-broadcast.mlir for an execution test that properly aborts in case of an error. 3. Lowering tensors to memrefs. This is done via a series of passes rather than an single mega conversion. Unlike the previous code that mixed in the npcomprt ABI stuff here, it's now a very clean "pure memref" conversion. See test/E2E/lower-*-to-memref.mlir and lib/E2E/TensorToMemref/ Most of the changes are concentrated here. 4. As part of the above, we use the upstream ConvertShapeToStandard for lowering shapes. 5. We lower linalg to loops and lower loops to CFG using upstream passes. 6. Rewrite the "ABI" boundaries of the program to npcomprt data structures (LowerToNpcomprtABI). This mainly affects ABI boundaries and how global tensor constants are represented. One of the major improvements in this commit is that now it's a very clean rewrite that just replaces memrefs on ABI boundaries with !npcomprt.tensor (before there was a get_extent function that is not needed). See test/E2E/lower-to-npcomprt-abi.mlir 7. Lower to LLVM with upstream mlir patterns + some patterns for the npcomprt lowerings. One aspect here that is still a remnant of a non-descriptor-based tensor to memref flow is the BypassShapes + LowerShapedResultsToMemref. BypassShapes wraps the "tensor compute" ops in a tcp.shaped_results (basically a "tie_shape" kind of op), and then LowerShapedResultsToMemref uses those annotations to allocate output buffers while lowering the "tensor compute ops". Note that there are very few "tensor compute" ops currently supported (tcp.add + tcp.broadcast_to), so we just hardcode them in both passes. Realistically, I expect this to go away as we fully embrace the descriptor-based approach for simplicity, so don't look too deep into it.	2020-09-16 17:31:40 -07:00

1 2 3

141 Commits (251aa6e435de7fca2248ad0f56be7c9edde8c03a)