torch-mlir

Commit Graph

Author	SHA1	Message	Date
Sean Silva	2efda323ff	Significantly restructure torch/aten import design. This is a really major and invasive restructuring of the way we get torch operators (`torch::jit::Operator` / `c10::OperatorHandle`) into MLIR. Please forgive the challenging review, but due to the sheer invasiveness, it wasn't really practical do do it in sane smaller pieces. This fully replaces everything that was already working on the TorchScript path (actually, more -- we added tanh support to TorchToLinalg in order to delete the older code paths). Additionally, I've kept the lights on for the acap path too, including what little e2e stuff was working before (for expediency I made a few tiny compromises along the way that will be easy to undo when we give that path proper attention). Overview of the new design: - The torch operator `somens::someunqualname.someoverloadname` is imported as `torch.somens.someunqualname.someoverloadname` (skip the last dotted part if the overload name is empty), OR, if we don't have such an op registered, it is imported as `torch.operator "somens.someunqualname.someoverloadname" (...) : ...`. - The addition of the "overload name" is a critical element here, as the `(ns,unqual,overload)` triple is unique, which solves a lot of problems we were having. - This involves having separate MLIR ops for the `trailing_` and `.out` variants and all the different overloads. This seemed necessary, because the set of overloads is so wild and varied and unstructured. The previous design was leaning into some underlying structure that just isn't there -- the default situation is the "random overload that we want to manage on the MLIR side", rather than that being an exception. E.g. `aten::ne` (not-equal) has 21 overloads, only 4 of which are c10 dispatcher ops see [gist](https://gist.github.com/silvasean/190ba918c550c956260e21254e1b8aa1), and the "out" variant is really called `.Tensor_out` instead of `.out` as it frequently is for other ops. - Rationale for all being in `torch` namespace: the set of operators are so varied and unstructured that "dialect per namespace" doesn't result in anything resembling the typical MLIR dialect boundary expectations. We could maybe draw the boundary at dispatcher ops vs non-dispatcher ops, but that doesn't seem to really result in very much useful structure at this point in time. - Note: within the torch operator registry, we effectively have a mini-basicpy subdialect (already type-resolved), which is reasonably structured. - The existing Torch op interfaces are also removed -- now that we track the overload name, we can losslessly find the original operator. - Instead of `ATenRecognizeKernelsPass`, we now have a `ReduceOpVariantsPass` that keys off certain traits (and perhaps eventually interfaces) to reduce variants of ops to a smaller set, ideally operating on immutable tensors and using surrounding ops to model the mutability/aliasing aspects. - Note: `torch.ns.unqual.overload` ops allow both immutable and mutable tensors (unlike the previous hard distinction in the common case). This is a premonition for a future change that will introduce a bona fide `!torch.tensor` type that will clean up a bunch of stuff. - `TorchToLinalg` / `TorchToStd` supercede the existing "ATen->TCF->TCP->Linalg" path. - The new `torch_ods_gen.py` supercedes `torch_signature_ods_gen.py`. It should look somewhat familiar, but the benefit of hindsight has allowed a lot of simplifications. The overall trend seems to be to make the `torch` dialect a nice layer independent of anything else. It feels like as a natural result of various future changes we will be removing the reliance on basicpy+numpy dialects and have a nice self-contained type system too that properly models the TorchScript type system (including proper subtyping, mutable/immutable tensors, optional dtype, etc.). Recommended review order: - Start at some of the new import IR, e.g. in `frontends/pytorch/test/node_import/prim.py`, `frontends/pytorch/test/acap_export/test_export_add3.py`, and other tests. - `frontends/pytorch/python/torch_mlir_utils/codegen/torch_ods_gen.py` and associated generated files: - `include/npcomp/Dialect/Torch/IR/GeneratedAtenOps.td` - `include/npcomp/Dialect/Torch/IR/GeneratedPrimOps.td` - Inspect `ReduceOpVariants.cpp` / `reduce-op-variants.mlir` and the new traits in `include/npcomp/Dialect/Torch/IR/TorchTraits.h` - Various code changes in the import path in `frontends/pytorch/csrc/builder`. Probably most interesting is the new code in `torch_to_mlir_utils.cpp` that has the logic to create the `torch.operator` ops or `torch.ns.unqual.overload` ops. This is the [new ResNet IR](https://gist.github.com/silvasean/5407aafb710d07612b7b5b92eabecebe), just to be able to look at a substantial sample of IR in the new style.	2021-05-19 13:37:39 -07:00
Bryce Arden	4591884d06	[refbackrt] Scalar arg support * Adds f32 scalar argument support across the ABI boundary. * Adds support for passing input type / shape information across the ABI boundary * Adds support for parsing / creating input FloatAttr's in `npcomp-run-mlir`	2021-03-23 13:16:44 -07:00
Aaron Arthurs	4fd9b4afb5	Import ATen conv2d conversion and test (#180 ) * Import ATen conv2d conversion and test This is a first attempt at expanding ATen-to-TCF conversion for the conv2d operator. Eventually, this will come in use when lowering a high-level conv-based model.	2021-03-12 17:21:16 -08:00
Sean Silva	c424c24ed8	Bump llvm-project to c68d2895a1f4019b387c69d1e5eec31b0eb5e7b0 - dialect registration - StringAttr::get: order of context arg - math dialect - LogicalResult nodiscard - error message for invalid broadcast	2021-02-22 12:23:24 -08:00
Aaron J Arthurs	63ee4f268a	Import basic TCP pad test	2021-01-28 12:01:35 -08:00
Aaron Arthurs	85898aaf10	Add TCF convolutional op with bias addition (#137 )	2020-12-15 12:53:12 -08:00
Sean Silva	b2077738ca	Bump llvm-project to 444822d77a7fea28aa49edf24533c987efa1b2ee Fixes: - renames StandardTypes -> BuiltinTypes - std.extract_element -> tensor.extract	2020-12-11 14:43:38 -08:00
Sean Silva	46aa6d0a24	[RefBackend] Fix leaks related to ABI boundaries. Best as I can tell (e.g. from LeakSanitizer), this fixes all the leaks except for those due to buffers created internally to the codegenned code itself (up next I'll add the buffer deallocation pass to fix those). The main change is that instead of attempting to pass `refbackrt::Tensor` to the codegenned function directly, we make all the ABI types be UnrankedMemRef which gets passed awkwardly (but workably) as a `{size_t rank, void ptrToDescriptor}` on the ABI. The reason why refbackrt::Tensor wasn't workable is that is that MLIR doesn't really have a way to deal with the lifetime of unranked memref descriptors that happen inside the function, which is inevitably what would happen in the old code that would emit runtime calls to `refbackrt.to_memref/refbackrt.from_memref` to convert back and forth to `refbackrt::Tensor` inside the codegenned code. So, instead of the `refbackrt.to_memref/refbackrt.from_memref` with no real sound basis for valid lifetime management, we now have a lovely piece of code in `refbackrt::invoke` in `Runtime.cpp` that just barely seems to be sound. We rely on the codegenned code having these properties, which it seems to have: - it won't free memref descriptors or their backing buffer for arguments of UnrankedMemRef type. - it will allocate a separate memref descriptor for each result UnrankedMemRef (which is ensured by having a separate memref_cast for each) - we can sniff the `allocatedPtr`'s (i.e. the backing buffer pointers) to avoid double-freeing in the case of aliasing of the backing buffer (including backing buffers for arguments feeding into results) - to catch the case of statically allocated data (which we need to avoid passing to `free`) , check if the `allocatedPtr` is (no joke) equal to `0xDEADBEEF`, because there is otherwise no way to distinguish statically allocated from malloc'ed data... (std.global_memref lowering to LLVM by happenstance sets the allocatedPtr equal to `0xDEADBEEF`, presumably mainly as a debugging thing) Even with all this, we still* need to (internally to refbackrt::invoke) make copies of all inputs/outputs! And the details of how the LLVM-level ABI gets laid out for e.g. function arguments/returns is still super tricky. This really highlights how deficient memref is as the general runtime type for our use case. It's stewing in my mind how best to improve the situation. My general gut feeling is that IREE's abstractions for this are "right", but I need to think more how to distill those aspects of IREE's design in a "reference" way for RefBackend. Some implementation notes: - In terms of how this is implemented, this did catch a bug in our ABI wrapper functions in LowerToLLVM.cpp, which I had to fix (it happened to work before through some combination of npcomprt::Tensor being passed as a single pointer + probably me infinite-monkey-ing it until it worked) - This actually removes 2 out of the 3 compiler runtime functions (the only one left is "abort_if". (most of the memref descriptor code moved from CopmilerRuntime.cpp to Runtime.cpp) - this also means deleting `refbackrt.from_memref` and `refbackrt.to_memref`	2020-11-25 13:09:58 -08:00
Aaron J Arthurs	94ea6f7c92	[RefBackend] Support element-wise multiply op Register the following for the multiply op: - tcf.mul - tcp.mul - TCP->TCP lowering - Shape transfer, broadcasted multiplicands - Lower to standard `MulFOp` op	2020-10-27 19:41:23 -07:00
Sean Silva	8022dfaf1a	[RefE2E] Initialize the linalg matmul accumulator buffer. I was seeing some miscompiles due to the uninitialized data read here before. Interestingly, this was masked in some of our previous test cases, since the uninitialized data "always" was so small that it would present as a rounding error for the 1.0-10.0 sized values that the matmul was computing on.	2020-10-02 16:24:52 -07:00
Sean Silva	f9b37c55b7	[RefE2E] Add support for unary ops exp and tanh This is fairly mechanical.	2020-09-24 18:41:30 -07:00
Sean Silva	c69e9fabc5	[RefE2E] Add support for "max". This cleans up the lowering pipeline to easily allow extending to multiple binary ops. It looks fairly repetitive at multiple levels, but I don't want to prematurely generalize. I think that in principle we could derive a large swatch of TCF + TCP from a single linalg-style specification. Another direction is to use an OpInterface (something like "buildLinalgGenericBody"). I'm keeping my eye on it. In a subsequent commit, I'll mechanically add a set of binary ops modeled off of the std arithmetic ops.	2020-09-22 18:38:32 -07:00
Sean Silva	7b7f35744b	[RefE2E] Add interesting control flow example. This also required adding a lowering for ForOp in our tensor->memref conversion.	2020-09-21 12:25:24 -07:00
Sean Silva	276f5b80ea	[RefE2E] Add assemblyFormat for TCF and TCP ops and tidy up.	2020-09-18 15:03:53 -07:00
Sean Silva	d8675f8ad2	[RefE2E] Add support for matmul. I'm pretty happy with how this turned out. It looks pretty much like it should -- one change at each layer. This particular op bottoms out on linalg which takes care of the rest. - Add tcf.matmul - Add tcp.matmul - Add TCF->TCP lowering - Add tcp.matmul shape transfer function (BypassShapes.cpp) - Add tcp.matmul -> linalg.matmul lowering (LowerShapedResultsToMemref.cpp) - Add support to LowerShapeConstraints for lowering the new shape.cstr_require This matmul op is pretty limited in its capabilities. There is no batching and no multidimensional contraction. Certainly more design work will be needed to find the right abstractions that aren't too general but also help to canonicalize many cases from frontends. This is mainly to show that adding a new op needn't be very "scary" once we have the e2e infra in place. Also, - this clears out some exploratory cruft from the TCF dialect now that this is starting to become real.	2020-09-18 11:31:01 -07:00
Sean Silva	75f57b461e	Totally rework RefE2E tensor to memref flow. (#42 ) This now gets the overall "RefE2E" compilation stack to a point that I'm fairly happy with. We simplify it by mostly embracing the "descriptor" view of the world. The overall flow is best understood by reading through the createE2ELoweringPipeline function in lib/E2E/E2E.cpp That function creates a pass pipeline that lowers from "TCF" (which is ~numpy level of abstraction) down to LLVM IR. A brief high-level summary of what happens there: 1. TCF to TCP conversion. This involves reifying error handling in the form of shape constraints. See test/Conversion/TCFToTCP/basic.mlir 2. Lowering shape constraints. This converts shape constraints into eager error-handling code. See test/E2E/lower-shape-constraints.mlir This pass will soon go upstream. Because this lowers to std.assert, some later passes like LowerToNpcomprtABI and LowerToLLVM are updated to properly plumb this through e2e. See test/npcomp-run-mlir/invalid-broadcast.mlir for an execution test that properly aborts in case of an error. 3. Lowering tensors to memrefs. This is done via a series of passes rather than an single mega conversion. Unlike the previous code that mixed in the npcomprt ABI stuff here, it's now a very clean "pure memref" conversion. See test/E2E/lower-*-to-memref.mlir and lib/E2E/TensorToMemref/ Most of the changes are concentrated here. 4. As part of the above, we use the upstream ConvertShapeToStandard for lowering shapes. 5. We lower linalg to loops and lower loops to CFG using upstream passes. 6. Rewrite the "ABI" boundaries of the program to npcomprt data structures (LowerToNpcomprtABI). This mainly affects ABI boundaries and how global tensor constants are represented. One of the major improvements in this commit is that now it's a very clean rewrite that just replaces memrefs on ABI boundaries with !npcomprt.tensor (before there was a get_extent function that is not needed). See test/E2E/lower-to-npcomprt-abi.mlir 7. Lower to LLVM with upstream mlir patterns + some patterns for the npcomprt lowerings. One aspect here that is still a remnant of a non-descriptor-based tensor to memref flow is the BypassShapes + LowerShapedResultsToMemref. BypassShapes wraps the "tensor compute" ops in a tcp.shaped_results (basically a "tie_shape" kind of op), and then LowerShapedResultsToMemref uses those annotations to allocate output buffers while lowering the "tensor compute ops". Note that there are very few "tensor compute" ops currently supported (tcp.add + tcp.broadcast_to), so we just hardcode them in both passes. Realistically, I expect this to go away as we fully embrace the descriptor-based approach for simplicity, so don't look too deep into it.	2020-09-16 17:31:40 -07:00
Stella Laurenzo	fc484d1bd8	Rework reference shape lowering based on upstream shape dialect changes. * Primarily, the upstream shape dialect now uses tensor<?xindex> for non-erroring, immediate shape calculations (and will return this for shape_of of a tensor or memref). * In addition, upstream passes do not yet exist for fully lowering to standard ops, so the passes here need to be extended to handle this new convention. * This should be seen as an intermediate state, necessary to integrate a new LLVM version and needs more work and cleanup for generality. * There is a good deal of awkwardness in these conversions. The hope is that additional upstream work will yield better defined conversion paths once out of this intermediate state.	2020-08-03 13:43:49 -07:00
Sean Silva	3f3dcad871	Make input file to npcomp-run-mlir be positional. This makes command lines more succinct.	2020-07-13 16:02:19 -07:00
Sean Silva	e228aa4b11	npcomprt: add support for constants - create tcp.global + tcp.get_global_memref - create npcomprt.global + npcomprt.get_global - LLVM lowering for new npcomprt ops - Runtime: - GlobalDescriptor struct emitted by LLVM lowering - implement __npcomp_compiler_rt_get_global Also, - cleanly isolate all runtime data structure definitions shared by the compiler and runtime into lib/runtime/CompilerDataStructures.h	2020-07-10 17:31:24 -07:00
Sean Silva	f18014f60c	LowerRankedShapes: support shape.const_shape op. Also, the previous code had a special case for deleting this op when it had no uses. This is subsumed by the change in this commit since now shape.const_shape is properly lowered. With this change, the included test case with multiple serially dependent ops works! This specific issue was related to the scalar argument to that function. We needed to compute a broadcast of a scalar shape (which is a shape.const_shape) with another shape.	2020-07-08 20:12:40 -07:00
Sean Silva	b4f0cea8fa	Rework e2e flow to use new "npcomprt" This ~totally reworks the existing "runtime" stuff to be more principled and usable, such as from Python. It's still not fully production-quality, mainly in the department of memory management (e.g. it currently leaks memory; we need to figure out "who frees memrefs" + the analysis and transformation needed to do that (maybe use upstream buffer allocation pass?)). The user API is in include/npcomp/runtime/UserAPI.h, though include/npcomp/JITRuntime/JITModule.h is a friendlier wrapper. The stuff under {include,lib}/runtime is totally firewalled from the compiler and tiny (<6kB, though no attention has gone into optimizing that size). For example, we don't link in libSupport into the runtime, instead having our own bare bones replacements for basics like ArrayRef (the JITRuntime helps with bridging that gap, since it can depend on all common LLVM utilities). The overall features of npcomprt is that it exposes a module that with multiple function entry points. Each function has arguments and results that are tensor-valued, and npcomprt::Tensor is the runtime type that is used to interact with that (and a npcomprt::Ref<T> reference-counting wrapper is provided to wrap npcomprt::Tensor in the common case). From an implementation perspective, an npcomprt module at the LLVM/object/binary level exposes a single module descriptor struct that has pointers to other metadata (currently just a list of function metadata descriptors). All interactions with the npcomp runtime are keyed off of that module descriptor, including function lookups and dispatching. This is done to dodge platform ABI issues and also allow enough reflection to e.g. verify provided arguments. Most of the compiler-side work here was in LowerToNpcomprtABI and LowerToLLVM. Also, - Rename npcomp_rt/NpcompRt to npcomprt/Npcomprt; it was getting annoying to type the underscores/caps. - misc improvements to bash_helpers.sh	2020-07-08 19:36:19 -07:00
Stella Laurenzo	308a54c3d0	Bump llvm-project to 52cae05e087b3d4fd02849fc37c387c720055ffb (2020/6/10). * Fixes compile errors from upstream. * XFAIL several tests that are now failing to legalize (will hand off to Sean).	2020-06-11 16:10:05 -07:00
Sean Silva	92e45703ad	Remove XFAIL. This test seems to be passing, after a clean rebuild of everything (including MLIR).	2020-06-03 20:52:16 -07:00
Sean Silva	cd7258dbd4	Enable warnings by default. The secret here is LLVM_ENABLE_WARNINGS=ON. I also fixed a couple warnings, which gets us to be warning-clean. I noticed also that npcomp-run-mlir/basic.mlir seems to be failing. Maybe something since the latest integrate. My next commit (introduce npcomp mini runtime) will largely rewrite it though, so it'll get fixed then.	2020-06-03 20:39:34 -07:00
Sean Silva	ea822968fa	Add bare-bones npcomp-run-mlir. The code isn't super clean, but is a useful incremental step establishing most of the boilerplate for future enhancements. We can't print or return tensors yet so correctness TBD, but I've stepped into the running code in the debugger so I know it definitely is running. This is the first step to building out an npcomp mini-runtime. The mini-runtime doesn't have to be fancy or complex, but it should at least be layered nicely (which this code and the current compiler interaction with the "runtime" code is not). Now that we have boilerplate for e2e execution in some form, we can build that out.	2020-05-28 18:37:11 -07:00

25 Commits (89d4931324589bf75ed088e98888ff21fe7cd41e)