[RefBackend] Add elementwise fusion and buffer deallocation

This gives some decent improvements to memory consumption and latency of testing. I would have expected buffer-deallocation to actually make a big difference to the final process RSS but it doesn't appear to. Also running buffer-deallocation later in the pipeline results in miscompiles. I didn't have the time or interest to dig in deeper, but something is off. (numbers below are taken from a single run, but I did do a few runs to make sure that the variance wasn't that great) - Linalg-on-Tensors shows memory consumption improvements and some slight speedups. ``` ./tools/e2e_test.sh -s -v -c refbackend fuse=0 dealloc=0 RSS: 3071.33 MB real 3m58.204s user 6m22.299s sys 0m51.235s fuse=1 dealloc=0 RSS: 2515.89 MB real 3m34.797s user 5m56.902s sys 0m44.933s fuse=1 dealloc=post-bufferize: RSS: 2290.25 MB real 3m42.242s user 6m0.560s sys 0m46.335s ``` - TOSA ResNet18 gets significantly faster and uses significantly less memory. ``` time ./tools/e2e_test.sh -s -v -c tosa -f ResNet18 fuse=0 dealloc=0 rss 1328.56 MB real 0m50.303s user 0m55.355s sys 0m12.260s fuse=1 dealloc=0 rss 859MB real 0m30.454s user 0m35.551s sys 0m11.879s fuse=1 dealloc=post-bufferize: rss 851MB real 0m30.313s user 0m39.889s sys 0m11.941s ``` Big thanks to Ramiro for the methodology here for measuring the RSS with `psutil`: https://gist.github.com/ramiro050/5b5c2501f7389c008d9029210772c3a8
2022-12-07 13:35:10 +00:00 · 2022-12-07 13:35:10 +00:00 · 69171c246a
parent 29c8823464
commit 69171c246a
1 changed files with 9 additions and 0 deletions
--- a/python/torch_mlir_e2e_test/linalg_on_tensors_backends/refbackend.py
+++ b/python/torch_mlir_e2e_test/linalg_on_tensors_backends/refbackend.py
@ -116,6 +116,14 @@ class RefBackendInvoker:

 LOWERING_PIPELINE = "builtin.module(" + ",".join([
    "func.func(refback-generalize-tensor-pad)",
+    # Apply some optimizations. It would be great if MLIR had more useful
+    # optimizations that worked out of the box here.
+    # Note: When measured, this doesn't seem to actually help that much
+    # for the linalg-on-tensors backend.
+    # This is likely because if things are naturally fusable we usually already
+    # emit things in that form from the high level (e.g. single linalg-generic).
+    # Other backends are likely to benefit more.
+    "func.func(linalg-fuse-elementwise-ops)",
    # Bufferize.
    "func.func(scf-bufferize)",
    "func.func(tm-tensor-bufferize)",
@ -126,6 +134,7 @@ LOWERING_PIPELINE = "builtin.module(" + ",".join([
    "refback-mlprogram-bufferize",
    "func.func(tensor-bufferize)",
    "func.func(finalizing-bufferize)",
+    "func.func(buffer-deallocation)",
    # Munge to make it ExecutionEngine compatible.
    # Specifically, we rewrite calling convention boundaries to be in terms
    # of unranked memref, and we rewrite the return to actually be a