mirror of https://github.com/llvm/torch-mlir
[RefBackend] Add elementwise fusion and buffer deallocation
This gives some decent improvements to memory consumption and latency of testing. I would have expected buffer-deallocation to actually make a big difference to the final process RSS but it doesn't appear to. Also running buffer-deallocation later in the pipeline results in miscompiles. I didn't have the time or interest to dig in deeper, but something is off. (numbers below are taken from a single run, but I did do a few runs to make sure that the variance wasn't that great) - Linalg-on-Tensors shows memory consumption improvements and some slight speedups. ``` ./tools/e2e_test.sh -s -v -c refbackend fuse=0 dealloc=0 RSS: 3071.33 MB real 3m58.204s user 6m22.299s sys 0m51.235s fuse=1 dealloc=0 RSS: 2515.89 MB real 3m34.797s user 5m56.902s sys 0m44.933s fuse=1 dealloc=post-bufferize: RSS: 2290.25 MB real 3m42.242s user 6m0.560s sys 0m46.335s ``` - TOSA ResNet18 gets significantly faster and uses significantly less memory. ``` time ./tools/e2e_test.sh -s -v -c tosa -f ResNet18 fuse=0 dealloc=0 rss 1328.56 MB real 0m50.303s user 0m55.355s sys 0m12.260s fuse=1 dealloc=0 rss 859MB real 0m30.454s user 0m35.551s sys 0m11.879s fuse=1 dealloc=post-bufferize: rss 851MB real 0m30.313s user 0m39.889s sys 0m11.941s ``` Big thanks to Ramiro for the methodology here for measuring the RSS with `psutil`: https://gist.github.com/ramiro050/5b5c2501f7389c008d9029210772c3a8pull/1697/head
parent
29c8823464
commit
69171c246a
|
@ -116,6 +116,14 @@ class RefBackendInvoker:
|
|||
|
||||
LOWERING_PIPELINE = "builtin.module(" + ",".join([
|
||||
"func.func(refback-generalize-tensor-pad)",
|
||||
# Apply some optimizations. It would be great if MLIR had more useful
|
||||
# optimizations that worked out of the box here.
|
||||
# Note: When measured, this doesn't seem to actually help that much
|
||||
# for the linalg-on-tensors backend.
|
||||
# This is likely because if things are naturally fusable we usually already
|
||||
# emit things in that form from the high level (e.g. single linalg-generic).
|
||||
# Other backends are likely to benefit more.
|
||||
"func.func(linalg-fuse-elementwise-ops)",
|
||||
# Bufferize.
|
||||
"func.func(scf-bufferize)",
|
||||
"func.func(tm-tensor-bufferize)",
|
||||
|
@ -126,6 +134,7 @@ LOWERING_PIPELINE = "builtin.module(" + ",".join([
|
|||
"refback-mlprogram-bufferize",
|
||||
"func.func(tensor-bufferize)",
|
||||
"func.func(finalizing-bufferize)",
|
||||
"func.func(buffer-deallocation)",
|
||||
# Munge to make it ExecutionEngine compatible.
|
||||
# Specifically, we rewrite calling convention boundaries to be in terms
|
||||
# of unranked memref, and we rewrite the return to actually be a
|
||||
|
|
Loading…
Reference in New Issue