Description
Could you provide some additional information about the memory consumption using your Graph Transformer?
You state, that sparse-attention favors both computation time and memory consumption, but do not provide actual measurements of the second property in your evaluation or do not state clearly, if and how your implementation is able to take advantage of it.
Some peak memory measurements of your experiments as an addendum to your evaluation of the computation times (e.g. Table 1) could be beneficial to others, too. As in my case the quadratic growth of the memory consumption w.r.t. the sequence length prevents an efficient use of Transformers for some task, where connectivity information is given and can be simply modeled by masking out (-Inf) the attention scores in the attention matrix.
Also some exemplary or artificial data could be interesting, e.g. (Mean) number of nodes n = {128, 1024, 2048, 4096}, (mean) number of edges per node e = {4, 8, 16, 64, 128}, to get an impression of the resource consumption of your Graph Transformer with Sparse Graph vs. NLP-Transformer (Full Graph with masking).
(I hope, that I could run the experiments myself, but I suppose your evaluation pipeline is already running and data provided by the original author should be more precise and trustworthy to other researchers, too.)