129

11

defrost

BlackFly

I don't think this algebra is a full description of flame graphs but it feels like it captures part of what you want for the limited purpose of global performance analysis.

I feel like a flame graph cannot be an element of a vector space since vectors are commutative and the stacks of a flame graph are ordered. From a performance perspective, ignoring the ordering can simplify the algebra (apparently by turning it into a vector space) and will still correspond with your overall impression of "time spent in a frame".

Specifically,

```
|A|B |A|
|C|C |C|
```

is (1, C;A), (2, C;B), (1, A;C) but as (1, C;A) + (2, C;B) + (1, A;C) with vector addition we can rearrange this to (2, C;A) + (2, C;B)or

```
|A |B |
|C |C |
```

which of course still captures the time spent in C;A and C;B, but it has lost some information. We think of it as a different flame graph. Of course, ignoring that information can be meaningful to a performance analysis.Indeed the ordering can only be important to application semantics. This is clearly accentuated by the norm, which is even destroying the information of the basis elements.

nirinor

Very nice formalization.

One area for refinement: it considers two stacks either identical or unrelated. Consider that stack A;B is actually very close to A;B;C, the difference might be due to a sample time occurring just before or just after the call to C. OP considers them just as different as A;B and Z;W, therefore amplifying a measurement noise.

This suggests using a refined metric between stacks (e.g., an edit distance counting pushes and pops), and then we can use it in defining the metric between flamegraphs (e.g., an optimal transport metric [1], instead of the proposed L1).

Avoiding that noise amplification reduces the background noise level, therefore the cost of effective measurements. From another perspective, the current OP scheme creates an avoidable curse of dimensionality in the form of the Hotelling test's requirement that each sample has more measurements have more samples than distinct stack frames. So the same code split into more functions is harder to measure, and too-small samples are useless. I think neither of those is necessary if we take stack similarity into account.

m1el

As a developer who doesn't use mathematical notation very often, I've had some difficulty reading it.
From my understanding, the paper goes as follows:

2.1 Define FlameGraph

```
Frame = something that identifies function/location in the code
Stack = Vec<Frame>
FlameGraph = Map<Stack, Real>
FlameGraphPositive = Map<Stack, RealPositive>
```

2.2 Define FlameChart```
FlameChart = Vec<{ start: Real, flame_graph: FlameGraphPositive }>
# such that the start values are strictly ordered
fn to_flamegraph(flame_chart: FlameChart) -> FlameGraphPositive {
flame_chart
.map(|{ start, flame_graph }| flame_graph)
.sum()
}
```

Personally, I think a better definition of FlameChart is:```
FlameChart = Vec<{ end_time: RealPositive, stack: Stack }>
# such that sampling starts at 0 and Vec is strictly ordered by end_time
fn to_flamegraph(flame_chart: FlameChart) -> FlameGraphPositive {
let mut start = 0.0;
flame_chart.map(|{ end_time, stack }| {
let elapsed = end_time - start;
start = end_time;
FlameGraphPositive::from_stack(stack, elapsed)
}).sum()
}
```

3.1 Define diff```
# a,b: FlameGraphPositive
fn diff(a, b) -> FlameGraph;
# such that a + diff = b
# diff: FlameGraph
# add, sub: FlameGraphPositive
fn to_positive(diff) -> { add, sub };
# such that diff = add - sub
```

3.2 Define a naive diff metric```
# a,b: FlameGraphPositive
fn diff_metric(a, b) -> RealPositive {
diff(a, b).size() / (a.size() + b.size())
}
# diff_metric(a, b) \in [0, 1]^Real
```

3.3 Define a more sophisticated diff metricUsing Hotelling T^2 Test, the author defines a metric, which takes sampling variance into account, and allows to detect performance regression, even when the sampling profile is noisy. (you'd have to read the paper for the exact method)

The code for this test is here: https://github.com/P403n1x87/flamegraph-experiment/blob/04db... https://github.com/P403n1x87/flamegraph-experiment/blob/04db...

Vervious

I'm confused, what are the takeaways, and why are there 83 upvotes with no discussion?

KyleBerezin

A link to a paper on the usefulness of a certain type of graph... Never change HN

sr.ht

For anyone else who has not recognised the name, I found these origin links which may interest some:

[1] https://www.brendangregg.com/flamegraphs.html[2] https://www.usenix.org/conference/atc17/program/presentation...

(hour long usenix talk of [2]) https://www.youtube.com/watch?v=D53T1Ejig1Q

This paper appears to be formalising the nature of a flame graph, recognising that they rest on data from

sampling profilersand as such suffer a "never the same twice" variation when repeatedly run, and thus seeking to quantify "distance" of one graph from another as well as "average" of multiple graphs (for the "one true mean reference graph" I guess).The motivation is then to be able to

reliablymeasure "distance" from one group of profiling runs to another, before and after changes of interest.I dare say much more is possible, but I fear I've barely skimmed the material before bedtime :/