P4 lacks a formal concurrency model. I can see at least two scenarios that demand such

Yes, <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-i

[design] Concurrency model for P4,about p4lang/p4-spec

Comments (23)

gbrebner commented on September 27, 2024

Pasting in what the draft spec currently says: it's summarizing some prior email discussion that wasn't captured as a github issue.

18.3.1 Concurrency model
[TODO: is this concurrency model suitable?]
In practice a network device may be processing multiple packets simultaneously:
• Packets may be received concurrently on different network interfaces
• Packet processing may be pipelined, with a new packet starting before the completion of the previous one
As long as the packet processing involves stateless elements and read-only state elements there should be no difference in the results obtained from concurrent or purely sequential execution.
Since tables are read-only from the data-plane point of view, we can provide a very simple semantics for P4 programs written solely in the P4 core language: they should behave identically irrespective of the concurrent execution.
However, as soon as one is using any stateful extern constructs, the question arises with respect to the semantics of the program under concurrent execution. For example, given a set of counters that can be accessed by multiple actions, what is the interleaving of the execution of the counter methods when processing multiple packets? What is the interleaving of method invocations if the counter is accessed from different blocks (e.g., ingress and egress pipelines)?
The answer to this question is left partially to the discretion of the target architecture. An architecture could:
• Prescribe specific order
• Forbid resources that are shared between multiple blocks (e.g., each counter must be allocated in one pipeline exclusively, and it must be used only from actions that can appear within one single table)
• Prescribe an implementation-specific order
We suggest the following minimum constraints on any P4 implementation:
• The invocation of a table is atomic
• The execution of a parser is atomic

from p4-spec.

chkim4142 commented on September 27, 2024

Adding this to Chang's bucket.

from p4-spec.

anirudhSK commented on September 27, 2024

draft.pdf
This is a preliminary draft of a proposal for atomics in P4. The Latex source is here: https://github.com/anirudhSK/p4-concurrency

This draft contains motivation, examples, and the concurrency model. It's too verbose to go directly into the spec, but should hopefully explain what we have in mind.

from p4-spec.

chkim4142 commented on September 27, 2024

This seems deserving a discussion. Although I've just assigned "P4_16" milestone to this issue, it could be considered "post P4_16" as well.

from p4-spec.

gbrebner commented on September 27, 2024

Aniriudh - thanks for the thoughtful document - this is an important topic.

We've been encountering issues with unthinking concurrency in various real P4 examples, basically where people have written their "natural" algorithm, but unconsciously imagining that each packet is handled completely before the next one. This is certainly not the case in our FPGA implementation, where there are actually three levels of pipelining with multiple packets in flight at different pipeline stages.

I think there is a common issue behind your two suggested proposals: that the target architecture has to supply some atomic operation combinations. For the first (use registers) proposal, these are what the smartened compiler has to map to; for the second (more complex extern types), these are the said types. Really the only difference is whether or not these combo operations are exposed to the P4 user or not. The question is how the target can supply some (small) set of widely useful operation combinations. One case that we have found recurring is a general read-update-write register operation, and this is seen in your examples too. Maybe there is some natural set that might emerge with more examples.

Another thing you identify is the extent of atomic blocks. This is related to the previous point, of course, since it's reflected in how generous the target can be in terms of atomicity. One case we have found is where a register is essentially being used as a working variable, accessed from relatively distant parts of the P4 program. A general solution for this has been to rewrite this as metadata travelling with the packet. Another issue for heavily pipelined implementations is that too-generous atomicity extents can limit the pipelining effectiveness.

At this stage, I don't have a comprehensive solution in mind, since there haven't been enough use cases yet. Being explicit with @atomic annotations might help people to think about what they're writing in their programs, which would mean that there was no longer the case of them unconsciously placing such an annotation round a whole control block, for example.

from p4-spec.

anirudhSK commented on September 27, 2024

Gordon,

Thanks for these comments; it helps illustrate the problem in at least one more context: an FPGA substrate.

Like you said, I think we'll need more examples to be certain. For what it's worth, my sense based on the examples we used in the Domino work (http://dl.acm.org/citation.cfm?doid=2934872.2934900) is that @atomic should suffice for all of them because it corresponds directly to the packet transactions abstraction used there. I also think we can implement a compiler pass (Section 4.2 of http://dl.acm.org/citation.cfm?doid=2934872.2934900 has details) that can decompose a user-supplied @atomic into minimum-size @atomic blocks. This makes use of (among other things) the trick you just mentioned of reading a register into a metdata for subsequent use.
We also need to solve the hardware-centric problem of specifying what the atomic instructions even are. This is substrate-specific and different hardware atomic instructions might have different performance characteristics (as measured in packet processing rate). I think part 2 of the compiler, which would reside in an FPGA or ASIC backend, would take the minimal-size atomic blocks from part 1 and generate atomic instructions for them if possible. On an FPGA target, a more generous block will run slower; on an ASIC, it may not run at all. Either way, the code generator should catch this.

I am happy to take a stab at implementing part 1 in the P4-16 compiler, while part 2 would reside in a vendor's backend. This has the added benefit of allowing the vendor to keep their atomic instructions closed and hidden within the backend.

Anirudh

from p4-spec.

gbrebner commented on September 27, 2024

I think that the stateful extern that should be discussed and evolved first is register. There's been some discussion on #73 about whether P4 is drifting in a general-purpose direction - while I'm comfortable with the issues under discussion there, I think that register is actually the biggest danger, since it appears as a very generic stateful artifact, but this has conflicts with the overall P4 model, and especially its concurrency. So developing it further, whether through constraints or through defining more atomic operations, as an initial focus for atomic concurrency would be wise.

from p4-spec.

anirudhSK commented on September 27, 2024

Yes, #73 is very pertinent here. I think it's useful for programmers to have a cost model of the hardware. At the same time, I agree that this is an implementation/target concern---not something to be mandated by the language.

For instance, I can imagine putting in a few conservative checks in a target's compiler that limits the extent of an atomic block. You could measure "extent" either by counting the number of statements within a @atomic or by turning the @atomic into a DAG of primitive instructions and measuring its depth.

Such conservative checks may be useful for many of the problems @chkim4142 points out in #73 like arbitrary complicated action-body expressions, action-body statements, and control-block statements.

@mbudiu-vmw, @ChrisDodd : How difficult is it to implement such checks in the P4-16 compiler?

from p4-spec.

mihaibudiu commented on September 27, 2024

Most of these questions belong to the implementation, not to the spec.
@anirudhSK's own work has shown that just counting statements or depth is not enough, e.g., CoDel and sqrt, or using special hardware widgets (think multiply-add).

from p4-spec.

anirudhSK commented on September 27, 2024

I can imagine a stateteless extern capturing the more exotic hardware ops (sqrt, multiply-add). These then show up as method calls within an atomic block, as opposed to expressions. We might have to "count" such method calls differently from other primitive expressions in the @atomic block, and this might only be doable with an intimate knowledge of the target.

Even if it's target specific, I think it's useful to think through a compiler implementation pathway for @atomic. That includes both compiling @atomics and reporting sane diagnostics when rejecting them, which the programmer can then use to modify their code.

from p4-spec.

anirudhSK commented on September 27, 2024

Here's a first cut at a specification of atomics written into the concurrency model of the P4-16 draft: #80. It provides the language construct, some suggested compiler implementations, and notes on supporting reasonable diagnostics. Grateful for any feedback.

from p4-spec.

mihaibudiu commented on September 27, 2024

I have simplified @anirudhSK's text. Here is the text I am proposing to use to replace Sections 18.3 and 18.4. If you like this text I will do the replacement in the spec.

1.1 Dynamic evaluation
The dynamic evaluation of a P4 program is orchestrated by the target model. Each target model needs to specify the order and the conditions under which the various P4 component programs are dynamically executed. For example, in the Simple Switch example the execution flow goes Parser->Pipe->Deparser.
Once a P4 execution block is invoked its execution proceeds until termination according to the semantics defined in this document (the various abstract machines).
1.1.1 Concurrency model
A typical packet processing system needs to execute multiple simultaneous logical “threads:” at the very least there is a thread executing the control plane, which can modify the contents of the tables. The data plane can exchange information with the control plane through extern method calls. Moreover, high throughput packet processing systems may be processing multiple packets simultaneously, e.g., in a pipelined fashion, or concurrently parsing a first packet while performing match-action operations on a second packet. This section specifies the semantics of P4 programs with respect to such concurrent executions.
Each top-level parser or control block is executed as a separate thread when invoked by the target architecture. All the parameters of the block and all local variables are thread-local: i.e., each thread has a private copy of these resources. This applies to the packet_in and packet_out parameters of parsers and deparsers.
As long as a P4 block uses only thread-local storage (e.g., metadata, packet headers, local variables), its behavior in the presence of concurrency is identical with the behavior in isolation, since any interleaving of statements from different threads must produce the same output.
In contrast, extern blocks instantiated by a P4 program are global, shared across all threads. If extern blocks mediate access to state (e.g., counters, registers) – i.e., the methods of the extern block read and write state, these stateful operations are subject to data races. P4 mandates the following behaviors:
• Execution of an action is atomic, i.e., the other threads can “see” the state as it is either before the start of the action or after the completion of the action.
• Execution of a method call on an extern instance is atomic.
To allow users to express atomic execution of larger code blocks, P4 provides an @atomic annotation, which can be applied to block statements, parser states, control blocks or whole parsers.
Consider the following example:

extern Register { ... }
control ingress() {
  Register() r;  
  table flowlet() { /* read state of r in an action */ }
  table new_flowlet() { /* write state of r in an action */ }
  apply {
    @atomic {
       flowlet.apply();
       if (ingress_metadata.flow_ipg > FLOWLET_INACTIVE_TIMEOUT) 
          new_flowlet.apply();
    }
  }
}

This program accesses an extern object r of type Register in actions invoked from tables flowlet_id (reading) and flowlet (writing). Without the @atomic annotation these two operations would not execute atomically: a second packet may read the state of r before the first packet had a chance to update it.
A compiler backend must reject a program containing @atomic blocks if it cannot implement the atomic execution of the instruction sequence. In such cases, the compiler should provide reasonable diagnostics.

from p4-spec.

gbrebner commented on September 27, 2024

This looks reasonable to me, and clarifies an important aspect of P4 execution. As a small detail, it looks like this would just replace 18.4 ("Dynamic evaluation"), and not both 18.3 and 18.4 as you say.

In practical terms for most targets, as Aniriudh identified in his original proposal, it will have to be the case that @atomic blocks are relatively short and local before concurrency benefits start getting lost. Extreme uses like putting @atomic round large components will condemn systems to largely handle each packet to completion before taking a next packet. (Unless compilers are smart enough to discover that the user's broad-range @atomic block is in fact unnecessary and actual concurrency dangers are much more local or maybe non-existent.)

from p4-spec.

mihaibudiu commented on September 27, 2024

The reason it makes sense to label a whole control is that you could write control modules in a library which have to behave atomically. After inlining these turn into blocks. There is no other way to do a multi-state atomic parser code fragment.

from p4-spec.

anirudhSK commented on September 27, 2024

@mbudiu-vmw, thanks for writing this up. I think it's reasonable overall, but here are a few comments, which might clarify some aspects that confused me.

"very least there is a thread executing the control plane, which can modify the contents of the tables." Personally, I would not bring up the control plane here because it is not written in P4, making it hard to specify its behavior in any way. But if we do bring it up, maybe we should specify some expected behavior, like guaranteeing that the match-action table has either the old rules or the new ones but not a strange mix.
"The data plane can exchange information with the control plane through extern method calls." While this is true (I think you are referring to learning filters), this isn't the primary use case for externs as I understand it. For instance, registers, counters, and meters are externs that maintain data plane state and have nothing to do with the controller except for one-time configuration. I think something like "The data plane can store and manipulate state on a per-packet basis through extern method calls, e.g., registers and counters" would be a better way to introduce externs in this section.
"concurrently parsing a first packet while performing match-action operations on a second packet." This is true, but not the focus of the concurrency model in this section. This section is discussing concurrency within a block (parser or control), not across different blocks as mandated by the target model, which is beyond P4.
"In contrast, extern blocks instantiated by a P4 program are global". If #81 is adopted, then we should say "extern blocks are global by default" and include local externs as examples of thread-local storage.
"is executed as a separate thread when invoked by the target architecture.". I would try and give examples of when it is invoked, e.g., packet arrival from the wire or a parsed packet arriving from another P4 program within the target model.

from p4-spec.

anirudhSK commented on September 27, 2024

@gbrebner:

Generous atomic extents are a problem with a simplistic compiler, and rejecting really large atomic blocks is the right place to start. That said, your final parenthetical remark "Unless compilers are smart enough ..." is the direction I hope P4 compilers will go towards in the future :)

from p4-spec.

mihaibudiu commented on September 27, 2024

Answers to @anirudhSK

We have to mention the control plane in some way. I don't think we can promise atomic control-plane operations: we don't know in this spec what these operations are, and on many targets it may be impossible to respect this promise.
Actually counters are exactly there to be read and probably reset by the control-plane. In this document we are not making any assumptions about what the various externs do, and how they interact with the control-plane, so we have to assume the worst-case behavior.
Actually we have to discuss concurrency between different P4 blocks too; the spec allows you to instantiate an extern block at top-level and pass references to it to multiple architectural blocks, e.g., both parsers and controls. Also, different extern blocks may communicate with each other through various hidden channels, e.g., the control-plane (consider a learning provider that communicates with a packet generator). @atomic must work even between different P4 architectural blocks.
Given the current spec the user cannot construct thread-local externs, so I could not refer to them yet (only packet_in and packet_out are thread-local). If we adopt the proposal in issue #81 then this section should be amended as you describe. I have not attempted to address issue #81 yet, we should first discuss it, but I think the discussion first requires us to solve this issue.
There is a short (and rather vague) example describing how possible invocations may occur in the previous sub-section, based on the very simple switch (VSS) model. We can't be too specific here, because we don't know anything about the architecture. Take your two examples "packet arrival from the wire or a parsed packet arriving from another P4 program within the target model:" these don't even hold in the P4-14 spec model: after arrival on the wire some hidden architectural block checks and removes the packet ethernet trailer checksum and may also drop the packet. Also, between parser and control there is some queueing which could use the priorities computed by the parser. So P4 blocks in general are not invoked immediately one after another. The most precise description on how invocations occur is in the VSS architectural model description.

from p4-spec.

anirudhSK commented on September 27, 2024

@mbudiu-vmw: Ok with 2, 4, and 5. Comments on 1 and 3 below.

I am ok with mentioning the control plane. Can the spec at least say that "the target architecture should provide some formal semantics about how the control and data planes interact"?
I agree and understand your point now. I have one clarification though. We could potentially manipulate the same extern instance from different P4 blocks. In this case, we want the method calls on that instance to appear atomic. But, do we really need an @atomic spanning different P4 blocks, e.g., parser and control?

I am concerned with externs that have "hidden channels". A simpler view would be to say that each extern instance is an independent entity with no hidden state that is shared between externs. I think this is equivalent to saying that operations on different extern instances commute. Would this be too strong?

from p4-spec.

jnfoster commented on September 27, 2024

Minor snark: Formal semantics is the reason that networking people find my papers unreadable :-) The "formal" in that phrase means having to do with "forms" or syntax, and few things about P4 have every been fully specified via a formal semantics. I might just say "targets should specify [or perhaps just 'describe'] how the control and data planes interact."

from p4-spec.

anirudhSK commented on September 27, 2024

Fair enough :-) I like your wording much better.

from p4-spec.

mihaibudiu commented on September 27, 2024

I have added a line about 1.

For 2 - the atomic block does not span multiple P4 blocks - what happens is that the @atomic execution is visible as atomic everywhere (it is not only atomic in the control, it is also atomic for parsers, and all other blocks in the P4 programs - they can only see state before or after the atomic block).

However, with 2 I think that we cannot do anything about externs that interact. All externs are visible to the control plane, and the control-plane may have APIs to read and write state from an extern. So in principle all externs can communicate with each other through the control-plane. The compiler front-end has to assume this.

We can perhaps add a series of annotations to give additional information to the front-end (and users). For instance, a @PrivateState annotation could indicate that an extern does not share state with other externs. This would imply that method calls between this extern and different ones can be reordered. But we may do this in also in a later language revision.

from p4-spec.

anirudhSK commented on September 27, 2024

The @PrivateState annotation seems useful; it could even be on by default. That said, I agree we don't have to address that here and can consider it in a later language revision.

from p4-spec.

chkim4142 commented on September 27, 2024

We agree with what's proposed. We might need to look into the BMv2 architecture and the compiler backend for BMv2 to see what's needed to realize this.

from p4-spec.

[design] Concurrency model for P4 about p4-spec HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent