An overview of Openvswitch implementation
Author: Cong Wang <xiyou.wangcong@gmail.com>
This is NOT a tutorial on how to use openvswitch, this is for developers who want to know the implementation details of openvswitch project, thus, I assume you at least know the basic concepts of openvswitch and know how to use it. If not, seewww.openvswitch.org to get some documents or slides.
Let’s start from the user-space part. Openvswitch user-space contains several components: they are a few daemons which actually implements the switch and the flow table, the core of openvswitch, and several utilities to manage the switch, the database, and even talk to the kernel directly.
There are three daemons started by openvswitch service: ovs-vswitchd, which is the core implementation of the switch; ovsdb-server, which manipulates the database of the vswitch configuration and flows; and ovs-brcompatd which keeps the compatibility with the traditional bridges (that is the one you create with ‘brctl’ command) .
Their relationship is shown in the picture below:
Obviously, the most important one is ovs-vswitchd, it implements the switch, it directly talks with kernel via netlink protocol (we will see the details later). And ovs-vsctl is the utility used primarily to manage the switch, which of course needs to talk with ovs-vswitchd. Ovs-vswitchd saves and changes the switch configuration into a database, which is directly managed by ovsdb-server, therefore, ovs-vswitchd will need to talk to ovsdb-server too, via Unix domain socket, in order to retrieve or save the configuration information. This is also why your openvswitch configuration could survive from reboot even you are not using ifcfg* files.
Ovs-brcompatd is omitted here, we are not very interested in it now.
Ovs-vsctl can do lots of things for you, so most of time, you will use ovs-vsctl to manage openvswitch. Ovs-appctl is also available to manage the ovs-vswitchd itself, it sends some internal commands to ovs-vswitchd daemon to change some configurations.
However, sometimes you may need to manage the datapath in the kernel directly by yourself, in this case, assume ovs-vswitchd is not running, you can invoke ovs-dpctl to let ovs-vswitchd to manage the datapath in the kernel space directly, without database.
And, when you need to talk with ovsdb-server directly, to do some database operation, you can run ovsdb-client, or you want to manipulate the database directly without ovsdb-server, ovsdb-tool is handy too.
What’s more, openvswitch can be also administered and monitored by a remote controller. This is why we could define the network by software! sFlow is a protocol for packet sampling and monitoring, while OpenFlow is a protocol to manage the flow table of a switch, bridge, or device. Openvswitch supports both OpenFlow and sFlow. With ovs-ofctl, you can use OpenFlow to connect to the switch and do some monitoring and administering in the remote. sFlowTrend, which is not a part of openvswitch package, is the one that is capable for sFlow.
Now, let’s take a look at the kernel part.
As mentioned previously, the user-space communicates with the kernel-space via netlink protocol, generic netlink is used in this case. So there are several groups of genl commands defined by the kernel, they are used to get/set/add/delete some datapath/flow/vport and execute some actions on a specific packet.
The ones used to control datapatch are:
enum ovs_datapath_cmd {
OVS_DP_CMD_UNSPEC,
OVS_DP_CMD_NEW,
OVS_DP_CMD_DEL,
OVS_DP_CMD_GET,
OVS_DP_CMD_SET
};
and there are corresponding kernel functions which does the job:
ovs_dp_cmd_new()
ovs_dp_cmd_del()
ovs_dp_cmd_get()
ovs_dp_cmd_set()
Similar functions are defined for vport and flow commands too.
Before we talk about the details of the data structures, let’s see how a packet is sent out or received from a port of an ovs bridge. When is packet is send out from the ovs bridge, an internal device defined by openvswitch module, which is viewed, by the kernel, as a struct vport too. The packet will be finally passed to internal_dev_xmit(), which in turn “receives” the packet. Now, the kernel needs to look at the flow table, to see if there is any “cache” for how to forward this packet. This is done by function ovs_flow_tbl_lookup(), which needs a key. The key is extracted by ovs_flow_extract() which briefly collects the details of the packet (L2~L4) and then constructs a unique key for this flow. Assume this is the first packet going out after we create the ovs bridge, so, there is no “cache” in the kernel, and the kernel doesn’t know how to handle this packet! Then it will pass it to the user-space with “upcall” which uses genl too. The user-space daemon, ovs-vswitchd, will check the database and see which is the destination port for this packet, and will response to the kernel with OVS_ACTION_ATTR_OUTPUT to tell kernel which is the port it should forward to, in this case let’s assume it is eth0. and finally a OVS_PACKET_CMD_EXECUTE command is to let the kernel execute the action we just set. That is, the kernel will execute this genl command in function do_execute_actions() and finally forward the packet to the port “eth0” with do_output(). Then it goes to outside!
The receiving side is similar. The openvswitch module registers an rx_handler for the underlying (non-internal) devices, it is netdev_frame_hook(), so once the underlying device receives packets on wire, openvswitch will forward it to user-space to check where it should goes, and what actions it needs to execute on it. For example, if this is a VLAN packet, the VLAN tag should be removed from the packet first, and then forwarded to a right port. The user-space could learn that which is the right port to forward a given packet.
The internal devices are special, when a packet is sent to an internal device, it is be immediately sent up to openvswitch to decide where it should go, instead of really sending it out. There is actually no way out of an internal device directly.
Besides OVS_ACTION_ATTR_OUTPUT, the kernel also defines some other actions:
OVS_ACTION_ATTR_USERSPACE, which tells the kernel to pass the packet to user-space
OVS_ACTION_ATTR_SET: Modify the header of the packet
OVS_ACTION_ATTR_PUSH_VLAN: Insert a vlan tag into the packet
OVS_ACTION_ATTR_POP_VLAN: Remove the vlan tag from the packet
OVS_ACTION_ATTR_SAMPLE: Do sampling
With these commands combined, the user-space could implement some different policies, like a bridge, a bond or a VLAN device etc. GRE tunnel is not currently in upstream, so we don’t care about it now.
So far, you already know how the packets are handled by openvswitch module. There are much more details, especially about the flow and datapath mentioned previously. A flow in kernel is represented as struct sw_flow, and datapath is defined as struct datapath, and the actions on a flow is defined as struct sw_flow_actions, and plus the one we mentioned, struct vport. These structures are the most important ones for openvswitch kernel module, their relationship is demonstrated in this picture:
The most important one needs to mention is each struct sk_buff is associated with a struct sw_flow, which is via a pointer in ovs control block. And the above actions is associated with each flow, every time when a packet passed to openvswitch module, it first needs to lookup the flow table which is contained in a datapath which in turn either contains in a struct vport or in a global linked-list, with the key we mentioned. If a flow is found, the corresponding actions will be executed. Remember that datapath, flow, vport, all could be changed by the user-space with some specific genl command.
As you can see, the kernel part only implements a mechanism and the fast path (except the first packet), and the user-space implements different policies upon the mechanism provided by the kernel, the slow path. The user-space is much more complicated, so I will not cover its details here.
Update: Ben, one of the most active developers for openvswitch, pointed out some mistakes in the previous version, I updated this article as he suggested. Thanks to Ben!
November 4th, 2012 at 1:29 AM #Ben Pfaff
I see a few mistakes in the diagram. The most important is that ovs-vsctl talks to ovsdb-server, not to ovs-vswitchd. A minor point is that ovs-appctl can talk to both ovs-vswitchd and ovsdb-server (and other less important daemons), although it talks to ovs-vswitchd by default.
[Reply]
王 聪 reply on November 5, 2012 12:00 AM:
Thanks, Ben! I will fix it.
[Reply]
Rahul reply on August 23, 2013 10:21 PM:
Hi Cong,
It seems that you have incorporated the above comment in the diagram, but the theory given below the diagram still shows that ovs-vsctl talks with ovs-vswitchd. Can you please correct if its a mistake so that diagram and theory provided is not confusing. Otherwise, its a very good article to learn about openvswitch. Thanks for providing such simple and detailed view.
[Reply]
November 16th, 2012 at 6:18 PM #Sudhakar
Hi (Wang??)
Thanks for this post. This is helping me understand the OVS implementation better and quicker.
Thank you so much. I will be waiting for your article on user-space side of the implementation too :)
Regards,
Sudhakar.
[Reply]
Jayce reply on November 21, 2012 2:41 PM:
I’m looking forward too.
[Reply]
November 20th, 2012 at 11:42 AM #lxs
I am researching OVS now, I have a problem about user space and kernel, when a packet is missing in flow table, it will be sent to user space, do you know which function of code to receive the packet?
[Reply]
January 10th, 2013 at 7:54 PM #Talha
I think the part of the article that explains packet being sent out from ovs is not correct. It is not the internal device all the time. The output mechanism depends on the type of vport it is being sent on. There are 5 types of vports i.e GRE, linux native netdev, internal, CAPWAP, patch type.
Packet output mechanism depends on the type of the port on which it is being sent. What you explained in article about internal_dev_xmit only holds true in case of internal device. For other devices the sending path is different for each. Similar is the case on reception.
Above this vport layer is the common layer in OVS that handles the packets in the traffic
[Reply]
王 聪 reply on January 13, 2013 6:59 PM:
I am talking about the upstream (kernel.org) openvswitch, not the one of openvswitch.org. Upstream kernel doesn’t have gre tunnel etc. in openvswitch.
[Reply]
March 30th, 2013 at 12:15 PM #David Zhang
Hi,
Firstly, thank you very much for the above great post.
I am not sure if i am ok to ask my question here or not.
My question is about the flow table when we are manually adding the flow entry.
(1) How the flow table is built initially;
(2) According to my understanding, the flow entry will be added when the first packet of the flow is coming in; is it correct?
(3) What the flow table entry update and timeout mechanism is? How I know individual flow entry timeout?
As I cant find any doc about my questions.
Your help is highly appreciated.
[Reply]
David Zhang reply on March 30, 2013 12:16 PM:
Sorry my typo.
My question is about the flow table when we are manually adding the flow entry.
should be:
My question is about the flow table when we are not manually adding any the flow entry in flow table.
[Reply]
April 12th, 2013 at 3:25 PM #braver
It is really helpful to me so much ! Thank you !
[Reply]
April 16th, 2013 at 8:28 PM #braver
It is my pleasure to discuss openvswitch with you .I am one undergraduate student from China.
Recently , I am reading openvswitch source code ,and I have one idea which I really want to get your help :
My idea : as I known , now the openvswitch still only support matching some special protocols ,such as IP、ARP and so on.
While if I want to match some new protocol filed ,
for example:
If my new dl_type is 0×0909 , then the next element is not nw_src or nw_dst , suppose I define new element ‘S’(4bytes) ,
Now , I want to switch the packets through matching the ‘S ‘ ,
Can you give me some help to support this demand ? I have some difficult to reading the source code .
Really thank you for your attention !
Best regard !
[Reply]
September 12th, 2013 at 1:28 AM #Radu
Hello guys,
I need some help!
I need to change the OVS source code (i have the 1.9.3 version) to make an implementation of ARP-Path protocol. Here’s a short video explaining how this new protocol works: http://www.youtube.com/watch?v=IhwCYAu_E7E
Basically it is has to create paths learning from the ARP messages, so I need to know how to modify the tables and where messages are handled when they arrive. I did not quite understand if this happens in the kernel or in user space, and can not find the part of the source code that needs to be modified to achieve this (where are the ARP packets processed).
Can you help me with some ideas?
I thank you in advance.
[Reply]
October 25th, 2013 at 10:36 AM #dsyjc
Thanks for your introduction.
Recently,I have a question.
I need the function similar with “split-horizon”.That is when one port receives broadcast packets,then it does not forward these packets to other ports.
Can ovs implement this?
Thank you!
[Reply]