微服务学习 - Kubernetes中的网络流量

发表于 2021-05-12 更新于 2023-01-05 分类于 k8s

概述

本例中使用Kubesphere v3.1.0环境（Kubernetes v1.20.4 + Calico v3.16.3），网关设备为Kourier。

在Kubernetes上部署应用负载并提供服务后，访问流量将经过网关设备、负载均衡最后到达后端设备。

在这个过程中Kourier将作为网关设备解析流量的目的地址，根据域名分发到负载均衡器即Service上。

Kube-proxy（本例中为ipvs模式）将负责Service、Endpoint部分的流量处理。

Calico（本例中为IPIP模式）将负责后端设备即pod间流量的传输。

网关

Ingress是Kubernetes中用于处理7层网络负载的反向代理抽象。

Kourier基于Envoy实现了Ingress的能力。它会监听集群内资源的变化，变更Envoy的配置来保证前端的流量可以正确路由到后端的Service上。

Kube-proxy

kube-proxy的配置文件存放在kube-proxy容器的/var/lib/kube-proxy/config.conf中

同时还会使用kubernetes服务的参数，如--feature-gates等

kube-proxy的监听工作

serviceConfig和endpointsConfig初始化时，在对应的Informer中注册了AddFunc、UpdateFunc和DeleteFunc

func NewServiceConfig(serviceInformer coreinformers.ServiceInformer, resyncPeriod time.Duration) *ServiceConfig {
    result := &ServiceConfig{
        listerSynced: serviceInformer.Informer().HasSynced,
    }

    serviceInformer.Informer().AddEventHandlerWithResyncPeriod(
        cache.ResourceEventHandlerFuncs{
            AddFunc:    result.handleAddService,
            UpdateFunc: result.handleUpdateService,
            DeleteFunc: result.handleDeleteService,
        },
        resyncPeriod,
    )

    return result
}

之后，两个对象还分别通过RegisterEventHandler方法注册了事件处理器，即proxier

1	serviceConfig.RegisterEventHandler(s.Proxier)

以Service为例，当serviceConfig启动时，会调用proxier的OnServiceSynced方法处理事件

func (c *ServiceConfig) Run(stopCh <-chan struct{}) {
    for i := range c.eventHandlers {
        c.eventHandlers[i].OnServiceSynced()
    }
}

本例中proxier为ipvs，则参考kubernetes/pkg/proxy/ipvs/proxier.go - OnServiceSynced()

func (proxier *Proxier) OnServiceSynced() {
    proxier.mu.Lock()
    proxier.servicesSynced = true
    if utilfeature.DefaultFeatureGate.Enabled(features.EndpointSliceProxying) {
        ...
    // 本例未开启EndpointSliceProxying特性，直接看else部分
    } else {
    // 向&proxier.initialized执行原子写操作
        proxier.setInitialized(proxier.endpointsSynced)
    }
    proxier.mu.Unlock()

    // Sync unconditionally - this is called once per lifetime.
    // 调用syncProxyRules来处理规则
    // 代码很长，主要的工作是：
    // 1. 创建kube-ipvs0设备，用于绑定Cluster IP
    // 2. 创建各种ipset，然后和iptables规则结合起来，极大简化了iptables表的内容
    // 3. 创建ipvs的内容，实现Service到Endpoint的负载均衡
    proxier.syncProxyRules()
}

Calico网络规则

负责处理iptables规则的是Calico的Felix组件，参考 Felix 。

Felix的入口是felix/daemon/daemon.go - Run()函数，在Run()函数中会启动所有syncer

1
2
3

// Start the background processing threads.
if syncer != nil {
    syncer.Start()

InternalDataplane是其中一个syncer，基于iptables和ipsets来处理Felix的数据面。

它的工作内容主要分为两个步骤，第一步完成初始配置，第二部开始监听变化更新配置：

func (d *InternalDataplane) Start() {
    // Do our start-of-day configuration.
    d.doStaticDataplaneConfig()

    // Then, start the worker threads.
    go d.loopUpdatingDataplane()
    go d.loopReportingStatus()
    go d.ifaceMonitor.MonitorInterfaces()
    go d.monitorHostMTU()
}

初始配置

felix/dataplane/linux/int_dataplane.go - InternalDataplane.doStaticDataplaneConfig()

func (d *InternalDataplane) doStaticDataplaneConfig() {
    // 1.加载nf_conntrack_proto_sctp
    // 2.开启ipv4 forwarding
    // 3.其他配置视实际环境而定
    d.configureKernel()

    if d.config.BPFEnabled {
        d.setUpIptablesBPF()
    } else {
        // 本例中选择这个分支，这部分代码会初始化iptables规则
        d.setUpIptablesNormal()
	}

    if d.config.RulesConfig.IPIPEnabled {
        log.Info("IPIP enabled, starting thread to keep tunnel configuration in sync.")
        go d.ipipManager.KeepIPIPDeviceInSync(
            d.config.IPIPMTU,
            d.config.RulesConfig.IPIPTunnelAddress,
        )
    } else {
        log.Info("IPIP disabled. Not starting tunnel update thread.")
    }
}

规则流程

入站流量
1. 本机流量
  
  PREROUTING(raw -> mangle -> nat) –> 路由表 –> INPUT(mangle -> filter) –> 路由表 –> OUTPUT(raw -> mangle -> nat -> filter) –> POSTROUTING(mangle -> nat)
2. 非本机流量
  
  PREROUTING(raw -> mangle -> nat) –> 路由表 –> FORWARD(mangle -> filter) –> POSTROUTING(mangle -> nat)
出站流量

路由表 –> OUTPUT(raw -> mangle -> nat -> filter) –> POSTROUTING(mangle -> nat)

可以按照这个思路去分析calico的具体规则，也可以参考 Calico iptables详解。

案例

请求地址解析

本例中使用了nip.io作为Knative的dns解析，案例url为：

1	http://sample-svc-ksvc.default.<public-ip>.nip.io

请求时相当于：

1	curl -H "Host: sample-svc-serving-ksvc.default.<public-ip>.nip.io" http://<public-ip>

可以在ingresses.networking.internal.knative.dev crd中查看上述ingress的详情，可见发送到上述url的请求将被路由到sample-svc-serving-ksvc-v100服务：

~# kubectl get ingresses.networking.internal.knative.dev sample-svc-serving-ksvc -oyaml

  rules:
  - hosts:
    - sample-svc-serving-ksvc.default.<public-ip>.nip.io
    http:
      paths:
      - splits:
        - appendHeaders:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: sample-svc-serving-ksvc-v100
          percent: 100
          serviceName: sample-svc-serving-v100
          serviceNamespace: default
          servicePort: 80
    visibility: ExternalIP

sample-svc-serving-ksvc-v100服务的后端为Knative的activator组件，用于在后端无pod实例时拦截流量，并通知autoscaler启动pod实例，待pod实例启动后再将流量发送至pod实例。

查看serverlessservices.networking.internal.knative.dev crd中的信息可以得知sample-svc-serving-ksvc-v100服务是sample-svc-serving-ksvc-v100-private服务的流量代理，即sample-svc-serving-ksvc-v100-private才是真正的后端pod负载均衡服务:

~# kubectl get serverlessservices.networking.internal.knative.dev sample-svc-serving-ksvc-v100 -oyaml

status:
  privateServiceName: sample-svc-serving-ksvc-v100-private
  serviceName: sample-svc-serving-ksvc-v100

查看kubernetes service中的信息

将sample-svc-serving-ksvc-v100-private缩写为sample-svc

~# kubectl get svc

NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP         PORT(S)       AGE
sample-svc           ClusterIP      10.233.49.34    <none>              80/TCP        3d5h

使用kubectl describe命令查看sample-svc的详情，可以发现其vip即clusterIP为10.233.49.34，代理的前端端口为80，Endpoint为10.233.96.13:80，代理协议为tcp

Type:              ClusterIP
IP:                10.233.49.34
Port:              http  80/TCP
TargetPort:        80/TCP
Endpoints:         10.233.96.13:80

查询ipvs中的信息

执行以下命令获取ipvs中的信息，其中-l表示列表查询connection信息，-t表示查询tcp协议的代理地址，即上面的10.233.49.34:80

从下面的信息中可以看到，ipvs代理的后端地址与上面的endpoint一致

~# ipvsadm -l -t 10.233.49.34:80

Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  node1:http rr
  -> 10.233.96.13:http            Masq    1      0          0

出站流量（192.168.0.2）

根据iptables的规则，出站流量需要经过：

路由表
OUTPUT，表顺序：raw -> mangle -> nat -> filter
POSTROUTING，表顺序：mangle -> nat

根据路由表信息，该报文将发送至tunl0设备，tunl0设备是IPIP（由IP层封装IP报文）传输时的隧道设备，任务是把报文传输到192.168.0.4的tunl0设备上；iptables部分可以按图索骥，略过

~# route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.233.96.0     192.168.0.3     255.255.255.0   UG    0      0        0 tunl0

入站流量（192.168.0.3）

根据iptables的规则，入站流量需要经过：

PREROUTING，表顺序：raw -> mangle -> nat
路由表
INPUT，表顺序：mangle -> filter

非192.168.0.3节点的入站流量为：

PREROUTING，表顺序：raw -> mangle -> nat

路由表

FORWARD，表顺序：mangle -> filter

POSTROUTING，表顺序：mangle -> nat

iptables部分可以按图索骥，略过

~# route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.233.96.13    0.0.0.0         255.255.255.255 UH    0      0        0 calie1e659a68ff

到达目的pod

网卡设备calie1e659a68ff与pod中的eth0设备为veth pair，这样流量就可以通过路由规则到达pod中的eth0网卡

pod:~# ip link

4: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
    link/ether a6:1d:84:b5:a8:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.233.96.13/32 brd 10.233.96.13 scope global eth0
       valid_lft forever preferred_lft forever

pod:~# ethtool -S eth0

NIC statistics:
     peer_ifindex: 21
     
host:~# ip link

21: calie1e659a68ff@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP mode DEFAULT group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 12