sysctl.conf 里面设置的参数无法加载

客户遇到一个很奇怪的问题,明明在sysctl.conf里面配置了net.netfilter.nf_conntrack_max参数,但是重启以后,这个参数还是没有生效,这个问题是什么原因呢?

答案在红帽的知识库里面

  • https://access.redhat.com/solutions/548813

我们知道,sysctl的配置说是在系统启动的时候,由systemd-sysctl.service加载的。而客户环境里面,又有docker,docker因为会调用iptables的nat功能,从而隐形的加载内核模块nf_conntrack。我们猜测,应该是docker服务和systemd-sysctl服务,在系统启动的时候,相互配合有问题,才造成了我们内核模块的参数加载失败的问题。

接下来,我们做个实验来看看。

实验环境准备

我们装一台centos7,关闭firewalld,再安装社区版本的docker-ce。注意,我们在这里,并没有激活docker服务的自动启动。

# disable firewalld
systemctl disable --now firewalld

# install docker ce
yum install -y yum-utils
yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo
yum install -y docker-ce docker-ce-cli containerd.io

# reboot is important
reboot

环境调查

系统重启以后,我们看看是否加载了内核模块nf_conntrack,并且看看有没有对应的参数。

# check nf_conntrack module status
lsmod | grep nf_conntrack
# nothing

sysctl net.netfilter.nf_conntrack_max
# sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_max: No such file or directory

可以看到,没有加载nf_conntrack模块,而且没有对应的参数。那么我们手动启动docker服务,然后再看看。

# enable docker service and check nf_conntrack again
systemctl start docker

lsmod | grep nf_conntrack
# nf_conntrack_netlink    36396  0
# nf_conntrack_ipv4      15053  2
# nf_defrag_ipv4         12729  1 nf_conntrack_ipv4
# nf_conntrack          139264  6 nf_nat,nf_nat_ipv4,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_netlink,nf_conntrack_ipv4
# libcrc32c              12644  2 nf_nat,nf_conntrack

sysctl net.netfilter.nf_conntrack_max
# net.netfilter.nf_conntrack_max = 65536

# check we didn't set the parameter for net.netfilter.nf_conntrack_max
find /etc -type f -exec grep -H nf_conntrack_max {} \;
# nothing

可以看到,加载了docker服务以后,内核模块nf_conntrack已经加载了,而且参数net.netfilter.nf_conntrack_max已经设置了。我们查找了一下/etc目录下面,发现没有配置net.netfilter.nf_conntrack_max的参数。

那么我们可以得出结论,不做任何配置,nf_conntrack内核模块,只有在docker服务启动的时候,才会加载到内核。如果systemd-sysctl这个服务,在系统启动的时候,早于docker服务,那么他就无法设置参数net.netfilter.nf_conntrack_max,就会造成我们所看到的故障。

接下来,我们确认一下我们的猜测,我们罗列一下,有哪些服务是在docker服务启动之前,必须要启动的。或者说,docker服务依赖哪些别的服务。

# check systemd init sequence
systemctl enable --now docker

# we can see, docker start after systemd-sysctl
systemctl list-dependencies docker
# docker.service
# ● ├─containerd.service
# ● ├─docker.socket
# ● ├─system.slice
# ● ├─basic.target
# ● │ ├─microcode.service
# ● │ ├─rhel-dmesg.service
# ● │ ├─selinux-policy-migrate-local-changes@targeted.service
# ● │ ├─paths.target
# ● │ ├─slices.target
# ● │ │ ├─-.slice
# ● │ │ └─system.slice
# ● │ ├─sockets.target
# ● │ │ ├─dbus.socket
# ● │ │ ├─systemd-initctl.socket
# ● │ │ ├─systemd-journald.socket
# ● │ │ ├─systemd-shutdownd.socket
# ● │ │ ├─systemd-udevd-control.socket
# ● │ │ └─systemd-udevd-kernel.socket
# ● │ ├─sysinit.target
# ● │ │ ├─dev-hugepages.mount
# ● │ │ ├─dev-mqueue.mount
# ● │ │ ├─kmod-static-nodes.service
# ● │ │ ├─plymouth-read-write.service
# ● │ │ ├─plymouth-start.service
# ● │ │ ├─proc-sys-fs-binfmt_misc.automount
# ● │ │ ├─rhel-autorelabel-mark.service
# ● │ │ ├─rhel-autorelabel.service
# ● │ │ ├─rhel-domainname.service
# ● │ │ ├─rhel-import-state.service
# ● │ │ ├─rhel-loadmodules.service
# ● │ │ ├─sys-fs-fuse-connections.mount
# ● │ │ ├─sys-kernel-config.mount
# ● │ │ ├─sys-kernel-debug.mount
# ● │ │ ├─systemd-ask-password-console.path
# ● │ │ ├─systemd-binfmt.service
# ● │ │ ├─systemd-firstboot.service
# ● │ │ ├─systemd-hwdb-update.service
# ● │ │ ├─systemd-journal-catalog-update.service
# ● │ │ ├─systemd-journal-flush.service
# ● │ │ ├─systemd-journald.service
# ● │ │ ├─systemd-machine-id-commit.service
# ● │ │ ├─systemd-modules-load.service
# ● │ │ ├─systemd-random-seed.service
# ● │ │ ├─systemd-sysctl.service
# ● │ │ ├─systemd-tmpfiles-setup-dev.service
# ● │ │ ├─systemd-tmpfiles-setup.service
# ● │ │ ├─systemd-udev-trigger.service
# ● │ │ ├─systemd-udevd.service
# ● │ │ ├─systemd-update-done.service
# ● │ │ ├─systemd-update-utmp.service
# ● │ │ ├─systemd-vconsole-setup.service
# ● │ │ ├─cryptsetup.target
# ● │ │ ├─local-fs.target
# ● │ │ │ ├─-.mount
# ● │ │ │ ├─rhel-readonly.service
# ● │ │ │ ├─systemd-fsck-root.service
# ● │ │ │ └─systemd-remount-fs.service
# ● │ │ └─swap.target
# ● │ └─timers.target
# ● │   └─systemd-tmpfiles-clean.timer
# ● └─network-online.target

我们可以很清晰的看到,systemd-sysctl.service服务是在docker服务启动之前,必须要启动的。

解决问题

红帽知识库给出了解决办法,就是使用系统自带的rhel-loadmodules.service服务,这个服务,在systemd-sysctl.service启动之前启动,我们来看看他的内容。

systemctl cat rhel-loadmodules.service
# # /usr/lib/systemd/system/rhel-loadmodules.service
# [Unit]
# Description=Load legacy module configuration
# DefaultDependencies=no
# Conflicts=shutdown.target
# After=systemd-readahead-collect.service systemd-readahead-replay.service
# Before=sysinit.target shutdown.target
# ConditionPathExists=|/etc/rc.modules
# ConditionDirectoryNotEmpty=|/etc/sysconfig/modules/

# [Service]
# ExecStart=/usr/lib/systemd/rhel-loadmodules
# Type=oneshot
# TimeoutSec=0
# RemainAfterExit=yes

# [Install]
# WantedBy=sysinit.target

我们可以看到,rhel-loadmodules.service服务,检查/etc/sysconfig/modules/目录是否存在,如果存在,就运行程序/usr/lib/systemd/rhel-loadmodules。那我们看看/usr/lib/systemd/rhel-loadmodules的内容。

cat /usr/lib/systemd/rhel-loadmodules
# #!/bin/bash

# # Load other user-defined modules
# for file in /etc/sysconfig/modules/*.modules ; do
#   [ -x $file ] && $file
# done

# # Load modules (for backward compatibility with VARs)
# if [ -f /etc/rc.modules ]; then
#         /etc/rc.modules
# fi

/usr/lib/systemd/rhel-loadmodules的内容很简单,就是遍历/etc/sysconfig/modules/目录下的所有*.modules文件,如果文件存在,就运行它。

那么我们就有了解决办法,创建对应的module文件,让这些内核模块,在系统启动的早期就加载了,这样之后的systemd-sysctl.service服务就可以正常的去设置参数了。

echo "modprobe nf_conntrack" >> /etc/sysconfig/modules/nf_conntrack.modules && chmod 775 /etc/sysconfig/modules/nf_conntrack.modules

echo "net.netfilter.nf_conntrack_max = 2097152" >> /etc/sysctl.d/99-nf_conntrack.conf

reboot

重启以后,我们坚持一下系统状态,如我们所预期的,一切正常。

systemctl status rhel-loadmodules.service
# ● rhel-loadmodules.service - Load legacy module configuration
#    Loaded: loaded (/usr/lib/systemd/system/rhel-loadmodules.service; enabled; vendor preset: enabled)
#    Active: active (exited) since Sun 2022-01-02 08:08:22 UTC; 40s ago
#   Process: 350 ExecStart=/usr/lib/systemd/rhel-loadmodules (code=exited, status=0/SUCCESS)
#  Main PID: 350 (code=exited, status=0/SUCCESS)
#     Tasks: 0
#    Memory: 0B
#    CGroup: /system.slice/rhel-loadmodules.service

# Jan 02 08:08:22 vultr.guest systemd[1]: Started Load legacy module configuration.

sysctl net.netfilter.nf_conntrack_max
# net.netfilter.nf_conntrack_max = 2097152

错误的方法 rc-local.service

面对sysctl参数无法加载的情况,我们第一时间想到的,可能就是rc-local.service服务,这个服务读取/etc/rc.local文件,并执行这个文件中的内容。不过,我们发现在这里面的sysctl -w 命令,依然不起作用,我们猜测,这是由于rc-local.service没有在docker.service之前执行导致的。

那么我们就来确认一下。

systemctl cat rc-local
# # /usr/lib/systemd/system/rc-local.service
# #  This file is part of systemd.
# #
# #  systemd is free software; you can redistribute it and/or modify it
# #  under the terms of the GNU Lesser General Public License as published by
# #  the Free Software Foundation; either version 2.1 of the License, or
# #  (at your option) any later version.

# # This unit gets pulled automatically into multi-user.target by
# # systemd-rc-local-generator if /etc/rc.d/rc.local is executable.
# [Unit]
# Description=/etc/rc.d/rc.local Compatibility
# ConditionFileIsExecutable=/etc/rc.d/rc.local
# After=network.target

# [Service]
# Type=forking
# ExecStart=/etc/rc.d/rc.local start
# TimeoutSec=0
# RemainAfterExit=yes

systemctl list-dependencies rc-local
# rc-local.service
# ● ├─system.slice
# ● └─basic.target
# ●   ├─microcode.service
# ●   ├─rhel-dmesg.service
# ●   ├─selinux-policy-migrate-local-changes@targeted.service
# ●   ├─paths.target
# ●   ├─slices.target
# ●   │ ├─-.slice
# ●   │ └─system.slice
# ●   ├─sockets.target
# ●   │ ├─dbus.socket
# ●   │ ├─systemd-initctl.socket
# ●   │ ├─systemd-journald.socket
# ●   │ ├─systemd-shutdownd.socket
# ●   │ ├─systemd-udevd-control.socket
# ●   │ └─systemd-udevd-kernel.socket
# ●   ├─sysinit.target
# ●   │ ├─dev-hugepages.mount
# ●   │ ├─dev-mqueue.mount
# ●   │ ├─kmod-static-nodes.service
# ●   │ ├─plymouth-read-write.service
# ●   │ ├─plymouth-start.service
# ●   │ ├─proc-sys-fs-binfmt_misc.automount
# ●   │ ├─rhel-autorelabel-mark.service
# ●   │ ├─rhel-autorelabel.service
# ●   │ ├─rhel-domainname.service
# ●   │ ├─rhel-import-state.service
# ●   │ ├─rhel-loadmodules.service
# ●   │ ├─sys-fs-fuse-connections.mount
# ●   │ ├─sys-kernel-config.mount
# ●   │ ├─sys-kernel-debug.mount
# ●   │ ├─systemd-ask-password-console.path
# ●   │ ├─systemd-binfmt.service
# ●   │ ├─systemd-firstboot.service
# ●   │ ├─systemd-hwdb-update.service
# ●   │ ├─systemd-journal-catalog-update.service
# ●   │ ├─systemd-journal-flush.service
# ●   │ ├─systemd-journald.service
# ●   │ ├─systemd-machine-id-commit.service
# ●   │ ├─systemd-modules-load.service
# ●   │ ├─systemd-random-seed.service
# ●   │ ├─systemd-sysctl.service
# ●   │ ├─systemd-tmpfiles-setup-dev.service
# ●   │ ├─systemd-tmpfiles-setup.service
# ●   │ ├─systemd-udev-trigger.service
# ●   │ ├─systemd-udevd.service
# ●   │ ├─systemd-update-done.service
# ●   │ ├─systemd-update-utmp.service
# ●   │ ├─systemd-vconsole-setup.service
# ●   │ ├─cryptsetup.target
# ●   │ ├─local-fs.target
# ●   │ │ ├─-.mount
# ●   │ │ ├─rhel-readonly.service
# ●   │ │ ├─systemd-fsck-root.service
# ●   │ │ └─systemd-remount-fs.service
# ●   │ └─swap.target
# ●   └─timers.target
# ●     └─systemd-tmpfiles-clean.timer

我们可以看到,rc-local服务,和docker服务,没有前后关系,这就导致了rc-local无法确保在docker.serivce之后运行,这样就会导致相关内核模块没有加载,进而导致sysctl命令无法加载。

reference

  • https://access.redhat.com/solutions/548813
  • https://www.dazhuanlan.com/bygxb/topics/1709928

others


systemctl list-unit-files | grep docker
# docker.service                                enabled
# docker.socket                                 disabled

# https://stackoverflow.com/questions/29309717/is-there-any-way-to-list-systemd-services-in-linux-in-the-order-of-they-were-l#fromHistory
# systemd-analyze plot > startup_order.svg

yum install -y graphviz

systemd-analyze dot | dot -Tsvg > systemd.svg
#    Color legend: black     = Requires
#                  dark blue = Requisite
#                  dark grey = Wants
#                  red       = Conflicts
#                  green     = After


# CONNTRACK_MAX = 连接跟踪表大小 (HASHSIZE) * Bucket 大小 (bucket size)
modinfo nf_conntrack
# filename:       /lib/modules/3.10.0-1160.49.1.el7.x86_64/kernel/net/netfilter/nf_conntrack.ko.xz
# license:        GPL
# retpoline:      Y
# rhelversion:    7.9
# srcversion:     358A2186187A7E81339334C
# depends:        libcrc32c
# intree:         Y
# vermagic:       3.10.0-1160.49.1.el7.x86_64 SMP mod_unload modversions
# signer:         CentOS Linux kernel signing key
# sig_key:        77:15:99:7F:C4:81:91:84:C7:45:27:B6:08:4B:C7:F9:BB:15:62:7D
# sig_hashalgo:   sha256
# parm:           tstamp:Enable connection tracking flow timestamping. (bool)
# parm:           acct:Enable connection tracking flow accounting. (bool)
# parm:           nf_conntrack_helper:Enable automatic conntrack helper assignment (default 1) (bool)
# parm:           expect_hashsize:uint

cat /proc/sys/net/nf_conntrack_max
# 65536
cat /proc/sys/net/netfilter/nf_conntrack_max
# 65536

cat /proc/sys/net/netfilter/nf_conntrack_buckets
# 16384

# so the bucket size = 4
echo "`cat /proc/sys/net/netfilter/nf_conntrack_max` / `cat /proc/sys/net/netfilter/nf_conntrack_buckets`" | bc
# 4