Limiting resource usage with cgroups on Android

6 minute read

cgroups on Android

cgroups(control groups for short) was created by Google engineers, and was merged into the mainline at 2008 in version 2.6.24, it was mainly used as resource management in linux containers, along with linux kernel namespaces.

To see what if your kernel support the cgroup, see output of cat /proc/cgroups:

rpi3:/ # cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset		3		5		1
cpu		2		1		1
cpuacct		1		103		1
blkio		0		1		1
memory		0		1		0
devices		0		1		1
freezer		0		1		1
net_cls		0		1		1

From last column of the above output, we see the kernel supports cpuset, cpu, cpuacct, blkio, memory, devices, freezer, net_cls. The memory controller was disabled by default.

There is kernel message if cgroup memory was disabled:

[    0.001711] Disabling memory control group subsystem

Enabling memory control will add 8 bytes of accounting memory per 4K page in 32 bit system, so they make it disabled by default, enable it by appending cgroup_enable=memory to the kernel cmdline in cmdline.txt.

After do a reboot, now memory controller was enabled:

rpi3:/ # cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset		4		5		1
cpu		3		2		1
cpuacct		1		2		1
blkio		0		1		1
memory		2		110		1
devices		0		1		1
freezer		0		1		1
net_cls		0		1		1

On my system which running lineageos 15.1, only cpuset, cpu, cpuacct and memory were used and these were mounted in init.rc, cpu accounting and memory controls were mounted in early-init stage:

on early-init
    # Mount cgroup mount point for cpu accounting
    mount cgroup none /acct cpuacct
    mkdir /acct/uid

    # root memory control cgroup, used by lmkd
    mkdir /dev/memcg 0700 root system
    mount cgroup none /dev/memcg memory
    # app mem cgroups, used by activity manager, lmkd and zygote
    mkdir /dev/memcg/apps/ 0755 system system
    # cgroup for system_server and surfaceflinger
    mkdir /dev/memcg/system 0550 system system

And cpu, cpuset were mounted in init stage:

on init
    # Create cgroup mount points for process groups
    mkdir /dev/cpuctl
    mount cgroup none /dev/cpuctl cpu
    chown system system /dev/cpuctl
    chown system system /dev/cpuctl/tasks
    chmod 0666 /dev/cpuctl/tasks
    write /dev/cpuctl/cpu.rt_period_us 1000000
    write /dev/cpuctl/cpu.rt_runtime_us 950000

    # sets up initial cpusets for ActivityManager
    mkdir /dev/cpuset
    mount cpuset none /dev/cpuset

    # this ensures that the cpusets are present and usable, but the device's
    # init.rc must actually set the correct cpus
    mkdir /dev/cpuset/foreground
    copy /dev/cpuset/cpus /dev/cpuset/foreground/cpus
    copy /dev/cpuset/mems /dev/cpuset/foreground/mems
    mkdir /dev/cpuset/background
    copy /dev/cpuset/cpus /dev/cpuset/background/cpus
    copy /dev/cpuset/mems /dev/cpuset/background/mems

    # system-background is for system tasks that should only run on
    # little cores, not on bigs
    # to be used only by init, so don't change system-bg permissions
    mkdir /dev/cpuset/system-background
    copy /dev/cpuset/cpus /dev/cpuset/system-background/cpus
    copy /dev/cpuset/mems /dev/cpuset/system-background/mems

    mkdir /dev/cpuset/top-app
    copy /dev/cpuset/cpus /dev/cpuset/top-app/cpus
    copy /dev/cpuset/mems /dev/cpuset/top-app/mems

    # change permissions for all cpusets we'll touch at runtime
    chown system system /dev/cpuset
    chown system system /dev/cpuset/foreground
    chown system system /dev/cpuset/background
    chown system system /dev/cpuset/system-background
    chown system system /dev/cpuset/top-app
    chown system system /dev/cpuset/tasks
    chown system system /dev/cpuset/foreground/tasks
    chown system system /dev/cpuset/background/tasks
    chown system system /dev/cpuset/system-background/tasks
    chown system system /dev/cpuset/top-app/tasks

    # set system-background to 0775 so SurfaceFlinger can touch it
    chmod 0775 /dev/cpuset/system-background

    chmod 0664 /dev/cpuset/foreground/tasks
    chmod 0664 /dev/cpuset/background/tasks
    chmod 0664 /dev/cpuset/system-background/tasks
    chmod 0664 /dev/cpuset/top-app/tasks
    chmod 0664 /dev/cpuset/tasks

After android boot completed, all four cgroup controllers were mounted properly:

rpi3:/ # mount | grep cgroup
none on /acct type cgroup (rw,relatime,cpuacct)
none on /dev/memcg type cgroup (rw,relatime,memory)
none on /dev/cpuctl type cgroup (rw,relatime,cpu)
none on /dev/cpuset type cgroup (rw,relatime,cpuset,noprefix,release_agent=/sbin/cpuset_release_agent)

Limiting process to cpuset

Stress-ng was born to make stress test on linux systems with over 220 stress tests covering CPU, memory etc. This can be used for demonstration in this post, first grab source code from github, then build with:

$ git clone --depth=1 https://github.com/ColinIanKing/stress-ng.git
$ export CC=arm-linux-gnueabihf-gcc
$ STATIC=1 make ARCH=arm

push stress-ng to target device, then do test, run trace-cmd to record sched in the meanwhile:

# stress-ng -m 1 -c 4
# trace-cmd record -e sched

trace-cmd report shows that all the stress processes are distributed to all the available CPUs:

   stress-ng-cpu-1478  [003]  3118.999753: sched_load_cfs_rq:    cpu=3 path=/autogroup-2 load=1024 util=983 util_pelt=983 util_walt=0
   stress-ng-cpu-1477  [001]  3118.999753: sched_load_cfs_rq:    cpu=1 path=/autogroup-2 load=1024 util=913 util_pelt=913 util_walt=0
    stress-ng-vm-1481  [000]  3118.999755: sched_load_cfs_rq:    cpu=0 path=/autogroup-2 load=1024 util=819 util_pelt=819 util_walt=0
   stress-ng-cpu-1479  [002]  3118.999755: sched_load_cfs_rq:    cpu=2 path=/autogroup-2 load=2049 util=1215 util_pelt=1215 util_walt=0
   stress-ng-cpu-1477  [001]  3118.999757: sched_load_tg:        cpu=1 path=/autogroup-2 load=5093
    stress-ng-vm-1481  [000]  3118.999759: sched_load_tg:        cpu=0 path=/autogroup-2 load=5093
   stress-ng-cpu-1478  [003]  3118.999759: sched_load_tg:        cpu=3 path=/autogroup-2 load=5093
   stress-ng-cpu-1479  [002]  3118.999760: sched_load_tg:        cpu=2 path=/autogroup-2 load=5093

Then echo these pids to /dev/cpuset/restricted/cgroup.procs, and set restricted group cpus to 2,3:

echo 2,3 > /dev/cpuset/restricted/cpus

Now all stress processes are running in restricted group, and only cpu2 and cpu3 are used:

   stress-ng-cpu-1478  [003]  3974.519769: sched_load_cfs_rq:    cpu=3 path=/autogroup-2 load=2049 util=1024 util_pelt=1024 util_walt=0
   stress-ng-cpu-1479  [002]  3974.519770: sched_load_cfs_rq:    cpu=2 path=/autogroup-2 load=3073 util=1024 util_pelt=1024 util_walt=0
   stress-ng-cpu-1478  [003]  3974.519774: sched_load_tg:        cpu=3 path=/autogroup-2 load=5121
   stress-ng-cpu-1479  [002]  3974.519774: sched_load_tg:        cpu=2 path=/autogroup-2 load=5121
   stress-ng-cpu-1479  [002]  3974.519780: sched_load_se:        cpu=2 path=/autogroup-2 comm=(null) pid=-1 load=614 util=1024 util_pelt=1024 util_walt=0
   stress-ng-cpu-1478  [003]  3974.519780: sched_load_se:        cpu=3 path=/autogroup-2 comm=(null) pid=-1 load=409 util=1024 util_pelt=1024 util_walt=0
   stress-ng-cpu-1479  [002]  3974.519784: sched_load_cfs_rq:    cpu=2 path=/ load=614 util=1024 util_pelt=1024 util_walt=0
   stress-ng-cpu-1478  [003]  3974.519784: sched_load_cfs_rq:    cpu=3 path=/ load=745 util=1540 util_pelt=1540 util_walt=0

Limiting memory usage

For demonstrating limit memory usage with cgroup memory controller, run memory test with:

stress-ng -m 1 --vm-bytes 256M -M

Since we have enough memory, stress will be running without issue, all the stress processes are running without memory limit.

In this test we will use the app group to show how to set memory limit in cgroup, and what will happen when process reach that limit, set the limit to 20M and move test processes to this group.

echo 20M > /dev/memcg/apps/memory.limit_in_bytes
echo 1602 > /dev/memcg/apps/tasks

After moving stress process to app group, OOM was triggered immediately:

[ 1856.169324] stress-ng-vm invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=1000
[ 1856.180617] stress-ng-vm cpuset=/ mems_allowed=0
[ 1856.185372] CPU: 0 PID: 1602 Comm: stress-ng-vm Not tainted 4.14.135-v8+ #8
[ 1856.192440] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
[ 1856.198801] Call trace:
[ 1856.201297] [<ffffff99d548b608>] dump_backtrace+0x0/0x270
[ 1856.206781] [<ffffff99d548b89c>] show_stack+0x24/0x30
[ 1856.211913] [<ffffff99d5d2ddf0>] dump_stack+0xac/0xe4
[ 1856.217045] [<ffffff99d55eb1fc>] dump_header+0x94/0x1e8
[ 1856.222352] [<ffffff99d55ea2e0>] oom_kill_process+0x2c8/0x5d0
[ 1856.228187] [<ffffff99d55eaf24>] out_of_memory+0x104/0x2d0
[ 1856.233759] [<ffffff99d564f260>] mem_cgroup_out_of_memory+0x50/0x70
[ 1856.240124] [<ffffff99d565539c>] mem_cgroup_oom_synchronize+0x35c/0x3b8
[ 1856.246842] [<ffffff99d55eb118>] pagefault_out_of_memory+0x28/0x78
[ 1856.253119] [<ffffff99d5d48de8>] do_page_fault+0x440/0x450
[ 1856.258689] [<ffffff99d5d48e64>] do_translation_fault+0x6c/0x7c
[ 1856.264699] [<ffffff99d54814a0>] do_mem_abort+0x50/0xb0
[ 1856.270004] Exception stack(0xffffff800b64bec0 to 0xffffff800b64c000)
[ 1856.276545] bec0: 0000000000000000 0000000010000000 00000000ff5a00a5 00000000f1a26000
[ 1856.284494] bee0: 00000000000fdafc 00000000e2d96000 00000000e1a26000 0000000000154550
[ 1856.292444] bf00: 00000000ffcf1840 0000000004000000 00000000e1a26000 0000000000000000
[ 1856.300394] bf20: 00000000000000c0 00000000ffcf1760 0000000000068c1f 0000000000000000
[ 1856.308342] bf40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1856.316291] bf60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1856.324241] bf80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1856.332191] bfa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1856.340141] bfc0: 0000000000063946 00000000200f0030 00000000e1a26000 00000000ffffffff
[ 1856.348091] bfe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1856.356041] [<ffffff99d548366c>] el0_da+0x20/0x24
[ 1856.361022] Task in /apps killed as a result of limit of /apps
[ 1856.367061] memory: usage 20480kB, limit 20480kB, failcnt 8911
[ 1856.373016] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 1856.379826] kmem: usage 572kB, limit 9007199254740988kB, failcnt 0
[ 1856.386167] Memory cgroup stats for /apps: cache:0KB rss:19908KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:19884KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 1856.405779] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1856.414845] [ 1602]     0  1602    69276     5146      18       2        0          1000 stress-ng-vm
[ 1856.424287] Memory cgroup out of memory: Kill process 1602 (stress-ng-vm) score 1955 or sacrifice child
[ 1856.433880] Killed process 1602 (stress-ng-vm) total-vm:277104kB, anon-rss:19892kB, file-rss:684kB, shmem-rss:8kB
[ 1856.462531] oom_reaper: reaped process 1602 (stress-ng-vm), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB

Freezing process

freezer controller was not mounted in init.rc, mount it with:

mkdir /dev/freezer
mount -t cgroup none /dev/freezer -o freezer

This section we use the same test as in memory section, before freezing, we can see the stress process takes about 100% CPU usage, and after put it into freezer group, it no longer takes that much of CPU usage, freeze it with:

mkdir /dev/freezer/test
echo 1707 > /dev/freezer/test/tasks
echo FROZEN > /dev/freezer/test/freezer.state

To unfreeze process:

echo THAWED > /dev/freezer/test/freezer.state

References