CSE 523: Container Measurement Project (CMP)
Fall 2019
Overview
The project is about making sense of application performance in
cloud-native environment.
You will be creating, justifying, and applying a set of experiments
to characterize and understand the performance overhead of modern container technologies,
including Docker, Rtk, gVisor, Firecracker, Kata Containers, LXC/LXD,
or any containers you are interested in.
You are required to answer the following high-level questions:
-
Are containerized applications slower than native applications running on the bare mental?
-
If so, where does the performance get lost, and why?
-
How do different container technologies compare?
-
Given that different container technologies make different design decisions,
what can we learn from the measurement results?
You will study how to design and use
benchmarks to usefully characterize a complex system. You should also gain an intuitive feel for
the relative speeds of different basic operations, which is invaluable in identifying performance
bottlenecks.
This project has two parts.
First, you will implement and perform a series of experiments.
Second, you will write a report documenting the methodology and results of your experiments.
When you finish, you will submit your report as well as the code used to perform your experiments.
Setup
You have the complete freedom to choose the OS and the
hardware platform for your measurements. You can use your
laptop that you are comfortable with, the Ubuntu in the VM assigned by
Engr-IT, a game system, or even a supercomputer if you know how :).
You need to apply the benchmarks on the bare mental
and at least two container technologies
(you can choose any two based on your interests), and compare
the performance.
Please decribe a reasonably detailed description of the test setup(s),
such as the hardware and/or the environment.
The hardware information should be available either from the system
(e.g., sysctl on BSD, /proc on Linux, System Profiler
on Mac OS X, the cpuid x86 instruction).
Gathering this information should not require
much work, but in explaining and analyzing your results you will find these
numbers useful.
Experiments
Methodology
We suggest you to perform your experiments by these following steps:
-
Run your benchmarks on your chosen machine or experiment environment.
-
Make a guess as to how much overhead your system stack will add to the
bare metal performance. We will not
grade you on your guess, this is for you to test your intuition.
For example, for a system call,
this overhead could come from Seccomp and LSMs.
If you are measuring a system in an
unusual environment, estimate the degree of variability and error that might
be introduced when performing your measurements.
-
Combine the bare metal performance and your estimate of software overhead
into an overall prediction of performance.
-
Run your benchmarks. We suggest that you should
run your experiment multiple times, for long enough to obtain
reproducible measurement results. Report the average and
the standard deviation across the measurements.
Note that, when measuring an operation
using many iterations (e.g., system call overhead), consider each run
of iterations as a single trial and compute the standard deviation across
multiple trials (not each individual iteration). If your benchmarks have done those for you,
remember to find them out and show them in your report.
-
If you are building your own benchmarks, use a low-overhead mechanism
for reading timestamps.
All modern processors have a cycle counter that applications can read using a
special instruction (e.g., RDTSC). Searching for "RDTSC"
in Google, for instance, will provide you with a plethora of additional
examples. Note, though, that in the modern age of power-efficient multicore
processors, you will need to take additional steps to reliably
use the cycle counter to measure the passage of time. You will want
to disable dynamically adjusted CPU frequency (the mechanism will depend on
your platform) so that the frequency at which the processor computes
is determinstic and does not vary. Use `nice` to boost your process priority.
Restrict your measurement programs to using a single core.
You need to explain your methods in detail to make your results interpretable.
If you are using different methods to compare different
aspects, please also explain the "what" and the "why" correspondingly.
Measures
You can use existing benchmarks, or build your own benchmarks to measure the
following four aspects of the applications:
-
CPU, Scheduling, and OS Services
-
Memory
-
Network
-
File System
If you choose to use an existing benchmark, you have to fully understand
the benchmark, including
the design rationale, the implementation (read the source code!),
and the runtime behavior.
Running benchmarks without understanding them will not be accepted.
Essentially, the goal of the project is to "understand" the details
of performance overhead of cloud-native container technologies,
instead of blindly running some benchmarks. Please read the papers in the
Reference section below which set up the standard of the levels of
understanding we are looking for.
You are highly encouraged to build your own benchmarks (either building from
scratch or revising existing ones).
The following describes a few ideas which you can start with:
-
CPU, Scheduling, and OS Services
-
Measurement overhead: Report the overhead of reading time counters and the overhead of
using a loop to measure many iterations of an operation.
-
Procedure call overhead: Report as a function of number of integer arguments from 0-7.
What is the increment overhead of an argument?
-
System call overhead: Report the cost of a minimal system call. Note that some operating systems will
cache the results of some system calls (e.g., idempotent system calls like getpid), so only the first
call by a process will actually trap into the OS.
-
Task creation time: Report the time to create and run both a process and a kernel thread (kernel threads
run at user-level, but they are created and managed by the OS; e.g., pthread_create on modern Linux will
create a kernel-managed thread).
-
Context switch time: Report the time to context switch from one process to another, and from one kernel
thread to another. In the past students have found using blocking pipes to be useful
for forcing context switches.
-
Memory
-
RAM access time: Report latency for individual integer accesses to main memory and the L1 and L2 caches.
Present results as a graph with the x-axis as the log of the size of the memory region accessed, and the
y-axis as the average latency. Note that the
lmbench paper
is a good reference for this experiment. In
terms of the lmbench paper, measure the "back-to-back-load" latency and report your results in a graph
similar to Figure 1 in the paper.
-
RAM bandwidth: Report bandwidth for both reading and writing. Use loop unrolling to get more accurate
results, and keep in mind the effects of cache line prefetching (e.g., see the lmbench paper).
-
Page fault service time: Report the time for faulting an entire page from disk (mmap is one useful
mechanism). Dividing by the size of a page, how does it compare to the latency of accessing a byte from
main memory?
-
File System
-
File read time: Report for both sequential and random access as a function of file size. Discuss the sense
in which your "sequential" access might not be sequential. Ensure that you are not measuring cached data
(e.g., use the raw device interface). Report as a graph with a log/log plot with the x-axis the size of the
file and y-axis the average per-block time.
-
Contention: Report the average time to read one file system block of data as a function of the number of
processes simultaneously performing the same operation on different files on the same disk (and not in the
file buffer cache).
-
Performance of different file systems. Docker
uses OverlayFS,
while Facebook's container uses Btrfs.
Does the choices of file system make a difference?
-
Network
-
Round trip time. Compare with the time to perform a ping (ICMP requests are handled at kernel level).
-
Peak bandwidth.
-
Connection overhead: Report setup and tear-down. Evaluate for the TCP protocol. For each quantity, compare both remote and loopback interfaces. Comparing the
remote and loopback results, what can you deduce about baseline network performance and the overhead of container
software? For both round trip time and bandwidth, how close to bare metal performance do you achieve? What
are reasons why the TCP performance does not match bare metal performance? In describing your methodology
for the remote case, either provide a machine description for the second machine (as above), or use two identical machines.
Report
In your report:
-
Explain the differences between your chosen container techniques.
-
Clearly explain the methodology of your experiment.
-
Introduce your benchmarks and explain what they are measuring.
-
Present your results:
-
For measurements of single quantities (e.g., system call overhead),
use a table to summarize your results. In the table report the bare metal
performance, your estimate of software overhead, your prediction
of operation time, and your measured operation time.
-
For measurements of operations as a function of some other quantity,
report your results as a graph with operation time on the y-axis and
the varied quantity on the x-axis. Include your estimates of bare metal
performance and overall prediction of operation time as curves
on the graph as well.
-
If you have more than one set of results in your graph, clearly paint them
with different graphic patterns or colors, and clearly annotate them.
-
Label your x-axis and y-axis in your graph clearly.
-
The scale of your graph should be carefully chosen, so that results that
are either too big or too small do not dominate or vanishes from your graph.
-
Remember to add label and title to your graph.
-
Discuss your results:
-
Compare the measured performance with the predicted performance. If they are wildly different, speculate on reasons why. What may be contributing to the overhead?
-
Evaluate the success of your methodology. How accurate do you think your results are?
-
For graphs, explain any interesting features of the curves.
-
At the end of your report, summarize your results in a table for a complete overview.
Reference
During the semester you will have read a number of papers
describing various system measurements. You may find those papers on the reading list
useful as references.
In addition, other papers you may find useful for help with system
measurement are:
- John K. Ousterhout, Why
Aren't Operating Systems Getting Faster as Fast as Hardware?,
Proc. of USENIX Summer Conference, pp. 247-256, June 1990.
- J. Bradley Chen, Yasuhiro Endo, Kee Chan, David Mazières,
Antonio Dias, Margo Seltzer, and Michael D. Smith, The
Measured Performance of Personal Computer Operating Systems,
Proc. of ACM SOSP, pp. 299-313, December 1995.
- Larry McVoy and Carl Staelin, lmbench:
Portable Tools for Performance Analysis, Proc. of USENIX Annual
Technical Conference, January 1996.
-
Aaron B. Brown and Margo I. Seltzer, Operating
system benchmarking in the wake of lmbench: a case study of the
performance of NetBSD on the Intel x86 architecture, Proc. of ACM
SIGMETRICS, pp. 214-224, June 1997.
- John Ousterhout, Always Measure One Level Deeper, Communications of the ACM, Vol. 61, No. 7, July 2018, pp. 74–83.
Finally, it goes almost without saying that you could implement all
of your measurements. You may download a tool to perform the
measurements for you. You are given the power to choose whether one
particular aspect is compared by your own implementation or existing
benchmarks. You need to explore carefully the reasons behind resulting
numbers, which should keep you busy.
Finally(+1), structure your paper and organize your materials according to
your need.
Acknowledgement
CMP is designed based on the Operating System Measurement project of
CSE 221 taught by Professors Geoff Voelker, Yuanyuan Zhou, and Stefan Savage
at University of California San Diego.