class: center, middle # Timekeeping in the Linux Kernel .fade[ * Stephen Boyd ] ??? Why? Had some commits to kernel/time area upstream and asked by co-workers to describe how it all works so they can debug system crashes when timekeeping breaks. --- class: center, middle # In the beginning ... --- name: first_counter # there was a counter .center[.ticking_counter[0000ec544fef3c81]] ??? Counter increments (or decrements) every time a counter tick happens. Ticks happen at the rate of the frequency of the counter. A 100 MHz counter will increment the counter every 1 / 100MHz = 10ns --- # Calculating the Passage of Time (in ns) .center[ ###`\(\frac{c_{cycles}}{f_{Hz}} =\frac{c_{cycles}}{f (\frac{1}{seconds})} =(\frac{c_{cycles}}{f})_{seconds}\)` ###`\((\frac{c_{cycles}}{f})_{seconds}=\frac{c_{cycles}}{f} \cdot 1e9 = T_{ns}\)` ] -- ## Problems * Division is slow * Floating point math * Precision/overflow/underflow problems ??? Given a counter that ticks at a certain frequency, we want to know how much time has passed given a "cycle delta". This is a simple calculation, but it's slow due to division and could potentially overflow if the time delta is large. Let's see how we speed it up --- # Calculating the Passage of Time (in ns) Better ```C static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift) { return ((u64) cycles * mult) >> shift; } ``` -- ### Where do mult and shift come from? ```C clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 minsec) ``` ??? * Mult is cheaper than div * Shift does a div by power of 2, cheaper * Avoids floating point * Conversion range is limited (minsec controls that) * from/to is swapped for clocksource and clockevents * clocksource go from frequency to ns, while clockevents go from ns to cycles Mult can't be too large or we'll overflow in the cycles * mult part when multiplying two large 64 bit numbers. A large shift allows us finer grained frequency adjustment via mult (discussed later). --- # Abstract the Hardware!  ```c struct clocksource { cycle_t (*read)(struct clocksource *cs); cycle_t mask; u32 mult; u32 shift; ... }; clocksource_register_hz(struct clocksource *cs, u32 hz); clocksource_register_khz(struct clocksource *cs, u32 khz); ``` Time diff: ```c struct clocksource *cs = &system_clocksource; cycle_t start = cs->read(cs); ... /* do something for a while */ cycle_t end = cs->read(cs); clocksource_cyc2ns(end - start, cs->mult, cs->shift); ``` ??? Putting it all together, we model a counter as a struct clocksource. It has fields for mult and shift, as well a a counter read function and a mask indicating which bits in the read function are valid to use in the delta calculation (recall clocksource_cyc2ns() from the previous slide). We register a clocksource with the hz/khz functions because they calculate a good mult/shift pair based on the assumption that we're not going to be idle for more than 10 minutes (we'll talk about this later). --- # POSIX Clocks * CLOCK_BOOTTIME * CLOCK_MONOTONIC * CLOCK_MONOTONIC_RAW * CLOCK_MONOTONIC_COARSE * CLOCK_REALTIME * CLOCK_REALTIME_COARSE * CLOCK_TAI ??? * CLOCK_MONOTONIC - always increasing (doesn't count time spent in suspend) * CLOCK_MONOTONIC_RAW - same as CLOCK_MONOTONIC but no NTP adjustment (almost shouldn't be used outside of NTP type work) * CLOCK_MONOTONIC_COARSE - same as CLOCK_MONOTONIC but not as accurate * CLOCK_REALTIME - "wall clock", can be changed at runtime * CLOCK_REALTIME_COARSE - same as CLOCK_REALTIME but not as accurate * CLOCK_BOOTTIME - monotonic time since boot (includes time spent in suspend) * CLOCK_TAI - atomic international time Refresher: These are the POSIX clocks that Linux exposes. All of these are built on top of the clocksource. *_COARSE is less exact, and the resolution can be queried via clock_get_res(). It can be used to get time faster because it doesn't have to read the clocksource. --- # POSIX Clocks Comparison .line-block[ .line.boottime[]] .line-block[ .line.mono[]] .line-block[ .line.real[]] .line-block[ .line.event-1[] .line.event-2[suspend] .line.event-3[] .line.event-4[settimeofday()] .line.event-5[]] ??? * Suspend makes monotonic clock freeze during that time * settimeofday() causes realtime clock to go backwards (could also go forward) --- class: center, middle # **R**ead **A**ccumulate **T**rack (RAT) ### *Best acronym ever* ??? * Read clocksource * Accumulate into nanoseconds u64 * Track last time we accumulated --- # __**R**__AT in Action (Read) ```C struct tk_read_base { * struct clocksource *clock; u64 mask; u64 cycle_last; u32 mult; u32 shift; u64 xtime_nsec; ktime_t base; }; static inline u64 tk_clock_read(const struct tk_read_base *tkr) { struct clocksource *clock = READ_ONCE(tkr->clock); * return clock->read(clock); } ``` ??? Shows reading of clocksource. --- # RA__**T**__ in Action (Track) ```C struct tk_read_base { struct clocksource *clock; u64 mask; * u64 cycle_last; u32 mult; u32 shift; u64 xtime_nsec; ktime_t base; }; static inline u64 timekeeping_get_delta(const struct tk_read_base *tkr) { u64 now = tk_clock_read(tkr); * u64 last = tkr->cycle_last; return clocksource_delta(now, last, mask); } static void timekeeping_forward_now(struct timekeeper *tk) { ... * tk->tkr_mono.cycle_last = cycle_now; } ``` --- # RAT in Action ```C static inline u64 clocksource_delta(u64 now, u64 last, u64 mask) { return (now - last) & mask; } static inline u64 timekeeping_get_delta(const struct tk_read_base *tkr) { u64 now = tk_clock_read(tkr); * u64 last = tkr->cycle_last; return clocksource_delta(now, last, mask); } static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, cycle_t delta) { u64 nsec = delta * tkr->mult + tkr->xtime_nsec; return nsec >> tkr->shift; } static inline s64 timekeeping_get_ns(struct tk_read_base *tkr) { * u64 delta = timekeeping_get_delta(tkr); return timekeeping_delta_to_ns(tkr, delta); } ``` --- # R**A**T in Action (Accumulate) ```C static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset, u32 shift, unsigned int *clock_set) { u64 interval = tk->cycle_interval << shift; * tk->tkr_mono.cycle_last += interval; * tk->tkr_mono.xtime_nsec += tk->xtime_interval << shift; *clock_set |= accumulate_nsecs_to_secs(tk); ... } static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk) { u64 nsecps = (u64)NSEC_PER_SEC << tk->tkr_mono.shift; unsigned int clock_set = 0; while (tk->tkr_mono.xtime_nsec >= nsecps) { int leap; * tk->tkr_mono.xtime_nsec -= nsecps; * tk->xtime_sec++; ... } ``` ??? Shows how we accumulate time into xtime_nsec and xtime_sec, and also how we track the last cycle when we did the accumulation. With this we can start supporting multiple timelines. --- # Juggling Clocks ```c struct timekeeper { struct tk_read_base tkr_mono; struct tk_read_base tkr_raw; u64 xtime_sec; unsigned long ktime_sec; struct timespec64 wall_to_monotonic; ktime_t offs_real; ktime_t offs_boot; ktime_t offs_tai; s32 tai_offset; struct timespec64 raw_time; }; ``` .center[ Note: Not entirely accurate diagram ] ??? Offsets maintain different clock timelines for us by letting us add or subtract time from another timeline. This way we only need to maintain one timeline, and adjust other POSIX clock timelines to that with offsets. Note: realtime is implemented by reading xtime_sec + clocksource read --- # Handling Clock Drift .center[ ### `\(\frac{1}{19200000} \cdot 1e9 = {52.08\overline{3}}_{ns}\)` ### Vs. ### `\(\frac{1}{19200008} \cdot 1e9 = 52.083311_{ns}\)` ] ??? * Hardware drifts * Most systems aren't using atomic clocks! --- # Handling Clock Drift .center[ ### `\(\frac{100000}{19200000} \cdot 1e9 = 520833_{ns}\)` ### Vs. ### `\(\frac{100000}{19200008} \cdot 1e9 = 5208331_{ns}\)` ] ### After 100k cycles we've lost 2 ns --- # Mult to the Rescue! .center[ ### `\((100000 \cdot 873813333) \gg 24 = 5208333_{ns}\)` ### Vs. ### `\((100000 \cdot 873813109) \gg 24 = 5208331_{ns}\)` ] ### Approach: Adjust mult to match actual clock frequency ??? We can adjust for the error by modifying the mult to match the true frequency of the hardware. We have to be careful though to not overflow and to avoid making time jump forward or backwards --- # Making Things Fast and Efficient ```C static struct { seqcount_t seq; struct timekeeper timekeeper; } tk_core ____cacheline_aligned; static struct timekeeper shadow_timekeeper; struct tk_fast { seqcount_t seq; struct tk_read_base base[2]; }; static struct tk_fast tk_fast_mono ____cacheline_aligned; static struct tk_fast tk_fast_raw ____cacheline_aligned; ``` ??? * NMI safety * Cache line aligning * Poor man's RCU design We have a tk_read_base structure to make reading the time NMI safe as well as small enough to fit into a 64-byte cache line so things don't always cache miss. tk_read_base is carefully constructed to fit into 56-bytes so it can be paired with a seq_count. shadow_timekeeper and 2 size array are poor mans RCU. --- # A Note About NMIs and Time .center[] ??? Fast timekeepers can look like they go backwards if read happens while mult is adjusted and that uses stale data while subsequent read after mult is adjusted uses new data. --- # Where We Are .center[] --- # What if my system doesn't have a counter? Insert #sadface here * Can't use NOHZ * Can't use hrtimers in "high resolution" mode Relegated to the jiffies clocksource: ```C static cycle_t jiffies_read(struct clocksource *cs) { return (cycle_t) jiffies; } static struct clocksource clocksource_jiffies = { .name = "jiffies", .rating = 1, /* lowest valid rating*/ .read = jiffies_read, ... }; ``` ??? Lowres means that hrtimers are jiffy resolution not nanoseconds. We'll talk about this more later. --- # Let's talk about jiffies -- ## .center[Jiffy = 1 / CONFIG_HZ] -- ## .center[Updated during the "tick"] ??? Jiffies are a good way to implement timeouts or timers that need to run with a low resolution. Also, jiffies is a global variable that anything in the kernel can read/reference. --- # The tick? .center[] ??? No not that one --- # The tick Periodic event that updates * jiffies * process accounting * global load accounting * timekeeping * POSIX timers * RCU callbacks * hrtimers * irq_work ??? Rate of the event is CONFIG_HZ It's typically implemented as an interrupt, so if the CPU can't receieve interrupts then jiffies isn't going to increment. Also, if we don't have a clocksource then we really need the interrupts to be exact or our jiffies clocksource will slow down. Not everything is done every jiffy. Things depend, like hrtimers as we'll see later, or irq_work where we emulate it here if arches can't do it themselves. --- # Implementing the tick in hardware ### Timer Value: .ticking_timer[4efa4652] ### Match Value: .timer_match[] .center.tick_img[] ??? Let's look at an example. This shows a hardware timer that we can configure to trigger an interrupt when the match value matches the timer value. --- # Abstract the Hardware! ```C struct clock_event_device { void (*event_handler)(struct clock_event_device *); int (*set_next_event)(unsigned long evt, struct clock_event_device *); int (*set_next_ktime)(ktime_t expires, struct clock_event_device *); ktime_t next_event; u64 max_delta_ns; u64 min_delta_ns; u32 mult; u32 shift; unsigned int features; #define CLOCK_EVT_FEAT_PERIODIC 0x000001 #define CLOCK_EVT_FEAT_ONESHOT 0x000002 #define CLOCK_EVT_FEAT_KTIME 0x000004 int irq; ... }; void clockevents_config_and_register(struct clock_event_device *dev, u32 freq, unsigned long min_delta, unsigned long max_delta) ``` ??? We model a hardware timer with a clockevent. structure used to setup an event to trigger at some point in the future. event_handler() is the function to call when the event happens. next_event holds the next ktime this clockevent is configured to call event_handler. Clockevents also use mult/shift to calculate a cycle counter value that corresponds to nanoseconds. min/max delta clamp that time in nanoseconds the clockevent is expected to be able to program the clockevent to raise an event in the future. Three types of features: * ### CLOCK_EVT_FEAT_PERIODIC Can raise an interrupt every 1 / CONFIG_HZ seconds * ### CLOCK_EVT_FEAT_ONESHOT Can raise an interrupt in X cycles. Time is converted from nanoseconds based on the clocksource to cycles of the clockevent. * ### CLOCK_EVT_FEAT_KTIME Can raise an interrupt based on a ktime structure (think nanoseconds) (only used for broadcast hrtimers) --- # Three event_handlers ```C struct clock_event_device { * void (*event_handler)(struct clock_event_device *); int (*set_next_event)(unsigned long evt, struct clock_event_device *); int (*set_next_ktime)(ktime_t expires, struct clock_event_device *); ktime_t next_event; u64 max_delta_ns; ... } ``` | Handler | Usage | |----------------------------|---------------| | tick_handle_periodic() | default | | tick_nohz_handler() | lowres mode | | hrtimer_interrupt() | highres mode | ??? Function where we run the tick. We'll go over the usages next. --- # Ticks During Idle ## tick_handle_periodic() .timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_periodic.i1[idle] .tick.tick3[] .idle_task_periodic.i2[idle] .tick.tick4[] .task.task3[t3] ] ??? * Default function * Runs tick every CONFIG_HZ cycles * Typically only runs for a short time until others take over * Can emulate with ONESHOT feature, but otherwise relies on PERIODIC feature to reprogram tick within clockevent driver * Note we always get the tick even when we're idle for longer than HZ --- # Tick-less Idle (i.e. CONFIG_NOHZ_IDLE) ## tick_handle_periodic() .timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_periodic.i1[idle] .tick.tick3[] .idle_task_periodic.i2[idle] .tick.tick4[] .task.task3[t3]] ## tick_nohz_handler() .timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_lowres[idle] .tick.tick4[] .task.task3[t3]] ??? NOHZ_IDLE is defined as removing the periodic tick during CPU idle. Run NOHz mode in low resolution mode. Ticks can be skipped and we can program clockevent to trigger an event X cycles in the future. Needs oneshot feature of clockevent to program clockevent appropriately. --- # High Resolution Mode ## tick_nohz_handler() .timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_lowres[idle] .tick.tick4[] .task.task3[t3]] ## hrtimer_interrupt() .timeline[.task.task1[t1] .tick.tick1[] .task.task21[t] .hrtimer.hrtimer21[] .task.task22[2] .tick.tick2[] .idle_task_highres[idle] .hrtimer.hrtimer1[] .hrtimer.hrtimer2[] .tick.tick4[] .task.task3[t3]] ??? ## Everything is an hrtimer now Runs NOHZ mode in high resolution mode. Same as low-res, but hrtimers control when the next time the clockevent should fire. The tick itself is an hrtimer, just like other hrtimers. In lowres mode hrtimers run during the tick, but in high res mode, hrtimers run when they're programmed to. --- # Tick Devices ```C enum tick_device_mode { TICKDEV_MODE_PERIODIC, TICKDEV_MODE_ONESHOT, }; struct tick_device { struct clock_event_device *evtdev; enum tick_device_mode mode; }; DEFINE_PER_CPU(struct tick_device, tick_cpu_device); ``` .center[] ??? There can be more clockevents than CPUs, so we need some local storage of the tick device for each CPU. Per-cpu so we can interrupt each CPU without cross CPU interrupts (more latency). So we have a wrapper structure to hold the clockevent dedicated to each CPU plus the operating mode (periodic or oneshot). The mode is similar to clockevent features, but slightly different. Periodic means that the tick device has configured its clockevent to call tick_periodic() each jiffy (assuming we aren't using nohz/highres). We can emulate periodic with CLOCK_EVT_FEAT_ONESHOT as well. This is the default operating mode when a clockevent is registered so that we ensure everything keeps working. Oneshot means that the tick device has configured its clockevent to call a nohz or hrtimer handler. We would use oneshot mode if we're trying to go "tick-less". --- # Running the Tick ```C struct tick_sched { struct hrtimer sched_timer; ... }; ``` .center[] ??? When we're in oneshot + nohz mode, we emulate the tick with an hrtimer. In lowres we use the hrtimer as a way to track the next tick, but otherwise the event handler programs the tick_device directly. In highres mode the tick is a true hrtimer, and the event handler programs whatever next hrtimer there is (tick or not) into the tick device --- # Running the Tick (Per-CPU) ```C struct tick_sched { struct hrtimer sched_timer; ... }; *DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); ``` .center[] ??? Everything is duplicated per-cpu. Only one CPU does the global jiffies update though. --- # Stopping the Tick * Not always as simple as * ```C hrtimer_cancel(&ts->sched_timer)``` * Could be that we need to restart the timer so far in the future * ```C hrtimer_start(&ts->sched_timer, tick, HRTIMER_MODE_ABS_PINNED)``` Needs to consider * timers * hrtimers * RCU callbacks * jiffies update responsibility * clocksource's max_idle_ns (timekeeping max deferment) *Details in ```tick_nohz_stop_sched_tick()```* ??? Otherwise known as entering NO-HZ idle. Either we cancel the hrtimer because we aren't the cpu updating jiffies, or we program the hrtimer to run some time in the future depending on what events are on this CPU or if we need to update jiffies. --- # Tick Broadcast * For when your clockevents **FAIL AT LIFE** * i.e., they don't work during some CPU idle low power modes * Indicated by CLOCK_EVT_FEAT_C3_STOP flag .center[] ??? Emulate per-cpu tick_device with a global tick_device. Only considers tick_devices that are non-functional in low power modes and is closely tied to CPU idle. Mechanism is by means of forcing an interrupt on another CPU from the broadcast clockevent's event_handler(). --- # Timers * Operates with jiffies granularity * Requirements * jiffies increment * clockevent * softirq .center[] --- # HRTimers * Operates with ktime (nanoseconds) granularity * Requirements * timekeeping increment * clockevent * tick_device .center[] ??? Everything comes together. We use timekeeping each time we run the timers to make sure we only run timers that we need to run. That in turn uses the clocksource, and the timers only run because we've setup a clockevent to run hrtimer_interrupt(). In low-res mode hrtimers still work, but we run hrtimers from the tick handler so they become jiffy granularity. --- # Summary .left-column[ ### What we covered * clocksources * timekeeping * clockevents * jiffies * NOHZ * tick broadcast * timers * hrtimers ] .right-column[ ### What's difficult * Timekeeping has to handle NTP and drift * Tick uses multiple abstraction layers * NOHZ gets complicated when starting/stopping the tick * Tick broadcast turns up NOHZ to 11 ] ??? Thanks John Stultz for slide review! And remark.js and mathjax for the slide software.