Timekeeping in the Linux Kernel

# Timekeeping in the Linux Kernel

???

Why? Had some commits to kernel/time area upstream and asked by co-workers
to describe how it all works so they can debug system crashes when
timekeeping breaks.

---
class: center, middle

# In the beginning ...

---

# there was a counter

???

Counter increments (or decrements) every time a counter tick happens.
Ticks happen at the rate of the frequency of the counter.
A 100 MHz counter will increment the counter every 1 / 100MHz = 10ns

---

# Calculating the Passage of Time (in ns)

.center[
###`\(\frac{c_{cycles}}{f_{Hz}} =\frac{c_{cycles}}{f (\frac{1}{seconds})} =(\frac{c_{cycles}}{f})_{seconds}\)`
###`\((\frac{c_{cycles}}{f})_{seconds}=\frac{c_{cycles}}{f} \cdot 1e9 = T_{ns}\)`
]

## Problems

* Division is slow
* Floating point math
* Precision/overflow/underflow problems

???

Given a counter that ticks at a certain frequency, we want to know how much
time has passed given a "cycle delta". This is a simple calculation, but it's
slow due to division and could potentially overflow if the time delta is large.
Let's see how we speed it up

---

# Calculating the Passage of Time (in ns) Better

```C
static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
{
        return ((u64) cycles * mult) >> shift;
}
```

### Where do mult and shift come from?

```C
clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 minsec)
```

???

* Mult is cheaper than div
* Shift does a div by power of 2, cheaper
* Avoids floating point
* Conversion range is limited (minsec controls that)
* from/to is swapped for clocksource and clockevents
 * clocksource go from frequency to ns, while clockevents go from ns to cycles

Mult can't be too large or we'll overflow in the cycles * mult part when
multiplying two large 64 bit numbers. A large shift allows us finer grained
frequency adjustment via mult (discussed later).

---

# Abstract the Hardware! ![:img 45%](block-diagram-1.svg)

```c
struct clocksource {
	cycle_t (*read)(struct clocksource *cs);
	cycle_t mask;
	u32 mult;
	u32 shift;
	...
};

clocksource_register_hz(struct clocksource *cs, u32 hz);
clocksource_register_khz(struct clocksource *cs, u32 khz);
```

Time diff:

```c
struct clocksource *cs = &system_clocksource;
cycle_t start = cs->read(cs);
... /* do something for a while */
cycle_t end = cs->read(cs);
clocksource_cyc2ns(end - start, cs->mult, cs->shift);
```

???

Putting it all together, we model a counter as a struct clocksource. It has
fields for mult and shift, as well a a counter read function and a mask
indicating which bits in the read function are valid to use in the delta
calculation (recall clocksource_cyc2ns() from the previous slide).

We register a clocksource with the hz/khz functions because they calculate a
good mult/shift pair based on the assumption that we're not going to be idle
for more than 10 minutes (we'll talk about this later).

---

# POSIX Clocks

* CLOCK_BOOTTIME
* CLOCK_MONOTONIC
* CLOCK_MONOTONIC_RAW
* CLOCK_MONOTONIC_COARSE
* CLOCK_REALTIME
* CLOCK_REALTIME_COARSE
* CLOCK_TAI

???

* CLOCK_MONOTONIC - always increasing (doesn't count time spent in suspend)
* CLOCK_MONOTONIC_RAW - same as CLOCK_MONOTONIC but no NTP adjustment (almost shouldn't be used outside of NTP type work)
* CLOCK_MONOTONIC_COARSE - same as CLOCK_MONOTONIC but not as accurate
* CLOCK_REALTIME - "wall clock", can be changed at runtime
* CLOCK_REALTIME_COARSE - same as CLOCK_REALTIME but not as accurate
* CLOCK_BOOTTIME - monotonic time since boot (includes time spent in suspend)
* CLOCK_TAI - atomic international time

Refresher: These are the POSIX clocks that Linux exposes. All of these are
built on top of the clocksource. *_COARSE is less exact, and the resolution can
be queried via clock_get_res(). It can be used to get time faster because it
doesn't have to read the clocksource.

---

# POSIX Clocks Comparison

.line-block[
.line.event-1[] .line.event-2[suspend] .line.event-3[] .line.event-4[settimeofday()] .line.event-5[]]

???

* Suspend makes monotonic clock freeze during that time
* settimeofday() causes realtime clock to go backwards (could also go forward)

---

# **R**ead **A**ccumulate **T**rack (RAT)
### *Best acronym ever*

???

* Read clocksource
* Accumulate into nanoseconds u64
* Track last time we accumulated

---

# __**R**__AT in Action (Read)

```C
struct tk_read_base {
*       struct clocksource    *clock;
        u64                   mask;
        u64                   cycle_last;
        u32                   mult;
        u32                   shift;
        u64                   xtime_nsec;
        ktime_t               base;
};

static inline u64 tk_clock_read(const struct tk_read_base *tkr)
{
        struct clocksource *clock = READ_ONCE(tkr->clock);
*       return clock->read(clock);
}
```

???

Shows reading of clocksource.

---

# RA__**T**__ in Action (Track)

```C
struct tk_read_base {
	        struct clocksource    *clock;
			u64                   mask;
*           u64                   cycle_last;
			u32                   mult;
			u32                   shift;
			u64                   xtime_nsec;
			ktime_t               base;
};

static inline u64 timekeeping_get_delta(const struct tk_read_base *tkr)
{
		u64 now = tk_clock_read(tkr);
*		u64 last = tkr->cycle_last;
		return clocksource_delta(now, last, mask);
}

static void timekeeping_forward_now(struct timekeeper *tk)
{
		...
*   	tk->tkr_mono.cycle_last = cycle_now;
}
```
---

# RAT in Action

```C
static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
{
        return (now - last) & mask;
}

static inline u64 timekeeping_get_delta(const struct tk_read_base *tkr)
{
		u64 now = tk_clock_read(tkr);
*       u64 last = tkr->cycle_last;
		return clocksource_delta(now, last, mask);
}

static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
                                          cycle_t delta)
{
        u64 nsec = delta * tkr->mult + tkr->xtime_nsec;
        return nsec >> tkr->shift;
}

static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
{
*       u64 delta = timekeeping_get_delta(tkr);
        return timekeeping_delta_to_ns(tkr, delta);
}
```

---

# R**A**T in Action (Accumulate)

```C
static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
                                    u32 shift, unsigned int *clock_set)
{
        u64 interval = tk->cycle_interval << shift;

*       tk->tkr_mono.cycle_last += interval;

*       tk->tkr_mono.xtime_nsec += tk->xtime_interval << shift;
        *clock_set |= accumulate_nsecs_to_secs(tk);
        ...
}

static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
{
        u64 nsecps = (u64)NSEC_PER_SEC << tk->tkr_mono.shift;
        unsigned int clock_set = 0;

while (tk->tkr_mono.xtime_nsec >= nsecps) {
               int leap;

*              tk->tkr_mono.xtime_nsec -= nsecps;
*              tk->xtime_sec++;
        ...
}
```

???

Shows how we accumulate time into xtime_nsec and xtime_sec, and also
how we track the last cycle when we did the accumulation. With
this we can start supporting multiple timelines.

---
# Juggling Clocks

```c
struct timekeeper {
        struct tk_read_base     tkr_mono;
        struct tk_read_base     tkr_raw;
        u64                     xtime_sec;
        unsigned long           ktime_sec;
        struct timespec64       wall_to_monotonic;
        ktime_t                 offs_real;
        ktime_t                 offs_boot;
        ktime_t                 offs_tai;
        s32                     tai_offset;
        struct timespec64       raw_time;
};
```

???

Offsets maintain different clock timelines for us by letting
us add or subtract time from another timeline. This way we only need
to maintain one timeline, and adjust other POSIX clock timelines
to that with offsets.

Note: realtime is implemented by reading xtime_sec + clocksource read

---
# Handling Clock Drift

.center[
### `\(\frac{1}{19200000} \cdot 1e9 = {52.08\overline{3}}_{ns}\)`
### Vs.
### `\(\frac{1}{19200008} \cdot 1e9 = 52.083311_{ns}\)`
]

???
* Hardware drifts
 * Most systems aren't using atomic clocks!

---

# Handling Clock Drift

.center[
### `\(\frac{100000}{19200000} \cdot 1e9 = 520833_{ns}\)`
### Vs.
### `\(\frac{100000}{19200008} \cdot 1e9 = 5208331_{ns}\)`
]

### After 100k cycles we've lost 2 ns

---
# Mult to the Rescue!

.center[
### `\((100000 \cdot 873813333) \gg 24 = 5208333_{ns}\)`
### Vs.
### `\((100000 \cdot 873813109) \gg 24 = 5208331_{ns}\)`
]

### Approach: Adjust mult to match actual clock frequency

???

We can adjust for the error by modifying the mult to match the
true frequency of the hardware. We have to be careful though
to not overflow and to avoid making time jump forward or backwards

---

# Making Things Fast and Efficient

```C
static struct {
        seqcount_t              seq;
        struct timekeeper       timekeeper;
} tk_core ____cacheline_aligned;

static struct timekeeper shadow_timekeeper;

struct tk_fast {
        seqcount_t              seq;
        struct tk_read_base     base[2];
};

static struct tk_fast tk_fast_mono ____cacheline_aligned;
static struct tk_fast tk_fast_raw  ____cacheline_aligned;
```

???

* NMI safety
* Cache line aligning
* Poor man's RCU design

We have a tk_read_base structure to make reading the time NMI safe as well as
small enough to fit into a 64-byte cache line so things don't always cache miss.

tk_read_base is carefully constructed to fit into 56-bytes so it can be paired
with a seq_count. shadow_timekeeper and 2 size array are poor mans RCU.

---
# A Note About NMIs and Time

???

Fast timekeepers can look like they go backwards if read happens while mult is
adjusted and that uses stale data while subsequent read after mult is adjusted
uses new data.

---
# Where We Are

---

# What if my system doesn't have a counter?

Insert #sadface here

* Can't use NOHZ
 * Can't use hrtimers in "high resolution" mode

Relegated to the jiffies clocksource:

```C
static cycle_t jiffies_read(struct clocksource *cs)
{
        return (cycle_t) jiffies;
}

static struct clocksource clocksource_jiffies = {
        .name           = "jiffies",
        .rating         = 1, /* lowest valid rating*/
        .read           = jiffies_read,
        ...
};
```

???

Lowres means that hrtimers are jiffy resolution not nanoseconds. We'll talk
about this more later.

---

# Let's talk about jiffies

--
## .center[Jiffy = 1 / CONFIG_HZ]

--
## .center[Updated during the "tick"]

???

Jiffies are a good way to implement timeouts or timers that need to run with a
low resolution. Also, jiffies is a global variable that anything in the
kernel can read/reference.

---

# The tick?

???

No not that one

---
# The tick

Periodic event that updates

* jiffies
* process accounting
* global load accounting
* timekeeping
* POSIX timers
* RCU callbacks
* hrtimers
* irq_work

???

Rate of the event is CONFIG_HZ

It's typically implemented as an interrupt, so if the CPU can't receieve
interrupts then jiffies isn't going to increment. Also, if we don't have a
clocksource then we really need the interrupts to be exact or our jiffies
clocksource will slow down.

Not everything is done every jiffy. Things depend, like hrtimers as we'll
see later, or irq_work where we emulate it here if arches can't do it
themselves.

---

# Implementing the tick in hardware

### Timer Value: .ticking_timer[4efa4652]
### Match Value: .timer_match[]

.center.tick_img[![The tick](the-tick.jpg)]

???

Let's look at an example. This shows a hardware timer that we can configure to
trigger an interrupt when the match value matches the timer value.

---
# Abstract the Hardware!

```C
struct clock_event_device {
        void            (*event_handler)(struct clock_event_device *);
        int             (*set_next_event)(unsigned long evt,
                                      struct clock_event_device *);
        int             (*set_next_ktime)(ktime_t expires,
                                      struct clock_event_device *);
        ktime_t         next_event;
        u64             max_delta_ns;
        u64             min_delta_ns;
        u32             mult;
        u32             shift;
        unsigned int    features;
#define CLOCK_EVT_FEAT_PERIODIC 0x000001
#define CLOCK_EVT_FEAT_ONESHOT  0x000002
#define CLOCK_EVT_FEAT_KTIME    0x000004
        int             irq;
        ...
};

void clockevents_config_and_register(struct clock_event_device *dev,
                                     u32 freq, unsigned long min_delta,
                                     unsigned long max_delta)
```

???

We model a hardware timer with a clockevent. structure used
to setup an event to trigger at some point in the future. event_handler() is
the function to call when the event happens. next_event holds the next
ktime this clockevent is configured to call event_handler. Clockevents also use
mult/shift to calculate a cycle counter value that corresponds to nanoseconds.
min/max delta clamp that time in nanoseconds the clockevent is expected to be
able to program the clockevent to raise an event in the future.

Three types of features:

* ### CLOCK_EVT_FEAT_PERIODIC
Can raise an interrupt every 1 / CONFIG_HZ seconds
 * ### CLOCK_EVT_FEAT_ONESHOT
Can raise an interrupt in X cycles. Time is converted from nanoseconds based on the clocksource to cycles of the clockevent.
 * ### CLOCK_EVT_FEAT_KTIME
Can raise an interrupt based on a ktime structure (think nanoseconds) (only used for broadcast hrtimers)

---

# Three event_handlers

```C
struct clock_event_device {
*       void            (*event_handler)(struct clock_event_device *);
        int             (*set_next_event)(unsigned long evt,
                                      struct clock_event_device *);
        int             (*set_next_ktime)(ktime_t expires,
                                      struct clock_event_device *);
        ktime_t         next_event;
        u64             max_delta_ns;
	...
}
```

| Handler                    | Usage         |
|----------------------------|---------------|
| tick_handle_periodic()     | default       |
| tick_nohz_handler()        | lowres mode   |
| hrtimer_interrupt()        | highres mode  |

???

Function where we run the tick. We'll go over the usages next.

---

# Ticks During Idle

## tick_handle_periodic()

.timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_periodic.i1[idle] .tick.tick3[] .idle_task_periodic.i2[idle] .tick.tick4[] .task.task3[t3]
]

???

* Default function
* Runs tick every CONFIG_HZ cycles
* Typically only runs for a short time until others take over
* Can emulate with ONESHOT feature, but otherwise relies on PERIODIC feature to reprogram tick within clockevent driver
* Note we always get the tick even when we're idle for longer than HZ

---

# Tick-less Idle (i.e. CONFIG_NOHZ_IDLE)

## tick_handle_periodic()

.timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_periodic.i1[idle] .tick.tick3[] .idle_task_periodic.i2[idle] .tick.tick4[] .task.task3[t3]]

## tick_nohz_handler()

.timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_lowres[idle] .tick.tick4[] .task.task3[t3]]

???

NOHZ_IDLE is defined as removing the periodic tick during CPU idle.

Run NOHz mode in low resolution mode. Ticks can be skipped and we can
program clockevent to trigger an event X cycles in the future. Needs
oneshot feature of clockevent to program clockevent appropriately.

---

# High Resolution Mode

## tick_nohz_handler()

.timeline[.task.task1[t1] .tick.tick1[] .task.task2[t2] .tick.tick2[] .idle_task_lowres[idle] .tick.tick4[] .task.task3[t3]]

## hrtimer_interrupt()

.timeline[.task.task1[t1] .tick.tick1[] .task.task21[t] .hrtimer.hrtimer21[] .task.task22[2] .tick.tick2[] .idle_task_highres[idle] .hrtimer.hrtimer1[] .hrtimer.hrtimer2[] .tick.tick4[] .task.task3[t3]]

???

## Everything is an hrtimer now

Runs NOHZ mode in high resolution mode. Same as low-res, but hrtimers
control when the next time the clockevent should fire. The tick itself
is an hrtimer, just like other hrtimers. In lowres mode hrtimers run
during the tick, but in high res mode, hrtimers run when they're 
programmed to.

---

# Tick Devices

```C
enum tick_device_mode {
        TICKDEV_MODE_PERIODIC,
        TICKDEV_MODE_ONESHOT,
};

struct tick_device {
        struct clock_event_device *evtdev;
        enum tick_device_mode mode;
};

DEFINE_PER_CPU(struct tick_device, tick_cpu_device);
```

???

There can be more clockevents than CPUs, so we need some local storage of the
tick device for each CPU. Per-cpu so we can interrupt each CPU without cross
CPU interrupts (more latency). So we have a wrapper structure to hold the
clockevent dedicated to each CPU plus the operating mode (periodic or oneshot).

The mode is similar to clockevent features, but slightly different.

Periodic means that the tick device has configured its clockevent to call
tick_periodic() each jiffy (assuming we aren't using nohz/highres). We can
emulate periodic with CLOCK_EVT_FEAT_ONESHOT as well. This is the default
operating mode when a clockevent is registered so that we ensure everything
keeps working.

Oneshot means that the tick device has configured its clockevent to call a nohz
or hrtimer handler. We would use oneshot mode if we're trying to go
"tick-less".

---

# Running the Tick

```C
struct tick_sched {
        struct hrtimer                  sched_timer;
        ...
};
```

???

When we're in oneshot + nohz mode, we emulate the tick with an hrtimer.  In
lowres we use the hrtimer as a way to track the next tick, but otherwise the
event handler programs the tick_device directly.

In highres mode the tick is a true hrtimer, and the event handler programs
whatever next hrtimer there is (tick or not) into the tick device

---

# Running the Tick (Per-CPU)

```C
struct tick_sched {
        struct hrtimer                  sched_timer;
        ...
};

*DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
```

???

Everything is duplicated per-cpu. Only one CPU does the global jiffies
update though.

---

# Stopping the Tick

* Not always as simple as
 * ```C
hrtimer_cancel(&ts->sched_timer)```
* Could be that we need to restart the timer so far in the future
 * ```C
hrtimer_start(&ts->sched_timer, tick, HRTIMER_MODE_ABS_PINNED)```

Needs to consider
* timers
* hrtimers
* RCU callbacks
* jiffies update responsibility
* clocksource's max_idle_ns (timekeeping max deferment)

*Details in ```tick_nohz_stop_sched_tick()```*

???

Otherwise known as entering NO-HZ idle. Either we cancel the hrtimer
because we aren't the cpu updating jiffies, or we program the hrtimer
to run some time in the future depending on what events are on this CPU
or if we need to update jiffies.

---

# Tick Broadcast

* For when your clockevents **FAIL AT LIFE**
 * i.e., they don't work during some CPU idle low power modes
 * Indicated by CLOCK_EVT_FEAT_C3_STOP flag

???

Emulate per-cpu tick_device with a global tick_device. Only considers
tick_devices that are non-functional in low power modes and is closely tied
to CPU idle. Mechanism is by means of forcing an interrupt on another CPU
from the broadcast clockevent's event_handler().

---

# Timers

* Operates with jiffies granularity
* Requirements
 * jiffies increment
 * clockevent
 * softirq

---

# HRTimers

* Operates with ktime (nanoseconds) granularity
* Requirements
 * timekeeping increment
 * clockevent
 * tick_device

???

Everything comes together. We use timekeeping each time we run
the timers to make sure we only run timers that we need to run.
That in turn uses the clocksource, and the timers only run because
we've setup a clockevent to run hrtimer_interrupt().

In low-res mode hrtimers still work, but we run hrtimers from the tick
handler so they become jiffy granularity.

---

# Summary

.left-column[
### What we covered
 * clocksources
 * timekeeping
 * clockevents
 * jiffies
 * NOHZ
 * tick broadcast
 * timers
 * hrtimers
]
.right-column[
### What's difficult
 * Timekeeping has to handle NTP and drift
 * Tick uses multiple abstraction layers
 * NOHZ gets complicated when starting/stopping the tick
 * Tick broadcast turns up NOHZ to 11
]

???

Thanks John Stultz for slide review! And remark.js and mathjax for the slide
software.