TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory

Chap 4: Design Principles of TPP

To Consider the following 3 questions:

  • What is an ideal layer to implement TPP functionalities?

  • How to detect page temperature?

  • What abstractions to provide for accessing CXL-Memory?

Implementation layer:

  • In user-space, Chameleon-like tools need to implement user-space page lists and history management, which entails overheads due to user-space to kernel-space context switching. Also, it adds processing overheads due to history management in user space. Lastly, there are memory overheads due to page information management, which may not scale with large working sets.

  • User-space tools are expensive when they run continuously on all production fleet, so design TPP as a kernel feature, which is less complex to implement and more performant over user-space mechanisms.

Page Temperature Detection:

  • PEBS does not work across all hardware platforms, and limited number of perf counters are supported in CPU. Moreover, PEBS-based profiling is not good enough as an always-running component for high pressure workloads.

  • IPT-based approaches

  • NUMA Balancing (or AutoNUMA) generates a minor page fault when the sampled page gets accessed, but periodically incurring page faults on most frequently accessed pages can lead to high overheads.

  • TPP only leverages minor page fault as a temperature detection mechanism for CXL-Memory. The reason is, CXL-Memory is expected to hold warm and cold pages, which will keep the overhead of temperature detection low.

  • For cold page detection in local memory node, Linux’s existing LRU-based age management mechanism is lightweight and quite efficient.

  • In short, the page temperature detection strategy of TPP is to combine LRUs and NUMA Balancing.

Memory Abstraction for CXL-Memory:

  • Need to read [TMO: Transparent memory offloading in datacenters. In ASPLOS, 2022.]

  • TPP does not plan to use CXL-memory as an in-memory swap device.

Chap 5: TPP for CXL-Memory

  • An effective page placement mechanism should:

    • 1) efficiently offload cold pages to slower CXL-Memory

    • 2) aptly identify trapped hot pages in CXL-node and promote them to the fast memory tier

  • CXL-Memory:

    • CPU-less and independent of the CPU-attached memory

    • Flexible enough to support heterogeneous memory technologies

  • Page allocation to a NUMA node should not frequently halt due to the slower reclamation to free up spaces

  • An effective policy should be aware of an application’s sensitivity toward diff page types

  • TPP: a smart OS-managed mechanism for tiered-memory system, places hot pages in local memory and moves cold memory pages in CXL-Memory:

    • a lightweight demotion to CXL-Memory

    • decoupled allocation and reclamation paths

    • hot-page promotion to local nodes

    • page type-aware memory allocation

5.1 Migration for Lightweight Reclamation

  • Linux

    • try to allocate a page to the memory node local to a CPU where the process is running

    • when a CPU’s local memory node fills up, default reclamation pages-out to swap device

    • then a NUMA system halts new allocations to local node, and allocates on CXL-node until enough pages are freed up

    • slower reclamation -> more pages end up being allocated to the CXL-node

    • Invoking paging events in the critical path worsens the avg page access latency

  • To enable light-weight page reclamation to local nodes, after finding the reclamation candidates, instead of invoking swapping mechanism, put these candidates in to a separate demotion list and migrate them asynchronously.

  • Migration to a NUMA node is orders of magnitude faster than swapping.

  • Use Linux’s default LRU-based mechanism to select demotion candidates

    • unlike swapping, as demoted pages are still available in-memory, along with inactive file pages, TPP scans inactive anon pages for reclamation candidate selection.

    • If a migration during demotion fails (e.g., due to low memory on the CXL-node), we fall back to the default reclamation mechanism for that failed page.

  • Allocation on CXL-node is not performance critical, CXL-nodes use the default reclamation mechanism (e.g., pages out to the swap device)

  • With multiple CXL-nodes, the demotion target is chosen based on the node distance from the CPU -> this simple distance-based static mechanism turns to be effective

Last updated