From: eLinux.org

Memory Type Based Allocation

1 Introduction
2 Purpose of Feature
3 Feature Requirements
4 High Level Design
- 4.1 Memory Type Information in ELF Binaries and the Elfmemtypes
  Utility
5 The MTA Config File and the Mtaconfig
Script
6 Page Fault Exception Handler
7 Allocating Pages
8 Default Page Allocator
9 Allocating Pages With a Node
List
10 Kernel API’s
11 User API’s
- 11.1 Mmap-memtypes() and
  brk-memtypes()
- 11.2 /proc Interface
  - 11.2.1 /proc/nodeinfo
  - 11.2.2 /proc/[pid]/nodemap
12 Tracing MTA with Linux Trace
Toolkit
13 Additional Information

Introduction

This specification describes the design for a Linux kernel memory
manager that can locate a program’s executable code and data in
different physical memory devices.

Purpose of Feature

Embedded systems can use this feature to locate a program’s text and
data segments in specific memory devices. Shared library text and data
segments can also be targeted to specific memory devices. For instance,
frequently executed code, such as glibc or “ls”, could be located
entirely in a single specified memory device or a set of memory devices.
Glibc text/data could be targeted to a fast static RAM bank for
instance, while other less frequently referenced libraries and programs
could be located in slower DRAM.

Feature Requirements

All of a program’s segments must be locatable in specified memory
devices: text, initialized data (data), unitialized data (bss),
heap (brk), and stack.
The loadable segments of shared libraries (text and initialized
data) must be locatable in specified memory devices.
The ELF binaries of programs and shared libraries must contain
memory device information for each of the binaries’ loadable
segments (text and initialized data). This must be in the form of
mnemonic strings. For instance: “SRAM”, “SDRAM”, etc.
A tool will be provided to mark the ELF binaries with memory device
information for each of the loadable segments.
A kernel API must be provided for kernel code (such as device
drivers) to allocate whole page frames from specified memory
devices.
A kernel API must be provided for kernel code to allocate memory
using the slab allocator (kmalloc()) from specified memory devices.
A user-level API must be provided for User program’s to create
mappings, using the mmap() system call, that will allocate page
frames for the mapping in specified memory devices.
A /proc filesystem interface must be provided that prints the
kernel’s node configuration.

High Level Design

Memory devices in Memory Type Based Allocation (MTA) are based on
discontiguous memory support. Traditionally, discontiguous memory is
meant for platforms whose system memory is not contiguous in the
physical memory map. Discontiguous memory in Linux in turn is based on
Non-Uniform Memory Access (NUMA) nodes. Each discontiguous memory bank
is represented by a NUMA node. Therefore in MTA memory devices are also
synonymous with NUMA nodes. Note: to execute a user program directly out
of ROM, such as flash, requires a totally different approach from that
described here.

To understand MTA, it’s best to first describe the memory device type
information contained in the ELF binary of programs and shared
libraries. Then we describe how memory nodes are configured in the
kernel. We then follow the path and morphing of the memorytype data from
its source (the ELF binary) until it reaches the lowest level: when its
used to allocate a page frame for the process during a page fault
exception.

Memory Type Information in ELF Binaries and the Elfmemtypes Utility

In MTA, memory device type information is added to the ELF binaries of
programs and shared libraries using the elfmemtypes utility. This
information is then passed down to the mmap() and brk() calls to create
new memory regions for the process. The elfmemtypes tool adds memory
type information by adding a new NOTE section with the name “.memtypes”
to the ELF binary. It does this by forking and running objcopy as
follows:

objcopy --add-section .memtypes=[temp binary file] [ELF file]

The memory type mnemonic strings specified to the tool are copied to a
temporary file, and that file is passed to objcopy, which copies the
temporary file’s contents to the new .memtypes section. Currently, the
elfmemtypes tool allows specifying memory types for the text segment and
data segment. The text segment includes code and read-only data
sections, and therefore all these sections will be allocated to the
memory types specified for text. Likewise, the data segment includes
initialized data (data) and uninitialized data (bss), so all these
sections will be allocated to the memory types specified for data. Also,
although there are no heap (brk) and stack
sections defined for ELF binaries, heap
and stack regions for the new process currently use the memory types
specified for data. A future enhancement will be to allow data, bss,
brk, and stack regions to have their own memory types. The command line
arguments to the tool are as follows to mark an ELF binary:

elfmemtypes [ELF file] [{text|data} [space-seperated list of mnemonics]]

An example command line might be:

elfmemtypes /bin/bash text SRAM SDRAM0 ANY data SDRAM1

In the example, /bin/bash is marked so that its text segment will have
physical memory allocated to it from the memory node named SRAM. If
allocation from SRAM fails, allocate from SDRAM0. If allocation from
SDRAM0 fails, allocate from any available node. Finally, /bin/bash is
marked so that its data segment only allows allocation of physical
memory from the memory node named SDRAM1. A more detailed description of
the algorithm for allocating physical pages using the above memory node
lists is discussed later. Note that the mnemonics ANY, any, text, and
data are reserved names, i.e. they cannot be used for memory type
mnemonic names. If a .memtypes NOTE section already exists in the ELF
file, the memory types specified in the section will be left undisturbed
unless they are overriden on the command line. For example, if the
existing .memtypes NOTE section lists memory types for both text and
data, but the command line specifies only data memory types, the
existing text memtypes will be left unchanged, but the data memtypes
will be modified. The elfmemtypes tool can also be used to display the
current memory type information in an ELF file, or clear out all memory
types information from the file. The command line forsuch cases is as
follows:

elfmemtypes [ELF file] [{show|clear}]

(or just elfmemtypes [ELF file] to display the current memory type
information). When clearing an ELF file, elfmemtypes simply removes the
.memtypes NOTE section by forking and running objcopy like so:

objcopy --remove-section=.memtypes [ELF file]

Note that a non-MTA configured kernel or non-MTA aware ld.so can still
load ELF executables and shared libraries that contain a .memtypes NOTE
section, since this section will just be ignored. Note also that
elfmemtypes does not check whether a memory type name corresponds to any
kernel node names. This is because the tool is meant to be a cross tool
as well as a native tool. As a cross tool, elfmemtypes has no way of
knowing the node names of the target kernel. See the
“load_elf_binary()” section below to see how the kernel handles
unknown memory mnemonics in the .memtypes NOTE section. As a native
tool, it is possible for elfmemtypes to compare memory mnemonic names
with kernel node names by reading /proc/nodeinfo (described later), and
this could be a future enhancement. The structure of the new .memtypes
NOTE section in the ELF file added by the tool is shown below:

typedef struct elf32_memtypes_note {
    Elf32_Nhdr nh;
    char note_name[16];
    Elf32_Word num_text_strings;
    Elf32_Word text_string_size;
    Elf32_Word num_data_strings;
    Elf32_Word data_string_size;
    char memtype_strings[0];
} Elf32_MemTypesNote;

The nh member contains the NOTE header, note_name is the name of the
NOTE (“memtypes”), and the rest specify the number and total size of the
text and data mnemonic strings. The member memtype_strings then marks
the start of the null-terminated mnemonic strings, beginning with text.
The data strings immediately follow the text strings, so the data
strings begin at
&memtype_stringstext_string_size.

The MTA Config File and the Mtaconfig Script

The MTA config file (and its associated parsing script mtaconfig) is
used for two purposes: defining nodes for building an MTA-enabled
kernel, and marking ELF binaries with memory types for text/data. The
MTA config file syntax defines two keywords for these purposes.

define_node keyword

To define nodes for configuring a kernel, use the following MTA config
file line:

define_node [name] [start physaddr] [end physaddr] [0|1]

name is the mnemonic name for the node.
start physaddr is the starting physical address for the node, in
hex.
end physaddr is the end physical address for the node, in hex.
0|1 is a flag, 1 means allow allocation from this node when no list
of nodes to allocate from is provided to the kernel page allocator.
This flag is described in more detail later.

An example line in the config file might be:

define_node SRAM 20000000 2002E000 0

which defines a node with the name SRAM to be located between 0x20000000
and 0x2002E000 physical, and do not allow default page allocation from
this node.

Node ID numbers are assigned in the order the define_node keywords
appear in the config file. So if the above line was the first
define_node line in the file, SRAM would be assigned node ID 0.

The mtaconfig script will output a C header file that can be used when
compiling the kernel. For this purpose it is called as follows:

mtaconfig [MTA config file] makehdr

This command is used by the kernel Makefile’s when confuring an MTA
kernel. If the makehdr argument is not specified, define_node keywords
in the config file are ignored and no header file is produced.

The content of the C header file produced by mtaconfig is an array of
structures containing the same information as the define_node lines in
the MTA config file. Each entry in the array is of type struct
mta_node, and is defined as follows:

struct mta_node {
    char * name;
    unsigned long start;
    unsigned long end;
    int allow_def_page_alloc;
};

A macro in the generated header file called INSTANTIATE_MTA_NODES will
instantiate the mta_nodes[] array. This is done in mm/numa.c in the
kernel source.

tag_elf keyword

The second use for the MTA config file is to mark ELF binaries in a
target file system with memory type information. This is simply a
convenience, it allows a file system’s memtypes configuration to be
described in a single location, instead of having to invoke the
elfmemtypes tool many times to configure the file system.

Use the following line to mark an ELF binary with memory type info:

tag_elf [ELF file path] [{text|data} [comma-seperated list of mnemonics]]

Notice that the command line is almost identical to the elfmemtypes tool
command line, except that the memtypes list is comma-seperated rather
than space-seperated. Also, text and data lists can be seperated on
different lines. An example config file entry might be:

tag_elf /target_root/bin/bash
 text SRAM,SDRAM0,any
 data SDRAM1

The command line to the mtaconfig script to process the tag_elf lines
is as follows:

mtaconfig [MTA config file] tag

The script will call the elfmemtypes tool once for every tag_elf line
found in the config file. Unlike the elfmemtypes tool, mtaconfig can
check if the memory type names correspond to any kernel node names,
because the node names are listed in the MTA config file itself. If any
memory names listed on the tag_elf line have not been defined in a
define_node line up to this point in the config file, mtaconfig prints
an error message and skips tagging the ELF file. Finally, a file
system’s memtype information can be completely cleared out with the
following command line:

mtaconfig [MTA config file] clear

The script will call the elfmemtypes tool with the clear argument once
for every tag_elf line found in the config file.

Load_elf_binary()

The function load_elf_binary() is an implementation of the
load_binary() method of the linux_binfmt object, for ELF binaries. It
is called by do_execve() when loading a new program for execution.

The job of load_elf_binary() is to read the executable file’s program
segments, and pass that segment info to do_mmap() for every loadable
segment program header found, which then actually creates the file
mapped regions. Loadable program headers are of type PT_LOAD.

For MTA, load_elf_binary() also locates and reads the .memtypes NOTE
section containing the memory types list. It then converts the mnemonic
names to node ID’s and passes that information to new functions
do_mmap_nodelist() and do_brk_nodelist(). The node ID’s are inserted
into a structure of type struct node_list and a pointer to the
structure is passed to do_mmap_nodelist() and do_brk_nodelist(), and
is described later.

If any of the mnemonic names listed in the .memtypes NOTE section do not
match any of the kernel’s node names, the node list is disabled for that
segment (text or data). That is, the text/data memory region will not
have node preferences, and will have pages allocated for that region
from any available node.

load_elf_interp()

Load_elf_interp() is called by load_elf_binary() when the latter
function discovers a program header of type PT_INTERP. This header
describes the interpreter program that is to be used to dynamically load
the shared libraries that the program requires.

It’s the job of load_elf_interp() to load the segments of the
interpreter itself, so that when the program begins executing, the
interpreter is actually the first code to execute.

For MTA, load_elf_interp() locates and reads the NOTE section
containing the memory types list from the interpreter binary, converts
the list to node ID’s, and passes that information to
do_mmap_nodelist() and do_brk_nodelist(). Just like
load_elf_binary(), the node info is inserted into a structure of type
struct node_list (described later).

The Program Interpreter (ld.so)

Ld.so is actually the first piece of code to execute when a new program
runs. Ld.so runs in user space, and it’s job is similar to
load_elf_interp(). It loads (maps) the text, data, and bss segments of
every shared object listed in the main program.

For MTA, ld.so reads the NOTE section containing the memory types list
of every shared object binary, and passes that information to a new
mmap_memtypes() system call. The memory types list passed to
mmap_memtypes() is a buffer holding the null-terminated memory type
mnemonic strings. The mmap_memtypes() system call is described in more
detail later.

Because ld.so is part of glibc, a new version of glibc is required to
load shared objects in the correct nodes.

memtypes_to_nodelist()

The method that converts memory type mnemonics to a node list is
memtypes_to_nodelist(), and it has the following interface: void
memtypes_to_nodelist(struct node_list * nl, char * names, int
size);

The names argument is a pointer to a buffer holding a packed list of
null-terminated mnemonic strings. That is, each null-terminated string
starts immediately after the previous string’s null-termination
character in the buffer. The size argument is the total size of the
buffer in bytes, including the null characters. The buffer must be a
kernel buffer, it cannot be a user-space buffer. If any of the names in
the buffer do not match any of the kernel’s node names, the node list is
disabled by setting nl->depth to zero (see next).

The node_list Object

The struct node_list object is defined as follows:

struct node_list {
    unsigned int nid[MAX_NR_NODES]; /* ID of nodes to alloc pages from,
    in order of preference */
    unsigned int depth; /* number of entries in above list */
};

The number of entries in the node list is limited to MAX_NR_NODES,
which is the maximum number of nodes a system could contain, currently
set at 16. Therefore depth must be less than MAX_NR_NODES. A depth of
zero is valid, meaning the node list is empty or disabled.

In addition, each entry in nid[] must be a valid node ID, i.e. it must
be in the range 0 to numnodes-1, where numnodes is the number of nodes
in the system.

The following method checks these conditions, and returns -EINVAL if any
are false:

check_nodelist(struct node_list * nl);

All of the kernel methods that take a node list as input (such as
do_mmap_nodelist() and do_brk_nodelist()) call check_nodelist() to
verify that the node list is valid. The section “Kernel API’s” below
describes how each method behaves when given an invalid node list.

do_mmap_nodelist() and do_brk_nodelist()

Load_elf_binary(), load_elf_interp(), and ld.so convert the
.memtypes NOTE section from the ELF binary into a node list via
memtypes_to_nodelist(), and pass the resultant struct node_list
object to the new methods do_mmap_nodelist() and do_brk_nodelist().
From this point on in the data flow of memory type information, the
memory types are in the form of node ID’s rather than mnemonic strings.

do_mmap_nodelist() and do_brk_nodelist() have the same arguments as
the original do_mmap() and do_brk(), with the addition of the struct
node_list pointer.

The primary job of do_mmap_nodelist() and do_brk_nodelist() is to
instantiate a new memory region descriptor for the requested range of
program adddresses. In Linux the memory region descriptor is an object
of type struct vm_area_struct, and is commonly referred to as a “VMA”
(Virtual Memory Area).

In MTA, the node list information is added to the VMA with a struct
vm_node_list vm_nodes member. The struct vm_node_list object
contains a node list as well as information important to the VMA, and is
defined as follows:

struct vm_node_list {
    struct node_list nl; /* the node list */
    unsigned long pgstart; /* if this node info belongs to a file mapping,
                the start page offset in the file */
    unsigned long pgend;    /* and end page offset */
    unsigned long flags;    /* unused */
};

Two struct node_list object’s are also added to a process’ memory map
descriptor (struct mm_struct), one each for the process’ text and data
regions (member names text_nodes and data_nodes in struct mm_struct).

After the new VMA is instantiated, do_mmap_nodelist() and
do_brk_nodelist() copy the passed struct node_list object to the VMA,
but only if the node list is valid as indicated by check_nodelist()
(see Kernel API’s below).

If the passed struct node_list pointer is null, or the list is empty
(depth is zero), do_mmap_nodelist() and do_brk_nodelist() check to
see if text_nodes or data_nodes in the calling process’ struct
mm_struct are enabled (depth is non-zero). If so, do_mmap_nodelist()
and do_brk_nodelist() copy to the VMA either text_nodes or
data_nodes depending on whether the region being mapped is text or
data. This ensures that, even if the mapping doesn’t pass a node list,
the new region will still use any node preferences listed by the
executable.

With the creation of the VMA, the program is now allowed to reference
addresses within the memory region described by the VMA. However, no
actual page frames for the region are available yet. The job of
allocating page frames for the program’s memory region goes to the Page
Fault Exception Handler. This is part of Linux’s demand paging
mechanism: memory pages are allocated to the program only as they are
needed (referenced) by the program.

The important point here however, is that the memory regions contain the
node ID’s needed by the page fault handler, so that it can allocate
pages in the correct nodes for the region. This is described later.

setup_arg_pages()

Setup_arg_pages() is called by load_elf_binary() to create the
memory region for the program’s stack, which includes the program stack
and also the argument strings to the program and environment variables
that the program inherited. When setup_arg_pages() instantiates the
new VMA for the stack region, it simply copies the struct node_list
data_nodes from the memory descriptor to the new VMA.

However, there is one small glitch. Before load_elf_binary() was even
called, in do_execve(), pages were already allocated for the argument
and environment strings. These pages were allocated using the default
node round-robin approach (because no node info was known at that time),
so the pages almost certainly were not allocated from the correct node
for the stack region. Therefore setup_arg_pages() needs to allocate a
new page in the correct node for every page already allocated, copy the
page contents from the old to new page, and then release the old page.

Page Fault Exception Handler

When the program references a valid address within one of the program’s
memory regions, a page fault exception occurs if the address is not yet
listed in any of the process’ page tables. The page fault exception
handler goes about allocating pages for the faulting region, and creates
the page tables that point to the new page.

For MTA, the exception handler will allocate the page from the correct
node as described in the faulting region’s vm_node_list object. This
includes allocating pages in all of the following situations: anonymous
mappings, private and shared file mappings, and copy-on-write pages for
private mappings.

Allocating Pages

At the lowest level of page allocation, the buddy system page allocator
_ _alloc_pages(), is passed a node descriptor pointer of type
pg_data_t. This descriptor contains information related to the NUMA
node, such as the number of “memory zones” contained in the node, the
pointer to the start of the struct page * list of pages contained in
the node, the start physical address of the node memory, and the node
ID.

_ _alloc_pages() is used by both the standard/default page allocator
_alloc_pages(), and by the MTA page allocator
alloc_pages_nodelist(). Internally, _ _alloc_pages() attempts to
allocate pages atomically (without blocking the calling process). If
that fails and the _ _GFP_WAIT bit is set in gfp_mask, it
“rebalances” the memoru zone within the node, and attempts the
allocation again. If that fails, it blocks the calling process and
yields to the kswapd daemon. When _ _alloc_pages() returns from
kswapd, it returns NULL to allow either _alloc_pages() or
alloc_pages_nodelist() to try again with a different node (the default
non-MTA behavior is to again attempt to allocate pages from the same
node in an endless alloc-kswapd loop until it succeeds).

Default Page Allocator

The default page allocator is _alloc_pages(). It attempts to allocate
from any available node in a round-robin manner. This method has been
slightly modified for MTA. Each configured node in MTA includes a flag
specifying whether _alloc_pages() is allowed to allocate pages from
that node. This flag can thus be used to reserve an entire node only for
MTA allocation. An example use might be a node containing a very small
number of physical pages. By reserving the node only for MTA allocation,
it guarantees that it will only be used to allocate pages for process
memory region’s that specify node lists, or for any caller of
alloc_pages_nodelist() (described next).

Allocating Pages With a Node List

A wrapper function is provided to _ _alloc_pages() called
alloc_pages_node(), which takes as arguments the node ID of the memory
node to allocate the pages from. It’s interface is:

struct page * alloc_pages_node(int nid, unsigned int gfp_mask,
        unsigned int order);

Alloc_pages_node() in turn is used by the MTA allocator,
alloc_pages_nodelist(). It is this latter method that the page fault
exception handler uses to allocate pages using the node information
described by the struct vm_node_list object in the faulting VMA. It’s
prototype is:

struct page * alloc_pages_nodelist(struct node_list * nl, int gfp_mask,
         unsigned int order);

The function is written so that plenty of opportunity is given for
allocation from the first choice node (nl->nid[0]) to succeed if the
gfp_mask includes the _ _GFP_WAIT flag. Note that if _
_alloc_pages() returns NULL when the _ _GFP_WAIT flag is set, it
means kswapd was allowed to run, and therefore pages may have become
free in the first choice node, so we should try again.

Alloc_pages_nodelist() accomplishes this behavior with an outer and
inner loop (see the flow chart below for an illustration of the
algorithm). The outer loop increments from zero to nl->depth, and the
inner loop increments from zero to the current outer loop index. The
inner loop attempts to allocate a page from nl->nid[j], where j is the
inner loop index. The function returns on the first successfull page
allocation. As described above, the underlying buddy system allocator,
_ _alloc_pages(), will first attempt atomic allocation from the node,
and if that fails, will yield to kswapd to free up pages, and then
return NULL back to alloc_pages_nodelist().

As an example, suppose we have a node ID list containing {3,1}
(nl->depth is 2), and the _ _GFP_WAIT flag is set in gfp_mask.
Assuming alloc_pages_nodelist() ultimately fails, it will attempt
allocation from the nodes in the following order: 3 3 1 3 1. In other
words:

kswapd runs after allocation from 1st choice node 3 fails.
retry node 3 - fails again (kswapd runs again).
try alloc from node 1 (2nd choice node) - fails (kswapd runs).
retry first choice node 3 - fails again (kswapd runs).
retry node 1 - fails again and giveup (return NULL).

It is also possible to attempt allocation from the first choice node
many times by repeating the node in the node list. For example, with a
node ID list containing {3,3,1}, alloc_pages_nodelist() attempts
allocation from the nodes in the following order before finally failing:
3 3 3 3 3 1 3 3 1.

Note that if the _ _GFP_WAIT flag is not set, the inner loop is
collapsed, and each node in the list is tried in sequence with no
retries. So given the node list {3,3,1} from the example above,
alloc_pages_nodelist() attempts allocation from the nodes in the
following order before finally failing: 3 3 1.

Kernel API’s

Allocating Whole Pages, alloc_pages_nodelist()

Device drivers or other kernel code that wish to allocate whole memory
pages from a specific node can call alloc_pages_nodelist() directly.
If the caller has a list of mnemonic strings, it must first convert the
strings to a node list with memtypes_to_nodelist() before calling
alloc_pages_nodelist().

For sake of speed in allocating pages during page faults,
alloc_pages_nodelist() does not call check_nodelist() to check the
validity of the passed node list. Instead, it does the following (refer
to the flow chart above):

if depth is greater than MAX_NR_NODES, fail immediately (return
NULL).
in the inner loop, if the current node ID in the list is invalid,
skip this entry and move on to the next ID in the list.

Note that the passed node list will never be invalid if
alloc_pages_nodelist() was called as a result of a page fault or a
slab allocation, because kmalloc_nodelist(), do_mmap_nodelist(), and
do_brk_nodelist() all check the validity of the list beforehand.

Slab Allocator, kmalloc_nodelist()

Device drivers or other kernel code that wish to allocate memory of
arbitrary size from a specific node can make use of a new interface to
the slab allocator, kmalloc_nodelist(), which takes as an extra
argument a pointer to a struct node_list object. It’s prototype is as
follows:

void * kmalloc_nodelist (struct node_list * nl, size_t size, int flags);

There is also a new slab interface that allows creation of a new cache
that includes a node list:

kmem_cache_t *  kmem_cache_create_nodelist (struct node_list * nl,
        const char *name, size_t size, size_t offset, unsigned long flags,
            void (*ctor)(void*, kmem_cache_t *, unsigned long),
            void (*dtor)(void*, kmem_cache_t *, unsigned long));

The new cache can then be used when allocating objects by passing it to
kmem_cache_alloc(). The new objects will be allocated from the nodes
listed in the cache objects node list.

Both of these new methods perform the following checks on the passed
node list:

if the node list pointer is NULL, or the list is empty, the new slab
object or cache will not have any node preference.
if the node list is invalid as indicated by check_nodelist(), both
methods fail, returning NULL.

do_mmap_nodelist() and do_brk_nodelist()

Kernel code that wishes to create new mappings for a process can call
do_mmap_nodelist() or do_brk_nodelist() directly. The current
prototypes are identical to the original do_mmap() and do_brk(), with
the addition of a node_list pointer as the last argument.

If the passed node_list pointer is non-NULL and enabled (depth is
non-zero), but the list is invalid as indicated by check_nodelist(),
the mapping fails, and both methods return -EINVAL.

User API’s

Mmap_memtypes() and brk_memtypes()

These new system calls are implemented to allow creating memory maps
from user space with node information. They essentially provides
user-level access to the kernel methods do_mmap_nodelist() and
do_brk_nodelist(). The prototypes are the same as the current system
calls, with two additional arguments:

void * mmap_memtypes(void *start, size_t length, int prot,
    int flags, int fd,  off_t offset, char * memtypes,
    int memtypes_len);
int brk_memtypes(void *end_data_segment, char * memtypes,
    int memtypes_len);

The memtypes argument is a pointer to a user buffer holding a packed
list of null-terminated strings. The strings represent the memory type
mnemonics, and their order in the buffer is the order of node preference
for the region. The memtypes_len argument is the total size of the user
buffer in bytes.

Note that these new libc functions are not reserved by the POSIX
standard. Applications that use them have to be compiled with
-D_GNU_SOURCE.

The new syscalls are also used by the dynamic linker (ld.so) in
MTA-aware glibc, to create the maps for a program’s shared libraries.
The following checks are made on the arguments passed to
mmap_memtypes() and brk_memtypes():

If the memtypes buffer pointer is NULL, or if memtypes_len is zero,
the new mapping created will not have any node list preference, i.e.
it will be as if the regular mmap() and brk() syscalls were used.
If the copy of the user buffer to kernel space fails (for instance
the memtypes pointer is invalid), the mapping fails.
There is an upper limit of one page (4096 bytes) on the user buffer
size. If memtypes_len is greater than PAGE_SIZE, the mapping
fails.
If any of the memory type mnemonic names in the memtypes buffer do
not match any of the kernel’s node names, the new mapping created
will not have any node list preference.
The usual conditions exist on the remaining arguments (for instance,
for a file mapping the file descriptor must refer to a valid open
file).

/proc Interface

There are two new entries in the /proc file system.

/proc/nodeinfo

The first is /proc/nodeinfo, which lists the node configuration of the
kernel, including the name, physical address range, and whether default
page allocation is allowed, of each configured node.

/proc/[pid]/nodemap

The second is an extension of the Memory Accounting tool. If the kernel
config option CONFIG_MEMORY_ACCOUNTING is enabled along with
CONFIG_MEMTYPE_ALLOC, a new proc entry, /proc/[pid]/nodemap will be
available. The information is similar to the Memory Accounting Tool’ s
/proc/[pid]/memmap, except that instead of displaying the page usage
counter for every resident page in each region, the node ID of resident
pages are displayed. Pages for a region that are not yet resident are
shown with a dash character “-“.

In other words, for every line (region) printed by /proc/[pid]/maps,
/proc/[pid]/nodemap also prints a line, showing the node ID of resident
pages for that region.

Tracing MTA with Linux Trace Toolkit

Important MTA events are captured by the run-time creation of Linux
Trace Toolkit (LTT) custom events for MTA. The following events are
defined in include/linux/vmnode.h, and are called at the appropriate
locations in the kernel where the corresponding events occur:

TRACE_MTA_ELF_MEMTYPES

An ELF executable or ld.so was loaded containing a .memtypes NOTE
section.

TRACE_MTA_MMAP_MEMTYPES

Entry to mmap_memtypes system call with a non-empty memtypes buffer.

TRACE_MTA_BRK_MEMTYPES

Entry to brk_memtypes system call with a non-empty memtypes buffer.

TRACE_MTA_MMAP_NODELIST do_mmap_nodelist() was called with a
valid node list.
TRACE_MTA_BRK_NODELIST do_brk_nodelist() was called with a
valid node list.
TRACE_MTA_KMALLOC_NODELIST kmalloc_nodelist() was called with a
valid node list.
TRACE_MTA_KMEM_CACHE_CREATE_NODELIST
kmem_cache_create_nodelist() was called with a valid node list.
TRACE_MTA_SLAB_ALLOC

A group of contiguous pages were allocated for a slab cache object
containing a node list.

TRACE_MTA_VMA_ALLOC

A page was allocated for a copy-on-write, for an anonymous or file
mapping containing a node list.

TRACE_MTA_PAGE_CACHE_ALLOC

A page was allocated and placed in the page cache, for a file mapping
containing a node list.

With these events, it’s possible to trace MTA-related activity from the
time a program was loaded, to the creation of its memory map, down to
the allocation of memory pages for the program. The events can also
trace the creation of new slab caches containing node lists, down to
allocation of pages for the cache objects.

Additional Information

Porting MTA to other Architectures

At this time, only the ARM OMAP1510 Innovator platform has MTA support.
To port MTA to other architectures:

First of all, the architecture must support discontiguous memory.
Add the CONFIG_MEMTYPE_ALLOC option to arch/[arch]/config.in if
CONFIG_DISCONTIGMEM is defined. See arch/arm/config.in for example.
Add system call entry points for sys_brk_memtypes() and
old_mmap_memtypes() and define their syscall numbers. See
arch/arm/kernel/calls.S and include/asm-arm/unistd.h for example.
Implement old_mmap_memtypes() (sys_brk_memtypes() is implemented
in generic kernel code in mm/mmap.c). See arch/arm/kernel/sys_arm.c
for example implementation.
Configure the system’s memory nodes using the start and end physical
addresses of each node in the mta_nodes[] array. How discontiguous
memory nodes are initially configured is very architecture specific.
See include/asm-arm/arch-omap1510/memory.h,
arch/arm/mach-omap1510/innovator.c, and arch/arm/mm/init.c for an
example of how this is done for ARM and the Innovator platform.

Limitations

In ELF binaries, the first file page offset of the initialized data
segment is usually the same file page offset as the last page of
text (the end of text and start of data share the same page).
Because of this, the same allocated page frame in the kernel’s page
cache is shared between the last page of text and the first page of
initialized data. Therefore, if the program references the last page
of text after it references the first page of data (which is usually
the case), the last page of the text region will be located in the
node of the data region, not in the text’s node.
The Innovator’s SRAM is very small, and page allocations from SRAM
will begin to fail very quickly. The text segment of ld.so happens
to just barely fit in SRAM. Even then, the kernel will attempt to
allocate a cluster of pages for a region instead of only one during
a file mapping page fault, and if that many pages are not free in
SRAM, the cluster allocation will fail.

Future Enhancements

Expand maximum allowable nodes beyond 16.
Allow separation of data/bss/brk/stack segments into different
nodes.
For native elfmemtypes tool, check mnemonic names against
/proc/nodeinfo.

Notes

Source Code

linux-mta-041004.tar.bz2
is a kernel soruce archive including MTA. Please someone isolate MTA
funtion from the tarball. MTA
util
and
mta-glibc-2.2.5.patch
are also available.

Category:

Memory Type Based
Allocation

Memory Type Based Allocation

Memory Type Based Allocation

Contents

Introduction

Purpose of Feature

Feature Requirements

High Level Design

Memory Type Information in ELF Binaries and the Elfmemtypes Utility

The MTA Config File and the Mtaconfig Script

define_node keyword

tag_elf keyword

Load_elf_binary()

load_elf_interp()

The Program Interpreter (ld.so)

memtypes_to_nodelist()

The node_list Object

do_mmap_nodelist() and do_brk_nodelist()

setup_arg_pages()

Page Fault Exception Handler

Allocating Pages

Default Page Allocator

Allocating Pages With a Node List

Kernel API’s

Allocating Whole Pages, alloc_pages_nodelist()

Slab Allocator, kmalloc_nodelist()

do_mmap_nodelist() and do_brk_nodelist()

User API’s

Mmap_memtypes() and brk_memtypes()

/proc Interface

/proc/nodeinfo

/proc/[pid]/nodemap

Tracing MTA with Linux Trace Toolkit

Additional Information

Porting MTA to other Architectures

Limitations

Future Enhancements

Notes

Source Code