Python, Linkers, and Virtual Memory by Brandon Rhodes¶
Presenter: Brandon Craig Rhodes (http://rhodesmill.org/brandon/) (@brandon_rhodes)
PyCon 2012 presentation page: https://us.pycon.org/2012/schedule/presentation/455/
Slides: http://rhodesmill.org/brandon/talks/2012-03-pycon/linkers-memory/
Video: http://pyvideo.org/video/717/python-linkers-and-virtual-memory
Video running time: 31:16
Imagine a computer with 1,000 bytes of addressable RAM...¶
(00:19 / 36:22 into the video)
- numbered 0 through 999, each storing a byte
Memory address | Byte stored |
999 | 128 |
998 | 255 |
997 | 254 |
996 | 128 |
... | ... |
3 | 2 |
2 | 0 |
1 | 104 |
0 | 77 |
Pages¶
We split memory into pages by grouping addresses that share a common prefix
(00:29 / 36:22 into the video)
If we defined our
page size = 100 bytes
Then our 1,000 bytes of main memory = 10 pages
and page 3 would include all addresses in the 3xx range.
•
400
┌─399─┐
│ 398 │
│ 397 │
│ • │
│ • │ Page 3
│ • │
│ 302 │
│ 301 │
└─300─┘
299
•
Page tables¶
(00:50 / 36:22 into the video)
What is a page table?
┌─────┐
│ CPU │
└─────┘
↕
┌────────────┐
│ Page Table │
└────────────┘
↕
┌─────┐
│ RAM │
└─────┘
It sits in between your CPU and the RAM on your machine and
rewrites the leading digits of every processor memory access before the RAM chip sees the request.
Virtual Physical
┌──────────────┐
│ 9-- → 59-- │
│ 8-- → 63-- │
│ 7-- → - │
│ 6-- → 61-- │
│ 5-- → 60-- │
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- │
│ 0-- → 35-- │
└──────────────┘
The addresses that your processor asks for are called virtual.
The actual addresses delivered to RAM are called physical.
- “Read virtual byte 821” => “Read physical byte 6321“
- “Read virtual byte 190” => “Read physical byte 3690“
- “Write virtual byte 522” => “Write physical byte 6022“
- “Read virtual byte 700” => Page fault
Page Faults¶
(01:46 / 36:22 into the video)
Your program is paused and the OS regains control.
We will see that the OS has several options about how to respond.
Q: What is stored in the bytes of memory pages?¶
(01:56 / 36:22 into the video)
A: Everything your program needs
- Executable code
- Stack - variables
- Heap - data structures
(02:10) These resources tend to load gradually as your program runs.
(02:16) Imagine running your editor...
Which I will not name to avoid obvious religious wars... :-)
You say: “Run my editor!“
So the OS loads it into memory along with any libraries it requires.
┌──────────────┐
│ 9-- → - │
│ 8-- → - │
│ 7-- → - │
│ 6-- → - │
│ 5-- → - │
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- │ libcurses
│ 0-- → 35-- │ editor “binary” (program)
└──────────────┘
The editor’s main() function starts calling other functions.
(02:36) So the OS automatically starts allocating new stack pages as that call graph grows.
The stack usually starts at a high address and grows downward.
┌──────────────┐
│ 9-- → 59-- │ stack (“bottom” = oldest)
│ 8-- → 63-- │ stack (“top” = newest)
│ 7-- → - │
│ 6-- → - │
│ 5-- → - │
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- │ libcurses
│ 0-- → 35-- │ editor “binary” (program)
└──────────────┘
Notice the fairly random physical addresses that can come from anywhere in RAM, because the association is a free one.
(02:57) Then the editor needs space for data structures like lists and dictionaries.
These go in another growing memory area called the heap.
Which instead of holding variables that go away when your function returns (i.e.: the stack), holds persistent data structures.
┌──────────────┐
│ 9-- → 59-- │ stack (“bottom” = oldest)
│ 8-- → 63-- │ stack (“top” = newest)
│ 7-- → - │
│ 6-- → 61-- │ heap (file being edited)
│ 5-- → 60-- │ heap (settings)
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- │ libcurses
│ 0-- → 35-- │ editor “binary” (program)
└──────────────┘
Page Table benefits¶
(03:22 / 36:22 into the video)
What are the benefits of this level of indirection?
(that has to be implemented in hardware and invoked every time your process hits main memory)
Security
- Processes can’t see each other’s pages
- Processes can’t see OS pages
- The OS wipes reallocated pages with 0s.
Stability
- Processes can’t overwrite each other.
- Processes can’t crash the OS.
- You can’t corrupt another process’s memory pages because they’re not even visible; they don’t exist to you if they’re not in the page table.
Protection¶
(04:22 / 36:22 into the video)
Real page tables include read and write bits.
In general you can write your own stack and heap, but only read executables and libraries.
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → 63-- w │ stack
│ 7-- → - │
│ 6-- → 61-- w │ heap
│ 5-- → 60-- w │ heap
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- r │ libcurses
│ 0-- → 35-- r │ binary
└────────────────┘
- Write bits set on the stack and heap
- Only the read bit set on the editor binary and the library.
Segmentation Faults¶
(04:54 / 36:22 into the video)
For an illegal read or write...
Then instead of responding to your page fault with more memory, the OS will stop you dead with a Segmentation Fault.
Which basically means a page fault that made the OS angry. :-)
(05:11 / 36:22 into the video)
“Read from 331“
┌────────────────┐
│ 9-- → 59-- w │
│ 8-- → 63-- w │
│ 7-- → - │
│ 6-- → 61-- w │
│ 5-- → 60-- w │
│ 4-- → - │
→ │ 3-- → - │ → SEGMENTATION FAULT
│ 2-- → - │
│ 1-- → 36-- r │
│ 0-- → 35-- r │
└────────────────┘
causes a segmentation fault because there’s no page there.
“Write to 101“
┌────────────────┐
│ 9-- → 59-- w │
│ 8-- → 63-- w │
│ 7-- → - │
│ 6-- → 61-- w │
│ 5-- → 60-- w │
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
→ │ 1-- → 36-- r │ → SEGMENTATION FAULT
│ 0-- → 35-- r │
└────────────────┘
causes a segmentation fault because we’re trying to write to a read-only page.
Sharing¶
(05:28 / 36:22 into the video)
Read-only pages give the OS a new superpower: Sharing!
Read-only binaries and libraries can be safely and securely shared among many processes!
The OS can reuse those pages in the page tables of multiple processes.
These two processes use different pages for heap, stack, and binary...
┌────────────────┐ ┌────────────────┐
│ 9-- → 59-- w │ stack │ 9-- → 48-- w │ stack
│ 8-- → 63-- w │ stack │ 8-- → - │
│ 7-- → - │ │ 7-- → 51-- w │ heap
│ 6-- → 61-- w │ heap │ 6-- → 50-- w │ heap
│ 5-- → 60-- w │ heap │ 5-- → 46-- w │ heap
│ 4-- → - │ │ 4-- → - │
│ 3-- → - │ │ 3-- → 39-- r │ libjson
│ 2-- → 39-- r │ libjson │ 2-- → 48-- r │ libX11
│ 1-- → 36-- r │ libc │ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ python │ 0-- → 47-- r │ firefox
└────────────────┘ └────────────────┘
But they safely share a single copy of libc and libjson
┌────────────────┐ ┌────────────────┐
│ │ │ │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ │ 3-- → 39-- r │ libjson
│ 2-- → 39-- r │ libjson │ │
│ 1-- → 36-- r │ libc │ 1-- → 36-- r │ libc
│ │ │ │
└────────────────┘ └────────────────┘
When we share pages we don’t need to use up RAM storing things twice.
Physical RAM page 36 can be reused for every process needing libc
Similarly for page 39 and libjson, because if you can’t write to it then it looks like it’s your own personal copy.
Memory consumed doesn’t add up¶
(06:32 / 36:22 into the video)
This means that the total memory consumed by process A and process B is rarely the sum
(A’s memory use) + (B’s memory use)
Q: So how can you tell how much memory an additional worker thread or process will consume on one of your servers?
A: Stop looking at the quantity – “How much RAM does my Python program use?”
Start looking at the delta
Δ memory
Ask, “How much memory does each additional process / worker / thread cost (and when will that run me out of RAM)?”
How should you measure that?
By actual load and resource tests
Because as we will see, memory usage is complicated.
Problem: Python code¶
(07:37 / 36:22 into the video)
We hit a problem with this idea of sharing binary code.
Q: Where in virtual memory does Python code live?
Python starts running with most RAM pages shareable.
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → - │
│ 6-- → - │
│ 5-- → - │
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ python
└────────────────┘
But Python runs read() on foo.py creating a writable, unshareable page on the heap
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → - │
│ 6-- → - │
│ 5-- → 60-- w │ foo.py ←
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ python
└────────────────┘
Then Python compiles foo.py to a code object on another writable, unshareable page
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → - │
│ 6-- → 61-- w │ codeobj ←
│ 5-- → 60-- w │ foo.py
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ python
└────────────────┘
Code object is another unshareable page.
Only then does foo.py start up and create legitimately unique program data
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → 63-- w │ data ←
│ 6-- → 61-- w │ codeobj
│ 5-- → 60-- w │ foo.py
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ python
└────────────────┘
The heap winds up as a mix of actual unique data, together with shared code
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → 63-- w │ data ←
│ 6-- → 61-- w │ codeobj
│ 5-- → 60-- w │ foo.py
│ 4-- → - │
│ 3-- → - │
│ 2-- → - │
│ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ python
└────────────────┘
Now it just so happens that every Python X.Y process loading foo.py will create the exact same code object...
–but the OS does not know about this because each Python process builds its code objects separately
To the OS pages that get written in the middle of a process running look like heap and don’t get shared.
┌────────────────┐ ┌────────────────┐
│ 9-- → 59-- w │ ≠ │ 9-- → 48-- w │ stack
│ 8-- → - │ │ 8-- → - │
│ 7-- → 63-- w │ ≠ │ 7-- → 71-- w │ data
│ 6-- → 61-- w │ ≠ │ 6-- → 69-- w │ codeobj
│ 5-- → 60-- w │ ≠ │ 5-- → 62-- w │ foo.py
│ 4-- → - │ │ 4-- → - │
│ 3-- → - │ │ 3-- → - │
│ 2-- → - │ │ 2-- → - │
│ 1-- → 36-- r │ │ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ │ 0-- → 35-- r │ python
└────────────────┘ └────────────────┘
Dynamic Linking¶
(10:22 / 36:22 into the video)
How does the Python interpreter find the routines it needs in libc.so?
┌────────────────┐
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ 1-- → 36-- r │ libc.so
│ 0-- → 35-- r │ python
└────────────────┘
See also
- http://en.wikipedia.org/wiki/Dynamic_linking
- Wikipedia article on dynamic linking
- http://en.wikipedia.org/wiki/Shared_libraries#Shared_libraries
- Wikipedia article on shared libraries
The old days: static linking¶
“compile” “link”
cmdn.c → cmdn.o ↘
opts.c → opts.o → prog
slen.c → slen.o ↗
Every .o file has a table of names that it defines and names that it needs someone else to define
$ nm p_move.o
U _nc_panelhook
U is_linetouched
00000000 T move_panel
U mvwin
U wtouchln
Linking returns an error if a name needed cannot be found
$ ld -o prog foo.o bar.o baz.o
foo.o: In function `main':
foo.c:(.text+0x7): undefined reference to `haute'
collect2: ld returned 1 exit status
ar archives¶
(12:32 / 36:22 into the video)
The ar tool bundles together related .o files into .a archives.
$ ar t /usr/lib/libpanel.a
panel.o
p_above.o
p_below.o
p_bottom.o
p_delete.o
p_hide.o
p_hidden.o
p_move.o
p_new.o
p_replace.o
p_show.o
p_top.o
p_update.o
p_user.o
p_win.o
.a files are static libraries.
See also
- http://en.wikipedia.org/wiki/Static_library
- Wikipedia article on static libraries
Demand Paging¶
(18:00 / 36:22 into the video)
The OS does not load pages from disk until your program reads or writes from them
Imagine a program that has just started running
Files:
- python — 3 pages
- libc.so — 2 pages
At first, only one page of the python binary is loaded
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → - │
│ 6-- → 61-- w │ heap
│ 5-- → - │
│ 4-- → - │ (libc.so)
│ 3-- → - │ (libc.so)
│ 2-- → - │ (python)
│ 1-- → - │ (python)
│ 0-- → 35-- r │ python ← LOAD FROM DISK
└────────────────┘
Then Python tries to run the instruction at 212 so that page gets loaded too
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → - │
│ 6-- → 61-- w │ heap
│ 5-- → - │
│ 4-- → - │ (libc.so)
│ 3-- → - │ (libc.so)
│ 2-- → 37-- r │ python ← LOAD FROM DISK
│ 1-- → - │ (python)
│ 0-- → 35-- r │ python
└────────────────┘
Then Python calls a function at libc offset 178 thus 300 + 178 = 478
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → - │
│ 6-- → 61-- w │ heap
│ 5-- → - │
│ 4-- → 41-- r │ libc.so ← LOAD FROM DISK
│ 3-- → - │ (libc.so)
│ 2-- → 37-- r │ python
│ 1-- → - │ (python)
│ 0-- → 35-- r │ python
└────────────────┘
If Python keeps calling the same routines, then only those 3 pages ever get loaded
So pages are loaded from disk lazily
Heap allocation works the same way
(18:50 / 36:22 into the video)
Q: How many RAM pages are allocated if you ask the OS for three new memory pages?
A: None!
The OS allocates no pages but simply makes an adjustment to its record-keeping
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → - │ (heap) ←
│ 6-- → - │ (heap) ←
│ 5-- → - │ (heap) ←
│ 4-- → 61-- w │ heap
│ 3-- → - │
│ 2-- → 41-- r │ libc.so
│ 1-- → - │ (python)
│ 0-- → 35-- r │ python
└────────────────┘
The OS will now respond to page faults at 500–799 by allocating a new page, not raising a Segmentation Fault
But no pages are actually allocated until a read or write shows they are needed
(19:19 / 36:22 into the video)
That is the difference
between VIRT and RES when you run top
top
↓ ↓
PID USER VIRT RES SHR S %CPU %MEM COMMAND
1217 root 195m 94m 53m S 2 2.6 Xorg
7304 brandon 183m 45m 19m S 2 1.3 chromium
2027 brandon 531m 162m 37m S 1 4.5 chromium
2319 brandon 179m 58m 23m S 1 1.6 chromium
18841 brandon 2820 1188 864 R 1 0.0 top
9776 brandon 175m 49m 12m S 0 1.4 chromium
↑ ↑
See also
- Demand paging
- Wikipedia article on Demand Paging
VIRT
- Virtual Image
- All memory pages that promise not to return a Segmentation Fault
RES
- Resident Size
- Only memory pages that have actually been allocated
Diagram to illustrate the difference
- VIRT — 8 pages
- RES — 4 pages
┌────────────────┐ VIRT RES
│ 9-- → 59-- w │ stack ✓ ✓
│ 8-- → - │
│ 7-- → - │ (heap) ✓
│ 6-- → - │ (heap) ✓
│ 5-- → - │ (heap) ✓
│ 4-- → 61-- w │ heap ✓ ✓
│ 3-- → - │
│ 2-- → 41-- r │ libc.so ✓ ✓
│ 1-- → - │ (python) ✓
│ 0-- → 35-- r │ python ✓ ✓
└────────────────┘
Only RES can actually run you out of memory
Look at RES when you wonder where all of your RAM is going
↓
PID USER VIRT RES SHR S %CPU %MEM COMMAND
1217 root 195m 94m 53m S 2 2.6 Xorg
7304 brandon 183m 45m 19m S 2 1.3 chromium
2027 brandon 531m 162m 37m S 1 4.5 chromium
2319 brandon 179m 58m 23m S 1 1.6 chromium
18841 brandon 2820 1188 864 R 1 0.0 top
9776 brandon 175m 49m 12m S 0 1.4 chromium
↑
/proc/<pid>/smaps¶
(20:19 / 36:22 into the video)
Linux file; shows whether physical RAM pages have been allocated to a segment
Note
(from Marc) On Linux systems, there is a nice tool called pmap that shows similar information in a nice format.
See also
- pmap(1)
- Man page for the pmap program.
Note
(from Marc) This is on Linux; OS X does not have the /proc filesystem. The OS X equivalent of this is the vmmap command.
See also
- vmmap(1)
- Man page for the vmmap program.
- “Viewing Virtual Memory Usage”
- A section from the Memory Usage Performance Guidelines guide in the Mac OS X Developer Library
Here are some segments of a new python process sitting quietly at its prompt
08048000-08238000 r-xp .../bin/python
Size: 1984 kB
Rss: 896 kB
b73b9000-b752f000 r-xp .../libc-2.13.so
Size: 1496 kB
Rss: 528 kB
bfc48000-bfc69000 rw-p [stack]
Size: 136 kB
Rss: 104 kB
Demand Paging: Example
Calling a Python function that was not called as Python started up
08048000-08238000 r-xp .../bin/python
Size: 1984 kB
Rss: 896 kBA
>>> frozenset()
08048000-08238000 r-xp .../bin/python
Size: 1984 kB
Rss: 916 kB
916 - 896 = 20 new kilobytes
So invoking new sections of a binary or library pulls more pages into RAM
But since binary and library code can always be reloaded from disk, the OS can also discard those pages
$ python -c 'range(500000000)'
This allocates lots of memory As it runs, your OS will dump information overboard to reclaim every RAM page it can
What did this memory hog do to the other Python process that was sitting quietly at its prompt?
08048000-08238000 r-xp .../bin/python
Size: 1984 kB
Rss: 916 kB
↑
before the hog ran
after
↓
08048000-08238000 r-xp .../bin/python
Size: 1984 kB
Rss: 464 kB
916 - 464 = 452k were thrown overboard!
Those 452k pages will be re-loaded, on-demand, from disk if this Python process tries to access them again
The process will slow down as de-allocated pages are re-loaded
L1 cache — 3s grabbing a piece of paper
L2 cache — 14s picking a book from a shelf
System RAM — 4-minute walk down the hall
Hard drive seek —
“like leaving the building to roam the earth for one year and three months.”
— Gustavo Duarte
Forking¶
(22:37 / 36:22 into the video)
On primitive operating systems the only way to create a new process is to start a whole new program
Linux and OS X support fork()!
fork() creates a child process that continues on from the parent’s current state
Process 1 A
↓
B
↓
C
↓
D — os.fork() ↴ Process 2
↓ ↓
E E
↓ ↓
F F
↓ ↓
The OS could implement fork() naively by copying every writeable page in the parent so that the child had its own copy
fork() parent fork() child
┌────────────────┐ ┌─────────────────┐
│ 9-- → 59-- w │ stack │ 9-- → new copy │
│ 8-- → 63-- w │ stack │ 8-- → new copy │
│ 7-- → - │ │ 7-- → - │
│ 6-- → 61-- w │ heap │ 6-- → new copy │
│ 5-- → 60-- w │ heap │ 5-- → new copy │
│ 4-- → - │ │ 4-- → - │
│ 3-- → - │ │ 3-- → - │
│ 2-- → 39-- r │ libjson │ 2-- → (same) │
│ 1-- → 36-- r │ libc │ 1-- → (same) │
│ 0-- → 35-- r │ python │ 0-- → (same) │
└────────────────┘ └─────────────────┘
But immediately copying every page could take a long time for a large process, and during the copy both parent and child would be hung waiting!
Instead, as you might guess, the OS implements fork()
lazily,
copying pages on-demand when the parent or child performs a write
So the OS starts fork() by copying the page table, marking all pages read-only in both processes, then letting them keep running
fork() parent fork() child
┌────────────────┐ ┌────────────────┐
│ 9-- → 59-- r │ stack │ 9-- → 59-- r │
│ 8-- → 63-- r │ stack │ 8-- → 63-- r │
│ 7-- → - │ │ 7-- → - │
│ 6-- → 61-- r │ heap │ 6-- → 61-- r │
│ 5-- → 60-- r │ heap │ 5-- → 60-- r │
│ 4-- → - │ │ 4-- → - │
│ 3-- → - │ │ 3-- → - │
│ 2-- → 39-- r │ libjson │ 2-- → (same) │
│ 1-- → 36-- r │ libc │ 1-- → (same) │
│ 0-- → 35-- r │ python │ 0-- → (same) │
└────────────────┘ └────────────────┘
As writes cause page faults the OS makes copies. Here page 63 is copied to 77 and marked w.
fork() parent fork() child
┌────────────────┐ ┌────────────────┐
│ 9-- → 59-- r │ stack │ 9-- → 59-- r │
→ │ 8-- → 63-- w │ stack │ 8-- → 77-- w │ ←
│ 7-- → - │ │ 7-- → - │
│ 6-- → 61-- r │ heap │ 6-- → 61-- r │
│ 5-- → 60-- r │ heap │ 5-- → 60-- r │
│ 4-- → - │ │ 4-- → - │
│ 3-- → - │ │ 3-- → - │
│ 2-- → 39-- r │ libjson │ 2-- → (same) │
│ 1-- → 36-- r │ libc │ 1-- → (same) │
│ 0-- → 35-- r │ python │ 0-- → (same) │
└────────────────┘ └────────────────┘
(24:18 / 36:22 into the video)
Q: How well does this usually work?
A: It works great!
Only the differences that develop between the parent and child memory images incur additional storage
Q: But how does it work for Python?
A: Terribly!
Thanks to reference counts, merely glancing at data with CPython forces the OS to create a separate copy
CPython vs. PyPy¶
# Create a list
biglist = ['foo']
for n in range(1, 8460000):
biglist.append(n) # ~100 MB
# put fork() here
# Iterate across the list
t0 = time.time()
all(n for n in biglist)
t1 = time.time()
CPython¶
Before C Python iterates
$ cat /proc/22364/smaps | awk '/heap/,/Private_D/'
08d60000-0ef90000 rw-p 00000000 00:00 0 [heap]
Size: 100544 kB
Rss: 100544 kB
Pss: 50294 kB
Shared_Clean: 0 kB
Shared_Dirty: 100500 kB
Private_Clean: 0 kB
Private_Dirty: 44 kB
After C Python iterates
$ cat /proc/22364/smaps | awk '/heap/,/Private_D/'
08d60000-0ef90000 rw-p 00000000 00:00 0 [heap]
Size: 100544 kB
Rss: 100544 kB
Pss: 100274 kB
Shared_Clean: 0 kB
Shared_Dirty: 540 kB
Private_Clean: 0 kB
Private_Dirty: 100004 kB
At finish: 0.5% of heap shared
PyPy 2.8¶
Before PyPy iterates
$ cat /proc/22385/smaps | awk '/heap/,/Private_D/'
0a5c9000-11fba000 rw-p 00000000 00:00 0 [heap]
Size: 124868 kB
Rss: 124540 kB
Pss: 62274 kB
Shared_Clean: 0 kB
Shared_Dirty: 124532 kB
Private_Clean: 0 kB
Private_Dirty: 8 kB
After PyPy iterates
$ cat /proc/22385/smaps | awk '/heap/,/Private_D/'
0a5c9000-11fba000 rw-p 00000000 00:00 0 [heap]
Size: 124868 kB
Rss: 124868 kB
Pss: 63160 kB
Shared_Clean: 0 kB
Shared_Dirty: 123416 kB
Private_Clean: 0 kB
Private_Dirty: 1452 kB
At finish: 96.1% of heap shared
(26:03 / 36:22 into the video)
Lesson:
Forked worker processes in Unix share memory when the parent process builds read-only data structures
—but not in CPython
See also
- http://en.wikipedia.org/wiki/Fork_(operating_system)
- Wikipedia article on process forking
Explicitly Sharing Memory¶
(26:14 / 36:22 into the video)
How can two Python procedures share writeable memory to collaborate?
- Threads
- Memory maps
Threads¶
(26:48 / 36:22 into the video)
Creating a thread is like fork() except that the heap remains shared between the two threads of control
Python supports threads on both Unix and Windows
import threading
t = threading.Thread(target=myfunc)
t.start()
Each thread gets its own stack so the two threads can call different functions, but all other data structures are shared
main thread child thread
┌────────────────┐ ┌────────────────┐
│ 9-- → 59-- r │ stack │ 9-- → 59-- r │
│ 8-- → 63-- w │ stack │ 8-- → 77-- w │
└────────────┬───┴─────────┴──┬─────────────┘
│ 7-- → - │
│ 6-- → 61-- w │ heap
│ 5-- → 60-- w │ heap
│ 4-- → - │
│ 3-- → - │
│ 2-- → 39-- r │ libjson
│ 1-- → 36-- r │ libc
│ 0-- → 35-- r │ python
└────────────────┘
Since threads share every data structure they have to be very careful
See also
- http://en.wikipedia.org/wiki/Thread_(computing)
- Wikipedia article on threads
Memory maps¶
(27:19 / 36:22 into the video)
Using mmap() a parent process can create shared memory that will be inherited by all the child workers it forks
import mmap, os
map = mmap.mmap(-1, 100)
os.fork()
...
fork() parent fork() child
┌────────────────┐ ┌────────────────┐
│ 9-- → 59-- r │ stack │ 9-- → 59-- r │
│ 8-- → 63-- w │ stack │ 8-- → 77-- w │
└─────────────┬──┴───────────┴─┬──────────────┘
│ 7-- → 88-- - │ mmap segment
┌─────────────┴──┬───────────┬─┴──────────────┐
│ 6-- → 61-- r │ heap │ 6-- → 61-- r │
│ 5-- → 60-- r │ heap │ 5-- → 60-- r │
│ 4-- → - │ │ 4-- → - │
│ 3-- → - │ │ 3-- → - │
│ 2-- → 39-- r │ libjson │ 2-- → (same) │
│ 1-- → 36-- r │ libc │ 1-- → (same) │
│ 0-- → 35-- r │ python │ 0-- → (same) │
└────────────────┘ └────────────────┘
This supports very fast RAM-based communication between processes
without requiring every data structure on the heap to carry locks or other protection
Another mmap() capability
Remember how the OS loads pages from binary programs like python and shared libraries like libc.so on-demand, not all at once?
Well, with mmap() you can do that yourself, with normal files!
mmap(myfile.fileno(), 0)
lets you replace seek(), read(), and write() calls and simply access it like an array!
┌────────────────┐
│ 9-- → 59-- w │ stack
│ 8-- → - │
│ 7-- → 91-- - │ mmap[1] ↔ myfile
│ 6-- → 90-- - │ mmap[0] ↔ myfile
│ 5-- → - │
│ 4-- → 61-- w │ heap
│ 3-- → - │
│ 2-- → 41-- r │ libc.so
│ 1-- → - │ (python)
│ 0-- → 35-- r │ python
└────────────────┘
See also
- http://en.wikipedia.org/wiki/Mmap
- Wikipedia article on mmap()
Summary¶
(28:42 / 36:22 into the video)
Your processes all access memory through the mediation of a page table
Page tables power all kinds of fun RAM optimizations:
- Demand loading
- Shared libraries
- Memory maps
- fork()
- Threads
And
PyPy lacks reference counts which lets the OS do its magic and conserve resources
Questions¶
(29:35 / 36:22 into the video)
Just one question
Erik Rose said he recently started using mmap() in a non-Python context. Free RAM meters (e.g.: free, top) behaving strangely. Do you know if mmapped memory that is actually resident affects free counts or does it somehow go behind its back?
- A: Very dependent on the version of the operating system.
Reporting tools can have odd effects. What I would expect is for it only show pages disappearing when they’re allocated. When you mmap, you can ask for it to eagerly grab the memory.
Marc’s Prologue¶
Resources¶
A great free resource is Ulrich Drepper’s DSO How To (a.k.a.: “How To Write Shared Libraries”).
An awesomely detailed blog post on Position-Independent Code (PIC) by Eli Bendersky
For more great info on this kind of stuff, check out “Linkers & Loaders” by John R. Levine.
To delve deep into how the Linux kernel implements virtual memory (and other stuff), you might check out “Linux Kernel Development” by Robert Love