Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Unix Operating Systems Software

Ask Slashdot: How do Software MMU's Work? 92

Rob_D_Clark asks: "How does a program (like VMware) implement memory management on top of Linux (or Unix in general)? For example: in VMware, the guest OS is going to expect to have a 32-bit address space, into which the memory you allocate to the guest OS is mapped. Also, the guest OS is going to expect hardware registers for different devices, etc., to be mapped in at certain addresses. How does a program trap reads/writes to these addresses and deal with them appropriately?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: How do Software MMU's Work?

Comments Filter:
  • by Anonymous Coward

    Funny, my dad told me that's how an 'internal combustion engine' works.

    My dad explained "marginal utility" and "network externalities" the same way. In fact, he used the same explaination for sex.

    Hmmm . . .

  • by Anonymous Coward

    . . . is a parody of the humorless mental dwarves who moderate all the life out of Slashdot.

    Lighten up.

  • by Anonymous Coward
    You don't need a disassembler, just a pattern matcher. Scan through each text page looking for unvirtualizable binary opcodes and over-write them with some other opcode(s) you know will cause a trap, make sure to pick some trap that is unique enough that you can determine what opcode you replaced.

    I believe the main problem is that opcode to ask the cpu what mode (user or supervisor) it is in is not virtualizable, so it isn't like you will need to be recomputing things like offsets into the stack and complicated stuff like that, just a binary search and replace.
  • by Anonymous Coward
    But how do you know what to breakpoint? There's only 4 hardware breakpoints on x86, I think.
  • by Anonymous Coward
    In a nutshell, this is how I would do it:

    My program (VMware work-alike) would run in ring 0. ALL other programs would run in ring 3. I'd set a seperate x86 VM for each instance of a hosted operating system. My work-alike program would emulate any instructions which can't be run in ring 3. The work-alike would have to emulate a lot of the functionality of the x86.

    When a program tries to use an instruction that can't be used in ring 3 a GPF will occur which will switch control to my work-alike. My work-alike will look at the stack to get the address of the offending instruction and emulate its functionality, then return control back to the program. The program would have no idea it was just interrupted and its request emulated as long as it gets what it wants.

    It'll be much more involved than the above and would require a lot of work to write, but thats basically how it could be done.

    Anonymous Coder
  • by Anonymous Coward
    TLB == Translation Lookaside Buffer
  • by Anonymous Coward
    Its flushing the on-chip cache.
  • by Anonymous Coward
    Actually, that is exactly how VM is doing it, I believe. We've discussed VM ware at length, here in the office -- we're all REALLY impressed -- and we have come to the conclusion (so realise that the following is just speculation) that:

    As others have pointed out, the i386 is NOT virtulisable, so you have to play some tricks, unless you want to emulate the processor (hey, it worked for insignia). But, that is too slow to get VMware's level of performance. Even digital's assembly language translation thingie (the had a vapor ware alpha/i386 hybid processor a few years ago that did this -- look in back issues of byte for pointers) is too slow.

    VM ware's trick is to scan the object code at load time and translate unvirtulisable instruction sequences to something else -- what I don't know, but I suspect they jump to an emulator for just that sequence. So it's just like Pure Atria's stuff, and even related to the Melting Ice tech availible for rapid development for Eiffel.

    hope this helps.

    Johan
    johan@ccs.neu.edu -- I'll log in when I get the johan uid. gimmie!
  • by Anonymous Coward
    Anyone who has some time to spare and knows a great deal about how to implement this should head over to: http://www.freemware.org/ they could really use your help.

    [The Freemware project started directly after VMWare was announced. It's an effort to create an open source (and possibly portable) VMWare-clone.]
  • I have a dual CPU system, and that second processor isn't doing all that much. Rather than use a bunch of kluges to virtualize the processor, why not give a whole processor to that OS?

    This would require you to "hide" the other processor from the OS as far as SMP support goes (or to use an OS that doesn't do SMP), and to make it use "device drivers" that use inter-processor communication to funnel the actual device I/O through the hosting OS.

    There would be other issues, like arranging for the hosted processor to see a BIOS that doesn't try to access devices directly and guaranteeing that the hosted processor doesn't go playing with hardware directly. Unfortunately, the solutions to those problems might leave you back at the original problem.
  • by Anonymous Coward
    It's been >5 years since I last studied this, but I thought that one could set a flag that allows even debug registers to be "protected", in effect causing a fault upon *any* access attempt by ring 3 ("user") code.

    My current favorite example comes from Linux: The kernel allows user processes to read the current value of the CPU clock counter (using the instruction "rdtsc", or "read time stamp counter"). That instruction can be made to cause a fault by an appropriate flag setting.

    I would expect Intel to be fairly good at VM technology after hearing some of the complaints about the '386. (The obvious one is the lack of ring 0 write-protect page faults.)
  • by Anonymous Coward
    It might be useful to do some research on Rational Purify (formerly of Pure-Atria, formerly Pure Software). Purify lets the programmer/debugger trap on reads of un-initialized memory and provides fast watch-points (100s of times faster than gdb's watch-points). So clearly Purify is doing exactly what you want (while it isn't really clear if VMware is).

    I'd dig around in deja-news if I were you.

  • by Anonymous Coward
    No, it's simpler than that. Read the linux-kernel archives and see how the UltraSparc guys discussed working around the bugs in the UltraSparc CPUs:

    (1) you mark all the pages that you want to trap instructions in as non-executable

    (2) when code attempts to execute in one of those pages, you get a fault

    (3) you trap the fault, and then (and only then) scan the page and modify instructions as necessary

    (4) you then mark the page executable and not writable, and let it run

    (5) if the page is modified, you then clear the executable bit, because you may have to re-scan it.
  • by Anonymous Coward on Thursday May 13, 1999 @05:49PM (#1892912)

    Okay, imagine that the memory of your computer is like a vast attic, full of flies. Each of the flies is either asleep or awake, and they change state frequently. They live, work, and play in groups of eight, called "bytes". Now, when the computer gets hungry, it opens up its mouth much like a blue whale and sucks in a great big gulp of air from the attic. It filters the flies out of the air with its giant long strandy teeth and gobbles up the flies -- gobble gulp!

    So.

    The whale has no eyes, and in the whale's tummy there is a man without his greatcoat. That guy is called the "kernel", or "Colonel", and he looks and talks exactly like Colonel Klink on Hogan's Heroes. He has a goofy, bumbling sidekick named Sergeant Shultz, otherwise known as the "Memory Mangement Unit". What Sgt. Shultz does is, well . . . okay. Let's start over. Colonel Klink is in charge of sorting through these flies and putting them together in the right order before the whale (the computer, remember) digests them. This way, the whale won't get a tummyache and feel funny. Col. Klink has to decide which flies to send when, but he needs to have them organized in the right way so he knows which flies are which. If two batches of flies crash into each other, the computer will get very frowny and sad. Col. Klink doesn't like that, because when that happens the General comes and yells at him in German, and Col. Klink doesn't speak German, he just speaks English with a funny accent. So Sgt. Shultz has the very important job of ensuring that the flies don't get mixed up before Col. Klink gets to look at them.


    In the arrangement that you're talking about above, things are more complex, because Col. Klink and Sgt. Shultz have to coexist with Col. Hogan and Richard Dawson, who are doing the same thing at the same time. (A little imagination will suffice to guess which OS is which). Hilarity ensues! But everything runs smoothly again at the end of the episode.


    Hope this helps.

  • by Anonymous Coward on Thursday May 13, 1999 @06:09PM (#1892913)
    Since everyone else is talking out of their ass on this one, I might as well too.

    The general consensus in comp.arch is that vmware is doing some dynamic recompilation, but is otherwise allowing the hosted operating system to execute natively, and thus use the hardware mmu for the majority of the work.

    As has already been mentioned, the IA32 instruction set architecture (ISA) is not completely self-virtualizable, i.e. you can't trap accesses to all cpu state information. But, you can scan through the text of your process and search for those specific opcodes that are not virtualizable. Substitute a call to your own handler for those opcodes and voila! we are now effectively fully virtualizable and the performance hit is minimal, especially if you can save your changes so that you don't have to scan and recompile each page of text more than once. And once you are fully virtualized, as long as you properly trap the right operations and do the right thing, you can let the hardware do 99% of the work for you.

    Clearly vmware does more than this with its various virtualized devices, but fundamentally this is probably what is going on.

  • The idea is sound, it's only that Intel got stingy with the breakpoint registers.

    s/breakpoint//

  • I don't think it's quite that simple. You can't scan and replace because

    (1) what you replace may be data, not code

    (2) even (1) wasn't a problem, how do you know the sequence you found isn't a coincidence? What if the end of "REPNE SCASB" and the beginning of "DIV ECX" just happens to look exacly like "MOV EAX, DR7" ?



    --synaptik

  • I was hoping you wouldn't notice that small problem. ;)

    The idea is sound, it's only that Intel got stingy with the breakpoint registers.

    --synaptik
  • 1. Not necessarily. What about instructions like
    "MOV EAX, 0CDCDCDCDh" ?

    The last four bytes look like 4 "INT 3" instructions.

    --synaptik
  • Remember, i386 is little endian:

    "fc ff ff ff" is really "0xFFFFFFFC"

    or -4.



    --synaptik
  • My experience in MS Windows land, is that writes to DR7 are protected, but reads are not.

    In fact, the Intel documents I have specifically state that this register is readable at any priveledge level, but no where have I seen a statement that you can MAKE it a priveledged instruction.



    --synaptik
  • Wow, I'm glad Rob Clark thought to ask this on Ask Slashdot. I was wondering this myself.

    Although, I would like to add a rider to his question:

    With Intel processors, some hardware registers can't be trapped. For example, any priviledge level can read DR7 to find out if a debugger is resident. Writes to this can obviously be trapped, but AFAIK there is no way to get the processor to trap on reads.

    I am sure there are other examples like this, as well. This seems to indicated that it is impossible to virtualize every aspect of the machine.

    (Although, I suppose you could put the processor into single-step mode, and look at each instruction before it executes, looking for these types of instructions, but that would slow things WAAAYYYY down.



    --synaptik
  • This is where the processor comes to the rescue. To be sure, they can't expect every occurence of a certain sequence to be a particular instruction. It's quite possible to have that sequence be a combination of the end of one instruction and the beginning of another.

    I'm no OS developer, but if I were trying to do this, I'd try scanning for these strings, and then placing a hardware execution breakpoint at the beginning of them. If it's not actually code, the breakpoint won't get hit. If it is code, then when it does get hit the VMWare software could just look at the instruction pointer register, to ascertain whether they "hit" in the middle of, or the beginning of, an instruction. If the latter, they simulate that "offending" instruction.

    But like he/you said, I'm talking out of my arse.

    :)

    --synaptik
  • Posted by Nick Carraway:

    I think you mean "voila." I liked the Colonel Klink explanation better...
  • Rather than rewriting instructions you detect on a scan of a page of program text, you could possibly set a hardware breakpoint for the instruction. Since there are only 4 hardware breakpoints (at least on the 386), you would only be able to do this on pages containing less than 4 instructions that need to be watched.

    I wrote a simple program to scan all the windows dlls and exes for "dangerous" instructions. I found that for most exes and dlls, there were less than 4 instructions per page that would be dangerous. For the remaining ones, you could rewrite the instructions. But then, you have to make the page execute only (not readable or writeable-- is this possible?) and trap any access to it by the processor, to fool it into seeing the original instruction instead of the rewritten one.

    Or, you could simply do single step on that page (which might be a viable option since there would be so few of those pages in the average OS-- unless someone specifically wanted to make your VM perform badly ;-) ).
  • A few comments show that a program may determine that it's in ring 3 rathar than 0. It's important to remember that an OS has little legitiomate reason to check for that. I wouldn't be surprised if M$ added such checks now that vmware is out, but apparently, they haven't done that in the past.

    In general, it's not necessary to perfectly virtualize ring 0 instructions, it just has to be 'good enough'. In practice, determining what 'good enough' actually is can be a tough problem (which is why there aren't dozens of vmware like products out there), but perfection is not required. Most OSes are not hostile to being virtualized, they just assume that they're not being virtualized.

  • EMM386.EXE is a kludge added to a kludge. It was done because DOS programs expected to run in real mode, and the '286 didn't virtual86 mode.

    Anything that needs to trigger a processor reset to operate is a problem. EMM386 under 'DOS7' probably doesn't work that way anymore.

  • The short answer is: "mmap + SIGILL + SIGSEGV". If you're curious about the details, you might want to check out the Brown Simulator [brown.edu], which provides a full MMU at user-level on top of Solaris.
  • Well, not a complete virtual machine in the sense of the JVM. This approach of completely emulating the target processor has been tried -- the best example is Bochs, which actually works pretty well. But it's seriously slow.

    The point of vmware is to provide the fastest possibly emulation of an ia32 machine. So it want to execute all (or nearly all) the instructions directly on the host processor, rather than having to emulate them. The clever bit is to allow it to do this without clobbering the host OS -- this is what requires lots of memory management tricks.

  • by nrrd ( 4521 ) on Thursday May 13, 1999 @07:25PM (#1892929)
    One of the best descriptions of file paging is here [ohio-state.edu]. It give a fairly solid, humorous description. Long live the Thing King!
  • that VM ware uses a virtual machine for each operating system. This is similar to JVM. It allows each OS to run natively on the same computer using it's resources without conflicting with the others. What *I* want to know is disk partitions/file systems. Do you need a different partition for each OS? Or a different drive entirely. Yes, you need a different partition for each OS. But it's that way anyways (for the multi-boot people) without VMWare. VMWare, however, can be set to use an existing partiton with an OS in "RAW" mode, so you don't have to reinstall the OS in a VMWare space on the host OS. Of course, you DO NOT want the host OS to be able to access that partition when VMWare is running...
  • Quite elegantly put :)

    While we're at it, here's my guess, which is based on badly blurred memories of 80486 documentation.

    Memory reads and writes really aren't the difficult part. In protected mode, every process (or task) gets executed in its own 4GB (max) virtual memory space and gets translated by the processor into absolute memory space. The OS swaps out these task spaces to disk while they're not being used. One process should never be able to write to another processes space, which was the whole point of protetcted mode with the i386.

    The real issues involve handling interrupts, and executing protected instructions. Take for instance writing directly to hardware through IO ports. The host OS absolutely can't let the hosted OS do what it wants in this area. But the interupt mechanisms of x86 architecture come to the rescue here.

    Run the hosted OS in some unpriveledged level (not ring 0) and let the processor interrupt whenever there's a priveledged instruction executed. The host then examines the situation and recovers by implementing the priveledged instruction in an alternative way.

    Registers also won't be a problem in most cases since they are saved and restored at a task state switch. Linux shouldn't care what NT does with the registers as long as they get restored when NT gets preempted.

    - dw
  • Wow, that reminds me of the old Altos 3068 I used on my first job out of college. It had a discrete logic MMU on a separate (big!) board. I never really thought about it, but I wonder if there was some problem with the 68[48]51 chips (the system was a 16MHz 68020).
  • Purify isn't doing what VMware is doing. Purify rewrites the object code at compile time (and sometimes at runtime if you load a non purify'd file in). That is why it can do this so fast. It doesn't play any hardware games to get this speed (like some debuggers that unmap pages that for watchpoints, or try to use the limited debugging registers on some CPUs).

    They have patented the object code rewriting, but as far as I know, no one has challanged the patent. Evidently, there were published papers years before Purify reinvented (and patented) this scheme, so at least some of the Purify patents may be uninforcible because of that.

    Anyway, this has very little to do with VMWare, or how VMWare is implemented.
  • What I am really trying to figure out is how to trap writes/reads to certain addresses without having to interpret the machine instructions... the best hack I have thought of is to trap SIGSEGV, and have your signal handler try to figure out what was going on.

    For example you can mark pages of memory as a not being readable (PROT_NONE flag for the mmap). This will cause a SIGSEGV if the program tries to read/write that address.

    Another idea I just thought of as I was writing this post... you could use a kernel module to create a /proc file, then have your VM mmap that file to use as it's memory. Then the module could deal with simulating memory space. (This is assuming you can prevent the mmap file from being cached.) This would be even better if you could bind a user-space program to a file. (I vaguely remember reading that GNU hurd has the ability to do this.)

  • Well, I remember using an old Siemens box. It ran SINIX 1.0, had a 80186 processor and just under 1MB of ram. It also (wonder of wonders) had a MMU implemented as a piggy-back board on the bus with an 8086 and a software ROM. There's software MMUs for you!
  • OK, well since it's the day of whacky metephors, I'll try and tackle this one...

    A normal linux application has to obey posix rules to interface with the outside world (memory, disk, printer, etc..). Lets take memory as the normal example. An *application* has to ask nicely for whatever resources it wants. It dosn't know much of anything about the *real* state of the machine. Like a cow in it's pen, the process has no idea what the other cows are up to or even how many other cows there are or how big the ranch is. Operating systems attempt to "dial directly" to the hadware, and this is the part that VMWare must emulate. No easy job either, considering all the whacky things you can do on a PC. So if I'm an OS (or one of those old boot-me game disks like flight sim 2.0), I don't even worry about allocating memory - I just start eating it by the bucketfull. BIOS loads the kernel into memory starting at 0x00, then "jmp 0x00" (goto 0x00). Kernel executes, checks how much memory it has, and starts parcelling it out to other applications - it rules the ranch. It dosn't *ask* for more memory, it just "walks the fence" to figure out how much there is.

    Clearly to make the ranch-owner behave as a simple cow, while letting him run his own little rat-ranch and never letting him have a clue that he's just a cow in a pen is a pretty neat trick.

    Most of the "neat trick" is done for you by the CPU, however as some of the more advanced hackers have pointed out, there are a few weak spots in this virtualized environment. So you still have some crazy stuff to do before you can fully fool the rancher. He's always asking if there's a larger wourld out there, and you have to keep him in the dark at all costs (or he'll die of surprise and fright). Like flatland...

    Basically, this is done by brain-washing the rancher into never poking his head through that hole in the wall - which we know leads to the *real* outside world. Or what we know as the outside world - but is it really? Maybee we've already been brainwashed ourselves!!

    Application = cow
    OS = rancher
    computer = ranch
    vmware = sophisticated brainwashing for ranchers which makes them think rats are cows and keeps them from looking over the wall of the stall. Also makes them live on hay instead of beef.

    -=Julian=-

  • It may do the work, and does, at least for vmware. There is a thing called single-step in an 80x86, which lets you single step a program, with an interrupt after every instruction, so you may easily look-ahead and see if the next would compromize your emulation... This functionality is used by vmware, while they say this functionality is not available to debuggers running inside a guest-os...
  • Unfortunately unlike MIPS, the x86 TLB is
    implemented entirely in hardware so this won't
    work.
  • Maybe you could get some hints as to
    how VMWare works by watching what MS
    changes in their next OS release in
    order to break it, if they can... :-)

  • by MasterD ( 18638 ) on Thursday May 13, 1999 @05:29PM (#1892942) Journal
    It all has to do with virtual memory. (not the misnomer use as swap) Basically, there is a mapping between _real_ memory addresses and the addresses programs use to access data.

    In a kernel, this is done (usually) using a mix of hardware and software. If a program tries to access a piece of memory, the hardware looks at the Transition Lookaside Buffer (TLB) to translate the address. If the address exists in the buffer, it does the transition and all is good. If it does not exist, a trap is called to the kernel. It is the kernel's responsibility to look at the virtual memory tables, allocate the memory, copy it if it was copy on write, and most importantly update the TLB so next time it does not have to set up the translation.

    So in VM case, this is sorta conjecture. The VM can allocate a slew of memory on the host OS. (As far as the client OS is concerned, this is physical RAM. Then it can make a TLB and all memory accesses will go through it first. This way it can stop Windows from pissing all over OS/2 running on the VM. But Linux will stop the VM from pissing on anything else on the host OS.

    As far as kernel traps, the user level program's data needs to be copied over to kernel space for the kernel to access it.

    I hope this begins to answer your question.
  • you forgot the part about the booze and naked women...

  • If you want to map a memory region to a file, check out the manpage for mmap(2).
  • As a project at University we had to implement a debugger in Minix such that executing

    debug

    would run the program using an execdbg (like execle but slightly modified) call and run the program *htrough* the debugger. At the heart of it was an extra couple of lines in the interupt handler that caught an interrupt either after every instruction was executed *OR* (and this is the bit that's important here) caught an interrupt we'd placed.

    What we'd done was implement a feature which given an address woudl copy that instruction out, store it somewhere and then replace it with an interrupt. When that interrupt was trapped the progarm counter would be wound back one and the interrupt replaced by the correct instruction. Execution would be resumed on some input from the user.

    It wouldn't be very hard for the VM, on loading the progarm, to scan for 'dodgy' instruction.s These could be replaced by an interrupt and pushed onto the end of a linked-list. Should this interrupt be trapped then the program counter would be wound back one and the interrup replaced by the head of the linked list.


  • To get the x86 into single stepping mode you OR the psw with a special number (ie set a single bit), this means it will generate an interrupt every single instruction. To turn in off you just AND the psw with an inverted (NOT-ted)version of this special number.

    From the Minix src code (only because I've got it to hand, it'd be the same for Linux)

    src/kernel/const.h


    (line 9) for INTEL chips
    #define TRACEBIT 0x100
    /* OR this with psw in proc[] for tracing */

    (line 109) for M68000
    #define TRACEBIT 0x8000


    and from my src code for a minix debugger

    /* Set trace bit */
    child_proc->p_reg.psw |= TRACEBIT;

    ... and later ...

    /* Clear the tracebit */
    child_proc->p_reg.psw &= ~TRACEBIT;

  • I never realy thought it would be all that complicated.

    wouldn't linux be able o give it a fixed ampount of RAM as an application and then let VMWare tell linux what to do with it? (er.. i guess im asking why is it different from any normal application? dont they all have protected memory?)

    arghh.. my head hurts now..
  • that makes sense. perhaps im just a naieve beginer at programing... (i am)

    but dont all programs get told this by the OS? I thought protected memory was supposed to segment stuff from each other, to keep them all from knowing who all is out there. And what i have read from the documentation of VMware, it doesnt emulate the CPU (which would be slow) but lets it talk to the CPU (er??? doesn't it???)


    hehehe... im having flashbacks to the matrix... maybe life was designed by VMware. With liuck my brain is running under linux, not nt.. =)

    on a side note, what happens if you let VMware boot your linux drive under linux? or what happens if you have linux automaticaly boot NT under VMware and then have NT boot linu, and so on.. recursive OS! =)
  • Considering that code segments are not by default readable on the x86 (if I remember correctly), this might actually be doable.

    -Mat
  • by rgrimm ( 89215 ) on Thursday May 13, 1999 @08:44PM (#1892951)
    The basic technology behin VMware (I suppose -- all three authors are cofounders of VMware) is described in research papers at http://www-flash.stanford.edu:80/Disco/ [stanford.edu] , if you want to find out the details.

    Robert

One way to make your old car run better is to look up the price of a new model.

Working...