Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Hardware

What Improvements Will 64-Bit Processors Bring? 69

RyanG asks: "Everyone always looks at numbers (MHz, RAM, HD) when they're considering buying a new computer. Recently, more users have been eyeing bits, as in 64-bit processors, namely the Itanium and to a lesser extent the G5. A lot of people remember the performance increases that were seen when moving from 16 to 32-bit processors and some people seem to think similar performance increases will be realized when moving from 32 to 64-bit pocessors. From what I've read this isn't going to be the case given that 64-bit percision isn't needed in all but a few cases and that moving around that extra data can actually hurt the performance of 64-bit processors when compared to 32-bit processors. Anyone care to comment?"
This discussion has been archived. No new comments can be posted.

What Improvements Will 64-Bit Processors Bring?

Comments Filter:
  • Don't forget AMD (Score:2, Informative)

    by theantix ( 466036 )
    The ClawHammer and SledgeHammer processors are going to be 64bit processors as well, don't forget to include the competition!
  • Part of the Playstation 2's bus is 2560 bits wide. Must be one dang fast machine...

    The computer industry really needs some new standardized, best-buy friendly means of identifying real processing speed - not just for the processor but for the entire system (as this is what matters to actual people). And it needs to be something that Sun can't fudge the results of.

    Of course, nobody actually wants this. It's what Scott Adams refers to as a confusopoly. They can all stay in business by making sure nobody understands how fast their products are.
    • If you ever find out how to calculate this one magic number, you'll be quite famous). Don't you think that all the benchmarking is for a reasons? There is no single number that measures the "Fastness" of a PC. Thats why each benchmark that is worth looking at puts multiple Numbers for each candidate out (Not that this stops everybody from takeing one of those and saying "This one is the fast-value")
      • You're right of course.

        But surely we could agree on some standard - a set of benchmarks evaluating the machine's performance at representative tasks.

        And you're right, the information is out there already - but because there's no friendly standards body the information seldom makes it to consumers.

        If suppliers or OEMs wanted to they could provide more information... And they'd stick with the standard until they find out they didn't win.

        Case in point: Sun is pulling out of TPC-C (all hands to battle stations!)
        • Well lets see here, at the momment to fairly evaluate computer performace we need to do these things
          • Processor
            • Clock Speed
            • Cache Sizes
            • Pipline length
            • Bus Speed
            • Total FLOPS/MIPS for floating point and interger math
          • Memory
            • Bus Speed
            • Size
            • CAS latency
            • Latency
            • ECC or its equivalent
          • Hard Disks
            • Size
            • Spindle Speed
            • Number of platters
            • MTBF
            • Latency
            • Bus type
            • Raid configuration
            • Transfer rate
            • And I have just scratched the surface, when you throw in graphics and sound cards, etc things get really muddled. The only way to do this would be to have a constant OS (Linux/*BSD/BeOS) that could run the same benchmark with all possible optimizations made.
          • Exactly. There's way too much information to evaluate.

            The benchmarks would need to be adminstered on actual for-sale machines, randomly and with large samples, by the standards body. The OEM's would then have incentive to optimize their machines (and make sure they're stable).
        • But surely we could agree on some standard - a set of benchmarks evaluating the machine's performance at representative tasks.

          The problem is defining what a representative task is. The requirements of even simple computation varies so much that no matter what 'tasks' you perform there will always be some algorithm which doesn't have any parrallel task against which you've benchmarks.

          I've already thrown out every single benchmark I've found for one task I'm doing - the requirements of the algorithm are nothing like sorting a spreadsheet, or searching in a database.

          The problem is that the information on /real/ performance of a device is either difficult or impossible to get hold of easily. I could analyse a number of hardware platforms in great detail (say, 4 or 5 days research each) and get a reasonable idea of their suitability for this task, but it's an awful lot of effort. In some cases, the information is simply not available (or if it is, it's hard to find... anyone used Intel's website recently? I was told by one of their guys to use google to find what I wanted on their site because they didn't know where it was...)

          Ian Woods
          • With all respect, these benchmarks wouldn't be for you - they'd be for average consumers who have even less information than you do.

            As to representative tasks, I suggest the following:

            1. Combined frame rates for the current top 5 games. This would have to be indexed so that over time, scores remained representative of relative performance increases.
            2. Reboot time. Time to load and recalc a large Word or Excel document.
            3. Performance on a tedious mathematical task with a large data base and no good shortcuts.

            There's plenty of good benchmarks out there already. The ones above are stupid. But something is better than nothing.

            These numbers don't need to be perfect. They just need to be standard, and at least an attempt to be representative of tasks that the average consumer cares about.
    • there are *so* many factors that affect performance its scary ... theres no clear answers in this field ... its so complicated, as part of the coursework for my BS in CS, I had to take a course on performance evaluation!

      Case in point: My Celeron 500 boots win2k about 30 seconds faster then my Athlon 1700 ... Of course, after they're booted the athlon rocks the socks off the celeron ... The answer is: The celeron has a faster hd then the athlon, but can you buy a computer based on HD speed? Could a Best Buy salesman even *tell* you the speed of a machines hd? :)
      • The fact that it's difficult to assess performance (and the fact that computer resellers are generally incompent) is the strongest argument for standardized benchmarking.

        Consumers need a body that can give them a number for how quickly a computer will perform a certain kind of task. I know its not going to be terribly precise (or accurate for a specific task), but its going to be better than the current knowledge a consumer has (which is pretty much "How many of MegaHurts's does it have?")

        That said, I don't think such a standard will be set, for the reasons in the original post.
    • dat's a 2560-bit wide data *bus*, the integer size of the Emotion Engine is 64-bit I believe.
  • by green pizza ( 159161 ) on Tuesday December 04, 2001 @07:02PM (#2656668) Homepage
    I've been dabbling in 64-bit computing since my business bought me a SiliconGraphics Indigo2 workstation with a MIPS R10000 CPU. Granted the machine could only address 768 MB of RAM (a decent amount for '95... but modern workstations, like the Octane, can address 8 GB or more), but it was quite neat to work with 64-bit pointers, larger datasets, and huge files. For me, the jump to 64-bit computing simply allowed me to work with big numbers without having to revert to tricks and other hacks. For others, it was the introduction of 6 GB datasets in RAM, being able to address each and every element with clean code. These days 64-bit computing is an assumed must on large scale systems (where 64 GB - 1024 GB RAM [on Origin 3000] on a single system [not a cluster] can be considered normal). But on the already-fast, one to two CPU desktop computer, it's not much more than a marketing blurb. And, for most, it's not even too needed, at least until 2+ GB of RAM becomes the norm. 32 bit computing already allows for some pretty darned big pointers and addressing.
  • Back when we had 16 bit computers, this created limits that we had to actively work around, all the time. 64K was a hard limit all over the place.

    On older machines, this was either an absolute hard limit (64K, period) or kludged in some way (Apple //c had a special bit to bankswitch in one or the other 64K memories, but both couldn't be used at the same time.)

    The IBM PC had its segment registers and so could address 1MB, but it was far from transparent. There was no way to declare, say, a 200K array of strings. The programmer's data structures had to be tied quite closely to the peculiarities of the architecture. We spent as much time working around the limitations as we did writing useful code.

    When 32 bit computing came along, bam! What a change! Want to declare a 20MB data structure? Go for it! In terms of articifial restrictions, there just weren't any practical limits to run into, or around, day in and day out.

    The reason 64 bits won't be as revolutionary as 32 bits is that, for the most part, 32 bits is still good enough. Even with the bloated software we have these days, 4 billion is still plenty when it comes down to most things. Take time_t; that's still not going to overflow for another 30 years. 4 billion is a lot.

    A 64-bit CPU working with 32-bit data is being slightly inefficent, but don't worry too much about a slowdown from that, as they'll tend to be inherently faster over time, which should more than make up for it.

    So, basically, you heard right. I think.

  • by Anonymous Coward
    About a year ago, my engineering frim replaced almost the entire computing network with Win2K based PCs and servers, most running Pentium III CPUs @ 933 MHz.

    Why does this matter? Because in the years before the PCs, I had *four* different 64-bit workstations on my desk. Over a period of about six years I had a Sun Ultra 2 (64-bit UltraSPARC II), SGI Indigo2 (64-bit MIPS R10K), and two AlphaStations (64-bit Alpha 21164 and 21264).

    How are things now, with our 32-bit PCs? Faster, for the most part. 64-bit computing is about the size of numbers that can be used. With 64 bits, you can address more RAM, address more disk space, and put more in each address. It really has nothing to do with performance (unless you're doing a task with big numbers that is a slow kluge under a 32-bit CPU).

    We haven't had too many growing pains. The servers were a nightmare for awhile, but that was mostly due to software. Thankfully our visualization room is still running off an old Onyx2 (I don't know of any easy way to run 4 full screen projectors off a single PC and still have good gfx performance). Several of our senior engineers still use their old dual CPU Octanes, too. We're saving money buying new PCs versus buying new UNIX workstations. Thankfully, most PC components are now capible enough for our tasks and software, but really, everything about PC hardware is *crap* even the so-called "professional" components. But, given the low price we're able to buy one of everything and find out what works.

    *shrug*
  • There are other things that 64 bits gets you than just bigger numbers. You also can scan, process, or copy data in much bigger chunks. Functions such as memcpy, strcmp, and strlen can be significantly sped up with a 64 bit boost.
    • by Lars T. ( 470328 )
      Of course you don't need a 64-bit processor for that, SIMD extensions like AltiVec and SSE2 already do better than those by doing it with 128 bit chunks.
    • "Functions such as memcpy, strcmp, and strlen can be significantly sped up with a 64 bit boost."

      Wouldn't memcpy (or memmove, if you want to handle overlaps on my system) be bound by memory bandwidth? If your DIMMs are capable of shuffling N bits around per cycle, a 64Kbit processor won't help. Memory is already far slower than a CPU, which is why we have caching. If you push a significant amount of data around, your cache won't help.

  • by "Zow" ( 6449 ) on Tuesday December 04, 2001 @07:35PM (#2656874) Homepage

    This makes question #11 on my Architecture midterm today. . .

    The jump from 32 to 64 bits isn't about speed or precision, it's about the amount of useable address space on a given architecture. For whatever reason (call it functionality, call it bloat, whatever), the amount of address space that programs require is going up by .5 to 1 bit per year. Have you noticed that a lot of people are starting to complain that their PC's are maxed out at 4GB, especially for things like heavyweight apps like db servers, simulation programs or MSWord? Or that there's been a lot of work on Linux or NT to allow the user to access more of the 4GB on the box? Guess what? The 80386 came out 16 years ago.

    So the jump now is mostly to allow us to continue to grow for another 32 years. Most processor manufactures tried to get the migration started early - the SPARC, MIPS and Power(PC) chips have all supported 64-bit operation for some time now. The Alpha was origionally designed as a 64-bit processor 10 years ago. Intel and AMD are actually rather late to the game.

    It's been said that the only thing that killed the PDP-11 from DEC was its small (16-bit) address space - Users were very happy with it, but when they needed more room for their programs, the PDP just couldn't be expanded to handle them. This is probably why DEC started migrating everyone to the Alpha 10 years ago. The origional release of the Alpha only used a 34-bit address path (so it could access 16GB of RAM - the rest is reserved). If you want the details check out chapter 5 of Computer Architecture, A Quantitative Approach by Patterson & Hennessy.

    -"Zow"

    • That's true, but the Alpha was *also* the fastest
      thing around when it came out. It has more memory bandwidth so you can address many gig, sure, but 64 bits also enables long instructions, and a bigger pipeline. Quad issue (executing 4 instructions simultaneously) wouldn't be possible without the ability to fetch all those instructions in one go. So 64 bit machines CAN be inherently faster than 32 bit machines, through that microparallelism.
      • Quad issue (executing 4 instructions simultaneously) wouldn't be possible without the ability to fetch all those instructions in one go

        True, but this does not require a the machine to be either a 64 bit processor or have a 64 bit address space. The size of the bus is independent of either the size of the address space or the length of the instruction word.

        Despite being only a 32 bit architecture, the original Pentium chips implemented a 64 bit bus (requiring two 32 bit simms in parallel for memory), while the 8088 was a 16 bit machine with a 20 bit address space running on an 8 bit bus.
      • The Alpha certainly was the fastest processor on the market when it came out (and if Compaq pushed it the way they should have, it still would be). But this wasn't due to the addressable space - it was having a very clean instruction, designed for multiple issue (the 21064 was a dual issue as I recall: one memory instruction and one other instruction per cycle) that allowed them to crank the clock rate up way past anything else available at the time.

        Neither the memory bandwidth nor the issue rate of a machine is necessarily a function of the addressable space on the machine. I had a 486 mainboard that had an effective 64-bit memory pipeline - it blew every other 486 I ever saw out of the water and remained useable well into the Pentium age. The SGI Onyx-2 I used to work on (a 64-bit MIPS machine) had, if memory serves, a 256-bit memory bus.

        Similarly, the ability to do multiple issue has nothing to do with the addressable memory space. The Transmeta Crusoe is, as far as I know, a 32-bit processor, and at its core it's issuing something like 128-bit VLIWs, comprised of 4 RISC instructions each. Likewise, the latest Intel & AMD offerings are superscaler - I've lost track of how many x86 instructions they're capable of issuing per cycle, but they certainly haven't been limited by the 32-bit address space. Of course, whenever you're issuing multiple instructions per cycle, having a fat pipe to memory helps.

        -"Zow"

    • Good point. You left out the 32bit wonder-machine that brought 2GB of virtual memory to Digital users - the VAX-11/780. Of course our Uni crammed up to 80 users into 128kB of real memory so response time was measured in tens of minutes when assignments were due :(
    • by SecretAsianMan ( 45389 ) on Wednesday December 05, 2001 @03:30AM (#2658597) Homepage
      The transition of Digital machines was not so clear-cut. Yes, the PDP-11, which was made in various models from 1970 to 1990, was a very popular machine and was generally considered to be 16-bit. The first models had a true 16-bit address space, but later models improved upon that. The 11/45 introduced 18-bit addressing and separate address spaces for data and code. The 11/70 introduced 22-bit addressing.

      Also, Digital was already making 36-bit machines, starting with the PDP-6 in 1964. The 36-bit PDP-10, which cam in several flavors, was quite popular and spawned quite a culture of its own.

      Lastly, how can you not mention the 32-bit VAX? From 1976 to 1999 (!), this was Digital's 32-bit machine, and it was also very, very successful. By the time the Alpha hit the scene, the VAX had certainly taken over the supermicro/mini/supermini position formerly held by the PDP-11.
      • I was not aware of the 18 or 22 bit addressing on the 45 or 70 - the source that Patterson & Hennessy cited was from 1976, which I think was before both. The point was that expanding the address space of a machine beyond what it was origionally designed for is hard and often ends up as a kludge.

        I can't claim to really know anything about the PDP-6, and I thought the 10 was an 18-bit machine. . .

        I didn't mention the VAX because I was trying to remain brief & to the point. It was a very good & successful machine. So good, in fact, that it was widely known that the Soviets cloned them for use in the USSR. In return, the DEC engineers started including little phrases on the die, like "VAX: When you only want to clone the very best" (or something close thereto). I applogize if I left the impression that DEC jumped straight from the PDP-11 to the Alpha. The point was that they had the foresight to see that the VAX architecture wasn't going to remain viable in the future, so they started migrating to the Alpha in more of a preemptive move.

        -"Zow"

        • Hey man, don't take it as a flame; I was merely attempting to give our Slashdot readers a wider education. I appreciate your comment very much.

          The 45 and 70 were introduced in 1972 and 1975. You're right, though; the wider addresses were something of a kludge. Even in the machines with separate code and data address spaces, the most virtual memory a program could address was 128KB. Can I hear an "overlays"? Ugh! :-)

          The PDP-10 and its children were indeed 36-bitters. People often wonder why the bitnesses DEC's pre-PDP-11 machines were always multiples of 6. It's fun to tell them "back in the day, a byte was not 8 bits, and octal made sense" and then watch enlightenment slowly spread over their face.
          • No offense taken - I always appreciate having my sphere of knowledge expanded.

            Overlays though. Can't say I appreciate those. I mean, they were a slick solution to the problem at hand back in the day, but I've had to port a heavily overlayed program (origionally written in the DOS 2.x days) to a proper virtual memory system (NT). Ever since, whenever I have a problem, I always try to consider what effect any "slick" solution will have on maintenance down the road.

            -"Zow"

    • No problem, just kludge on bank-switching.

      Heh, I remember bank-switching on my C64. Then a few years later I was learning to use LIM EMS memory on x86/MSDOS and that's when I realized nothing ever changes. So now you 386 programmers are doing it again? All I can say is a Nelson-like "ha ha".

  • by Nater ( 15229 ) on Tuesday December 04, 2001 @07:46PM (#2656945) Homepage

    Think about what the average home/office user is doing on the computer and how much processing power it really takes to make that cursor blink. The simple fact is that for a typical office suite and web browser, current technology is overkill. Some people like to play audio, video, or games on their computers and that takes some more processing power, but it's nothing that pushes the limits of modern hardware (you gamers who say you can tell the difference between 100 and 125 FPS are lying... that's 1.5 to 2 times your monitor's refresh rate).

    People are going to get the hot new toys because they're hot new toys and then be really disappointed when everything they've been doing doesn't get any better.

    Somebody somewhere might develop the killer app that makes a 64-bit processor make sense for home and desktop users, and I can think of a few things that have the potential to take off like that, but until then the new hardware will basically be a "my dick is bigger than yours" type of thing. I honestly hope that killer app comes sooner rather than later because whatever it is, it'll be killer.

    • by green pizza ( 159161 ) on Tuesday December 04, 2001 @08:27PM (#2657123) Homepage
      Very well said. However, I did want to make one point.

      (you gamers who say you can tell the difference between 100 and 125 FPS are lying... that's 1.5 to 2 times your monitor's refresh rate)

      I typically can't stand gamers, but I do understand their desire for framerate. The typical PC game has no mechanism for holding a sustained framerate, nor are things like texture preloading handled with any sort of elegance. Most PC games use weak code and brute force to produce any acceptable output. As such, the machine with the highest average framerate is the machine that's the least likely to get into a situation where the framerate will drop below, say, 30 FPS... perhaps in a complex scene or hitting a spot of particularly bad code.

      That said, the nitpicking "this 3D accelerator is better because it's 10.2% faster" blurbs are mostly BS.

      Then there's the other end of the spectrum. The company I work for has an SGI-powered RealityCenter for engineering review and presentations. The 30-foot-wide screen is curved and lit by three Barco projectors. It's normally driven as either 3840x1024 super-wide using all three projectors, each driven by a graphics pipe. For more complex scenes, three pipes work in parallel to drive just one projector at 1280x1024. Most of our software is created in house with the help of SGI IRISPerformer and MultiGen-Paradigm Vega libraries. Aside from a few exceptions, the whole setup runs at a locked 60 Hz (60 FPS gfx and projector).

      For those that like tech specs, the machine behind the curtain is a Silicon Graphics Onyx2 installed in early 1999. It has 24 MIPS R10000 CPUs each with 8 MB of L2 cache and running at 250 MHz. 48 GB RAM and 1.8 TB of disk via four channels of gigabit fibrechannel. The graphics pipes are three InfiniteReality2 subsystems, each with four Raster Managers (64 MB of dedicated texture ram plus 320 MB of generic graphics ram per pipe). There's a DPLEX module on each pipe to allow all three to work in parallel when needed.

      If the bean counters approve, we should have a totally new Onyx3000 system installed by June 2002. After all, our current setup is about 3 years old... ancient by computer terms. Thankfully the projectors, lighting controls, and indeed most of the room (seating, conference table loft, etc) will be reused.
      • Wow! What does your company do with that beast? What kind of reviews & presentations benefit from that kind of horsepower? I'm guessing it's not for displaying PowerPoint with 4-foot high bullets.
        • Wow! What does your company do with that beast? What kind of reviews & presentations benefit from that kind of horsepower? I'm guessing it's not for displaying PowerPoint with 4-foot high bullets.

          Engineering reviews of current projects. Tracing beams, pipes, etc on 3D models of offshore oil rigs. When the three InfiniteReality2 pipelines are working together in DPLEX mode we're able to render every pipe, elbow, and setscrew and still fluidly move around the model. It's been a godsend for pouring (or arguing) over details. Standing around a small monitor on an underpowered desktop workstation or, worse yet, blueprints, wastes a tremendous amount of time.
      • I typically can't stand gamers, but I do understand their desire for framerate. The typical PC game has no mechanism for holding a sustained framerate, nor are things like texture preloading handled with any sort of elegance. Most PC games use weak code and brute force to produce any acceptable output. As such, the machine with the highest average framerate is the machine that's the least likely to get into a situation where the framerate will drop below, say, 30 FPS... perhaps in a complex scene or hitting a spot of particularly bad code.

        That said, the nitpicking "this 3D accelerator is better because it's 10.2% faster" blurbs are mostly BS.


        That "nitpicking" has a nugget of truth to it, however. As you have said average framerate depends on lowest framerate you hit. The highest a card will hit can be limited by the system itself. Therefore, if a card is 10% faster it may mean its lowest framerate is 20% higher. This lowest framerate is noticeable as it is definitely under the refresh rate of your monitor. Also, some people have monitors that can run relatively high refresh rates at high resolution (my old CTX VL710 ran at 110 Hz at 1024x768). Framerates are only really noticeable when they are slow, and a faster graphics card makes a huge impact there.
    • At work I've been using iMovie on a nice dual processer G4 heavily for the past few weeks and have been amazed at what kind of difference there is between that box and my G3 iMac... most consumers won't get crazy for video, but it is something that more and more people are doing.
    • Right, but keep in mind that when the latest generation of games comes out (think the new DooM here), it makes all current hardware look like crap. Case in point: Carmack says that the new DooM should run at about 30 fps on a GeForce 3, and a bit faster on the Radeon 7500. That's fast enough to provide the illusion of motion, but it's a far cry from silky smooth graphics. Games, and maybe video encoding, are the only thing that are going to push Joe Bloe's 2 ghz P4 to the limit.
  • The move from 32 to 64 bits isn't so much about performance as it is about ...size, I guess. The ability to hold huge databases and datasets in a flatly addressable space. The ability to do maths with larger and/or more accurate numbers. That kind of thing.

    As I recall, a 286 was slightly faster than a 386 at the same clockspeed. The 486 was the first x86 that was actually designed to go fast. The big deal about the 386 was that it did memory management properly, and had the multitasking abilities, and to do that it needed a large addressable flat memory space (hence 32-bit pointers). The 32-bit registers were that size mainly so they could hold pointers and offsets and things. (Yes, I'm simplifying, I know.)

    64-bit CPUs will be faster at a few things, like copying memory and crunching RC5, but most performance benefits in future CPUs will have nothing to do with word-size (clock speeds, cache sizes, clever pipelines, etc.).

    • 64-bit CPUs will be faster at a few things, like copying memory and crunching RC5,


      Sorry buddy, but you're wrong.

      Taking an example from the x86 world: since the Pentium MMX, there exists the ability to copy memory in 64-bit chunks, using the MMX instructions. The Pentium 4 adds SSE2 instructions into the mix, being able to address data in 128-bit chunks, plus some ``Cacheability Control'' instructions. These hint the processor on the nature of the data, so it can handle caching better. (Although some PREFETCH instructions, found e.g. in the Athlon, already have some of these capabilities.) In the end, a ``plain'' 64-bit processor can't possibly use memory bandwidth as efficiently as these 32-bit designs shown above.

      Now for the issue of RC5. First, I assume you're talking about the distributed.net client. (The RC5 cypher itself is parameterizable, so this discussion might not apply to real-life RC5 implementations.) I do have some knowledge on this area, and I can assure you, 64-bit will only slow down RC5 implementations, when the word size parameters are specified at 32 bits. Which, BTW, is the value set by RSA on all their Secret Key Challenges, this being the stuff distributed.net is cracking.

      This is mostly related to the fact that arithmetic must be done modulo 2**32, and this cannot be done effectively with 64-bit instructions. With ADD instructions, that's fairly easy -- just mask out the 32 most significants bits, i.e. AND REG, 0x00000000FFFFFFFF (still, you're executing an ``useless'' instruction, as in ``unrelated to the algorithm at hand'', and which will slow you down.) However, RC5 relies heavily on rotation instructions, and unless the ISA in question allows for working with 32-bit chunks of registers, we're back to the slow {shift left by n, shift right by 32-n, OR} procedure. Again, that's not a flaw in the algorithm -- it allows you to specify many parameters, word size being one of them, and RSA picked 32-bit in their contests.
      • I do have some knowledge on this area, and I can assure you, 64-bit will only slow down RC5 implementations, when the word size parameters are specified at 32 bits. Which, BTW, is the value set by RSA on all their Secret Key Challenges, this being the stuff distributed.net is cracking.

        This isn't true if you have sufficient registers and you use a "bitslice approach." With a bitslicing approach, you store each of the 32 bits of a number across 32 different registers, and you do this for several keys in parallel. For instance, you store the data for key 0 in bit 0 across the 32 registers, the data for key 1 in bit 1, and so on. Then you can process N keys in parallel, where N is the bit-width of your machine. If you have a 64-bit machine, this is twice as efficient as a 32-bit machine, for the same number of registers. XORs don't change, ROTLs by 3 just become moves (if anything), and ADDs are two XORs and an AND. The only hard part with RC5 is when you have to ROTL by a varying amount -- that requires some fancy footwork, but it's not too bad (about 5 sets of tricky ANDs and ORs). :-)

        --Joe
  • by jquirke ( 473496 ) on Tuesday December 04, 2001 @08:59PM (#2657273)
    Mainly it seems people are talking about the register width, precision, and of course address space.

    Keep in mind the first Itaniums have a 64-bit virtual adddress space, while the physical space is limited to 52-bits I think.

    The Hammer series processors are really just an x86 extension. They offer no where near the capabilities of Intel's fresh start with IA64.

    Here are some of the features of the IA64:

    -> Heavy use of ILP (Instruction Level Parallelism) - speaks for itself.

    -> Predication - less branches taken and hence stalling. The conditional handling is done through a controlling predicate, rather than jumping. look at this C code:

    if (!eax) ebx=VALUEB; else ebx=VALUEA;

    Now the i386 code:

    testl %eax,%eax
    jz 1f
    movl $VALUEA,%ebx
    jmp 2f
    1: movl $VALUEB,%ebx
    2:

    Now the IA64 code:

    p2,p3 = cmp.ne r5,0 ;;
    (p2)ld8 r4=$VALUEB
    (p3)ld8 r4=$VALUEA
    /* last two statements run in parallel */

    Now whereas the i386 code jumps all over the place, stalling the CPU, the IA64 code uses the controlling predicate registers to decide (p2,p3)

    ->Huge register sets

    r0-r127 are the general 64-bit registers, compare this to eax,ebx,ecx,edx,esi,edi,ebp?

    p0-p63 are the predicates

    As well as 128 82bit floating registers f0-f127

    ->Speculation

    Normally you can't reschedule a load to run before a store because the addresses can overlap

    *ptr=b;
    some_code_that_does_not_touch_b_c_ptr_ptr2();
    c=*ptr2;

    Previously, you couldn't move c=*ptr2 prior to the start of this code because ptr could overlap the same memory as ptr2.

    Now you can - basically the load (using the "advanced load" instruction) is performed anyway, which allocates an entry in an internal table, then the store, and _if_ that store overwrites the load the load is performed again. Hopefully though, this shouldn't happen often. And its more flexible and powerful than this, this was just a simple example.

    -> Remappable registers - the registers can be mapped kind of in a way that memory can be paged - that way when calling a new function the stack is not necessarily needed to push and pop various registers.

    ->"Modulo" loop scheduling

    The beginning of the next iteration of a loop before the last one has finished - the remappable registers "rotate" to give each iteration a new set of the virtual registers

    ->An interesting way of handling paging

    Which reduces TLB flushes on task switches by tagging an entry in the page tables with a unique ID specific to a process - I'm not fully sure on the details on this since I've never looked at IA64 system programming.

    Sorry I'm sounding like an Intel brochure, but it really is quite amazing if your coming from x86 programming background - IA64 is a lot more than doubling data unit sizes. I suggest if you're familiar with assembly programming read the IA64 manual at developer.intel.com.
    • by Anonymous Coward
      Heavy use of ILP is nothing new. Every modern processor does it, but most do it
      it implicitly - superscalar, out-of-order
      and speculative execution. IA-64 (and EPIC
      in general) needs the compiler to explicitly
      encode the ILP in the instruction stream. This
      is similar to VLIW, and just as difficult. Predication is useful, but note that its not as easy as you might think to do this in the compiler. Naive use of "if-conversion" and
      beyond is actually likely to slow down your
      code as speed it up; there are people working on compiler algorithms to sort this out.
      The module scheduling support you mention is fine too, helps the compiler alright, but again its not as big a win for general purpose codes as you might imagine. There are people working on it, but I'm not holding my breath for it to be solved.

      This is not a snub on the very smart people working on it; people like Dan Lavery who did
      a dissertation on modulo scheduling of general purpose codes are smart and working hard..but its a DAMN hard problem.

      Modern software tends to be irregular, integer oriented and typically not subject to whole program optimization. This makes the compiler and processors jobs even harder..
      remember most codes do not dynamically
      have that much ILP (although dynamic ILP processors do have limited windows), finding it in a static compiler is even harder.

      Looking at the SPEC scores underscores my point (lets not argue right now how good a set
      of benchmarks speccpu2k is); fairly good
      FP performance (POWER-4 beats it with far less power consumption, smaller die and #transistors and power-4 has 2 processor cores on die!) the int scores are DREADFUL.

      Right now IA-64 is a hot, complex, slow beast that depends on a "sufficiently smart compiler". I work on compilers and let me tell you there is no such thing:-)

      Sorry for the rambling..I'm tired...

      njd@kamayan.net

    • Sorry I'm sounding like an Intel brochure,


      You are. That's my biggest grip with your post (that, and your downplaying of AMD's x86-64 architecture.) The guy asked about the advantages of 64-bit computing in general. In particular, he didn't ask for features specific to a single architecture. None of the features you talked about couldn't be implemented on a 32-bit processor, or an 8-bit microcontroller for that matter. Based on that, your post is completely off-topic and should be moderated accordingly.

      Mind you, I might stop here, but I do have some other problems with your post.

      Heavy use of ILP (Instruction Level Parallelism) - speaks for itself.


      SPEC results also speaks for themselves, and louder.

      Itanium 800, SPEC base=314, SPEC peak=314. Athlon XP 1600, SPEC base=677, SPEC peak=701.

      You don't need to tell me that the Itanium crushes everyone else in floating point applications in a clock-for-clock comparison. That makes it great for scientific computations, but what about everything else? (Nobody's playing Quake on a multi-thousand-dollar CPU.) Still, the 1 GHz Alpha 21264 is 11% faster than the 800 MHz Itanium on SPECfp, and even the newest P4s beat Itanium in this benchmark. While, of course, completely trashing it in the integer benchmarks. Seeing as how Itanium can't scale either, it won't regain the performance crown soon, if ever.

      Predication - less branches taken and hence stalling. The conditional handling is done through a controlling predicate, rather than jumping. look at this C code:


      There are enough examples out there that show the x86 in a bad light in this regard, however, you didn't choose one. A real programmer would rather write:

      movl VALUEA, %ebx
      testl %eax, %eax
      cmovzl VALUEB, %ebx

      Just as fast as your code, cycle-wise. Note: a 2 GHz processor has a 2.5x smaller cycle time than an 800 MHz processor.
      (The above code only runs on 6th+ generation processors, and requires VALUE[A-B] to be stored in registers; not a problem, I'd say.)

      I really wished to debunk the rest of your points, but I've got better stuff to do. Not that I think those were the only flaws on your post.

      Face it, Itanium is doomed. McKinley will be the first IA64 processor to be taken seriously. Particular implementations notwithstanding, it's a flawed architecture. VLIW has proved NOT to be a panacea, while OOO, which has proven time and again to work, was laughed at by the Intel PR machine. Now, with benchmarks available, they're just reaping the bitter results of their strategy.

      Finally, the fact that ``[t]he Hammer series processors are really just an x86 extension'' is, in fact, its most advertised feature, and you didn't seem to have gotten it. Only time will tell whether AMD will succeed, and most factors are outside of AMD's control (and quite a bit of those are non-technical.) If Intel didn't have as much power as it does, hardly would the Itanium suceed in the marketplace. But then again, Sun hardware has always been a bad performer; technical factors do not play a major role in market economics, apparently.
      • Alright mate I confess - I'm an Intel exec with nothing better to do than sit around trolling Slashdot.

        Nah, but seriously, I think you took the post too seriously. Of course I could have used the CMOV instruction in my i386 code, however I was merely highlighting what predication was about - but then it wouldn't be i386 code would it? Hey your the real programmer you tell me.

        And for this supposedly being offtopic - well we all interpret things a little differently - I'm sorry if reading my comment wasted so much of your precious time.

        SPEC results also speaks for themselves, and louder - a real programmer would take SPEC with perhaps a grain of truthfulness at most.
        I am in no way downplaying AMD's efforts at a 64-bit processor. I was pointing out they are in different leagues and should not be compared, as it seems many people are doing.

        I really wish to debunk the rest of your points, but I've got better stuff to do

        umm, yeah
      • I'm not an assembly programmer and you can probably tell that by this message, but one thing that really gets to me is people picking apart each others ASM code and calling it a "bad example"

        "movl VALUEA, %ebx
        testl %eax, %eax
        cmovzl VALUEB, %ebx"

        or

        "p2,p3 = cmp.ne r5,0 ;;
        (p2)ld8 r4=$VALUEB
        (p3)ld8 r4=$VALUEA "

        I don't think the author of the second piece of code was trying to be the most efficient ASM programmer that he could be. He was showing that it is half as long as the regular i386 code, and 2 of the 3 lines execute at the same time, making it like a two line long code that does the same thing, with no jumps.

        Someone else got reamed out for using

        MOV register,0; or something like that instead of
        XOR Af,0;

        to null something.

        *sigh* Nitpicking at its worst. I'm not flaming either of the two posters personally, just commenting on their writing styles.

        Is Assembly such an 3l33t language that programmers feel a need to very loudly correct each other's small mistakes?

        Hmm.
        • There's a huge difference there.

          Although the x86 ISA is serial in nature, processors starting with the Pentium were superscalar (that is, were able to execute more than one instruction per clock cycle), and since the Pentium Pro/II/III were capable of out of order execution.

          The difference between the original code and my code is simple. Unless the branch can be easily predicted, this code will take (worst case scenario) something from 12 cycles (on the Athlon) to 22 cycles (on the P4.) We're only taking into account the delay caused by pipeline flushing. In fact, his code was also not aligned to a particular boundary (say, addresses divisible by 16.) This might incur penalties as well.

          My code, on the other hand, will execute in 2 cycles, since the first two instructions are paired in about any superscalar processor (see? My code runs in parallel as well. That's not an Itanium-only feature; in fact, you might even say that Itanium's hardwired implementation of parallelism is rather poor.) But I digress. My point was, I didn't study the IA-64 documentation, and if this is the only situation where I can avoid branches, than I need not drift away from x86; there I can avoid these sorts of branches already.

          As for MOV EAX, 0 vs. XOR EAX, EAX, while both accomplish the same goal at the same speed, the former instruction is encoded into 5 bytes, while the latter is encoded into a single byte. If you're optimizing tight loops, this might really make a difference.
          • As for MOV EAX, 0 vs. XOR EAX, EAX , while both accomplish the same goal at the same speed, the former instruction is encoded into 5 bytes, while the latter is encoded into a single byte. If you're optimizing tight loops, this might really make a difference

            (Disclaimer: I'm not speaking specifically about x86 here, but in the general sense of instruction scheduling.)

            If they both encoded into the same size, I'd go with the MOV, though. The latter looks like it depends on the previous value of EAX, even though in this special case it doesn't. Unless the dynamic scheduler has a special test for this (and on x86, because of the coding size issue, I'm pretty sure it does), it'll not allow the XOR to parallelize up with anything that modifies EAX. OTOH, the MOV can not only parallelize, but also get a different rename register and be reordered very aggressively with respect to other places where EAX is used. (That's what rename registers are for, after all.)

            --Joe
    • Perhaps you should go read the manual for an Alpha CPU. It has had almost all of these improvements for a decade now. And modern Alpha CPUs are kicking the Itanium's ass in performance. These might be new ideas to you, but they aren't exactly new to the industry...

      I guess I just find it really annoying when M$/Intel/Whatever come out with some "innovation" that is not.

  • malloc(-1) != NULL

    (that's assuming, of course, that size_t is still 2^32)
  • The main advance will be throwing off twenty years of cruft.

    As one example, AFAIK the PIV still uses 20 bit addressing which is mapped to 32 bit by the MMU. This really isn't a big deal in terms of performance, but it is an example of the way the x86 architecture has grown into a rigged-up hack on a hack on a hack.

    Another big advantage is (again AFAIK) 128 64 bit general purpose registers . If used in a smart way, these could save boatloads of expensive (in terms of speed) memory ops.

    I'm really going out on a limb here, but I think it means a larger instruction set without having to resort to multi-word instructions. Love to hear comments on this from someone who actually knows what he's talking about!

    -Peter
    • Most modern architectures don't have multi-word instructions. If I remember correctly, even the 68000 processor series used single-word instructions.

      The P4 doesn't always use 20 bit addressing. It can, but it doesn't need to. There are pure 32 bit memory accessor operations that let you skip the translation.

      Most non-x86 modern 32 bit processors (32 bit PowerPC processors, 32 bit MIPS processors, 32 bit SPARC processors etc.) have more registers than the PC. It does net you great performance increases without the register renaming features the latest x86 CPUs have.

      64 bit processors, in their own right, don't net your average user much extra power right now. But we are reaching the point where mid and high end computing needs 64 bits. It's only a matter of time before consumer PCs start to need 64-bits.
      • If I remember correctly, even the 68000 processor series used single-word instructions.

        The 680x0 processor series (used in Sega Genesis, the slightly modified Genesis that was Neo-Geo, pre-PPC Macs, and Palm OS devices) used a 16-bit instruction word followed by 0 to 4 words of address or data depending on the addressing mode. You must be thinking of Thumb (scaled-down version of ARM instruction set used in Game Boy Advance) or KIPS (scaled-down version of MIPS instruction set used in many undergraduate computer engineering projects)

  • by cmowire ( 254489 ) on Wednesday December 05, 2001 @02:20PM (#2660712) Homepage
    Some advantages nobody's touched on yet..

    1) Easier implementation of large filesystems. 2^64*512 bytes of disk space per filesystem should be good enough for quite a few increments of Moore's Law. Ditto for larger than 4 gig files.
    2) 64 bit processors take better to certain implementations of NUMA. SGI's implementation of NUMA gives each processor a range of memory that is local to that processor. If you had a 64 processor NUMA cluster, you'd have 64 megs local to each processor with 32 bit processors. You could have a few gigs per processor with 64-bit addressing.
    3) With 64-bit processors, it's easier to map a file to memory again, without needing to map individual chunks. Over the near term, you could map your entire disk drive to memory space.
    4) There are cases (i.e. bit packing) that don't take too well to vectorized MMX/SSE/etc. processing but do take well to 64 bit registers.
    5) The ability to segment your memory space without creating annoying limitations. As in, you can have the lower 8,388,608 terabytes of RAM reserved for the user and the upper 8,388,608 terabytes of RAM reserved for the kernel. As opposed to Windows 2k, which leaves 2 gigs for the user and 2 gigs for the kernel. With the possibility of 3 gigs for the user, if you are running a higher-end version.
    6) The ability to cache a data structure in the RAM attached to a given machine instead of buying solid state disk drives or other such things.
  • One of the features of 64-bit processing that I've been eyeing for the last 5-6 years is memory-mapped IO. Instead of manually reading files into memory, it's possible to tell the OS to map a huge mutli-gigabyte file into an address space and then access that address space as if the file was already in memory. The OS can then cache and do really cool optimizations in the background, and the program doesn't have to worry about reading and writing blocks of data one chunk at a time (though, Linux as I understand optimized this too with copy-on-write). When a value changes in that address space, the OS takes care of writing it back to the file.

    Of course, 32-bit processors can do memory-mapped IO, but not anywhere close to the scale that 64-bit processors can. Practical limits may constrict a 32-bit memory mapping IO implementation to less than 2GB of address space, though PAE might be able to increase that slightly (not totally sure, but it makes sense if it does :).

    Expect possibly more efficient databases that allow the OS to optimize the disk access even more :).

I tell them to turn to the study of mathematics, for it is only there that they might escape the lusts of the flesh. -- Thomas Mann, "The Magic Mountain"

Working...