Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
News

Will We Ever Get Rid Of ASCII? 38

GeZ asks: "When will Unicode finally replace ASCII? When will 7-bit-encoded text finally disappear? When will 'extented' chars (like 'é' or 'ß', etc) be recognized as 'alphanumerics', letting us use all characters we want for file names, functions names, and DNS names? Most top-level modern apps and standards use Unicode so it deserves to be integrated at the lowest level, now. I really think old ASCII is too limited and fragmented to be useful. Using metachars in an ASCII file (a la HTML entity) is a boring way to solve the problem. A perfect integration with OSes (and base libraries) will "magically" make nearly all apps Unicode compliant, no? Yes, text chars will be encoded on 16 bits intead of 7 or 8 and would double text file size, but is this really troublesome, given today's storage medias?" Do any of you think that Unicode will completely replace ASCII or are there reasons why it's still in use as the primary way to represent text characters?
This discussion has been archived. No new comments can be posted.

Will We Ever Get Rid of ASCII?

Comments Filter:
  • Who really needs 65536 different characters?

    • Chinese and Japanese scripts use Han brand characters (Japanese calls them KanJi). There are estimated to be about 50,000 characters in Chinese, but only a small fraction of those are commonly used, and even a smaller fraction (about 5,000 or so) in Japanese.
    • Sinhala, Devanagari, Greek, Cyrillic, Tengwar, Arabic, Hebrew, and other scripts need character codes; all are present and accounted for (except for Tengwar) in Unicode 3.0. (Tengwar can be found in a defacto con-script standard somewhere on the Net that specifies characters in the Private Use areas of Unicode.)

    again, a char is supposed to be the same size as a byte

    Nowhere in the C standard does it say that bytes must be 8 bits. Some C/C++ compilers for DSP architectures set char = short = int = long = 32 bits and still comply with the standard. There's also the wchar_t data type.

  • Unicode is one of "semi-proprietary" standards -- documents aren't available for free (be it ones from ISO or Unicode), however there is no legal barriers for making an implementation -- just the size of the table makes a job of creating fonts unreasonably huge. OTOH, the tables necessary for determining, what the characters are, are available for free.

    The problem however is different -- people already use their own charsets, and those charsets were designed to reflect the structure of language, or just to be most convenient for their language, sometimes made quite different from part of Unicode that is supposed to be used for the same language. If instead of trying to _convert_ everything to Unicode, people adopted a reasonable (iso 2022 isn't reasonable) way to label, which charset and which language are where in their strings, the implementation would be able to use all known charsets, and programs that aren't concerned with operation thats depend on them can just ignore the whole thing and treat text as a sequence of bytes until charset-specific procedures are called to process/display/compare/convert/input/... the text where "real" size and mapping of characters will emerge -- and those procedures can be language-dependent, replaceable and expandable if they will implement an easy mechanism of mapping charset/language names to sets of procedures. Unicode could be used as one of possible charsets, and UTF-8 could be used as one of possible encodings in such a system, however it won't be "the" thing, that everyone is supposed to support and be aware about. At most some programs would have to know how label delimiters look.

    It can be a very easy solution for the real problem, however it requires an agreement on how charsets/languages should be labeled (their "real" names should be used to make the thing expandable, however how those things should be separated from "normal" text remains a question).

  • I think people who speak languages that don't use the Roman alphabet may end up simply learning to encode their languages in ASCII. The Germans already do this, with u-umlaut being encoded as "ue" and so forth.

    Transliterating to Roman letters and using ASCII (especially since more people already speak English as a second language than any other in the world) may simply be simpler and faster.

  • by cr0sh ( 43134 )
    Cool - now I am going to have to find my punch card and look at it (I have one I found in a IBM 740 (or 704?) training manual - I guess someone was using it as a bookmark). The image you gave, though, was clear enough to see what you meant by zone holes.

    Your explanation helps a lot - not that I have any use for such info, but I was curious about it. Between your explanation and the byte conversion array the other guy gave, I should be able to figure it out further.

    Now, what does this say about EBCDIC and ASCII, about which came first? It sounds like ASCII came first - but what is the real answer?
  • I wish I could mod you back up - you were most certainly on topic (code is fine by me)...
  • Well, mebbe not. I am waiting to hear more comments from non-English slashdotters on this subject, the comments so far reflect a definite world view -- the English world.

    Non-English slashdotters that at the same time use iso8859-1, most likely see Unicode (or UTF-8) as a good thing because first 256 characters of Unicode are the same as iso8859-1, and they don't give a damn about everything else, while non-English slashdotters that use other local encodings/charsets (like me, whose native language is Russian, with koi8-r as the charset used in unixlike systems) see Unicode as a monstrosity, forced on them by a bunch of dumbasses at Unicode Consortium, ISO and software vendors thart benefit from every incompatibility that can force people to upgrade.

    If charset/language labeling was standardized, everyone would be able to use their own charset, and all software that is not directly involved in text editing/displaying would be able to continue working as it was before, however by STUPID decision made by "standard bodies" the priority is given to sticking "should support Unicode/UTF-8" into every standard in the place of "should pass the data as a stream of bytes regardless of the actual size of character, encoding and their possible meaning, except special characters involved in protocol" that would actually accomplish something.

  • Actually, a lot of the AS/400 text (internal to drivers/OS) is all EBCDIC - some of it was just converted from ASCII a couple of years ago. Interesting...
  • How many of us use computers where we have 16 bit bytes? Not many, I assume. BTW, I always thought that char = 1/2 the size of an int. As in, for our computers:
    8 bits - char
    16 bits - int
    32 bits - long int
    I've also heard of a 80 bit integer, but I'm sure it's rumor and hearsay.


    When the pack animals stampede, it's time to soak the ground with blood to save the world. We fight, we die, we break our cursed bonds.
  • Ah, but *I* can't read those pages, so they must not exist [/typical American]

    In fact, I'm going to continue my lobby for a separate language - American! Heck, we never use 'bobby'm 'lorry' or 'wc'. Not to mention 'football' means something different in American than English or any other language. We'll stick with our 'coney dogs' and other fun local slang.
    [/stupidity]

    We should at least make English the national language, and make it a crime to conduct schools and buisnesses in other languages [/John Rocker]

    Guess I can't win... though I'd rather not. ASCII isn't going away anytime soon, but Unicode is a Good Thing TM.
  • by zrpg ( 10539 )
    Does anyone remember ZZT, the DOS game that uses the smily character as the player with the beepy PC speaker music? I still play that game, and many other do. As long as there is ZZT, there will be ASCII!

    OT, but checkout www.planetzztpp.com, they're working on a Linux version!
  • As another poster noted - this waste of space is most troublesome in embedded systems, but hey - you should have a special compiler for that anyway - problem solved. Besides, we waste more memory with that @$#&@%! animated paperclip...
  • Sure, it's all fine and dandy. We're mostly programmers here.

    But someone, please, tell me the easiest way to type ü (u-umulat) in Windows? One of the things that I do on my Mac that shocks people is just type with the flow foreign characters (opt-u, u is u-umulat, opt-u, e is e-umulat, etc). I think one of the reasons no one wants to move from ASCII to anything else is because it's rather hard to type in anything else.

    Just my .02

    ls: .sig: File not found.
  • Damn. I can never trust what I read on the internet. I still think UTF-8 is cool. :)
  • > The best part is that utf-8 requires no change. All ascii programs can read utf-8 and all utf-8 programs can read ascii.

    I'm confused. ASCII is 7 bit, right? If UTF-8 is eight bit then I don't see how the two are always interchangeable, unless UTF-8 always zeroes the high bit. If that's the case, why not UTF-7?
  • iso-accents-mode is yet another reason why Emacs is The One True Editor(tm).
  • While I agree that texts, filenames, etc should by default support 8-bit characters, there are advantages to keep ASCII around:

    The limitations in ASCII makes searching texts and code a lot easier. I _like_ restrictions for function and variable names.

    Of course something like is_ascii might just be enough for such a backwards compatibility hack.

  • by pimaniac ( 98575 ) on Wednesday May 10, 2000 @10:04PM (#1079862)
    Utf-8 is the name of the set of all characters formed by the lower 8 bits of unicode, which are all the ascii characters.
    Since unicode is a variable length encoding, utf-8 can look exactly like ascii to an ascii machine.
    The best part is that utf-8 requires no change. All ascii programs can read utf-8 and all utf-8 programs can read ascii. So therefore all unicode programs can read and write ascii. And all ascii programs can read and write a unicode subset.
    To top it off, if a file does use the extended unicode stuff (>8 bits) then it will just look like line noise to an ascii machine, and a normal document in whatever language to a unicode machine.
    The file size increase wont happen for ascii characters, but an additional 8 bits is needed for extended characters.
    In conclusion, Unicode will completely replace ascii, and almost no one (in english speaking countries at least) will notice. :)

    Example:
    ascii A == 65. or 1000001
    unicode/utf-8 A == 65, or 1000001.
    There wont be any problems here. :)


  • static char
    _atoe_[] = {
    0x4b, 0x01, 0x02, 0x03, 0x37, 0x2d, 0x2e, 0x2f, /* 00-07 done */
    /*NUL,SOH, STX, ETX, EOT, ENQ, ACK, BEL */
    0x16, 0x05, 0x25, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 08-0f done */
    /*BS, HT, LF, VT, FF, CR, SO, SI */
    0x10, 0x11, 0x12, 0x13, 0x3c, 0x3d, 0x32, 0x26, /* 10-17 done */
    /*DLE,DC1, DC2, DC3, DC4, NAK, SYN, ETB */
    0x18, 0x19, 0x3f, 0x27, 0x4b, 0x4b, 0x4b, 0x4b, /* 18-1f done */
    /*CAN, EM, SUB, ESC, N/A, N/A, N/A, N/A */
    0x40, 0x5a, 0x7f, 0x7b, 0x5b, 0x6c, 0x50, 0x7d, /* 20-27 done */
    /*SP, "!", """, "#", "$", "%", "&", "'" */
    0x4d, 0x5d, 0x5c, 0x4e, 0x6b, 0x60, 0x4b, 0x61, /* 28-2f done */
    /*"(", ")", "*", "+", ",", "-", ".", "/" */
    0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, /* 30-37 done */
    /*"0", "1", "2", "3", "4", "5", "6", "7" */
    0xf8, 0xf9, 0x7a, 0x5e, 0x4c, 0x7e, 0x6e, 0x6f, /* 38-3f done */
    /*"8","9", ":", ";", "<", "=", ">", "?" */
    0x7c, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 40-47 done */
    /*@", "A", "B", "C", "D", "E", "F", "G" */
    0xc8, 0xc9, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, /* 48-4f done */
    /*"H", "I", "J", "K", "L", "M", "N", "O" */
    0xd7, 0xd8, 0xd9, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, /* 50-57 done */
    /*"P","Q", "R", "S", "T", "U", "V", "W" */
    0xe7, 0xe8, 0xe9, 0x4b, 0xe0, 0x4b, 0x5f, 0x6d, /* 58-5f done */
    /*"X","Y", "Z", N/A, "\", N/A, "^", "_" */
    0x79, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 60-67 done */
    /*"`","a", "b", "c", "d", "e", "f", "g" */
    0x88, 0x89, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, /* 68-6f done */
    /*"h","i", "j", "k", "l", "m", "n", "o" */
    0x97, 0x98, 0x99, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6, /* 70-77 done */
    /*"p","q", "r", "s", "t", "u", "v", "w" */
    0xa7, 0xa8, 0xa9, 0xc0, 0x6a, 0xd0, 0xa1, 0x07, /* 78-7f done */
    /*"x","y", "z", "{", "|", "}", "~", DEL */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* 80-87 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* 88-8f done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* 90-97 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* 98-9f done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* a0-a7 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* a8-af done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* b0-b7 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* b8-bf done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* c0-c7 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* c8-cf done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* d0-d7 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* d8-df done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* e0-e7 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* e8-ef done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, /* f0-f7 done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
    0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b /* f8-ff done */
    /*N/A,N/A, N/A, N/A, N/A, N/A, N/A, N/A */
  • Old standards never die, they just fester in a closet.
    Look at EBCDIC. Still used for terminals in businesses. Look in a Wards or a Home Depot. They're everywhere.

    Teacher: "So the government wanted IBM to make an encryption standard, and IBM did. It was named..."
    Student: "EBCDIC?"
  • Since ASCII contains all the characters you need to write in English, and 2/3 of all applications are written in the US, Unicode will take a very long time to replace ASCII, if ever.

    Of course, I haven't got the numbers to back this up.

    --Bud

  • Unicode is an abomination. ASCII may not be perfect, but it fits in a byte.
    Cheers,

    Rick Kirkland
  • I doubt that's true. If you said of all apps were written in the English-speaking world, you might be correct, but I certainly dispute that of them are written in the US. Significantly less than ½ IMHO. Look at the UK, Canada, Australia etc etc. In fact I'm not at all sure about the English-speaking world; just look at what Japan, China, India, Russia produce.

  • Utf-8 is the name of the set of all characters formed by the lower 8 bits of unicode, which are all the ascii characters. Since unicode is a variable length encoding, utf-8 can look exactly like ascii to an ascii machine.

    Not quite right. Unicode is fixed-size (16 bits), UTF-8 is an variable length encoding of Unicode which, _if the text consists entirely of the 7-bit ASCII subset, will look exactly like ASCII. Other characters (in the larger range, around 0x6000 to 0xFFFF [I'm guessing]) will take up to 3 (maybe 4?) bytes to represent.
  • A perfect integration with OSes (and base libraries) will "magically" make nearly all apps Unicode compliant, no?

    No.

    Remember, there's a large amount of plain ol' text lying around. Heck, all of the web (including Slashdot) is essentially just ASCII with SGML entities. Nobody will suggest converting all of this to straight Unicode.

    This is why there's UTF-8, a variable-length version of Unicode that's essentially backwards-compatible.

    But that's not the whole problem. You mention implementing Unicode/UTF-8 in libraries and OS'es to get "magical compliance." No such luck. If you take a lot of code out there (including some of my own), it makes assumptions that byte=char. So people use char * and perform pointer additions and so on to parse. This is fine when you have 8-bit text. But what happens when you go to 16-bit text or in the case of UTF-8 variable-length chars? Things break.

    However, getting good solid implementations of UTF-8 in core libraries and OS'es will help a lot. Right now there really isn't one standard API for treating UTF-8 text. The new glibc has a good implementation, but if you want to write portable code, this is a problem--you don't have glibc on all systems (e.g. *BSD, Solaris...).

    But the day will soon come when programs that are not Unicode/UTF-8 compliant are in the minority.

    -Geoff
  • ...for things that require a small memory footprint.

    ASCII uses 7 or 8 bits (usually 8 now). Unicode uses 16--twice as much. For things like embedded systems where memory can be in short supply, there no need to double the space used for storage.

    Yes, you can do clever tricks like compress the data, and make up your own encoding scheme. And yes, memory is (relatively) cheap these days, but even so...
  • when they stop handing out ASCII keyboards. ;-)
  • Not everyone needs Unicode (I know I don't). Who really needs 65536 different characters? And again, a char is supposed to be the same size as a byte, not twice the size of one.
    Switching to Unicode would be like putting a gun to your head and pulling the trigger.


    When the pack animals stampede, it's time to soak the ground with blood to save the world. We fight, we die, we break our cursed bonds.
  • Was the resoning behind EBCDIC - from what I have seen, it is nearly totally different from ASCII (or maybe it is the other way around - which came first?). I have not ever been able to find an EBCDIC to ASCII conversion chart/table/code, nor have I ever seen more than a subset of an EBCDIC chart. On top of this, I have never been able to find a history or anything on how EBCDIC came about or why. Can anyone point me to a resource?
  • Well, mebbe not. I am waiting to hear more comments from non-English slashdotters on this subject, the comments so far reflect a definite world view -- the English world.

    There is an opportunity (or was, some vendors have missed it) when converting an O/S from 32-bit to 64-bit to also build in support for Unicode. Let's face it, current multi-byte encoding schemes (1, 2, or 3 bytes per character, varying) are a pain, but Unicode is a breeze to use. Try writing some internationalized Java if you don't believe me.

  • The facility to create Unicode programs has been there for quite some time. In fact, most programming libraries make it quite easy to invoke. Even the size issue is not an absolute, as you can write versions that restrict the character set to a practical subset based on which language/region is in scope.

    The problem is that to do so involved programming effort, and with the quick development cycle of today, it makes use of Unicode really unlikely unless it is a very international company.

    -L
  • Thats encoding for URIs, not just CGI, and your description is wrong. Theres a safe character set used for URIs which is more than just the alphanumerics, and the %xx stuff is is used to represent arbitrary octet sequences as character sequences in this set. Its not assumed that each octet is a char, thats up to the recieving application. Read rfc2396, where it actually says this.

    BTW '+' is an abomination - its not actually in the spec (go check if you don't believe me). It was in an internet-draft that didnt make it to the RFC stage, but some 'early adopter' foisted it upon us, so most browsers support it.

    There has been some work on DNS standards to include unicode names, which would then be used in URLs, although the proposal there is somewhat different (essentially '-xx' instead of '%xx'). See http://search.ietf.org/internet-drafts/draft-oscar sson-i18ndns-00.txt
  • Legacy code assumes that bytes are equivalant to chars. Bytes are the smallest addressable units on modern computers, but they weren't always. In the 1950's, CPUs addressed words, and you had to jump through hooks to access individual bytes. So, the solution is obvious.

    Build computers that address memory in 16-bit chunks!

    char == 16 bits
    short == 2 chars == 32 bits
    long == 2 shorts == 64 bits
    int == pointer == 32/64 bits, depending on model of CPU

    This is exactly the way C worked on PDP-11s, etc. All existing code would recompile just fine, but would "magically" start using Unicode instead of ASCII. Yeah, the table that's used by isascii and its friends would suddenly grow to 64KB (remember, those are 16-bit bytes, not 8-bit!), but memory's cheap and getting cheaper. The memory that would be used by such a table is cheaper today than the 256 byte version was in 1970.

  • In addition to the byte=char comment, there are some standards out there that require a char to be one byte. For example, in CGI, all non-alphanumerics are converted into hex, preceded by the % sign (except for the space which is +). This would break every CGI implementation I've seen out there.

    A less critical point that I've seen in my own code is some of it will only process data if it is below 0x7f (i.e. first 7 bits). Usually, it processes a subset of these and ignores the rest. While this wouldn't break in Unicode, it would ignore everything but the first 7 bits.
  • ...non-English slashdotters... see Unicode as a monstrosity, forced on them by a bunch of dumbasses at Unicode Consortium...

    While in principal I feel there is a genuine need to use a globally-unified character set, I've heard that Unicode is proprietary. Is this true? If so, how does it affect attempts to support it in, for example, Linux?

  • EBCDIC was a method of translating punched-cards into binary. Here [maxmon.com] is a picture of a punched card. (The image comes from here [maxmon.com].) EBCDIC means "Extended BCD Interchange Code", and BCD means "Binary Coded Decimal". In BCD, the digit "0" is encoded with a low-order nybble of "0000" and "9" is "1001". On a punched card, 0-9 were encoded as single punches, and A-Z were encoded as 1-9 with additional "zone" punches. As a result, the EBCDIC encoding for the letters followed the encoding for digits, so when expressed in binary, there are gaps between "I" ("yyyy1001") and "J" ("xxxx0001"), and again between "R" ("xxxx0001") and "S" ("zzzz0010").

    BTW, my pseudo-values for the high-order nybbles follows from the zone punches that were overpunched. The top row was the "Y" zone, then came the "X" zone, and then the "zero" zone.


  • One of the few things that Microsoft actually did reasonably well was to build Unicode support into Windows NT. It's possible a program where your char is a unicode char without too much trouble. Unfortunately, such a program will not run on Windows 95/98, which do not offer much in Unicode support.

    So...

    Once Windows 9x dies, Unicode will become vastly more prevalent, at least on M$ platforms.
  • rrrrrrgh... you try to be helpful, and you end up off-topic...

    Guy asks about conversion between ASCII and EBCDIC (chart/table), so I put one up... off-topic my arse!!!
    [/bitch-and-moan]
  • Not only that, you /can/ use much of high-ascii (8-bit) in filenames, on most OSs I've used. Unix, DOS, even TRS-80 machines with their color basic, and TRS-DOS. It /does/ become a pain to access those files, and tends to mess up directory browsers and the like (tripwire used to have a problem like this).

    I'd never use an OS that wasn't ascii-based at the lowest (character) level.

    ---
    script-fu: hash bang slash bin bash

Intel CPUs are not defective, they just act that way. -- Henry Spencer

Working...