Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Unicode, WWW, Databases and Japanese Web Sites? 11

Matthew Branton asks: "I have recently started working on a database driven Japanese language web site, and frankly I am getting lost in a sea of complicated Unicode madness. I was toying with the idea of using Python 2.1, PostgreSQL and the mxODBC interface, has anyone else had experience with this particular setup? Are there more appropriate free solutions available, perhaps JSP/Servlet based?" Unicode received a a lot of coverage this week, on Slashdot, and maybe that's due to the increasing worldwide popularity of the net and the desire to read and work in the character sets for users of a given nationality rather than trying to have it all fit in Latin-1. Do Python, PostgreSQL or mxODBC have any issues with Unicode?
This discussion has been archived. No new comments can be posted.

Unicode, WWW, Databases and Japanese Web Sites?

Comments Filter:
  • by Anonymous Coward
    The fetish for Anime here notwithstanding, very, very few of the /. readers can read Japanese, or know anything about the culture. Asking an Anglo crowd won't get you much in the way of results.

    Start surfing some Japanese websites, and start emailing their webmasters. Ask people who have done it, not people who have opinions on how to do it. (Note the distinction.)
  • ...and everything works just fine with Shift-JIS.
  • The fetish for Anime here notwithstanding, very, very few of the /. readers can read Japanese, or know anything about the culture. Asking an Anglo crowd won't get you much in the way of results.

    Even if you ignore the fact that many Anime fans visit Japanese sites on a regular basis, and /. Anime fans tend to be technical, and thus know what is used on Japanese sites, I am sure that I am not the only person who has been contracted for an Asian site (in my case, Chinese).

    I *can* tell you that PHP and MySQL work quite nicely with Chinese character sets, but you will quickly run into a few dozen tiny issues involving things like sorting and oddball string functions. In PHP's case (and since I don't know Chinese), I created a testbed which ran an arbitrary string through every function that could possibly be in use. I then had a Chinese tester go through and make sure everything worked (iirc, only a few wierd, replacable functions like ucword() didn't work). In MySQL's case, using the latest version, setting a few flags and writing the queries slightly differently than I normally do (BINARY flag, etc) was the fix: something that I contacted the developers about, and got (as of then) undocumented, cutting edge answers.

    --
    Evan

  • Postgres handles multibyte

    If the locale of your machine is set correctly you should have no problems with JDBC and JSP
  • I've written several dynamic web sites using mySQL, JSP and servlets and this is what I did:
    Stored the database internally in UTF-8 format and accessed it using unicode from the servlets.
    Stored and served the HTML and JSP files in Shift-JIS format. As long as you tell the servlet that a JSP file is in shift-jis, any characters you write to the document as it's being processed will be converted automatically to shift-jis.
    Because HTML forms are stupid and don't support (by default) character sets, you have to assume that the form will be sent back in the same character set the file was sent in. In Java, this meant telling the form handler to interpret the bytes as Shift-JIS which automatically converted them back into unicode, which would let me handle the form data in a uniform manner (1 character is 1 character, even its UTF-8 representation is more than one byte) and easily store it in the database in UTF-8.
    As far as I knew, Python supported unicode strings, which would allow this type of handling. I recommend storing your database in some sort of native unicode format, which should make this fairly seamless and should cut down on conversion costs, since UTF-8 to unicode is a MUCH faster conversion than Shift-JIS to unicode. So storing like this cuts down one step, assuming you wish to handle all form data as unicode.
  • We had a few small troubles doing similar things to what you're attempting... not in Japanese, but still in Unicode.

    MSSQL2000 seems to do just about everything in Unicode, and python (2.02 then) did choke on a few things.

    We basically solved the problem by casting and coercing fields as they came out; staring out things with a

    tempStr = u""

    and then appending database fields to that. It didn't seem to like adding unicode data to a plain string, but initializing the string as unicode helped that.

    We did notice that it wasn't a good idea to try and set Python variable names with unicode strings, but simple str(field) stuff worked in our case, because the fields coming out of the DB were basically plain english.

    We also noticed that Python doesn't seem to like 'exec()' on unicode strings, so we re-wrote the code to avoid exec() (should have done that anyway).

    Perhaps with more expertise with the various Unicode encodings one could really get everything working transparently, but in the meantime, the above explicit band-aid translation worked fine.

    Good luck!

  • There are two approaches for this:
    1. use the unicode like you've been trying
    2. use shift-JIS (like most Japanese websites)
    The trick to handling Japanese data-driven content is to remove all the 'text' filters you may have in place for the content administration or data types in the system. Unfortunately, when you tell it 'text', it assumes the single-byte limitations (which is fine for ascii). But Japanese is double-byte, and as long as you pass data along as raw data, everything should be okay.

    I'm sorry if that's restating the obvious-- but it's a point many people overlook.

    yoroshiku,

    Dave

  • First of all, there have been a few posts saying to use Shift_JIS. What they don't say is how to get the passed parameters into Shift_JIS. For this, Java (JSP/servlets) is a real blessing.

    When Java processes character codes, it does so in Unicode. However, the client browser may be sending data in Shift_JIS (Windows clients), EUC_JP (most UNIX clients), or JIS (???). In order to process that, you have to first convert the code to some common denominator - and Java uses Unicode for that.

    Because I do this so often, I have a library method that I often refer to to handle this sort of thing. See Java Utility Library Inititive [slashdot.org]'s (JULI) StringUtil.decode(String string, String encoding) for details. Pass it "JISAutoDetect" and it'll figure out which encoding to use for decoding.

    I'm sure that the other languages (Perl, etc.) have similar functionality. But this is a must for recieving data from a client.

    Once you have your string in Unicode, you say that you're using Postgresql? I'd recommend sticking with Unicode for it, but if you want to use a native encoding, install the Japanese patch (/usr/ports/japanese/postgresql7 on FreeBSD), and you're set to use EUC_JP - NOT Shift_JIS. The last I checked, Postgresql didn't support Shift_JIS as a native encoding.

    Finally, when serving pages to the Internet, iso-2022-jp (JIS) is still the standard. However, from my understanding, i-mode et. al. want Shift_JIS. I don't know if they convert internally or not. (I refuse to be on call 7/24.)

  • The link to JULI above should be:

    http://sourceforge.net/projects/juli [sourceforge.net]

    I left off the "http://". Gomen.

  • by Kingfox ( 149377 ) on Monday June 11, 2001 @01:45PM (#161240) Homepage Journal
    You could always try asking the same question on Slashdot Japan [slashdot.ne.jp]. You might get more of an answer. I think the /. readers there read a bit more Japanese.
  • works nicely in my experience. You could also use the Shift-JIS encoding if you plan to store only Japanese (and English), but I use UTF-8 to store multilingual data in MySQL.

"If it ain't broke, don't fix it." - Bert Lantz

Working...