Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet

On Building High Volume Dynamic Web Sites 222

kolestrol asks: "A while back I built a Web site using mysql and Java servlets to track Kosova refugees. That experience had taught me a lot. I had severely underestimated the job. I was wondering if anyone has any similar experiences, i.e. maintaining highly data-driven interactive Web sites with a high volume, and how they have managed to handle the load. Furthermore, how have they managed to handle content (site redesigns, etc.). The reason I ask is that ever since the above-mentioned project, I have been doing a lot more research, trying to find a free Linux solution. The only thing I found was at The Linux Virtual Server Project." I don't know how the larger Web sites do it, but I assume they evolved in stages to add their current features. What kind of design decisions are made when designing such sites?" (Read more.)

"Apart from this I have been talking to commercial vendors like BEA (I was very impressed) who provided application servers with load-balancing, replication, etc., starting at $20,000 (Australian) -- they run sites like Amazon.com, Qwest, Wells-Fargo etc.

There is an issue here (is there? I don't have any experience to really know hence am asking you) ... I can build a custom solution with load balancing written at the application level. But how does this affect my maintainability (for example Amazon.com moving from just books to all sorts of other stuff .. how long did it take to redesign the site etc.)?

The site I first built could potentially hold information about a million refugees, and allowed searching on most fields regarding information on a person (wildcard queries). Unfortunately, on doing some stress testing (with around 700,000 records) I found that at most 15 hits could be handled every ten seconds. I optimized the code, switched JDBC drivers to a faster driver, wrote a simple load balancer (and I mean very simple) and limited searching of fields to a few fields as well as preventing bad wildcard queries (e.g., a wildcard at the start would make little if any use of the index). Consequently, I managed to get the system to handle slightly more load (200 hits at 5 seconds) (Hardware was Dual Pentium II 450Mhz I think, 512MB RAM, 2x8G Ultra-wide SCSI hard drives, and running Linux of course). BTW, The Kosova refugees articles has a lot of misinformation, e.g. encrypted databases, and the time to actually build it was actually one week (and two weeks of overcoming red tape, etc.)."

This discussion has been archived. No new comments can be posted.

On Building High Volume Dynamic Web Sites

Comments Filter:
  • by Anonymous Coward
    That's what philg [photo.net] says anyway. And he should know.
  • by Anonymous Coward
    WebObjects is the coolest thing around for this. Yes, it's quite expensive, but for the type of thing it was made to do, the cost is usually much less than what one spends on hardware and support. Of course, some of us little folks would like the advantages of WO for smaller projects, so of course there is the GNUstep Web Project. It's quite far along, does a lot, but still has some work to do. It's the more difficult solution as there's little documentation and you probably have to be familiar with WO to do anything with it, but it's free and open.

    So, why GSWeb or WebObjects?

    - It's a framework, not an application server. In other words, you don't build applications, you use the WO framework to build application servers. Each app handles it's own memory, threading, resources, etc for incredible performance and scalability. This also makes it extremely flexible as it's a strong OO environment and almost all of the functionality of the app server can be overriden in you application.

    - EOF (Entreprise Object Framework). If you've never used EOF (or GNUstep DB) you don't know what database development could be like.

    - It's a standard environment. By this I mean the programming is more like a standard GUI app. You don't have a form that posts data to the same page or another page. Instead, you have a form with an action property that is bound to a method so that when an action is performed (click submit) the method is called. You as a developer don't have to worry about form names and how to do validation, your method can do whatever it needs and return an arbitrary page without messiness like forwarding or including other pages.

    There's lots more. Basically, when logic very much outweighs content, WO is hard to beat. On the downside, it is a completely different way to think about web development and there is a learning curve. For OpenStep developers, this will be small, but for programming novices, it could be quite steep. But how productive you can be with a tool is definately a factor that needs to be considered.

    Check out GNUstep Web at www.gnustepweb.org [gnustepweb.org]. I have the source for a very simple GSWeb application up at http://zeus1.tzo.com/GuestBook/ [tzo.com], this is basically a GSWeb conversion of apple's first WebObjects tutorial app (yes, it's entirely useless and it would be stupid to use WO for a GuestBook :)

  • by Anonymous Coward
    I would like to caution people against Philip Greenspun's system, especially if you are optimizing for open source. I have taken his class at MIT and used his system. I cannot say that I have used that many other systems (especially not stress testing them under 10-20 hps) but if Apache nd PHP or mod_perl can do the job for you, I wouldn't touch AOLServer, ACS and TCL.

    The reason is primarily that the system has not been embraced by the community. There aren't people outside of ArsDigita hacking on it, and the people inside are only hacking on it in ways that are good for ArsDigita (as of now, there is no development team for the toolkit, site development teams modify the toolkit, and sometimes their mods get folded into the distribution).

    Philip will try to sell you on the "proven" AOLServer (uh, almost no one uses it in the real world) which is a very sketch open source project last time I looked at it. His system also backends to Oracle, which is not going free anytime soon. Finally, he will try and sell you the ACS as the greatest thing since sliced bread. It was, in 1995.

    So read his book, which is pretty good (you could even buy it, its the only computer book I would put on my coffeetable, lots of pretty pictures). Just ignore all the technical parts.

    PS: It will also force you to do development in TCL, which sucks ass.

  • by Anonymous Coward
    The continued popularity of MySQL amongst the open-source community continues to amaze me. MySQL is not open source. Go read their license [mysql.org]. It's not even free as in beer. Almost incidentally, it's not a database either. Their is no transaction support which means that if your box goes down the database is likely to be in an incosistent state, and there is no easy way to fix it. The stunning thing is that their is a real database that is open source as well. PostgreSQL [postgresql.org].
  • by Anonymous Coward
    Take a look at
    http://www.eddieware.org/
  • I have developed websites for the last few years. During that course of time, I used Php, mod_perl (embperl and mason), zope, ACS with AOLServer. Here are my impressions of them:
    1. Php [php.net] Pros:
      1. Easy to learn. Once Apache is configured with Php, it is seductively easy to write code.
      2. The Db connection pooling comes in handy.
      Cons:
      1. It is yet another language (I mean, I already know perl, right? Why do I have to learn another language?).
      2. Also, you tend to design page by page. it does not have a great library system to use. It has lots of code snippets you can copy.
      Notable sites: Sourceforge.net, and persiankitty.com (reputed to be the yahoo of porn))
    2. mod_perl [modperl.com]: This beast can be used in several ways. If you use it with Apache::Registry, y use Php,ou almost are doing a disservice to yourself. The three ways of using it are:
      1. embperl [apache.org]: Looks like Php style of embedding. Comes preconfigured with DBI and friends. But, too cumbersome to programs. It also does not encourage component programming. It provides the substrate for you to build other features you might like.
      2. mason [masonhq.com]: Has cool component features. Has neat features such as caching with intesting way of managing it, and autohandlers. Looks ideal for the publishing world, where it evolved. On the cons side, not too many components.
      3. Apache::ASP: Not used it much.
      Notable sites: dejanews uses embperl. techweb, stamps.com use mason.
    3. Zope [www.zope.ousephprg]: Too different from others. The strongest suite is the built-in authorization systems and web based management system. You can program at embedded HTML level called DTML, or you can write python classes. There are several prebuild modules that run out of the box. As long as the prebuilt modules serve you well, you can go ahead and use zope. Otherwise, be prepared for a long learning curve. It almost looks black magic to me.

      As an aside, python has the weirdest variable scoping and declaration rules. I ought to know, I have a PhD in programming languages.

    4. ACS [arsdigita.com]: ACS is entirely different from others. It is not a language like Php, it is not a webserver like APache, it is not a app server like zope. What it is one man' use Php,s vision and experience about community web sites distilled into a toolkit. It is a datamodel, it is a set of practices, and a set of programs. It happens to use TCL, Oracle, and Unix. But that is almost besides the point.

      The biggest selling points of ACS for me are:

      1. The documentation. Quite possibly the best documented system. I can take an average joe out of the street and train him to use ACS in a systematic way in no time.
      2. The Data model. I would expect to pay in hundreds of thousands of dollars for such a datamodel. It constantly amazes me to find the little details that I needed in those datamodel.
      3. Best practices to run a website. How to harden a Unix system and set up the services so that you can sleep peacefully.
      The negative points are:
      1. Alas, it requires Oracle. I just learnt the existence of ACS/pg and I am rejoicing!
      2. Not too wide spread usage. I expect this situation would change. Look at this way. I tried building a community web site. It took me untold hours to gather all the snippets of information for a Php based site. It took me no time at all to build it using ACS. I would use ACS for a community based website over any other toolkit anyday.
      3. No good enough template mechanism. Fortunately, ADP in AOLServer is changing that.
      So, in conclusion, php and modperl and zope vs ACS is not a right question. You can implement ACS in php, and I frankly hope somebody does to save lot of human misery and suffering.

      Rama Kanneganti

  • Oops! Typo here: What it is one man' use Php,s vision and experience about community web sites distilled into a toolkit.



    Somehow my emacs inserted that "use Php" in the previous sentence.


    Thanks.

  • Yeah, but templates aren't part of JServ.

    Recommending JServ as an alternative solution to Dynamo is like (strained analogy) giving sheet metal and welding tools to kids vs. giving them a climbing frame.

    Caching data in a hashmap is a good first pass solution, but then you have to deal with threading issues, which brings up the problem of State. Writing an LRU thread-safe cache is perfectly possible, but a total waste of time if you're trying to write a web application.

    If you must go for a free application server, choose Zope or The Locomotive. I haven't tried them out, but they seem to have the (minimal) functionality required...
  • Since I'm an ATG employee, I'm going to restrict my comments to the strictly personal non-Dynamo related side.

    I've written server-side Java using Apache Jserv before I ever used Dynamo. Server side Java alone will screw you up horribly, because you can't easily generate dynamic HTML pages, and unless you write your own templating feature you're stuck mixing HTML code with Java code.

    This will stymie the designers completely, because the only way they can make changes to the page is to ask the programmers to do it for them. The programmers are stuck doing both design and programming. Forget about making easy progressive changes to the code -- if you have one page that you want to change every third element, you'll have to trawl mounds of printlns.

    Not to mention that there's no personalization. There's no database cache so you go to the database every time you need data (which can be horribly expensive). There's no connection pooling. If you want to grab stuff from the database, you have to write dynamic SQL to grab elements... I can't describe how tedious this can get.

    If you want to write everything yourself or don't have the money, use Jserv. It's a good Java server, and it does everything it's supposed to. Whether it will do everything you need is another matter.

  • Your post is very informative, but there is one thing missing. You said you're using multiple Apache boxes which talk to the MySQL box. But how do you distribute the load among the Apache boxes? The incoming traffic needs to be somehow directed towards one of the Apache servers. Also, what if one of them dies? Will it kill the whole cluster or will the other boxes take over?

    ___
  • Before I forget.

    Another important feature is that these techniques are server side so are independent of the browser used. You can also use PHP with Oracle 8i but most of the really cool stuff is centered around MySQL. Plus, why spend money on software if you don't have to. Spend the money save paying for support or donating to the OpenSource projects (PHP and MySQL). Yet another advantage is that the application can be developed and tested on Windows (read Configuring Windows98 For Local Dev [phpbuilder.com]) then uploaded to the Apache Server. A good free program with PHP syntax highlighting and tons of other goodies for developing is HTML-Kit [chami.com].

  • PHP is a recursive acronym. It stands for PHP: Hypertext Preprocessor.

    --

  • Sorry, but this is nonsense.

    Last summer I worked for a while at Planet Online in the UK. (They're big Linux supporters and host linux.org.uk). The focus of their activities is Freeserve, the largest ISP in Europe with two million users. It's specified for ten million.

    The entire system runs on Linux, with NetApp filers doing the actual disk storage.

    The user database runs in mySQL. Every mail sent or received, every user homepage hit, every DNS hit, involves a query against the back-end database. That's for two-million users - almost a hundred thousand of whom are logged in at any one time.

    The amazing fact is that it isn't slow, it isn't unreliable, and it isn't reaching its limits. Why? Engineering.

    The people at Planet simply have an in-depth understanding of the hardware and software that makes up their platform. They've put in fast kit and lots of memory, and they've tuned the code (long live open source) to make use of it. It flies.

  • 20% as fast? No way. 20% slower maybe. But either way all you're saying is that your proxy server isn't as good at talking HTTP as your FastCGI server is at talking its own protocol. Naked HTTP communication doesn't add much to the content except a few headers, so this should be fixable just by improving the HTTP proxy code to be as efficient as the FastCGI networking code.
  • FastCGI is a good technology, but the problems you describe with mod_perl have a pretty simple solution. You just use mod_perl as an application server and put a reverse proxy (like Squid or Apache + mod_proxy) in front of it to serve static requests.
  • Don't fall into the application server trap. Think about what they're telling you. If you load-balance your Java servlet runners, you will still have to come up with a separate solution for load-balancing your web servers in front of them.

    Ultimately the proof is in the pudding. None of the big sites use commercial application servers for anything they really care about. I know people who work at Yahoo, Amazon, and others, and they use home grown stuff, often with Apache, and often in C or Perl. You can make a fast system with any number of open source tools like PHP, mod_perl, Resin, AOLServer, etc.

    Your search speed issues can be fixed if you put MySQL on a box with enough RAM to keep most of your data in memory and use proper indexing. For full text searching, you want to build an inverted word index of your documents. There are plenty of examples in Perl that you can steal from if you need help. Try looking up the module Search::InvertedIndex on CPAN, or search the Dr. Dobb's Journal archives for an article about this subject.

    Now, you need to do some load-balancing. For a good overview and some possible options, read this:
    http://www.engelschall.com/pw/wt/loadbalance/

    Then check out some of the links to IP-level load-balancers at http://perl.apache.org/guide/download.html#High_Av ailability_Linux_Project.

    There are probably some good FreeBSD-based projects as well, since products like big/ip and Coyote Point's Equalizer are based on it.
  • If separating content from presentation is what you're after, take a look at the Perl module "Template Toolkit" on CPAN, or look at http://www.freemarker.org/ or http://www.webmacro.org/ for Java.
  • I'm not the previous Anonymous Coward.

    However, I have to look at it the same way -- just like I wouldn't judge MySQL by one site, I won't judge the ACS by one site. Nor will I judge AOLserver by one site.

    However, when AOL is pumping 28 thousand hits per second out to the net through AOLserver, I _do_ have to sit up and take notice.

    Oh, I've run AOLserver for nearly three years on my Linux box -- and it runs smooth as silk. Very easy to develop dynamic pages in, very scalable, and wicked fast.

    --
    Lamar Owen
    WGCR Internet Radio
    1 Peter 4:11
  • On the other hand, the pages served by /. are quite large by common web standards... lots of query results/test per page especially for Nested or Flat mode on a story with 400 comments. Definately not your average web "hit".

  • This sounds bogus to me. Slashdot runs on MySQL... Slashdot get a LOT of traffic, and those hits involve a LOT of database access.

  • 140,000 pages per day is a pretty small, load. A very small load for a custom C webserver. You could handling 140,000 pages a day easily with ASP and VBScript, for that matter.
  • I read an article recently stating tha VA Linux will be working with TCX to add "real" (non-"hack") replication to MySQL.

  • No need to write your own template engine, there are several out there to choose from.

    Regaring non-DB caching, that it an excellent point. However, it's not really that hard of a problem to solve, either. It's nice to have it solved for you (Dynamo, etc.), but as long as you have everything running in the same JVM, you can cache data in a HashMap (keyed by user ID or whatever), just make sure to watch threading issues.

  • I've used Dynamo extensively for the last 1.5 years, and I can recommend against it with absolute assurance. It forces you into a slow and unproductive development cycle, it's byzantinely complex, and it's not very reliable. As many of the posts have suggested, you're really going to need to do the basics:

    - Improve your db
    - Cache the living heck out of everything
    - Buy more memory
    - Buy more machines
    - Etc.

    And guess what, Dynamo doesn't help you do any of those.

    Marketing != Truth

    If you want to write server-side Java (not a bad choice), go with Apache's JServ for absolute sure. It is delightful.

  • OH yeah, make sure you pool your database connections. A lot of time is spent creating and tearing 'em down.

    --hunter
  • Check out Enhydra - its an open source application server that has been in development since about 95 (http://www.enhydra.org) and it runs on Linux. Its also tested and used by FedEx, Huffy Sports and others so you know it must be pretty good.
  • There's an application server that was developed in-house by what used to be Small World in New York, now iXL's New York office.

    It was originally just called "the Filter," but on its 3rd rewrite in 1998 was rechristened "Velma" (yes, from Scooby Doo). Info on it is available at http://velma.ixl.com/ [ixl.com].

    The first version was Perl-with-CGI, the second C/C++-with-CGI. The third, Velma, is C/C++ with a persistent connection, like most professional app servers.

    Though it was originally written for Solaris, I know of at least one port to Linux.

    The idea when iXL decided not to support it was to release it Open Source. I don't know if that happened officially, but certainly unofficially it's available. The developers who made it have all left the company, so there's probably no one officially "in charge" of it or of dealing with other open-source contributers, but if someone offered to take it off their hands, I'm sure they'd listen.

    It really is good technology, I wrote custom dynamic web sites in it for almost a year.

    Please email me (take out the S_P_A_M, sorry) for more info, I think I could hook you up with the right people.
  • ColdFusion Server, and Sybase...
  • It's worth having a look at what your java code is doing - and select your implementation of it with care. <A href="http://www.volano.com/report.html">http://ww w.volano.com/report.html</A> will give you a good comparison of a few of the brands out there. I was working on a project using a java front end to an informix db at my last job, and one problem that I vaguely recollect was something like all reads on hashtables being synchronous, i.e. only one thread could read from it at a time. Those sorts of things can produce severe bottlenecks. I think IBM provide some fairly good runtime analysis tools with their jikes compiler for free, have a look around their <A href="http://www10.software.ibm.com/developer/open source/jikes/project/index.html">site</A >.
  • I managed a reasonably big site (16cgi hits/sec) using mod_perl, php, mysql, and LOTS OF RAM. Your single best optimization, as listed above, is LOTS OF RAM to cache with. mod_perl has some other great tricks- if you're using templates, put a BEGIN{} block at the top of a module that is a PerlRequire in httpd.conf, and assign templates or other file-IO portions of scripts to globals. They stay in ram then (but you bought LOTS OF RAM, right?).

    Put mysql's data partition in
    a) it's own partition or better yet DISK
    b) on it's own scsi controller

    If mysql is your bottleneck, run oracle. make lots of index tables. run benchmarks on your queries. with mysql, avoid table joins if you can, it's much faster without them. Optimize the Sh#t out of
    your tables for query speed.

    don't run heavy cpu junk like log analysis on the box that needs to serve dynamic content

    use squid cache or another cache, or even plain old ramdisks to hold your static stuff, remember that IO is a huge bottleneck. Try to put eveything in ram.

    don't run 4 quake3 servers and one unreal tournament server on the box when you are anticipating heavy load :)

    Cache anything you can (did I mention that?) take slashdot pages for example- every time someone posts a comment, you should take a dump of the dynmamic page to flat html. When the next person requests the page, give 'em the dump if there are no new comments (saves hitting mysql every time!). Of course, you cached that html page in your LOTS OF RAM, right?

    Imagine you need to serve a lot of file download requests. Apache has a built-in maxclients limit of 250, but you can modifiy that in the source. A dual p3/600 + 1gb ram can easily saturate a t3 with static content..

    More stuff.. don't open lots of filehandles if you don't have to. Optimize out any calls that open a new shell (don't use $var = `pwd`; in perl, for example, use built-in function that don't require a new shell). Modify your linux kernel to allow more open file descriptors, max user processes. Nuke any unwanted ulimit directives in your start-up scripts.

    Remove daemons you don't need on the box. Don't run anything you don't absolutely need running.

    Run more than one instance of Apache- one compiled with mod_perl or mod_php, another just flat. This saves some of your LOTS OF RAM by using the cache only in the daemons that need it. You can even combine multiple daemons per ip/domain, but use squid to make it look like one.

    Did I mention to get LOTS OF RAM?
  • No certificates are name based, not ip based, so no problem. (You may need wildcard certificates if you plan on doing www1 ... www99... 'manual/explicit' balancing) [see, someone answered ;-)]
  • It's been said before, and I'll say it again. Go to Philip Greenspun's web site [greenspun.com] and read the book, check out the code, download the freeware. This guy and his crew understand high volume db backed sites like nobody else.
  • amazon has many disparate systems serving many needs for its business. do not assume that
    transaction middleware was found necessary for running the frontend "retail" website (all the /exec/obidos links, which covers every new product line Amazon has ever entered). there is much business to be done behind the scenes to move all those books, there are other ("non retail") sections of the site, and there's more then one way to skin a cat.
  • This is one of the few good comments on this article...
    First Design, second design, third design and then you can start your implementation...
    A bad design (or no design at all) is the best way to generate real headaches for you...

    I am quite stunned that all the comments mention really specialized stuff (PCGI, mod_perl) but do not seem to care about the design of the thing:

    You can really obviously tell this person to move to a three tier architecture, for example...
    Even before saying the least thing about the technology used... My .02 swiss francs ;-)
  • I'm in the process of comparing Apple's WebObjects to Microsoft's ASP for a somewhat similar project. I'd be very interested to hear from anyone who has info on the WO/ASP comparison, but here's what I have found so far.

    ASP has the better market share, so you'll be able to find more resources (consultants, books, courses). WO has the better tools. It's a Betamax-VHS problem.

    ASP is kind of a markup language that you mix with HTML, then pump through a special Web server (typically MS's IIS, but you can buy add-ons to make other Web servers work with it). The special Web server pre-processes the ASP directives, fetching data from databases if necessary and calling external components to do your work. Then the result gets rendered in plain HTML and sent out to the client (Web browser). You need additional products to handle load balancing and failover.

    WO is an application server, meaning you write and compile an application, which the users interact with via their Web browsers. It automatically handles load balancing, fail over, and all kinds of good stuff. The developer tools are very complete, and they include a simple Web page editor, a data modeling package, and lots of nice features. WO is pretty expensive, and they charge per CPU that your application is deployed on. You can use Mac OS X Server (BSD Unix based) or Windows NT for deployment.

    Both products connect to a range of data sources via ODBC. ASP requires you to embed the SQL in your .asp pages. WO has an object modeler. You set up objects that know about your database's structure, and the SQL is generated automatically when you access those objects.

    Your choice (ASP, WO, or other) depends on your exact application. I'm not recommending either product, just sharing what (I think) I know.
  • >How much research did you do exactly? :)

    Plenty. :)

    >There of tons of big name sites that use WebObjects.

    True, but there are many more that use ASP. At the bookstore, there are dozens of ASP books, and exactly zero WebObjects books. These are the perceptions that my PHB is using to resist WO and bring in ASP.

    If anyone has suggestions as to how I can change his mind, I'd love to hear them.

  • >> This is only possible if the tables are incredibly denormalized.

    Well, they are quite denormalised, yes!

    >> Are you perhaps saying that you only expose some specially created denormalized tables for reporting purposes?

    No, the whole app architecture really is built around SQL queries with no joins. By doing this it is more effecient for our file-caching of db queries - you end up caching a smaller amount of files, as opposed to the many more files which would essentially containing the same data as one another if you were caching the results from a query involving a join. (That doesn't read very well - I hope you understand what I'm trying to say!)

    At the app level, we do do things that are effectively functionally identical to an SQL join. We do all of our db access through a layer of code that makes the caching invisible, and this layer also deals with the 'joins'.

    HTH,
    Jim

  • Good comments. We also have a highly (98%) dynamic site running apache+php+mysql (plus some big sun boxes in there... somewhere... :-) Anyhow, our site does millions of hits a day (more than a couple) and it takes machines, and architecture, and money. We spend a lot of money on hosting (Exodus, dual gigabit uplinks) and have approx. 100 servers. We've spend hundreds of hours designing our systems to scale big, and be distributed, and run on RICs (ridiculously inexpensive computers). And it does, well! But this is a serious proposition. Get yourself some talent--and don't always belive the guys spouting off "N-tier", "app server", "java" as the only solution... there are many solutions, go discover what's right for YOUR site--maybe it's a 3-tiered java app server, maybe not. Hire good developers a DBA, and remember: always build for a 10x load.
  • The text of the book, up to chapter nine, is available at http://patrick.net/wpt/ [patrick.net].
  • Hmm. I must say, having written dynamic sites in C and scripting languages, that I would rather have a bigger array of servers running scripted stuff than a smaller array of systems running compiled code. C just doesn't cut it (for me) in terms of maintainability and stability - a segmentation fault that can bring down your entire apache (sub)process vs the slight speed decrease by using a scripting language? Nah.
    Plus, I dispute your premise; do you have some stats on how many hits/day a mod_perl-driven site can take? On what hardware? How complex a script? Where are the bottlenecks?
    It seems to me that it's very likely the bottlenecks are disk I/O or database access, in which case the language used doesn't matter a damn.
  • With that design, there will be a period of time where a record has been "written" but is not readable. This will only be viable when new records are being added. In any situation where a record is being updated or deleted, something hitting one of the "read" DB servers will get the old record.

    This could get really nasty for an application that reads the DB, and then writes something based on what it just read.

  • Yes, but can you have Zope AND high volume? Zope is like a three legged dog stuck in the middle of a horse race. But it does some really cool tricks.
  • But servlets are already persistant. I do wonder if the potential speedup in going from interpreted to compiled is worth it.
  • The latest version of php- Version 4 [php.net]- uses the zend [zend.com] engine. Zend has an optomizer and a cacheing option. I've used php3 and I am pretty happy with it. php4 is still in beta, but its almost done.

    I hope this helps.



    john
  • ASP has the better market share, so you'll be able to find more resources (consultants, books, courses). WO has the better tools. It's a Betamax-VHS problem.

    How much research did you do exactly? :) There of tons of big name sites that use WebObjects. Some are Disney, Adobe, Toyota, BBC, MCI, Toshiba. There are many others. As for finding consultants, WebObjects allows you to write apps in Java, Objective-C (same as Mac OS X's Cocoa!), or WebScript. I'm sure you can find one or two Java developers out there. :)

    You can use Mac OS X Server (BSD Unix based) or Windows NT for deployment.

    Or Solaris or HP-UX. I believe Solaris is the most common deployment platform for the big guys.

    - Scott

    ------
    Scott Stevenson
  • Servlets do, in general, have scalability problems all their own. ;-) That being said, here's some advice:

    1) Use Weblogic or even better iPlanet. These products have much faster servlet engines than you can get elsewhere.

    2) Use squid as a proxy IN FRONT of the Weblogic back end. This will signficantly improve your performance, at the cost of somewhat reducing the dynamic nature of your site.

    3) Switch from using servlets. :-) Seriously, they aren't very quick. Actually, using JSP (even though it's a servlet itself) can be quite helpful as it will automatically do a lot of caching for you.
  • This comment is actually pretty valuable. These days many sites have dynamic content when they could easily get away with a batch job run once an hour or even once a day.

    The acceleration gain is greater than any other solution mentioned.

    Also, people sometimes think 'batch' == 'old', probably because batch makes people think of MSDOS batch files. You can actually get quites sophisticated with batch jobs - I've designed batch daemons that watch the content - when enough of it is new they kick of the batch job, so if a major update to the underlying db occurs, the content will reflect it quickly.

    Its an option that you should consider fully. Don't just write it off as old or clumsy or not cool.
  • I am the head programmer for an e-commerce site, and I recently decided to rewrite everything, from the groud up. Using these simple guidelines, I was able to acheive spped increases on the order of 100-500 times faster.

    Avoid monolithic components at all costs. I mean by this: large tables in your database, large columns in your tables, and large, multi-function CGI programs. This also means that everything should do one thing, and one thing only.

    You database design needs to be normalized, and your selects optimized. Understand indexes. If you don't understand the concepts of normalization and database optimization, buy a book. Read it. Grok it. Also, examine the algorithms that are employed by the CGIs. If the algorithms are less than optimal, rewrite them from scratch. Rule of thumb: the source of software speed is ~90% design, ~10% implementation. If you use a fundamentally crappy algorithm or database design, a better implementation won't help you.

    Another desirable aspect of a high volume web site is many servers. For example, your server's LAN should probably include: a web server for dynamic pages, a web server for static content (this can include static content masquerading as dynamic content, ie. caching server, or images), a secure server (if you need it), a database server, and a log server. That last one is a bit unique, and underrated. The best thing to do is log as much activity as you can, this allows you to identify problem areas. A log server doesn't have to be high powered, but it allows you to correlate activity much better. Customize your hardware to it's task. RAID is your friend. Swapping is the devil. The proccessor isn't always the bottleneck. Keep this kind of stuff in mind, and above all, keep the UNIX philosophy in mind.

    Where I work, we have just a basic shopping cart system, by since everything is a small component, it is easy to upgrade and extend. We have somthing like 25 tables in our database, and none with more than 9 columns.

    Are you using MySQL? If so, a quick hint: use INSERT DELAYED instead of INSERT, and EXPLAIN SELECTs. If you have to manually join tables, that's a good thing.

  • Freshmeat [freshmeat.net] uses PHP, MySQL and Apache. I'm sure they get a ton of his every day and the entire website amounts to a huge indexed database.
  • Thank you very much for the multiple MySQL server trick. I've never thought of this, and that's a really useful hack. I'm probably going to use this soon -- in a simplified way: my current MySQL server is currently busy just computing stats and reports, thus slowing the whole process while it could easily be deported to another server. Coool! Thanks A LOT! You've saved me hours of coding on this one. Question: how exactly do you connect the update log to the other database? I can think of several ways.
  • A few ideas come straight to mind.. Persistant CGI (such as Apple WebObjects) is a great idea. mod_perl on Apache can severely speed things up. Perhaps move to Oracle, as you may benefit from some large performance increses, Linux 2.3.x has khttpd for kernel caching of static web pages (ala AIX).

    EraseMe
  • The advice against MYSQL will be arguing point for some here but I use Oracle every day. It can be a pain in the ass at times but there is no substitute for the reliability. Just spend the time to set up right the first time and you have it made.

    Spending wisely is another bit of good advice but one big server backed by redundant smaller servers can be a very good option. It has worked for me in the past. The fellow I am replying to is saying don't blow the whole wad on a couple of huge servers. This can be misunderstood. A big server can give a lot of stability if configured correctly.

  • Unfortunately, it doesn't help for the columns themselves to be indexed. Take this:

    select id from inventory where description like '%foo%';

    Indexing doesn't help with this at all since it has to search through the entire description field - 'foo' could be in the middle of the field as much as the beginning.

    The solution is to create another database consisting of all the words in the applicable fields and a pointer to the record's location. Like this:

    create table word (
    word varchar(20),
    id int);

    where id is the record ID of the record with the word in it. So a description field like this:

    HEWLETT PACKARD LASERJET 4

    would become four records, with word being 'hewlett', 'packard', 'laserjet' and '4'. Then you can build queries that search for occurances of those words in the word database, which is - of course - indexed.

    I was able to decrease query time from .40 second to .01 second by making this change, and as a bonus the search results are more relevent, too, since parts of words were not considered.

    I'm sure that was what the original poster was referring to. If you do it, remember to exclude common words like 'and' or 'for'.

    Hope that helps.

    D

    ----
  • Okay, I have done a couple of sites that are serving up high volume... Here's a few tips:

    www.suckage.com -> Java servlets on Linux with mySQL. It works pretty well. On Linux, use the IBM JDK1.1.8. It's the only JDK worth anything in terms of performance... Also, it uses native threads so SMP will help 'ya. Honestly, Java on Linux is still not as fast as Solaris or WIndows NT...

    www.fatfreeradio.com -> PHP and mySQL. This site is also running pretty well. Cool thing about PHP is it's very quick to develop. Bad thing is it's hard to stick to the MVC paradigm...

    --hunter
  • I'm one of the developers working on www.blink.com [blink.com] (we do a web-based bookmarking system), and the system's gotten large enough (thousands of concurrent sessions at any particular time, and many gigs of stuff in the database--sorry I can't be more specific) that I think I can comment on the performance concerns.

    First of all, I think java servlets are easily ready for prime time for large, complex sites, much more so than other technologies I've seen. Blink has a huge number of features (even I don't know them all :-)) and there's no way something like this would have worked in perl or php3. We're not at 100K lines of code yet, but we're certainly headed in that direction, and a less rigid language would probably have left us with an unmaintainable system at this point.

    Speed hasn't been a problem, at least on the web server and java side. Apache runs on its own machine, and that's hardly ticking over. We have to make sure we've got enough java servers behind it, but that's not a big problem. We use, for the most part, what I regard as fairly low-end PC-type boxes running Linux, not 8-CPU Xeons or E4500s or anything like that.

    The database is what kills you, especially when you're highly transactional, as we are. To be able to do things like tell you if there's new (to you) content on a page, we need to know when you last clicked on it, which means we need to update the time last visited every time you click on one of your links. So our write load makes it very difficult to use multiple servers.

    I strongly recommend getting hold of a guy who knows SQL very well when you put together your site. Doing schema changes later rather than earlier is more painful (as I know from experience) and the right schema can make you many, many times faster than the wrong one (as I also know from experience :-)). And at this point, if you're doing fairly sophisticated stuff, I'd recommend a good transactional server over something like MySQL. Surprisingly, MS SQL server is a great product, and fairly cheap compared to Oracle (though it comes along with the usual miseries of anything running on Windows).

    Delivering real applications over the web is pretty difficult. It's really, really important to think about where you keep state, and where you can avoid keeping state. This can make a noticable difference in efficiency, too.

    In the end, there's no substitute for experience; there's a lot to learn in this field, and it's quite different from traditional applicatons programming. So get it wherever you can, whether it be by writing a prototype or finding someone who's done it before and describing your project over a couple of beers.

    cjs

  • The first thing to realize is that this is a lot harder in practice than it seems like it ought to be. So, don't be too frustrated: everybody is struggling with this, and if you managed to build a high-volume dynamic site by yourself, that's already a big achievement.

    The best way of dealing with complexity is to simplify and limit features to the essentials. JavaScript, graphics, fancy layout, etc. all can present an bottomless hole for time and bandwidth.

    Another thing to realize is that the relational database you are using is probably one of your biggest bottlenecks. Relational databases are slow. MySQL is actually one of the faster ones, but that's because it doesn't make a lot of the guarantees or provide a lot of the features that a "real" relational database provides. You can get a lot of performance from your relational database by tuning it, but ultimately, the architecture and functionality itself present a limit. If speed is of the essence, consider using dbm, plain files, or memory mapped files (Apache has several "databases" internally, and that's its approach.)

    For a single person project, Java is probably not the best implementation language. It really shines for multiprogrammer projects. You may find that Python, PHP, or Ruby are better choices. Perl and Tcl are also widely used, but they are also the oldest of the scripting languages and have a lot of rough edges and clunkiness. The performance of all of them is excellent when used as server plug-ins. Perl and PHP have by far the largest libraries and toolsets.

    Some of the packages I haven't used but that look interesting are the following. Enhydra [enhydra.org] takes a much better approach to dynamic HTML than most other packages, I think. Erlang [erlang.org] / Eddie [eddieware.org] address scalability, reliability, and clustering in a really clean way (but you have to learn a new language). Zope [zope.org] I think has a lot of good ideas, but I'm not sure I'd use it for a large server.

    Most load balancing solutions use some kind of network routing hacks. That's efficient, but can be a pain to configure. However, there is now a load balancing module for Apache that works as a proxy; that's probably also worth looking into.

    The Coda [cmu.edu] file system is a free, next-generation AFS; while I'm not sure Coda is mature enough yet to be used in heavy-duty applications, systems like it help with scalability by replicating files automatically and have been used on some really large web sites.

    I would stay away from commercial "application server platforms". They often are mostly repackaged open source software, and they are expensive and complex, and while they claim to be general, they are also usually created with fairly specific commercial applications in mind ("shopping cart", etc.). I think those kinds of packages are worthwhile for corporate developers that work with large teams of programmers and need consistency, documentation, training, and support.

    So, to recap, this kind of stuff is still a lot of work and it seems like you are ahead actually. But there are a lot of proven open source tools available that have been used on really big projects. There are also a lot of open source developments coming around that may make this kind of project a easier.


  • You're saying WO is crap, but the only reason you give is the lack of developers. I'm confused. There aren't as many developers out there, mostly all of the WO developers I know work on in house projects. But your average WO consultant is going to be a lot better than your average ASP consultant. I could probably walk down the street and find an "ASP Consultant" but that doesn't make me feel good about the technology...

    Here's a shocker, WO isn't easy! If that scares you, go run to MS with their "Hey, anyone can do it!" products and increase the number of worthless MCSE's and ASP developers out there (It's so easy, anyone can do it poorly!). If you've never written WO apps before, you're damn right it's going to take work, and you actually have to learn. Imagine that. If you sit down two people with no ASP/WO experience and ask them to write an app, the ASP person will certainly succeed faster. If you sit down an experience ASP developer and an experienced WO developer and ask them to write an app, the time difference would be minimal and the WO would also have a completely seperated interface to ship off to the html designers and a database model that can be reused in your other applications (not to mention better performance in a much easier to maintain and modify package).
  • In my experience, locking contention is usually due to inappropriate indexing and bad SQL coding. I'm not familiar with MySQL, but if you are having to do funky schema changes like splitting the tables it sounds like MySQL isn't ready for prime time yet.

    FWIW and FYI, MySQL is optimizing for speed by keeping things simple. They don't support a lot of SQL features for transactions by design. This lets them get some tremendously good performance for somethings. For example, Slashdot, which is mostly DB reads with occasional DB writes. But I wouldn't run an e-commerce site with it.
  • shameless plug:
    If you are looking to build high volume dynamic site you need to look at ATG [atg.com]. The Dynamo app server is widely regarded as a great!. Sony's The Station [sony.com] is one of the a very high volume dynamic sites [atg.com] that use Dynamo.
  • Interbase v6 isn't out yet (Mid uear target date) - and Interbase 5.6 isn't open source.

    You can download & use Interbase v4 for Linux for free.

    It's a great database - ANSI 92 SQL, transactions, row locking - everything you'd want.

    It supports big (Multi-TerraByte) Databases in multiple DB files, and yet has a small ( 10M or so) footprint.

    It runs on lots of platforms (Linux, Windows, assorted Unices, Novel etc).

    The licence will be MPL, and everything but the Replication engine and ODBC driveer (both from thrid parties) will be open sourced.

  • A lot of people are putting forward ideas centered around hardware - there are a number of things you can do in your software to design for scalability:
    • Application Cache: Figure out what objects are constant across your entire application, and build a simple object cache that loads them up, holds them in memory, and has the ability to refresh the cache on a background thread.
    • NO SESSION STATE: This one is counter-intuitive. You would think that it would be a good idea to cache some objects on a session-by-session basis, but it's not! Think about it this way: An average user clicks about 1 time per minute. By NOT caching objects for that user on a high volume server, you can use the resources to serve a couple thousand more requests. Also, most servers use cookies to hold a session ID, which must be read on each request - turing off session globally on your server will dramatically increase your maximum throughput.
    • Scalability, NOT Performance: This is a trap that many developers fall into - they think that by optimizing how fast pages come up, they are creating a scalable site. What this translates into is optimizing for a single user's performance. The actual goal is a linear ramp up of response time based on the number of users. I've seen sites that returned pages in record time choke under a load of 20-30 simutaneous users because of this.
    • Session State: If you truly need session state, create a separate database or LDAP server, optimized for reads, to hold your session state. Don't use cookies to hold your session ID - use a GUID or something similar as a URL parameter. This lets you have session state, but allows users to transparently switch between Web servers on the front end, and thier Session follows them. Also, you are only loading up session state on the pages that need it.
    • Load Test: After every major change, load test your application, and watch for that linear ramp up. Capture statistics internally in your application about how effective your caching is, how much time is being spent on a particular operation, etc. Use these stats to figure out where to tune next.
    I think that's about it. If you (or anyone else) has any more questions, feel free to email me @ scotty@auragen.com. I've been doing a lot of this stuff lately, and could probably give you some more pointers.




    Scott Severtson
    Applications Developer
  • Can't say too much, coz we're only days from going live with a major e-commerce site & we're under NDA - but here's what I (think) I can say that may be helpful:-

    The system we are deploying on is capable of sustaining a delivery of more than 230 dynamic pages per second (which we have benchmarked using RSW's test suite) - around 20million dynamic pages in a 24 hour period.

    We are using Compaq PCs and each is fitted with twin PIII @ 500MHz, 1GB RAM, 9Gb SCSI local drive (mirrored) and they are running WinNT 4. All networking is 100Mb, doubled up cards and cables (for fault tolerance). Our hosting ISP (big and in the UK) is responsible for maintenance, support, (main/primary) backup and hardware configuration.

    The system is isolated from anything but port 80 requests by a pair of Cisco firewalls, and all the kit plugs into a Cisco Catalyst 5505 switch. Behind the wall sits a BigIP box (for fault tolerance and load balancing - we've found BigIP kit to be very good, BTW) which passes HTTP requests to a pair of identical web/application servers (with the above config) that are both running IIS 4.0.

    The backend is connected to the front via a couple of hubs, and consists of a grand total of four boxes (again with the above config) running M$-SQL 7.0. Using Compaqs own clustering solution, these are linked up as two buddy pairs for 24x7 fail-over. Each pair of db servers appears as one virtual SQL server and each pair of db servers shares a 186Gb RAID5 array.

    We are using a proprietary scripting engine, which interfaces as an ISAPI filter and currently only runs under NT (hence the choice of M$ technology throughout). This is similar in architecture/functionality to PHP or ASP - although it outperforms both.

    Our application architecture allows us to split our db across two db servers (careful db schema design allowing load distribution). Furthermore all of our SQL queries are very atomic (no joins!) and the query results are then turned into script files and saved into a (db query) file cache on the RAID.

    Before we implemented caching of db queries as files, the db access was the bottleneck with throughput. Since implementing the query cache we can run at full tilt with our db servers ticking over at well under 10% load. We are using pooled persistent connections to our dbs also.

    Our script (embedded in HTML) is compiled into byte code on first access and written out as a file (similar to the stuff Zend are gonna do / are doing). This improves execution speed greatly after the first access.

    All files that are accessed by the script engine (which doesn't include the images/etc) are cached into memory. All the application code is stored on the RAID and is accessible to all web servers in the cluster - the performance hit due to accessing a file over the network is only incurred first read (due to the local memory caches on the web servers script engines).

    When a db row is changed, the script file in the cache is rewritten and the all web servers are told to refresh the file cached in memory. (Mostly, our app reads from the db - writing to the db is quite infrequent compared to reading, hence this works very well for us).

    Searching the site is all done through a table of keywords (again this table is file cached and then memory cached) - so we do not incur a performance hit from free-text searching of our pages.

    The system is fully scaleable simply through the addition of extra web/application servers should the need arise (which I'm sure it will, one day!). As previously stated, the actual real world load on our db servers is negligible so we should be ok there for a while. If necessary, we can also move from 100Mb Cat5 up to optic-fibre if networking becomes the bottleneck.

    Just my 2 pence worth - HTH.
    Jim

  • Probably not a popular view here, but if you are going to use something like Perl to be the front end glue for all your dynamic content, don't expect to be a speed demon when you've got a few hundred simultaneous users. A compiled language, like C, is optimum for thrifty thinking web designers. Small memory footprint. Small loading time. Small CPU expenditure.

    Harder to program in, but worth the trouble if you're building a scalable site. Just a personal opinion... I would avoid a real SQL database for simple jobs. While plugging in something like MySQL may make programming a whole lot easier, you pay for it in performance.

  • 1)Free text search on your database is going to
    slow down your application. If you can try searching on a summary or title, or try to use a third party search i.e like verity

    2) Have you indexed the tables?

    3) Ram use is very intensive on dynamic sites
    try for 1G to start with

    4) 200 req a second is pretty respectable, put page requests and more importantly dynamic page requests is what you want to measure.

    5) have a look at

    http://hotwired.lycos.com/webmonkey/design/site_ building/tutorials/tutorial1
    .html

    for some tips about designing redesigning sites

    6) dont make everything dynamic just because you can

    adios
  • With the important thing being the balancing... separate DB and web servers, ads off of a different box... more of that info is included in the Slash source d/l, so take a look..
  • 140,000 pages per day is a pretty small, load.

    Obviously, but the main part of the point was that it is 99% idle during the peak loads. I doubt ASP/VBScript could implement starship traders [starshiptraders.com], run 17 separate 64,000 sector games with hundreds of players each, and still be 99% idle on a Celeron 366. ;)
  • One thing that sometimes gets forgotten is that you can reduce calls to the server by using client-side strategies. For example, putting the raw results of a database query in a Javascript array then mainpulating it (sorting, etc.) in the browser
    Ugh! May I puke on your shoes? Because this idea makes me ill.

    What happens if I'm running on an old 486? (I know people who are.) What if I have Javascript turned off, or filtered by a firewall? What if I have an browser with a buggy Javascript implementation?

    Unless you are in a highly controlled environment where you know the capabilities and configuration of every client, you should never, never, never do anything critical to your site on the client side.

  • At ArsDigita, instead of trying to keep up with the latest in fashionable languages, we spend our time worrying about "What data models and workflow structures would we need to run the entire MIT Sloan School from a Web-based system?"
    Sage advice. Get a lock on what you're keeping track of and who's changing it how, and the rest is just a SMOP.

    I've been amusing myself lately by creating a discussion system at my unreasonable.org [unreasonable.org] website. The interesting part wasn't so much coding it up (in PHP with PostgreSQL and a little Perl), but in figuring out the process - who's doing what to whom, and who might be doing other whats to someone else as I expand the system. I'll probably hack on it for a few months and then do a rewrite to clean up - but the data model shouldn't have to change.

  • ... the problems you describe with mod_perl have a pretty simple solution. You just use mod_perl as an application server and put a reverse proxy (like Squid or Apache + mod_proxy) in front of it to serve static requests.

    Sure, that's the standard solution to the problem of oversized mod_perl-ified processes eating up memory, but it comes at a high price in mod_perl's performance. I've found that mod_perl with a proxy in front of it can be about 20% as fast (in terms of requests/second) as direct access to mod_perl.

    "Naked" mod_perl, i.e. mod_perl without a proxy in front of it, is one of the fastest solutions around; but proxied mod_perl is so much slower that you might as well go to something like FastCGI and save all the administration effort that mod_perl requires.
  • The problem with giving students bad grades in courses is that (1) not only were they confused to begin with and didn't learn the material, but (2) they will be pissed off with you later! :-) Anyway, ACS has nothing to do with AOLserver or Tcl, really. SAP was written in COBOL initially. I'm sure a lot of language snobs threw rocks at them but at the end of the day a good engineer realizes that the value is in the data model and business process workflow. Now that ArsDigita has $20 million in revenue, we're releasing pure Java and Apache versions of ACS. So the anonymous coward (and former bad student) could have learned that his statements about being "forced" to develop in Tcl are simply wrong. All that he would have had to do is visit arsdigita.com. Anyway, the sad thing is that most people aren't good enough engineers to compare apples to apples. So they compare our toolkit, which is all about data models, to something like PHP, which is probably a fine programming language but it does not even try to to do the things that ACS does. On the scaling issue originally posted. People with sane tech architectures are usually able to rely on serving 10 dynamic requests/second/processor (these would involve a db query and a template merge). Could you do more than this? Sure! The real-time gamer guys build amazing stuff using UDP, custom C code, and in-memory databases of their own design. But what's the point for the average online community or ecommerce publisher? Most people with high traffic sites can afford a rack of processors and a load balancing switch. Anyway, I don't want to get into a flame war about AOLserver or Tcl. If you want to use our toolkit with pure Java running inside Oracle 8.1.7, go for it! You won't even need a separate Web server. If you want to use Apache instead of AOLserver, go for it! You'll probably need a slightly bigger machine but it won't really affect the end result. At ArsDigita, instead of trying to keep up with the latest in fashionable languages, we spend our time worrying about "What data models and workflow structures would we need to run the entire MIT Sloan School from a Web-based system?"
  • Not a God? Now I'm sad.

    Not as sad as I was on that Boston Marathon day (when we got 30X the traffic that the customer told us to expect). But is that a failure for our toolkit? No. We had the equivalent of 8 400 MHz CPUs working on the Marathon server (two Sun E450s). If you can serve what seemed like half the US population with two Sun E450s using PHP, that's great. But so what?

    There are people using ACS with a fat-ish db server plus a rack of small Web servers plus a load balancer. They don't seem to have any trouble scaling to arbitrary size.

    "On really expensive boxes"? Photo.net handles 1 million hits/day, goes to the database on every page load, and runs on a computer with 4 180 MHz processors! There are 10-year-olds whose Quake machines would crush my server like a bug.

    I am so sick of being the poster child for AOLserver (5 MB of code, smaller than the Oracle client library) and Tcl (which takes about two hours to learn). They are both open source, they don't get in the way, but if you want to use our toolkit (which is also free and open-source), you can now do it with Apache or with all-Java inside Oracle. Ben Adida ported it to PostgreSQL so you can be 100% open-source.

    Anyway, if you want to start from scratch in MySQL and PHP and try to replicate the effort of ArsDigita's 80 full-time developers, be my guest. But don't confuse yourself into thinking that these are "solutions" (as you put it). They are tools.
  • When your site uses SSL, the certificate is for one IP, mod_backhand and round robin DNS all respond with different IP addresses. Reverse proxying doesn't scale beyond 4 servers or so. How does Amazon or whatever handle load balancing among SSL web servers?

    Of course, I posted this late, so no one will probably read this question...
  • Pertinant links:

    Here [server51.net], at Server51.net, is some info on SLASH.

    -and-

    This [slashcode.com] is the place you really want to go. Slashcode.com. Here, you'll find bug reports, feature requests, and the latest source (and a few trolls...just like home :-).

    Also be sure to look through the mailing list archives... (at Server51, I believe)...

    Here's my [redrival.com] copy of DeCSS. Where's yours?
  • ...Search slashdot for older articles on this subject. In particular, you may find Ask Slashdot: Optimizing Apache/MySQL for a Production Environment [slashdot.org] useful. I'm sure there are plenty of others.

    Of course, don't search just slashdot for articles like this. Also search the archives of papers that have been presented at various USENIX [usenix.org]/SAGE [usenix.org] conferences (in particular, LISA [usenix.org]), and other USENIX publications, starting at http://www.usenix.org/publicat ions/publications.html [usenix.org].

    You will also want to use index sites such as Yahoo! [yahoo.com] and Excite [excite.com], as well as search engines like Google [google.com], Altavista [altavista.com], and Hotbot [hotbot.com], not to mention community directory projects such as dmoz Open Directory [dmoz.org].

    That's just a sampling of the sorts of research that you should START with. Of course, to do this right, you'll need to do much, much more.
    --
    Brad Knowles

  • You'll probably get lots of replies about the funkiest hardware available, but for the extensibility of your service I would simply suggest to make your datamodel as generic as possible. The reason Amazon could easily switch to selling other things besides books would be because their database-tables don't contain any book-specific information. Crosstable until you drop :)

    Also, a lot of functionality can be implemented at a lot of different levels. If you're using a powerful and fast database like Oracle, then use its features to do as much as possible.

    And finally: optimize your queries. It matters a great deal whether a single page in your site is constructed from 10 or 20 different queries.

    Just my 2 cents...

  • First design the system, all the tiers (layers) in the n-tier system. Secondly, define which functionality is needed in which tier to make the system work as defined. Third, choose for each tier the right technology to make it perform at best: performance AND reliable. THEN you start implementing.

    If you let your vision get blurred by focussing on one or 2 technologies before you've defined and designed the system, you'll never get to your goal.

    As a global delimiter there is always the budget and what you can buy for the budget. Another is the lack of skills in certain fields of technology: some people at your team can be great Perl programmers but lousy in COM component building for middle tier software.

    I know this sounds vague, but if you first do step 1 and 2 and good, you can then ask for which tier/functionality group which technology is the best to choose, concerning your team's delimiters money and knowledge/skills.

    Any platform and tool mentioned in this thread will be, in a certain way, able to give what you asked for. Even Windows2000 with build in middleware (COM+), ASP, and an additional SQLserver or the free MSDE (Microsoft Database Engine, an SQLserver lookalike without storedprocs). It's however not possible to point NOW to the right direction which SET of tools, technologies and languages you'll need to create the high speed webapplication setup you'll need.

    Please keep that in mind. :)

    --
  • I have much experience in this area. I run a hosting company that has built High Availability into the design of the network. Using Linux Virtual Server [linuxvirtualserver.org] as a load balancer allows us to scale very easily and provides maximum uptime and speed. For scalability, to add more resources, we simple add new servers behind the LVS. The LVS is a patch to the linux kernel, so it performs very fast and it provides a layer of security, since you can forward only the ports that you want to. LVS has several different algorithms for scheduling which server gets each request.

    Also, using the Roxen WebServer [roxen.com] provides great performance for dynamic content. Roxen has built in database features that perform very well, for instance, roxen will keep the database connection open for fast access. Roxen will also allow you to write scripts that run very quickly because they get read into the roxen process memory the first time they execute, and they never have to be called again (no forking). Or, you can get away from CGI all together and write your own roxen modules that run internally in the server and add functionality. Roxen also has it's own built in markup language called RXML that includes tags for doing database queries, creating images on the fly, and many more. Bottom line, Roxen gives you the ability to do anything you want with your web server, and still give you the best performance possible.

    For example, we have developed smtp and pop servers that run internally in the web servers. Our design allows us to load balance all of our services. The mail server also uses mysql to manage all the messages, email accounts, and aliases. None of the email accounts actually have system user accounts on the servers (higher security and easier to manage). Also, we get better performance out of the mail server, because instead of writing to files for mailboxes, it writes to a database. This is much easier to manage in an environment such as the hosting industry where we must maintain many domains and users.

    I was an Apache user for many years, until I discovered Roxen in mid 1998. And I would never go back. I have experience running thousands of web sites, using this concept, successfully.

    I also know that Real Networks [real.com] uses Roxen as there web server and development environment. Maybe, somebody from there can post a comment about there experiences as well.
  • My experience at a "very high volume" web site tells me quite clearly that if you connect code to the live site, it must be compiled. Perl, Java, etc. are all unable to cope with sites getting more than 50 million hits/day.

    You may say that 50 million hits is an unrealistically high number to be discussing, but considering the rate of growth of web traffic, it won't just be the top ten sites that get this much traffic - many sites can expect it.

  • Ignore your application server vendor...I'm willing to bet that its the most unreliable part of Amazon.com.

    Agreed. Almost none of these tools are sufficiently mature to this point.

    Use well known, well respected, and evolved tools.

    To this point, I would caution against using MySQL - while excellent for smaller jobs, I have found it unable to hold up to high volume concurrent use as well as Oracle. When it comes to an RDBMS, spend the money and get it over with. Free solutions have not closed the gap wuite yet.

    Cache! Cache whatever you can.

    Make use of Akamai or another caching network where useful.

    Be ready to spend. Running a fast, large hits web site is expensive.

    And spend wisely. Don't fill up your rackspace with full-height servers when 1u servers will do. Don't tie too much of your site to one piece of hardware - you get better tolerance and granularity from many small servers as opposed to a few big servers. The colocations I visit tend to be a textbook case study in how to waste rack space.

  • mod_perl should be able to handle somewhere close to this on a single machine

    100 million hits???? No. I have mistakingly put perl on one of our most highly visited pages and watched it get crushed like a grape...and this was only a random redirect generator.

  • I'd happily eat my words if someone could demonstrate PHP handling 100 million hits a day. I'd be pleasantly surprised..as my experience at a site that gets this type of traffic (so yes, I actually have some vague idea of what I'm talking about) has indicated otherwise. Our experiences with PHP (yes, we tried it on part of the site) on even moderately traffic'd parts of our site were not good.
  • Or OO HiP WAD. ;-)

    Really, I do this stuff for a living (primarily in server-side Java and these suggestions may show that), and there is no easy way I know to get great maintainability and high performance. Here are a few things to consider, though (there may be some information that was in previous posts):

    • The first thing to consider is what the scale of your system's complexity is. For the simplest sites, try to stick with static HTML as much as possible. Use a server-side process to update those files once or twice a day. For moderately complex systems or systems that just need to be updated more quickly use something like PHP or Cold Fusion with a simple database. For highly-dynamic sites with complex business rules, you'll need a well designed, application server-based system. The rest of the comments are on how to build these types of systems.
    • To make your system the most adaptable it can be, concentrate on the "problem domain" or "business rules". Design an object-oriented problem domain. Keep that problem domain code as separate as possible from the "presentation" and the "data managment" code. Generally, we have separate Java packages for problem domain, presentation logic, and data managment.
    • Find a way to cache instantiated problem domain objects in memory. Object pools, Enterprise Java Beans, and object caches are things to look into.
    • Cache semi-static results. Even in a highly-dynamic system, it is likely that less than 25% of the site is really that dynamic. In an e-commerce site, the inventory levels might change, but the product information (descriptions, images, etc.) will not. Design your presentation logic such that you can cache as much of these results as possible in memory.
    • As you can see from the previous two, go heavy on the RAM on your servers.
    • To handle heavier loads, never have the same web server handling both application requests and static requests. Off-load static HTML and images to another server (or group of servers).
    • Use an application server that supports load balancing. EJB is a good (not great, but it's pretty widely available) method for distributing the problem domain load.
    • To aid in the maintainability of the presentation, use a template engine or use JSP, but treat it as a template engine. Examples of template engines include WebMacro [webmacro.org] and FreeMarker [sourceforge.net]. Don't put code in the HTML or HTML in the code if you can at all help it.

    Well, I'll stop there. There's plenty more to cover, but there are also plenty of books out there. Basically, in my experience, the first rule of maintainability is keeping the problem domain separate, and the first rule in performance is caching.

  • Hmm.. It Makes me wonder whether you were misinformed, or just unware of PHP.

    PHP, freely availble from php.net [php.net] is a Pre-Hypertext-Processor (Hence the name.) It is similar to ASP in the fact that you splice snippets of code into HTML, but thats where the similarities stop. I used to do full time ASP backend development for the web design company I work for (no, not digital11 for those of you who like to criticize my page.) I was looking for something that was totally server compatible for pretty much everything, because our client's hosting methods vary between OS's. Once I found PHP, I never looked back. Its faster, its easier to learn, its easier to use, its faster, it has many more features, its open source, its faster. Now I haven't heard about web objects (oh, did I mention that PHP is faster?) But after using PHP, I'm sure it would have to do everything but fsck your wife for you to be better than PHP.

    d11

  • One of the most common ways to scale your site is to add hardware load balancing and multiple cheap servers running exact copies of your web site. This is how I have built my company's site.

    We use a Cisco Local Director to load balance as many Linux servers as we like. It is very easy to insert and remove servers in a configuration like ours. Most large company's (such as Amazon or Yahoo!) will use multiple systems like this.

    As for adding on new features, I would say that an iterative process is best. Once again, most large sites will have hundreds of developers working on different features which they will roll out one by one.
  • by bert ( 4321 ) on Wednesday March 08, 2000 @03:37AM (#1218705) Homepage
    Slashdot would be the obvious example, right? So ask CmdrTaco and his crew, and take care to download en fiddle with Slash first!
  • by JohnZed ( 20191 ) on Wednesday March 08, 2000 @08:01AM (#1218706)
    Ok, I promise I don't work for these guys, but I'd have to highly recommend Resin from www.caucho.com. It's open source and amazingly fast. We can serve dozens of requests per second of a resonably complicated site on a crappy $400 Linux PC.
    Also, what JVM are you using? Definitely try the newest Sun (with Inprise JIT, must be downloaded separately) for a single-processor system and IBM's jdk for an SMP box.
    For App Servers, you should check out Web Sphere from IBM (not too expensive, relatively speaking). Also, TowerJ can DRAMATICALLY speed up Linux server-side java and improve scalability (towerj.com, I believe), but it ain't cheap.
    Good luck!
    --JRZ
  • by elliot ( 23381 ) on Wednesday March 08, 2000 @03:46AM (#1218707) Homepage
    Be sure to separate dynamic and static content. Make sure your servers that are hitting the database and generating content are not wasting cpu and memory
    serving images.
    Make sure that you use HTTP Headers to your
    advantage. By setting these correctly, you can benefit from cacheing on the client end, as well as on a front end server (SQUID, mod_proxy, etc.).
    Find a tool that lets you seperate code and HTML. Perl Modules work great in conjunction with mod_perl and one of the embedded perl Modules (HTML::Mason, HTML::Embperl, ePerl, etc.) However, many tools allow you to do similar things, Java Server Pages. The key is to be consitent and keep the logic separated from the layout and display.

    HTH, Aaron
  • by Sun Tzu ( 41522 ) on Wednesday March 08, 2000 @03:59AM (#1218708) Homepage Journal
    In Starshiptraders.com (a game), I wrote the entire thing in C -- there isn't even a copy of Apache involved. That system serves about 140,000 pages per day, and at peak times, the single-cpu (Celeron 366/320MB) is about 99% idle. All the files will fit in memory -- I only have about 120MB of data, but it is frequently intensively used. The data is all stored in flat files with ponters to related records in the same and other files. All data is, therefore, directly addressable and there is not much in the way of wasteful I/O. I can't really recommend this approach though for the obvious development and maintainability problems that it entails. ;)

    My other dynamic site project, SiteReview.org (user-posted website reviews), is written in PHP, serves pages with Apache, and has a MySQL back end data store. The /. folks are proof that MySQL and Apache are up to the task for some serious work. I have gone to some trouble to minimize the number of database calls and will work a bit more to minimize the size of returned pages. Each of my tables is indexed on the (very few) columns that are used to access it, so I get no full table scans. PHP can be compiled as a module for Apache, eliminating the startup overhead and resulting in quite efficient processing. PHP is also very easy to work with.

    Currently that site is a work in progress with very small volume and I therefore have no evidence yet that I did anything right ;). (Anyway, it's at a hosting provider who is not optimized for PHP -- they call a php executable for each of my pages. If the need should arise, I will move it.) However, I think that this approach is a good balance between maintainability, efficiency and scalability. You can start with a single system and, when the load exceeds that capabilities of the box, you can easily offload the database onto a dedicated database server and put up multiple webservers on the front end with DNS round-robin or somesuch.
  • by noeld ( 43600 ) on Wednesday March 08, 2000 @04:20AM (#1218709) Homepage
    My site RootPrompt.org -- Nothing but Unix [rootprompt.org] is written in php3 with a MySQL database backend.

    I have worked to minimize the database calls and keep the pages as small as I can. By minimizing the database calls I give my self more room to grow before I start needing more hardware and by keeping the page small I make the site more slow connection friendly and make better use of my bandwidth. I think that if you are waiting for something to download it should be what you want (content) not fluff.

    I have added features slowly as I have gotten them working. Comments, user logins, syndication pages, etc. I think that if you get a good idea get it online and then work to make it better.

    I think you should always keep in mind that anything cool may soon be much bigger so write a site that is cool when ten people use it and is still cool (and fast) when a ten thousand people (or more) are using it.

    I would also recomend setting things up so that your content can be syndicated and shared on other sites.

    RootPrompt.org's headlines for example can be had in netscape's rss format at:
    http://rootprompt.org/rss/ [rootprompt.org]

    and in text format at:
    http://rootprompt.org/rss/text.php3 [rootprompt.org]

    Doing this will allow you to share the content that you create with the world without requiring a lot of machine on your end.

    Noel

    RootPrompt.org -- Nothing but Unix [rootprompt.org]

  • by Hairy Fop ( 48404 ) on Wednesday March 08, 2000 @03:30AM (#1218710) Homepage
    If you're using jserv and apache then it allows you to use multiple front end servers e.g. round robin DNS. and multiple backend JServ engines.

    If you want to cluster multiple SQL servers, then you can have multiple read only mysql servers and one write mysql server which updates the other mysql servers from the update log.
    Your DB code would have to be aware of the read and write DB servers.

    As for coping with changes if it's interface changes then you need to create you architecture seperating application logic from interface logic.
    e.g.
    Servlet (Java) code handling DB intercation and application type logic and an intelligent templating language handling interface (XSL).
  • by slazlo ( 87565 ) on Wednesday March 08, 2000 @03:45AM (#1218711) Homepage
    About 6 months ago I began development on a site that I wanted to be scalable to the extreme. After research into which tools best fit my job I decided on AOLServer for many reasons including: multithreaded vs forking architecture, persistant db connections, shared memory space, proven track record, simplistic implementation including embedded tcl in pseudo asp like pages. Fortunately since then AOLServer has become OpenSource under the GPL allowing my complete architecture to rely on only OS tools: LINUX, PostGres, AOLServer, Postfix, and more. I believe AOLServer to be one of the best kept secrets as far as Open Source tools out there. A company named Arsdigita has an Open Source toolkit designed for building online communities and online forums for any problems in case you get stuck. I would write more but its 8AM and I haven't been to bed yet ... maybe when I get up after Noon ;)
  • by slazlo ( 87565 ) on Wednesday March 08, 2000 @04:34AM (#1218712) Homepage
    Before I sleep here are some useful links I found when I was investigating which tools to use:

    http://www.acme.com/software/thttpd/benchmarks.h tml

    http://www.cs.wustl.edu/~jxh/research/

    http://photo.net/wtr/thebook/server.html

    http://aolserver.com/features/

    http://www.aolserver.com/tcl2k/html/index.htm

    http://www.linux-ha.org/

    http://www.linuxvirtualserver.org/

    http://www.citi.umich.edu/projects/citi-netscape /reports/web-opt.html

    http://linuxperf.nl.linux.org/

    http://www-4.ibm.com/software/developer/library/ java2/index.html

    http://www.squid-cache.org/

    Hopefully this may be of help to you also.... after all this research I was very pleased to go with AOLServer even though they were not in the web server comparison at thttpd site the model was represented by Zeus and thttpd and AOLServer has many additional features that really sold me.

  • by Tassach ( 137772 ) on Wednesday March 08, 2000 @08:24AM (#1218713)
    I can't even count the number of times we stared at the mysqladmin processlist and saw one of our tables constantly locked (stopping all reads and writes) before we came up with this solution.


    In my experience, locking contention is usually due to inappropriate indexing and bad SQL coding. I'm not familiar with MySQL, but if you are having to do funky schema changes like splitting the tables it sounds like MySQL isn't ready for prime time yet. Dodgy workarounds are no substitute for a quality DB server. My personal preference, Sybase ASE 11.0.3.3 for Linux [sybase.com], is available with a zero-cost license for both production and development deployments. Sybase ASE 11.9.2 is more has some significant improvements over 11.0 and is zero-cost for development only. (Unless you REALLY need row level locking, 11.0 will probably meet your needs.)

    Question - are you splitting between rows or between columns? If you are having to split between rows, the problem is most likely resulting from an inappropriate clustered index. In an insert-intensive database, a bad clustered index will result in a hot spot in the last data and/or index page of the database. The best solution here is to cluster on a surrogate key. Your surrogate key generation algorithm needs to be carefully designed to distribute inserts evenly in the table. If you are doing primarily single-row updates and inserts, you should only be seeing page-level locks. If you are updating records frequently, try and use only fixed-length datatypes (or at least only update the fixed-length fields); this allows in-place updates. You should avoid indexing frequently updated fields, if possible.

    Database design is an art. Ditto for performance tuning. An expert DBA is worth his/her weight in gold. There's a good reason top DBA's command top rates :-)

    "The axiom 'An honest man has nothing to fear from the police'
  • by Anonymous Coward on Wednesday March 08, 2000 @04:47AM (#1218714)
    We use Apache/PHP/MySQL to serve out about a million dynamic pages per day. Here are some general tips that hopefully will help, although a further understanding of how your data is structured would help.

    Cache your dynamic data
    I cannot emphasize how important this is. If your site pulls up a page of 50 records, which each pull other database info for each record, create a cache for the whole page. Then create a cache for each record as well. Cache elements as well as entire pages of dynamic data into separate cache tables in the database wherever possible.

    Split your servers
    Separate your database and Apache servers. They should communicate with each other through a 100Mb network switch at the least (make sure you use full duplexing as well).

    Split your tables
    I know this sounds funny, but the current version of MySQL has a tendency to lock tables when doing writes to the table. This means that one update on a table can halt all other reads on the table. MySQL is very fast, so normally this isn't much of a problem, but when you start getting into a high volume of requests, you're going to start to get bogged down. The solution to this is to split your table into multiple tables, i.e. the table mydata becomes mydata_1, mydata_2, mydata_3, etc, where for example mydata_1 might hold records starting with a-e, mydata_2 is for f-h, etc. This might sound tricky but it will save you a lot of trouble. I can't even count the number of times we stared at the mysqladmin processlist and saw one of our tables constantly locked (stopping all reads and writes) before we came up with this solution.

    Your MySQL server
    Should have a ton of memory plus a fast disk. Preferably a gig or more for memory, use RAID or a fast (10,000 RPM+, 6ms or less seek time) SCSI disk for the data, also keep it separate from your OS disk (i.e. don't have your OS running on the same disk as your MySQL tables). Save the high MHz processors for...

    Your Apache servers
    We've found that the really processor intensive stuff happens on the Apache servers. So you should keep an eye on the load average, etc. on these servers. If they start to get bogged down, you can just pop in another Apache server and split the load.

    The nice thing about this setup is that you can keep adding Apache servers as your processing needs go up. From our own experience MySQL is not very processor intensive but very dependent on memory and disk speed. When using MySQL with Linux, you also have to be careful about file system limitations, handlers etc. Tweaking certain variables should help. Good luck!

  • Slashdot is the most unreliable site I visit on a regular basis. Throughout the day, page loads can take several seconds; sometimes not, sometimes longer, sometimes not accessible at all. This has been the case both in the East Coast and the West Coast (I just moved from one to the other). Note that in the East Coast, I had an extremely fast cable modem that was faster than most T1 connections I've had at actual companies, and in the West Coast, it's been through actual T1s. Also, I compare to other sites at the same time that Slashdot is slow.

    No, I'm not flaming Slashdot; I love everything else about the site. But its accessibility unfortunately didn't improve with the Andover.net takeover, nor through any of the other changes that have been happening in the last two years.

    I'm sure other people's mileage will vary, I'm interested in hearing other people's experience.
    ----------

  • by Get Behind the Mule ( 61986 ) on Wednesday March 08, 2000 @06:00AM (#1218716)
    The fashionable technologies for persistent server-side programming for the Apache server these days are mod_perl, PHP and Java servlets, but I've had very good experience with another one that doesn't seem to be so "hip": FastCGI [idle.com]. FastCGI is not the fastest of them all (for that, you need to program your own Apache module), but its robustness and maintainability, in addition to very good speed, make it one of the best choices of the lot.

    Here are some of the advantages I've seen:

    • Low impact on the server. FastCGI processes run independently of the server; they communicate with the server via a protocol, although if you use one of the programmer's interfaces you never really have to worry about that. To the programmer, a FastCGI program looks very similar to a CGI program; except that it has all the advantages of persistence.

      Since FastCGI processes are independent of the server, they are less likely to weigh the server down with a heavy processing load, and buggy FastCGI's are less likely to slow down or crash the server. If a FastCGI is going haywire, the problem can be diagnosed with the usual tools for analyzing the process behavior (like ps, top, Sun's proctool, etc). And FastCGI can be configured to adjust the number of running processes to fit the load.

      In contrast, technologies like mod_perl or PHP, which are embedded in the server, place an extra load on the server itself. It increases their memory footprint (especially in the case of mod_perl), which can be very problematic when Apache forks extra servers to handle request spikes -- you run the risk of running out of memory. They can make the web servers start up more slowly, and if one of your programs has gone on the blink, it can adversely impact the servers themselves. And embedded programs are not as easy to debug as independent processes.

      In the case of servlets, since they all run as threads within a JVM, then if one of them is buggy or slow, it's not easy to find out which one is causing the problem. Usually you just notice that your JVM has slowed down, deteriorating everything else; then you have to go about finding out which thread is responsible.

    • No commitment to a particular programming language. This takes away one of the most contentious debating points concerning mod_perl, PHP, Java servlets, and other technologies like mod_pyapache and the Apache API. Each of them requires a specific language, and hence a discussion of the various approaches often deteriorates into a language flamewar. More to the point, many programmers simply cannot use one paradigm or the other becuase they don't know its "proprietary" language very well. And once you've started, you're locked in; you may have to think twice about a hiring a perfectly good new programmer, because your candidate doesn't happen to be fluent in the language you've chosen.

      None of these problems come up with FastCGI. You can write FastCGI processes in whatever language you like, as long as you honor the protocol. And there are re-usable FCGI interfaces for C, C++, Java, Perl, Python and TCL.

    • With Perl's CGI::Fast, the FCGI program can also run as conventional CGI or from the command line. Having just praised FCGI's language independence, I do have to mention this advantage of the Perl interface. It makes FastCGI processes much easier to debug than, say, a mod_perl handler.


    I personally happen to like Perl a lot, and I very much like the idea of mod_perl. Programming to the Apache API with Perl is way cool, and so many Perl programmers fall all over themselves praising it as a panacea. But because of the memory impact on the server, I have found very difficult to implement mod_perl so that the server is stable and doesn't eat up all my RAM. It can be done, with a lot of effort on the part of the web server administrators, but it's certainly a lot harder than it is with FastCGI.

    And for the record, I do recognize the strengths of the various other techniques that I've mentioned as well. They all deserve their status as highly respected technologies for server-side programming, but FastCGI ranks up there with them in quality and deserves more attention than it's been getting.
  • by Steven Pulito ( 80879 ) on Wednesday March 08, 2000 @04:16AM (#1218717) Homepage
    Some links you might find useful are:
  • by rambone ( 135825 ) on Wednesday March 08, 2000 @05:21AM (#1218718)
    This book is really quite good considering how early it came out relative to the maturity of most high volume sites.

    Most points mentioned here are covered in detail in this book.

  • by Anonymous Coward on Wednesday March 08, 2000 @03:37AM (#1218719)
    Philip & Alex's Guide to Web Publishing and the Web Tools Review are some good sources of information on this topic. Both can be easily found at http://www.photo.net/. Philip Greenspun, who is the creator of photo.net and wrote the Guide to Web Publishing, also is the founder of ArsDigita. ArsDigita does web dev consulting and offers a free, open source toolkit for building robust, high-utilization sites. The previous poster directed you to a good info source, I'm not sure why they were rated down to 0...
  • by Matts ( 1628 ) on Wednesday March 08, 2000 @03:44AM (#1218720) Homepage
    I'm going to give away the big secret of this:
    There are no shortcuts
    Wow - amazing huh? There are some things you can do, like not using spawning CGI scripts (which you're not) and using persistent database connections (which you are), but short of that there's no shortcut. That's not to say there's nothing you can do though:
    • Ignore your application server vendor. They have to pass on some of the cost to Oracle, and they don't really manage Amazon.com with their product - but they probably do some small part of it so they can say that legally. I'm willing to bet that its the most unreliable part of Amazon.com.
    • Use well known, well respected, and evolved tools. These include things like mod_perl, Apache, Oracle, java servlets are getting there (but you saw that they don't scale fantastically, and their JDBC drivers are much slower than Perl's equivalent), but they just aren't that fast yet on large projects. AOLServer also looks like a fairly nippy option, but you need to use tcl to program it AFAIK.
    • Tune your database. This can't be stressed enough. It may take the rest of your life, but do it anyway. And if you can't do it, then hire a proffesional. These guys are expensive though - but you get what you pay for in this respect.
    • Split up your hardware. A separate DB and Web server can increase your application's speed no end due to removing contention for resources.
    • Cache! Cache whatever you can. If using something like mod_perl then stick the "Oops" proxy server in front of it to cache page accesses (there are good reasons why this speeds things up). Cache stuff in your server's ram. Cache stuff in shared memory.
    • Be ready to spend. Running a fast, large hits web site is expensive. There's no ifs nor buts about this unless you don't mind downtime. PhilG of "Phillip and Alex's" fame estimates something like $100,000+++ a year to run a web site like this, taking into account Oracle costs, support, DBA costs (yes, you do need one), hardware and network costs.
    And read "Philip and Alex..." - even if you only get the web version - somewhere off http://photo.net. He debunks the myths of application servers and reducing the costs and time of development of this sort of thing. And read "The Mythical Man Month" - that also debunks the idea of reducing the time to develop complex things.

    Good Luck!

On the eighth day, God created FORTRAN.

Working...