On Building High Volume Dynamic Web Sites 222
"Apart from this I have been talking to commercial vendors like BEA (I was very impressed) who provided application servers with load-balancing, replication, etc., starting at $20,000 (Australian) -- they run sites like Amazon.com, Qwest, Wells-Fargo etc.
There is an issue here (is there? I don't have any experience to really know hence am asking you) ... I can build a custom solution with load balancing written at the application level. But how does this affect my maintainability (for example Amazon.com moving from just books to all sorts of other stuff .. how long did it take to redesign the site etc.)?
The site I first built could potentially hold information about a million refugees, and allowed searching on most fields regarding information on a person (wildcard queries). Unfortunately, on doing some stress testing (with around 700,000 records) I found that at most 15 hits could be handled every ten seconds. I optimized the code, switched JDBC drivers to a faster driver, wrote a simple load balancer (and I mean very simple) and limited searching of fields to a few fields as well as preventing bad wildcard queries (e.g., a wildcard at the start would make little if any use of the index). Consequently, I managed to get the system to handle slightly more load (200 hits at 5 seconds) (Hardware was Dual Pentium II 450Mhz I think, 512MB RAM, 2x8G Ultra-wide SCSI hard drives, and running Linux of course). BTW, The Kosova refugees articles has a lot of misinformation, e.g. encrypted databases, and the time to actually build it was actually one week (and two weeks of overcoming red tape, etc.)."
there is no scalability problem (Score:1)
WebObjects (Score:1)
So, why GSWeb or WebObjects?
- It's a framework, not an application server. In other words, you don't build applications, you use the WO framework to build application servers. Each app handles it's own memory, threading, resources, etc for incredible performance and scalability. This also makes it extremely flexible as it's a strong OO environment and almost all of the functionality of the app server can be overriden in you application.
- EOF (Entreprise Object Framework). If you've never used EOF (or GNUstep DB) you don't know what database development could be like.
- It's a standard environment. By this I mean the programming is more like a standard GUI app. You don't have a form that posts data to the same page or another page. Instead, you have a form with an action property that is bound to a method so that when an action is performed (click submit) the method is called. You as a developer don't have to worry about form names and how to do validation, your method can do whatever it needs and return an arbitrary page without messiness like forwarding or including other pages.
There's lots more. Basically, when logic very much outweighs content, WO is hard to beat. On the downside, it is a completely different way to think about web development and there is a learning curve. For OpenStep developers, this will be small, but for programming novices, it could be quite steep. But how productive you can be with a tool is definately a factor that needs to be considered.
Check out GNUstep Web at www.gnustepweb.org [gnustepweb.org]. I have the source for a very simple GSWeb application up at http://zeus1.tzo.com/GuestBook/ [tzo.com], this is basically a GSWeb conversion of apple's first WebObjects tutorial app (yes, it's entirely useless and it would be stupid to use WO for a GuestBook :)
Re:photo.net & ArsDigita (Score:1)
The reason is primarily that the system has not been embraced by the community. There aren't people outside of ArsDigita hacking on it, and the people inside are only hacking on it in ways that are good for ArsDigita (as of now, there is no development team for the toolkit, site development teams modify the toolkit, and sometimes their mods get folded into the distribution).
Philip will try to sell you on the "proven" AOLServer (uh, almost no one uses it in the real world) which is a very sketch open source project last time I looked at it. His system also backends to Oracle, which is not going free anytime soon. Finally, he will try and sell you the ACS as the greatest thing since sliced bread. It was, in 1995.
So read his book, which is pretty good (you could even buy it, its the only computer book I would put on my coffeetable, lots of pretty pictures). Just ignore all the technical parts.
PS: It will also force you to do development in TCL, which sucks ass.
Re:Comments on this informative post... (Score:1)
A free loadbalancer (Score:1)
http://www.eddieware.org/
Re:photo.net & ArsDigita (Score:1)
As an aside, python has the weirdest variable scoping and declaration rules. I ought to know, I have a PhD in programming languages.
The biggest selling points of ACS for me are:
Rama Kanneganti
Re:photo.net & ArsDigita (Score:1)
Somehow my emacs inserted that "use Php" in the previous sentence.
Thanks.
Re:Actually, Run Screaming From Dynamo (Score:1)
Recommending JServ as an alternative solution to Dynamo is like (strained analogy) giving sheet metal and welding tools to kids vs. giving them a climbing frame.
Caching data in a hashmap is a good first pass solution, but then you have to deal with threading issues, which brings up the problem of State. Writing an LRU thread-safe cache is perfectly possible, but a total waste of time if you're trying to write a web application.
If you must go for a free application server, choose Zope or The Locomotive. I haven't tried them out, but they seem to have the (minimal) functionality required...
Re:Actually, Run Screaming From Dynamo (Score:1)
I've written server-side Java using Apache Jserv before I ever used Dynamo. Server side Java alone will screw you up horribly, because you can't easily generate dynamic HTML pages, and unless you write your own templating feature you're stuck mixing HTML code with Java code.
This will stymie the designers completely, because the only way they can make changes to the page is to ask the programmers to do it for them. The programmers are stuck doing both design and programming. Forget about making easy progressive changes to the code -- if you have one page that you want to change every third element, you'll have to trawl mounds of printlns.
Not to mention that there's no personalization. There's no database cache so you go to the database every time you need data (which can be horribly expensive). There's no connection pooling. If you want to grab stuff from the database, you have to write dynamic SQL to grab elements... I can't describe how tedious this can get.
If you want to write everything yourself or don't have the money, use Jserv. It's a good Java server, and it does everything it's supposed to. Whether it will do everything you need is another matter.
how do you get Apache to load balance? (Score:1)
___
More Freshmeat Info: (Score:1)
Before I forget.
Another important feature is that these techniques are server side so are independent of the browser used. You can also use PHP with Oracle 8i but most of the really cool stuff is centered around MySQL. Plus, why spend money on software if you don't have to. Spend the money save paying for support or donating to the OpenSource projects (PHP and MySQL). Yet another advantage is that the application can be developed and tested on Windows (read Configuring Windows98 For Local Dev [phpbuilder.com]) then uploaded to the Apache Server. A good free program with PHP syntax highlighting and tons of other goodies for developing is HTML-Kit [chami.com].
Re:Persistant CGI (Score:1)
--
Re:Persistant CGI (Score:1)
Last summer I worked for a while at Planet Online in the UK. (They're big Linux supporters and host linux.org.uk). The focus of their activities is Freeserve, the largest ISP in Europe with two million users. It's specified for ten million.
The entire system runs on Linux, with NetApp filers doing the actual disk storage.
The user database runs in mySQL. Every mail sent or received, every user homepage hit, every DNS hit, involves a query against the back-end database. That's for two-million users - almost a hundred thousand of whom are logged in at any one time.
The amazing fact is that it isn't slow, it isn't unreliable, and it isn't reaching its limits. Why? Engineering.
The people at Planet simply have an in-depth understanding of the hardware and software that makes up their platform. They've put in fast kit and lots of memory, and they've tuned the code (long live open source) to make use of it. It flies.
Re:FastCGI is the unsung hero (Score:1)
Re:FastCGI is the unsung hero (Score:1)
load balancing, fast searching, and app servers (Score:1)
Ultimately the proof is in the pudding. None of the big sites use commercial application servers for anything they really care about. I know people who work at Yahoo, Amazon, and others, and they use home grown stuff, often with Apache, and often in C or Perl. You can make a fast system with any number of open source tools like PHP, mod_perl, Resin, AOLServer, etc.
Your search speed issues can be fixed if you put MySQL on a box with enough RAM to keep most of your data in memory and use proper indexing. For full text searching, you want to build an inverted word index of your documents. There are plenty of examples in Perl that you can steal from if you need help. Try looking up the module Search::InvertedIndex on CPAN, or search the Dr. Dobb's Journal archives for an article about this subject.
Now, you need to do some load-balancing. For a good overview and some possible options, read this:
http://www.engelschall.com/pw/wt/loadbalance/
Then check out some of the links to IP-level load-balancers at http://perl.apache.org/guide/download.html#High_A
There are probably some good FreeBSD-based projects as well, since products like big/ip and Coyote Point's Equalizer are based on it.
separating content (Score:1)
Re:Whoa now! Greenspun is NOT a GOD (Score:1)
However, I have to look at it the same way -- just like I wouldn't judge MySQL by one site, I won't judge the ACS by one site. Nor will I judge AOLserver by one site.
However, when AOL is pumping 28 thousand hits per second out to the net through AOLserver, I _do_ have to sit up and take notice.
Oh, I've run AOLserver for nearly three years on my Linux box -- and it runs smooth as silk. Very easy to develop dynamic pages in, very scalable, and wicked fast.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
Re:MySQL (Score:1)
MySQL (Score:1)
Re:My two approaches... One good, one bad (?) (Score:1)
Re:Multiple MySQL server -- thanks (Score:1)
Re:Actually, Run Screaming From Dynamo (Score:1)
Regaring non-DB caching, that it an excellent point. However, it's not really that hard of a problem to solve, either. It's nice to have it solved for you (Dynamo, etc.), but as long as you have everything running in the same JVM, you can cache data in a HashMap (keyed by user ID or whatever), just make sure to watch threading issues.
Actually, Run Screaming From Dynamo (Score:1)
- Improve your db
- Cache the living heck out of everything
- Buy more memory
- Buy more machines
- Etc.
And guess what, Dynamo doesn't help you do any of those.
Marketing != Truth
If you want to write server-side Java (not a bad choice), go with Apache's JServ for absolute sure. It is delightful.
Re:Some Tips (Score:1)
--hunter
Open Source Application Server (Score:1)
Velma (Score:1)
It was originally just called "the Filter," but on its 3rd rewrite in 1998 was rechristened "Velma" (yes, from Scooby Doo). Info on it is available at http://velma.ixl.com/ [ixl.com].
The first version was Perl-with-CGI, the second C/C++-with-CGI. The third, Velma, is C/C++ with a persistent connection, like most professional app servers.
Though it was originally written for Solaris, I know of at least one port to Linux.
The idea when iXL decided not to support it was to release it Open Source. I don't know if that happened officially, but certainly unofficially it's available. The developers who made it have all left the company, so there's probably no one officially "in charge" of it or of dealing with other open-source contributers, but if someone offered to take it off their hands, I'm sure they'd listen.
It really is good technology, I wrote custom dynamic web sites in it for almost a year.
Please email me (take out the S_P_A_M, sorry) for more info, I think I could hook you up with the right people.
What we use is.... (Score:1)
Java can trip you up (Score:1)
Reasonably big site (Score:1)
Put mysql's data partition in
a) it's own partition or better yet DISK
b) on it's own scsi controller
If mysql is your bottleneck, run oracle. make lots of index tables. run benchmarks on your queries. with mysql, avoid table joins if you can, it's much faster without them. Optimize the Sh#t out of
your tables for query speed.
don't run heavy cpu junk like log analysis on the box that needs to serve dynamic content
use squid cache or another cache, or even plain old ramdisks to hold your static stuff, remember that IO is a huge bottleneck. Try to put eveything in ram.
don't run 4 quake3 servers and one unreal tournament server on the box when you are anticipating heavy load
Cache anything you can (did I mention that?) take slashdot pages for example- every time someone posts a comment, you should take a dump of the dynmamic page to flat html. When the next person requests the page, give 'em the dump if there are no new comments (saves hitting mysql every time!). Of course, you cached that html page in your LOTS OF RAM, right?
Imagine you need to serve a lot of file download requests. Apache has a built-in maxclients limit of 250, but you can modifiy that in the source. A dual p3/600 + 1gb ram can easily saturate a t3 with static content..
More stuff.. don't open lots of filehandles if you don't have to. Optimize out any calls that open a new shell (don't use $var = `pwd`; in perl, for example, use built-in function that don't require a new shell). Modify your linux kernel to allow more open file descriptors, max user processes. Nuke any unwanted ulimit directives in your start-up scripts.
Remove daemons you don't need on the box. Don't run anything you don't absolutely need running.
Run more than one instance of Apache- one compiled with mod_perl or mod_php, another just flat. This saves some of your LOTS OF RAM by using the cache only in the daemons that need it. You can even combine multiple daemons per ip/domain, but use squid to make it look like one.
Did I mention to get LOTS OF RAM?
Re:SSL Monkeywrench (Score:1)
one name - GREENSPUN (Score:1)
don't assume Amazon uses BEA for the retail site (Score:1)
transaction middleware was found necessary for running the frontend "retail" website (all the
Design ... (Score:1)
First Design, second design, third design and then you can start your implementation...
A bad design (or no design at all) is the best way to generate real headaches for you...
I am quite stunned that all the comments mention really specialized stuff (PCGI, mod_perl) but do not seem to care about the design of the thing:
You can really obviously tell this person to move to a three tier architecture, for example...
Even before saying the least thing about the technology used... My
Re:Persistant CGI (Score:1)
ASP has the better market share, so you'll be able to find more resources (consultants, books, courses). WO has the better tools. It's a Betamax-VHS problem.
ASP is kind of a markup language that you mix with HTML, then pump through a special Web server (typically MS's IIS, but you can buy add-ons to make other Web servers work with it). The special Web server pre-processes the ASP directives, fetching data from databases if necessary and calling external components to do your work. Then the result gets rendered in plain HTML and sent out to the client (Web browser). You need additional products to handle load balancing and failover.
WO is an application server, meaning you write and compile an application, which the users interact with via their Web browsers. It automatically handles load balancing, fail over, and all kinds of good stuff. The developer tools are very complete, and they include a simple Web page editor, a data modeling package, and lots of nice features. WO is pretty expensive, and they charge per CPU that your application is deployed on. You can use Mac OS X Server (BSD Unix based) or Windows NT for deployment.
Both products connect to a range of data sources via ODBC. ASP requires you to embed the SQL in your
Your choice (ASP, WO, or other) depends on your exact application. I'm not recommending either product, just sharing what (I think) I know.
Re:Misinformation on WebObjects (Score:1)
Plenty.
>There of tons of big name sites that use WebObjects.
True, but there are many more that use ASP. At the bookstore, there are dozens of ASP books, and exactly zero WebObjects books. These are the perceptions that my PHB is using to resist WO and bring in ASP.
If anyone has suggestions as to how I can change his mind, I'd love to hear them.
Re:Here's what we're doing.... (Score:1)
Well, they are quite denormalised, yes!
>> Are you perhaps saying that you only expose some specially created denormalized tables for reporting purposes?
No, the whole app architecture really is built around SQL queries with no joins. By doing this it is more effecient for our file-caching of db queries - you end up caching a smaller amount of files, as opposed to the many more files which would essentially containing the same data as one another if you were caching the results from a query involving a join. (That doesn't read very well - I hope you understand what I'm trying to say!)
At the app level, we do do things that are effectively functionally identical to an SQL join. We do all of our db access through a layer of code that makes the caching invisible, and this layer also deals with the 'joins'.
HTH,
Jim
Re:Big Secrets Given Away... (Score:1)
Re:Read O'Reilly's "Web Performance Tuning" (Score:1)
Re:For "very high volume", no interpreted code liv (Score:1)
Plus, I dispute your premise; do you have some stats on how many hits/day a mod_perl-driven site can take? On what hardware? How complex a script? Where are the bottlenecks?
It seems to me that it's very likely the bottlenecks are disk I/O or database access, in which case the language used doesn't matter a damn.
Re:JServ and Apache (Score:1)
This could get really nasty for an application that reads the DB, and then writes something based on what it just read.
Re:Content management (Score:1)
Re:Persistant CGI (Score:1)
have you checked out php4 and zend? (Score:1)
I hope this helps.
john
Misinformation on WebObjects (Score:1)
How much research did you do exactly?
You can use Mac OS X Server (BSD Unix based) or Windows NT for deployment.
Or Solaris or HP-UX. I believe Solaris is the most common deployment platform for the big guys.
- Scott
------
Scott Stevenson
Servlet problems specifically (Score:2)
1) Use Weblogic or even better iPlanet. These products have much faster servlet engines than you can get elsewhere.
2) Use squid as a proxy IN FRONT of the Weblogic back end. This will signficantly improve your performance, at the cost of somewhat reducing the dynamic nature of your site.
3) Switch from using servlets.
Re:just make it static (Score:2)
The acceleration gain is greater than any other solution mentioned.
Also, people sometimes think 'batch' == 'old', probably because batch makes people think of MSDOS batch files. You can actually get quites sophisticated with batch jobs - I've designed batch daemons that watch the content - when enough of it is new they kick of the batch job, so if a major update to the underlying db occurs, the content will reflect it quickly.
Its an option that you should consider fully. Don't just write it off as old or clumsy or not cool.
avoid monolithic components (Score:2)
Avoid monolithic components at all costs. I mean by this: large tables in your database, large columns in your tables, and large, multi-function CGI programs. This also means that everything should do one thing, and one thing only.
You database design needs to be normalized, and your selects optimized. Understand indexes. If you don't understand the concepts of normalization and database optimization, buy a book. Read it. Grok it. Also, examine the algorithms that are employed by the CGIs. If the algorithms are less than optimal, rewrite them from scratch. Rule of thumb: the source of software speed is ~90% design, ~10% implementation. If you use a fundamentally crappy algorithm or database design, a better implementation won't help you.
Another desirable aspect of a high volume web site is many servers. For example, your server's LAN should probably include: a web server for dynamic pages, a web server for static content (this can include static content masquerading as dynamic content, ie. caching server, or images), a secure server (if you need it), a database server, and a log server. That last one is a bit unique, and underrated. The best thing to do is log as much activity as you can, this allows you to identify problem areas. A log server doesn't have to be high powered, but it allows you to correlate activity much better. Customize your hardware to it's task. RAID is your friend. Swapping is the devil. The proccessor isn't always the bottleneck. Keep this kind of stuff in mind, and above all, keep the UNIX philosophy in mind.
Where I work, we have just a basic shopping cart system, by since everything is a small component, it is easy to upgrade and extend. We have somthing like 25 tables in our database, and none with more than 9 columns.
Are you using MySQL? If so, a quick hint: use INSERT DELAYED instead of INSERT, and EXPLAIN SELECTs. If you have to manually join tables, that's a good thing.
Look at FreshMeat.net (Score:2)
Multiple MySQL server -- thanks (Score:2)
Persistant CGI (Score:2)
EraseMe
Re:Comments on the comment of the informative post (Score:2)
Spending wisely is another bit of good advice but one big server backed by redundant smaller servers can be a very good option. It has worked for me in the past. The fellow I am replying to is saying don't blow the whole wad on a couple of huge servers. This can be misunderstood. A big server can give a lot of stability if configured correctly.
Indexed substring searches (Score:2)
select id from inventory where description like '%foo%';
Indexing doesn't help with this at all since it has to search through the entire description field - 'foo' could be in the middle of the field as much as the beginning.
The solution is to create another database consisting of all the words in the applicable fields and a pointer to the record's location. Like this:
create table word (
word varchar(20),
id int);
where id is the record ID of the record with the word in it. So a description field like this:
HEWLETT PACKARD LASERJET 4
would become four records, with word being 'hewlett', 'packard', 'laserjet' and '4'. Then you can build queries that search for occurances of those words in the word database, which is - of course - indexed.
I was able to decrease query time from
I'm sure that was what the original poster was referring to. If you do it, remember to exclude common words like 'and' or 'for'.
Hope that helps.
D
----
Some Tips (Score:2)
www.suckage.com -> Java servlets on Linux with mySQL. It works pretty well. On Linux, use the IBM JDK1.1.8. It's the only JDK worth anything in terms of performance... Also, it uses native threads so SMP will help 'ya. Honestly, Java on Linux is still not as fast as Solaris or WIndows NT...
www.fatfreeradio.com -> PHP and mySQL. This site is also running pretty well. Cool thing about PHP is it's very quick to develop. Bad thing is it's hard to stick to the MVC paradigm...
--hunter
Observations from www.blink.com. (Score:2)
I'm one of the developers working on www.blink.com [blink.com] (we do a web-based bookmarking system), and the system's gotten large enough (thousands of concurrent sessions at any particular time, and many gigs of stuff in the database--sorry I can't be more specific) that I think I can comment on the performance concerns.
First of all, I think java servlets are easily ready for prime time for large, complex sites, much more so than other technologies I've seen. Blink has a huge number of features (even I don't know them all :-)) and there's no way something like this would have worked in perl or php3. We're not at 100K lines of code yet, but we're certainly headed in that direction, and a less rigid language would probably have left us with an unmaintainable system at this point.
Speed hasn't been a problem, at least on the web server and java side. Apache runs on its own machine, and that's hardly ticking over. We have to make sure we've got enough java servers behind it, but that's not a big problem. We use, for the most part, what I regard as fairly low-end PC-type boxes running Linux, not 8-CPU Xeons or E4500s or anything like that.
The database is what kills you, especially when you're highly transactional, as we are. To be able to do things like tell you if there's new (to you) content on a page, we need to know when you last clicked on it, which means we need to update the time last visited every time you click on one of your links. So our write load makes it very difficult to use multiple servers.
I strongly recommend getting hold of a guy who knows SQL very well when you put together your site. Doing schema changes later rather than earlier is more painful (as I know from experience) and the right schema can make you many, many times faster than the wrong one (as I also know from experience :-)). And at this point, if you're doing fairly sophisticated stuff, I'd recommend a good transactional server over something like MySQL. Surprisingly, MS SQL server is a great product, and fairly cheap compared to Oracle (though it comes along with the usual miseries of anything running on Windows).
Delivering real applications over the web is pretty difficult. It's really, really important to think about where you keep state, and where you can avoid keeping state. This can make a noticable difference in efficiency, too.
In the end, there's no substitute for experience; there's a lot to learn in this field, and it's quite different from traditional applicatons programming. So get it wherever you can, whether it be by writing a prototype or finding someone who's done it before and describing your project over a couple of beers.
cjs
suggestions and links (Score:2)
The best way of dealing with complexity is to simplify and limit features to the essentials. JavaScript, graphics, fancy layout, etc. all can present an bottomless hole for time and bandwidth.
Another thing to realize is that the relational database you are using is probably one of your biggest bottlenecks. Relational databases are slow. MySQL is actually one of the faster ones, but that's because it doesn't make a lot of the guarantees or provide a lot of the features that a "real" relational database provides. You can get a lot of performance from your relational database by tuning it, but ultimately, the architecture and functionality itself present a limit. If speed is of the essence, consider using dbm, plain files, or memory mapped files (Apache has several "databases" internally, and that's its approach.)
For a single person project, Java is probably not the best implementation language. It really shines for multiprogrammer projects. You may find that Python, PHP, or Ruby are better choices. Perl and Tcl are also widely used, but they are also the oldest of the scripting languages and have a lot of rough edges and clunkiness. The performance of all of them is excellent when used as server plug-ins. Perl and PHP have by far the largest libraries and toolsets.
Some of the packages I haven't used but that look interesting are the following. Enhydra [enhydra.org] takes a much better approach to dynamic HTML than most other packages, I think. Erlang [erlang.org] / Eddie [eddieware.org] address scalability, reliability, and clustering in a really clean way (but you have to learn a new language). Zope [zope.org] I think has a lot of good ideas, but I'm not sure I'd use it for a large server.
Most load balancing solutions use some kind of network routing hacks. That's efficient, but can be a pain to configure. However, there is now a load balancing module for Apache that works as a proxy; that's probably also worth looking into.
The Coda [cmu.edu] file system is a free, next-generation AFS; while I'm not sure Coda is mature enough yet to be used in heavy-duty applications, systems like it help with scalability by replicating files automatically and have been used on some really large web sites.
I would stay away from commercial "application server platforms". They often are mostly repackaged open source software, and they are expensive and complex, and while they claim to be general, they are also usually created with fairly specific commercial applications in mind ("shopping cart", etc.). I think those kinds of packages are worthwhile for corporate developers that work with large teams of programmers and need consistency, documentation, training, and support.
So, to recap, this kind of stuff is still a lot of work and it seems like you are ahead actually. But there are a lot of proven open source tools available that have been used on really big projects. There are also a lot of open source developments coming around that may make this kind of project a easier.
Huh? (Score:2)
You're saying WO is crap, but the only reason you give is the lack of developers. I'm confused. There aren't as many developers out there, mostly all of the WO developers I know work on in house projects. But your average WO consultant is going to be a lot better than your average ASP consultant. I could probably walk down the street and find an "ASP Consultant" but that doesn't make me feel good about the technology...
Here's a shocker, WO isn't easy! If that scares you, go run to MS with their "Hey, anyone can do it!" products and increase the number of worthless MCSE's and ASP developers out there (It's so easy, anyone can do it poorly!). If you've never written WO apps before, you're damn right it's going to take work, and you actually have to learn. Imagine that. If you sit down two people with no ASP/WO experience and ask them to write an app, the ASP person will certainly succeed faster. If you sit down an experience ASP developer and an experienced WO developer and ask them to write an app, the time difference would be minimal and the WO would also have a completely seperated interface to ship off to the html designers and a database model that can be reused in your other applications (not to mention better performance in a much easier to maintain and modify package).
FYI & FWIW: MySQL is optimizing for speed (Score:2)
FWIW and FYI, MySQL is optimizing for speed by keeping things simple. They don't support a lot of SQL features for transactions by design. This lets them get some tremendously good performance for somethings. For example, Slashdot, which is mostly DB reads with occasional DB writes. But I wouldn't run an e-commerce site with it.
check out Dynamo from Art Technology Group (Score:2)
If you are looking to build high volume dynamic site you need to look at ATG [atg.com]. The Dynamo app server is widely regarded as a great!. Sony's The Station [sony.com] is one of the a very high volume dynamic sites [atg.com] that use Dynamo.
Re:Interbase RDBMS (Score:2)
Interbase v6 isn't out yet (Mid uear target date) - and Interbase 5.6 isn't open source.
You can download & use Interbase v4 for Linux for free.
It's a great database - ANSI 92 SQL, transactions, row locking - everything you'd want.
It supports big (Multi-TerraByte) Databases in multiple DB files, and yet has a small ( 10M or so) footprint.
It runs on lots of platforms (Linux, Windows, assorted Unices, Novel etc).
The licence will be MPL, and everything but the Replication engine and ODBC driveer (both from thrid parties) will be open sourced.
Applications Design... (Score:2)
Scott Severtson
Applications Developer
Here's what we're doing.... (Score:2)
The system we are deploying on is capable of sustaining a delivery of more than 230 dynamic pages per second (which we have benchmarked using RSW's test suite) - around 20million dynamic pages in a 24 hour period.
We are using Compaq PCs and each is fitted with twin PIII @ 500MHz, 1GB RAM, 9Gb SCSI local drive (mirrored) and they are running WinNT 4. All networking is 100Mb, doubled up cards and cables (for fault tolerance). Our hosting ISP (big and in the UK) is responsible for maintenance, support, (main/primary) backup and hardware configuration.
The system is isolated from anything but port 80 requests by a pair of Cisco firewalls, and all the kit plugs into a Cisco Catalyst 5505 switch. Behind the wall sits a BigIP box (for fault tolerance and load balancing - we've found BigIP kit to be very good, BTW) which passes HTTP requests to a pair of identical web/application servers (with the above config) that are both running IIS 4.0.
The backend is connected to the front via a couple of hubs, and consists of a grand total of four boxes (again with the above config) running M$-SQL 7.0. Using Compaqs own clustering solution, these are linked up as two buddy pairs for 24x7 fail-over. Each pair of db servers appears as one virtual SQL server and each pair of db servers shares a 186Gb RAID5 array.
We are using a proprietary scripting engine, which interfaces as an ISAPI filter and currently only runs under NT (hence the choice of M$ technology throughout). This is similar in architecture/functionality to PHP or ASP - although it outperforms both.
Our application architecture allows us to split our db across two db servers (careful db schema design allowing load distribution). Furthermore all of our SQL queries are very atomic (no joins!) and the query results are then turned into script files and saved into a (db query) file cache on the RAID.
Before we implemented caching of db queries as files, the db access was the bottleneck with throughput. Since implementing the query cache we can run at full tilt with our db servers ticking over at well under 10% load. We are using pooled persistent connections to our dbs also.
Our script (embedded in HTML) is compiled into byte code on first access and written out as a file (similar to the stuff Zend are gonna do / are doing). This improves execution speed greatly after the first access.
All files that are accessed by the script engine (which doesn't include the images/etc) are cached into memory. All the application code is stored on the RAID and is accessible to all web servers in the cluster - the performance hit due to accessing a file over the network is only incurred first read (due to the local memory caches on the web servers script engines).
When a db row is changed, the script file in the cache is rewritten and the all web servers are told to refresh the file cached in memory. (Mostly, our app reads from the db - writing to the db is quite infrequent compared to reading, hence this works very well for us).
Searching the site is all done through a table of keywords (again this table is file cached and then memory cached) - so we do not incur a performance hit from free-text searching of our pages.
The system is fully scaleable simply through the addition of extra web/application servers should the need arise (which I'm sure it will, one day!). As previously stated, the actual real world load on our db servers is negligible so we should be ok there for a while. If necessary, we can also move from 100Mb Cat5 up to optic-fibre if networking becomes the bottleneck.
Just my 2 pence worth - HTH.
Jim
A good language is a start... (Score:2)
Harder to program in, but worth the trouble if you're building a scalable site. Just a personal opinion... I would avoid a real SQL database for simple jobs. While plugging in something like MySQL may make programming a whole lot easier, you pay for it in performance.
some tips (Score:2)
1)Free text search on your database is going to
slow down your application. If you can try searching on a summary or title, or try to use a third party search i.e like verity
2) Have you indexed the tables?
3) Ram use is very intensive on dynamic sites
try for 1G to start with
4) 200 req a second is pretty respectable, put page requests and more importantly dynamic page requests is what you want to measure.
5) have a look at
http://hotwired.lycos.com/webmonkey/design/site
.html
for some tips about designing redesigning sites
6) dont make everything dynamic just because you can
adios
Re:Dynamic High Traffic Site? Slashdot! (Score:2)
Re:My two approaches... One good, one bad (?) (Score:2)
Obviously, but the main part of the point was that it is 99% idle during the peak loads. I doubt ASP/VBScript could implement starship traders [starshiptraders.com], run 17 separate 64,000 sector games with hundreds of players each, and still be 99% idle on a Celeron 366.
Re:Tips (Score:2)
What happens if I'm running on an old 486? (I know people who are.) What if I have Javascript turned off, or filtered by a firewall? What if I have an browser with a buggy Javascript implementation?
Unless you are in a highly controlled environment where you know the capabilities and configuration of every client, you should never, never, never do anything critical to your site on the client side.
Re:this is the problem with giving out F's.... (Score:2)
I've been amusing myself lately by creating a discussion system at my unreasonable.org [unreasonable.org] website. The interesting part wasn't so much coding it up (in PHP with PostgreSQL and a little Perl), but in figuring out the process - who's doing what to whom, and who might be doing other whats to someone else as I expand the system. I'll probably hack on it for a few months and then do a rewrite to clean up - but the data model shouldn't have to change.
Re:FastCGI is the unsung hero (Score:2)
Sure, that's the standard solution to the problem of oversized mod_perl-ified processes eating up memory, but it comes at a high price in mod_perl's performance. I've found that mod_perl with a proxy in front of it can be about 20% as fast (in terms of requests/second) as direct access to mod_perl.
"Naked" mod_perl, i.e. mod_perl without a proxy in front of it, is one of the fastest solutions around; but proxied mod_perl is so much slower that you might as well go to something like FastCGI and save all the administration effort that mod_perl requires.
this is the problem with giving out F's.... (Score:2)
now I'm sad... (Score:2)
Not as sad as I was on that Boston Marathon day (when we got 30X the traffic that the customer told us to expect). But is that a failure for our toolkit? No. We had the equivalent of 8 400 MHz CPUs working on the Marathon server (two Sun E450s). If you can serve what seemed like half the US population with two Sun E450s using PHP, that's great. But so what?
There are people using ACS with a fat-ish db server plus a rack of small Web servers plus a load balancer. They don't seem to have any trouble scaling to arbitrary size.
"On really expensive boxes"? Photo.net handles 1 million hits/day, goes to the database on every page load, and runs on a computer with 4 180 MHz processors! There are 10-year-olds whose Quake machines would crush my server like a bug.
I am so sick of being the poster child for AOLserver (5 MB of code, smaller than the Oracle client library) and Tcl (which takes about two hours to learn). They are both open source, they don't get in the way, but if you want to use our toolkit (which is also free and open-source), you can now do it with Apache or with all-Java inside Oracle. Ben Adida ported it to PostgreSQL so you can be 100% open-source.
Anyway, if you want to start from scratch in MySQL and PHP and try to replicate the effort of ArsDigita's 80 full-time developers, be my guest. But don't confuse yourself into thinking that these are "solutions" (as you put it). They are tools.
SSL Monkeywrench (Score:2)
Of course, I posted this late, so no one will probably read this question...
Re:Dynamic High Traffic Site? Slashdot! (Score:2)
Here [server51.net], at Server51.net, is some info on SLASH.
-and-
This [slashcode.com] is the place you really want to go. Slashcode.com. Here, you'll find bug reports, feature requests, and the latest source (and a few trolls...just like home
Also be sure to look through the mailing list archives... (at Server51, I believe)...
Here's my [redrival.com] copy of DeCSS. Where's yours?
Do your homework... (Score:2)
Of course, don't search just slashdot for articles like this. Also search the archives of papers that have been presented at various USENIX [usenix.org]/SAGE [usenix.org] conferences (in particular, LISA [usenix.org]), and other USENIX publications, starting at http://www.usenix.org/publicat ions/publications.html [usenix.org].
You will also want to use index sites such as Yahoo! [yahoo.com] and Excite [excite.com], as well as search engines like Google [google.com], Altavista [altavista.com], and Hotbot [hotbot.com], not to mention community directory projects such as dmoz Open Directory [dmoz.org].
That's just a sampling of the sorts of research that you should START with. Of course, to do this right, you'll need to do much, much more.
--
Brad Knowles
Datamodel (Score:2)
Also, a lot of functionality can be implemented at a lot of different levels. If you're using a powerful and fast database like Oracle, then use its features to do as much as possible.
And finally: optimize your queries. It matters a great deal whether a single page in your site is constructed from 10 or 20 different queries.
Just my 2 cents...
Design first, then choose the technology (Score:2)
If you let your vision get blurred by focussing on one or 2 technologies before you've defined and designed the system, you'll never get to your goal.
As a global delimiter there is always the budget and what you can buy for the budget. Another is the lack of skills in certain fields of technology: some people at your team can be great Perl programmers but lousy in COM component building for middle tier software.
I know this sounds vague, but if you first do step 1 and 2 and good, you can then ask for which tier/functionality group which technology is the best to choose, concerning your team's delimiters money and knowledge/skills.
Any platform and tool mentioned in this thread will be, in a certain way, able to give what you asked for. Even Windows2000 with build in middleware (COM+), ASP, and an additional SQLserver or the free MSDE (Microsoft Database Engine, an SQLserver lookalike without storedprocs). It's however not possible to point NOW to the right direction which SET of tools, technologies and languages you'll need to create the high speed webapplication setup you'll need.
Please keep that in mind.
--
One suggestion, use Roxen and LinuxVirtualServer (Score:2)
Also, using the Roxen WebServer [roxen.com] provides great performance for dynamic content. Roxen has built in database features that perform very well, for instance, roxen will keep the database connection open for fast access. Roxen will also allow you to write scripts that run very quickly because they get read into the roxen process memory the first time they execute, and they never have to be called again (no forking). Or, you can get away from CGI all together and write your own roxen modules that run internally in the server and add functionality. Roxen also has it's own built in markup language called RXML that includes tags for doing database queries, creating images on the fly, and many more. Bottom line, Roxen gives you the ability to do anything you want with your web server, and still give you the best performance possible.
For example, we have developed smtp and pop servers that run internally in the web servers. Our design allows us to load balance all of our services. The mail server also uses mysql to manage all the messages, email accounts, and aliases. None of the email accounts actually have system user accounts on the servers (higher security and easier to manage). Also, we get better performance out of the mail server, because instead of writing to files for mailboxes, it writes to a database. This is much easier to manage in an environment such as the hosting industry where we must maintain many domains and users.
I was an Apache user for many years, until I discovered Roxen in mid 1998. And I would never go back. I have experience running thousands of web sites, using this concept, successfully.
I also know that Real Networks [real.com] uses Roxen as there web server and development environment. Maybe, somebody from there can post a comment about there experiences as well.
For "very high volume", no interpreted code live (Score:2)
You may say that 50 million hits is an unrealistically high number to be discussing, but considering the rate of growth of web traffic, it won't just be the top ten sites that get this much traffic - many sites can expect it.
Comments on this informative post... (Score:2)
Agreed. Almost none of these tools are sufficiently mature to this point.
Use well known, well respected, and evolved tools.
To this point, I would caution against using MySQL - while excellent for smaller jobs, I have found it unable to hold up to high volume concurrent use as well as Oracle. When it comes to an RDBMS, spend the money and get it over with. Free solutions have not closed the gap wuite yet.
Cache! Cache whatever you can.
Make use of Akamai or another caching network where useful.
Be ready to spend. Running a fast, large hits web site is expensive.
And spend wisely. Don't fill up your rackspace with full-height servers when 1u servers will do. Don't tie too much of your site to one piece of hardware - you get better tolerance and granularity from many small servers as opposed to a few big servers. The colocations I visit tend to be a textbook case study in how to waste rack space.
Re:For "very high volume", no interpreted code liv (Score:2)
100 million hits???? No. I have mistakingly put perl on one of our most highly visited pages and watched it get crushed like a grape...and this was only a random redirect generator.
Re:For "very high volume", no interpreted code liv (Score:2)
Oject-Oriented, High Performance Web App. Design (Score:2)
Really, I do this stuff for a living (primarily in server-side Java and these suggestions may show that), and there is no easy way I know to get great maintainability and high performance. Here are a few things to consider, though (there may be some information that was in previous posts):
Well, I'll stop there. There's plenty more to cover, but there are also plenty of books out there. Basically, in my experience, the first rule of maintainability is keeping the problem domain separate, and the first rule in performance is caching.
Re:Persistant CGI (Score:2)
PHP, freely availble from php.net [php.net] is a Pre-Hypertext-Processor (Hence the name.) It is similar to ASP in the fact that you splice snippets of code into HTML, but thats where the similarities stop. I used to do full time ASP backend development for the web design company I work for (no, not digital11 for those of you who like to criticize my page.) I was looking for something that was totally server compatible for pretty much everything, because our client's hosting methods vary between OS's. Once I found PHP, I never looked back. Its faster, its easier to learn, its easier to use, its faster, it has many more features, its open source, its faster. Now I haven't heard about web objects (oh, did I mention that PHP is faster?) But after using PHP, I'm sure it would have to do everything but fsck your wife for you to be better than PHP.
d11
Scalability (Score:2)
We use a Cisco Local Director to load balance as many Linux servers as we like. It is very easy to insert and remove servers in a configuration like ours. Most large company's (such as Amazon or Yahoo!) will use multiple systems like this.
As for adding on new features, I would say that an iterative process is best. Once again, most large sites will have hundreds of developers working on different features which they will roll out one by one.
Dynamic High Traffic Site? Slashdot! (Score:3)
Resin (Score:3)
Also, what JVM are you using? Definitely try the newest Sun (with Inprise JIT, must be downloaded separately) for a single-processor system and IBM's jdk for an SMP box.
For App Servers, you should check out Web Sphere from IBM (not too expensive, relatively speaking). Also, TowerJ can DRAMATICALLY speed up Linux server-side java and improve scalability (towerj.com, I believe), but it ain't cheap.
Good luck!
--JRZ
Headers and Tiers (Score:3)
serving images.
Make sure that you use HTTP Headers to your
advantage. By setting these correctly, you can benefit from cacheing on the client end, as well as on a front end server (SQUID, mod_proxy, etc.).
Find a tool that lets you seperate code and HTML. Perl Modules work great in conjunction with mod_perl and one of the embedded perl Modules (HTML::Mason, HTML::Embperl, ePerl, etc.) However, many tools allow you to do similar things, Java Server Pages. The key is to be consitent and keep the logic separated from the layout and display.
HTH, Aaron
My two approaches... One good, one bad (?) (Score:3)
My other dynamic site project, SiteReview.org (user-posted website reviews), is written in PHP, serves pages with Apache, and has a MySQL back end data store. The
Currently that site is a work in progress with very small volume and I therefore have no evidence yet that I did anything right
My site (Score:3)
I have worked to minimize the database calls and keep the pages as small as I can. By minimizing the database calls I give my self more room to grow before I start needing more hardware and by keeping the page small I make the site more slow connection friendly and make better use of my bandwidth. I think that if you are waiting for something to download it should be what you want (content) not fluff.
I have added features slowly as I have gotten them working. Comments, user logins, syndication pages, etc. I think that if you get a good idea get it online and then work to make it better.
I think you should always keep in mind that anything cool may soon be much bigger so write a site that is cool when ten people use it and is still cool (and fast) when a ten thousand people (or more) are using it.
I would also recomend setting things up so that your content can be syndicated and shared on other sites.
RootPrompt.org's headlines for example can be had in netscape's rss format at:
http://rootprompt.org/rss/ [rootprompt.org]
and in text format at:
http://rootprompt.org/rss/text.php3 [rootprompt.org]
Doing this will allow you to share the content that you create with the world without requiring a lot of machine on your end.
Noel
RootPrompt.org -- Nothing but Unix [rootprompt.org]
JServ and Apache (Score:3)
If you want to cluster multiple SQL servers, then you can have multiple read only mysql servers and one write mysql server which updates the other mysql servers from the update log.
Your DB code would have to be aware of the read and write DB servers.
As for coping with changes if it's interface changes then you need to create you architecture seperating application logic from interface logic.
e.g.
Servlet (Java) code handling DB intercation and application type logic and an intelligent templating language handling interface (XSL).
Optimization, Scalability, and AOLServer (Score:3)
Some links on scalability, optimization (Score:3)
http://www.acme.com/software/thttpd/benchmarks.h tml
http://www.cs.wustl.edu/~jxh/research/
http://photo.net/wtr/thebook/server.html
http://aolserver.com/features/
http://www.aolserver.com/tcl2k/html/index.htm
http://www.linux-ha.org/
http://www.linuxvirtualserver.org/
http://www.citi.umich.edu/projects/citi-netscape /reports/web-opt.html
http://linuxperf.nl.linux.org/
http://www-4.ibm.com/software/developer/library/ java2/index.html
http://www.squid-cache.org/
Hopefully this may be of help to you also.... after all this research I was very pleased to go with AOLServer even though they were not in the web server comparison at thttpd site the model was represented by Zeus and thttpd and AOLServer has many additional features that really sold me.
Re:Tips (Score:3)
In my experience, locking contention is usually due to inappropriate indexing and bad SQL coding. I'm not familiar with MySQL, but if you are having to do funky schema changes like splitting the tables it sounds like MySQL isn't ready for prime time yet. Dodgy workarounds are no substitute for a quality DB server. My personal preference, Sybase ASE 11.0.3.3 for Linux [sybase.com], is available with a zero-cost license for both production and development deployments. Sybase ASE 11.9.2 is more has some significant improvements over 11.0 and is zero-cost for development only. (Unless you REALLY need row level locking, 11.0 will probably meet your needs.)
Question - are you splitting between rows or between columns? If you are having to split between rows, the problem is most likely resulting from an inappropriate clustered index. In an insert-intensive database, a bad clustered index will result in a hot spot in the last data and/or index page of the database. The best solution here is to cluster on a surrogate key. Your surrogate key generation algorithm needs to be carefully designed to distribute inserts evenly in the table. If you are doing primarily single-row updates and inserts, you should only be seeing page-level locks. If you are updating records frequently, try and use only fixed-length datatypes (or at least only update the fixed-length fields); this allows in-place updates. You should avoid indexing frequently updated fields, if possible.
Database design is an art. Ditto for performance tuning. An expert DBA is worth his/her weight in gold. There's a good reason top DBA's command top rates
"The axiom 'An honest man has nothing to fear from the police'
Tips (Score:4)
Cache your dynamic data
I cannot emphasize how important this is. If your site pulls up a page of 50 records, which each pull other database info for each record, create a cache for the whole page. Then create a cache for each record as well. Cache elements as well as entire pages of dynamic data into separate cache tables in the database wherever possible.
Split your servers
Separate your database and Apache servers. They should communicate with each other through a 100Mb network switch at the least (make sure you use full duplexing as well).
Split your tables
I know this sounds funny, but the current version of MySQL has a tendency to lock tables when doing writes to the table. This means that one update on a table can halt all other reads on the table. MySQL is very fast, so normally this isn't much of a problem, but when you start getting into a high volume of requests, you're going to start to get bogged down. The solution to this is to split your table into multiple tables, i.e. the table mydata becomes mydata_1, mydata_2, mydata_3, etc, where for example mydata_1 might hold records starting with a-e, mydata_2 is for f-h, etc. This might sound tricky but it will save you a lot of trouble. I can't even count the number of times we stared at the mysqladmin processlist and saw one of our tables constantly locked (stopping all reads and writes) before we came up with this solution.
Your MySQL server
Should have a ton of memory plus a fast disk. Preferably a gig or more for memory, use RAID or a fast (10,000 RPM+, 6ms or less seek time) SCSI disk for the data, also keep it separate from your OS disk (i.e. don't have your OS running on the same disk as your MySQL tables). Save the high MHz processors for...
Your Apache servers
We've found that the really processor intensive stuff happens on the Apache servers. So you should keep an eye on the load average, etc. on these servers. If they start to get bogged down, you can just pop in another Apache server and split the load.
The nice thing about this setup is that you can keep adding Apache servers as your processing needs go up. From our own experience MySQL is not very processor intensive but very dependent on memory and disk speed. When using MySQL with Linux, you also have to be careful about file system limitations, handlers etc. Tweaking certain variables should help. Good luck!
Slashdot is terrible example (Score:4)
No, I'm not flaming Slashdot; I love everything else about the site. But its accessibility unfortunately didn't improve with the Andover.net takeover, nor through any of the other changes that have been happening in the last two years.
I'm sure other people's mileage will vary, I'm interested in hearing other people's experience.
----------
FastCGI is the unsung hero (Score:4)
Here are some of the advantages I've seen:
Since FastCGI processes are independent of the server, they are less likely to weigh the server down with a heavy processing load, and buggy FastCGI's are less likely to slow down or crash the server. If a FastCGI is going haywire, the problem can be diagnosed with the usual tools for analyzing the process behavior (like ps, top, Sun's proctool, etc). And FastCGI can be configured to adjust the number of running processes to fit the load.
In contrast, technologies like mod_perl or PHP, which are embedded in the server, place an extra load on the server itself. It increases their memory footprint (especially in the case of mod_perl), which can be very problematic when Apache forks extra servers to handle request spikes -- you run the risk of running out of memory. They can make the web servers start up more slowly, and if one of your programs has gone on the blink, it can adversely impact the servers themselves. And embedded programs are not as easy to debug as independent processes.
In the case of servlets, since they all run as threads within a JVM, then if one of them is buggy or slow, it's not easy to find out which one is causing the problem. Usually you just notice that your JVM has slowed down, deteriorating everything else; then you have to go about finding out which thread is responsible.
None of these problems come up with FastCGI. You can write FastCGI processes in whatever language you like, as long as you honor the protocol. And there are re-usable FCGI interfaces for C, C++, Java, Perl, Python and TCL.
I personally happen to like Perl a lot, and I very much like the idea of mod_perl. Programming to the Apache API with Perl is way cool, and so many Perl programmers fall all over themselves praising it as a panacea. But because of the memory impact on the server, I have found very difficult to implement mod_perl so that the server is stable and doesn't eat up all my RAM. It can be done, with a lot of effort on the part of the web server administrators, but it's certainly a lot harder than it is with FastCGI.
And for the record, I do recognize the strengths of the various other techniques that I've mentioned as well. They all deserve their status as highly respected technologies for server-side programming, but FastCGI ranks up there with them in quality and deserves more attention than it's been getting.
Some links pertinant to this comment (Score:4)
Read O'Reilly's "Web Performance Tuning" (Score:4)
Most points mentioned here are covered in detail in this book.
photo.net & ArsDigita (Score:5)
Big Secrets Given Away... (Score:5)
Good Luck!