Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Apache Software

Supporting Tens Of Thousands Of Users With Apache? 33

embo writes: "The company I work for has been approached recently by an academic organization looking for advice on providing web space for 30,000 - 40,000 users. They are limited by budget, so I'd like to recommend something with Linux and apache. They are thinking of offering around 50 MB if disk space per user (which at maximum utilization would be ~2 TB of data storage), and no database driven content (though they want to allow CGI through Perl and Python, for example)." This is a huge undertaking. Can anyone think of solutions better than the ones embo outlines below?

"I can only think of a couple of ways of doing this. One is to have an enormous single fileserver and have a cluster of apache web servers that NFS mount the home directories to serve up web pages. Then the users FTP into the main file server to store their web pages. To me, this seems wildly inefficient, and you have no real redundancy if the main fileserver crashes unless you are using a SAN (which is very expensive), or you have a hot backup that is rsync'ed or something. And I'm thinking that rsyncing up to 2TB of data would be an exercise in futility.

My other thought was to have several back end file servers with a fixed number of users on each server, and then send all HTTP requests through an LDAP server first, which would then do a redirect to the machine that user's web page resided on. The big problem then is how to make sure users are FTP'ing into the machine that their account is on? They may also use FrontPage extensions with Apache, and this could complicate things even worse.

I know there has got to be a better architecture for this. How do enormous sites like Yahoo and Excite tackle this problem? They have hundreds of thousands of users! Better yet, how could they tackle it with Open Source tools? Would, for example, a Turbo Linux cluster help this problem any, or would I still have to replicate the data across every node in the cluster (meaning I'd need up to 2TB of storage for each cluster node!) Then what happens if they decide they want to add another 10,000 users? I can't find pointers to information, or ideas on how to do this *anywhere*. Can you fellow Slashdotters give me any advice?"

This discussion has been archived. No new comments can be posted.

Supporting Tens Of Thousands Of Users With Apache?

Comments Filter:
  • The "big guys" you mentioned (yahoo, hotmail, etc) use BSD and apache. I would recommend the same.

    Vague rambling:
    Use multiple NFS (or CODA?) servers on the backend. All of the web servers (N seperate machines) would mount "/webspace" or something from each NFS server. (probably a soft mount, or via CODA) You need to specify all of the server mounts via apache "UserDir" directives so it will search for the user's homepage correctly.
    If the NFS servers also crossmount then they can be FTP servers as well and be able to access any user's home directory. Then you just list all of the NFS servers in a DNS round robin for "ftp.blah" and the WEB servers in a round robin for 'www.blah' and it should work. :-)

  • by Zurk ( 37028 )
    i'd use coda or AFS ( i prefer AFS ) as he base filesystem. at the uni i work we serve 125K users..not the 50K you want but we use AFS as the main filestore on an origin 2000 (192 CPUs) for compute and a AIX cluster (6 x dual/quad cpu RS/6000s). In general, i'd recommend you get a couple of dual cpu alpha boxen running tru64 or linux/alpha ( say two or 3 at $10K each ), a RAID array for each alpha box or just get two sun E450s with 4 CPUs each and get one A3500FC array with two FC cards running solaris. either way its going to cost $70K or so. you can use apache/linux x86 boxen on the front end authenticating using kerberos and kerberised FTP to the back end cluster of alphas or sparcs. Around 50 of these machines should be enough..you get cheap dual cpu asus p2bds mobos + dual 650Mhz piiis for 1700 each with HDD, NIC, 512MB of ram etc... look at spending $250K for the whole thing..your company should be able to afford it easily. i recommend cisco for the switches and 3c905Bs for the linux boxes NICs. we spend $2-3mil a year for our infrastructure but we can afford it.
  • by Anonymous Coward
    If you can hold off a little (October-ish), you should check out VA Linux's NAS (network attached storage) solution. I think they call it 'VANA'. From the presentation I saw, a 2TB storage unit should cost about $125k. Not beautiful, but better than the $250k that someone else was quoting. Beyond that, service is simple, and VA Linux is putting a kickass 3 year service contract behind it. As far as front-end computatio to put in front of it, for actually serving pages anf ftp access of of, I'm not sure what to recommend, but with offloading all the file i/o on to a NAS, you should be able to drop quite a bit of the computation from the application servers. As a random guess, I'd say 4 dual-p3's running Apache/BSD or Apache/Linux that are load balanced for web serving, and then 1 or 2 of the same for ftp access should do.
  • First of all, I love Linux and I'm not a huge FreeBSD fan. However, FreeBSD with Apache is probably a better choice for such a large site - it's known to hold up with these types of heavy loads.

    Also, recommend to them to not allow CGI scripting - that would be a NIGHTMARE to support with 30,000 users. Not only would there be a huge amount of security holes, imagine the amount of server power that would take.

    Of course, if you have a large amount of money to spend, get an S/390 and give each user a virtual machine running Linux :-) Then security isn't a concern, since the most they can fuck up is their own stuff.
    --
  • They say they have 40,000 users and want to provide 50MB per user.

    Ok, but how much are people *really* going to use? The university I am at has about 20,000 users, and provides each with 50MB of disk space (to be used for everything including webspace). In total about 100GB is used, so the average per person is only 5MB. Since this includes much more than just webspace, I'm guessing that you'd find that 200GB would be more than enough for your users.

    Other notes which might be of interest; my university runs apache on solaris, with the file system on a separate NFS-mounted box. The webserver (which is also FTP server and telnet server) is a four processor SUN box IIRC.
  • Not entirely true, Hotmail uses Windows 2000/IIS 5 now. Well that's unless they hacked Apache and FreeBSD to report Windows? http://www.netcraft.com/whats/? host=www.hotmail.com [netcraft.com]. Oh and if you want to get technical, Yahoo doesn't necesarily use Apache, netcraft reports their webserver as unknown, I guess it could be a modified apache though. http://www.netcraft.com/whats/?ho st=www.yahoo.com [netcraft.com]. But yeah it's true that Yahoo does run FreeBSD and Hotmail did until recently.
  • The first thing is to is look at what's already available. The (admittedly smaller) university where I worked has a Novell network and the networking gods in Academic Technology managed to connect the Netware server to a linux box and serve pages in people's home directories out via Apache. Here's [drew.edu] a description of how.

    If there's something there that your users are familiar with, it'll be easier for them and for your support people.

  • You didn't mention anything about how much traffic you expect. If the traffic is small enough to fit on one machine, you could use several 3ware IDE controller cards with 80GB drives attached to them to get a lot of storage cheap.
  • The most straight foward way is to avoid the use of fileserver entirely. The problem your having is your assuming your going to use URLs of the form http://www.sitename.edu/path-to-userdir/. If you instead give everyone their own subdomain this is quite easy.

    Setup 10 apache/linux-or-bsd servers with 3000 users each. Setup a single DNS server that manages the subdomain "users.sitename.edu". Then give each user a subdomain of "username.users.sitename.edu". Map each subdomain to the IP of the appropriate server. You can manually configure apache or you can use one of the dynamic boot-time configuration schemes.

    Of course this could be a problem if your client has a policy about DNS subdomain allocation. And 50MB * 3000 users per server is still big... you might want to buy a big big disk array and plug all the linux servers into it.

    For that matter a rack full of Sun Netra T1s and a fibre channel disk array should be cheap enough and supported by Sun to boot. Netra's are dream to manage.

  • If it's HTML, graphics, and some animations, then 10 meg is still plentiful. But if you keep the quotas down, say, to a meg per person, it gets easier to do backups, it prevents the warez, MP3, and pr0n sites. Or at least limits them.

    If folks want more than that, they can pay extra, or go to a third-party hosting system.
  • Hey -- I was just thinking -- this is a damn clever idea.

    Damn. good thinking--

    willis/
  • the recent slashdot article reported that they had around 5-10% w2k, and the rest assumed to be freeBSD...

    willis/
  • I think you will find that MS has currently only ported a minor number of servers in the Hotmail load sharing pool to win2k.

    Last I heard approx. 80% are still BSD. Microsoft will probably, eventually, have to move to win2k, for technical (PR) reasons.

    cheers, G

    ps. Netcraft is now reporting (at least for me) hotmail.com is running Microsoft-IIS/5.0 on Windows 2000

  • Let's see if it is possible to get useful information out of a troll....
    • every free webpage provider out there provides CGI access
    Can you [or anyone else out there] name one?

    I don't know any free space providers who allow you to run your own CGI scripts, rather than just a few guest book / counter scripts that they provide.

    Please, please, prove me wrong. Lets have some links, please.

    cheers,
    G

    • Not only would there be a huge amount of security holes, imagine the amount of server power that would take.
    Just because you give 30,000 people permission to run CGI scripts, does not mean that 30,000 people go out and learn perl :-)

    I doubt that you should worry about the processing overhead that CGI may add - it will hardly be used. Now the security issues, well that's another matter.

    cheers,
    G

  • Ummm, you don't need to do the dns thing; you can also use mod_rewrite on a cluster of faster computers to do the redirecting to the back end..

    And if you don't use virtual hosts on the back end, you could have the rewrite box forward the request to serverid.domain.com which always rewrites to www.domain.com, you can seamlessly reallocate locations.
  • Two issues ago in `Linux Journal` they had a "what's that site running" list. Yahoo! claimed to run Apache on PalmOS.

    Hey, wait a minute...

  • What he said.

    mod_rewrite is very fast if you have clean rules (keep it simple, stup^dumbass). As far as a pretty URL goes i'd like http://user.blah.edu/cmatthew rather than cmatthew.user.blah.edu... that's just going too far the wrong way.

  • If you decide to go with more then one web server take a look at Squid [squid-cache.org]. It can reverse proxy web request to make many servers look like one. You should be able to split the user names on alpha ranges.

    From the Squid FAQ [squid-cache.org]

    The Squid redirector can make one accelerator act as a single front-end for multiple servers. If you need to move parts of your filesystem from one server to another, or if separately administered HTTP servers should logically appear under a single URL hierarchy, the accelerator makes the right thing happen.

    This doesn't quite solve your ftp problem. I did a quick search and didn't find anything that would direct a ftp to a different server based off of username. It shouldn't be hard to adapt a ftp proxy to do this for you, but I've never tried.

    It wouln't be hard to write a quick php/cgi help page that given the user name would provide the used with the correct server address. Or you could make a few dns entries like a.ftp.host, b.ftp.host, etc and if the users name was tom they would use t.ftp.host.

    Or you could ask Geocites for their user management software ;~)
    Leknor
    http://Leknor.com [leknor.com]

  • I'm curious to know what you mean by "the wrong way". I think using mod_rewrite is a good idea too. Managing 50,000 subdomains would certainly be interesting. An with mod_rewrite, depending on how you architect the system you could get away with far fewer rewrite rules. But these difference in complexity seem simple. Is there any other reason to avoid the subdomain solution?

  • by Anonymous Coward
    I setup a webserver that serves our customers personal homepages (1-10megs each) (10k+ accounts and growing) running Apache+Frontpage and NCFTPD on a single Sun Microsysyems Ultra2 with 500mb of RAM and 2x300mhz Sparc cpu's with a single A1000 filled with ?4-9 gig drives setup for RAID 0+1.

    The load on the box is very minimal and I fully expect this box to handle well over 50k before needing any type of upgrade.

    The CGI's could easily kill just about any box if you allow 50k people to have access to the CGI-BIN, you might consider using a reverse proxy to distrubute the CGI-BIN accross serveral boxes.

    -luke (lstepnio@majjix.com)
  • and you are a moron. Thanks for sharing.
  • Actaully from what I can tell (by using Netcraft) Hotmail is entirely running Win2K now atleast on the front-end. I have been unable to get a FreeBSD box show up at all now, it is true that they started the migration by slowly putting in Win2K boxes, but it seems that they are now complete and all of the computers are windows now. Maybe it's just me though.
  • This is definitely the path to real expandability and much like the path "real" sites take. Also, make as many a.ftp.host sites as possible. It is then always possible to direct 10 or 20 of them to the same server by dns, but also to transparently split apart the ones that are overloaded later.

    If your budget is really tight then load as much disk as convienent on each server, make them cheap linux or BSD servers, and base the node size on whats cheap in HDs these days. Organize your directory structure so one tree contains all data from each named server (atree/ and btree/ if a & b are on the same physical server) so you can split easily by node. and all the CGI et cetera on the local servers, so you can only slow it down a bit.

    On another note, a typical thing to do is professors.host (who have fewer procs usually, but therefore get better response times) nongeeks.host and a.cs.host b.cs.host so that the cs users can get angry at each other for bringing down a server, whle the nongeeks who'll be more likely to think it is your fault won't have problems. On a similar note, if your setup allows easy user migration, you could disallow cgi on most servers, allowing it only by permission (and migrating that user, then.)
  • If this is a university, here's my suggestion

    Split up the users by the school they are in, i.e. at GWU, we've got seas.gwu.edu(engineering), esia.gwu.edu(international affairs), and a couple others. Most university students don't run high volume stuff off their website - so you could probably get away with decently powered boxes - 1 for each school, with tons of SCSI HD space on each box. Or - get a big honkin SUN box and let it do everything (gwis2.circ.gwu.edu).

    Of course - if this isn't a university - just throw my comments out the window ;-}
  • Yes, actually i can. Fleacircus.org [fleacircus.org]. From their site:
    Flea Circus Hosting Features:
    • Full FTP and Shell access to your server
    • Unlimited space for your files*
    • Unlimited bandwidth usage
    • youraddress.com hosting. Simply register your domain and we'll host it.
    • Unlimited E-mail accounts
    • Full CGI-Bin and Server Side Includes (SSI) support
    I myself use them for my open source project.

    Mark Duell
  • Excellent, I'll check this out.

    Many thanks,
    G
  • Well how cheap is cheap? what type of budget do you have to work with? Are the sites dynamic static or what? Do you know what kind of traffic these sites currently have? Are you looking for high availability and/or redundancy? Do you want load balancing of any sort? Besides frontpage do your customers want anything else like php or zope (this will add to server load)? Are you providing other services like SQL, SMTP etc?
  • In an old article [slashdot.org] in Slashdot these guys had the ticket for you. This origin server could have been upgraded to easily handle your load. Concentric [concentric.com] serves over 100k users with theirs. Of course, you want to load it with Apache [apache.org] and stay away from that crazy things like JSPs and to much CGI.
  • Of course, if you used DNS, then you'd also have to deal with domain name propagation problems, and I'm sure that's a mess you don't want to get involved with.
  • By "wrong way" I mean too many subdomains instead of directories from a purly cosmetic point of view. Mod_rewrite is a wonderfully good idea. But http://cmatthew.users.barkley.edu isn't as clean an URL as http://users.barkley.edu/cmatthew IMO.

    Slash should use mod_rewrite to map http://slashdot.org/sid/pid -- then get into all that ?blah=blah&purplemonkey=dishwasher

  • From the message this sounds like a web server to support personal web pages at a university. Based on that, you can bet that a large percentage (most?) of the people will not use the space at all. Some small number will use the maximum of their quota (whatever that number is) and a bunch will put up just 1-2MB of stuff.

    My swag would be that *maybe* 1/10th of the 2TB of theoretical maximum would be used. 200GB of storage is doable on a good server, with simple backup strategies, etc.

    As for the Web server - I would bet that a single well-tuned machine (of the PC make) could do a reasonable job, assuming that no user has really nasty CGI scripts running. Yeah - it would slow down at peak times, but hey - it's a freebie for the students. You might saturate the uplink before the server.

    It will take some system admin time on an ongoing basis to monitor performance and slap those students that run nasty CGI scripts that chew up lots of CPU time. That requires a carefully written user policy that basically states you can do stuff, but you can't hog the CPU. Then it also requires a system admin that can be a BOFH to clue in the lusers that need to be helped out. The need for maintenace is probably a greater cost than the actual hardware.

  • I hate to prove an asshole right. But tripod.com (check out my website and CGI page here [tripod.com]) allows you to roll your own CGI. The servers aren't particularly fast, but they are free.

    The excuse that he didn't want his free space provider to get slashdotted was very amusing.

Two can Live as Cheaply as One for Half as Long. -- Howard Kandel

Working...