Supporting Tens Of Thousands Of Users With Apache? 33
"I can only think of a couple of ways of doing this. One is to have an enormous single fileserver and have a cluster of apache web servers that NFS mount the home directories to serve up web pages. Then the users FTP into the main file server to store their web pages. To me, this seems wildly inefficient, and you have no real redundancy if the main fileserver crashes unless you are using a SAN (which is very expensive), or you have a hot backup that is rsync'ed or something. And I'm thinking that rsyncing up to 2TB of data would be an exercise in futility.
My other thought was to have several back end file servers with a fixed number of users on each server, and then send all HTTP requests through an LDAP server first, which would then do a redirect to the machine that user's web page resided on. The big problem then is how to make sure users are FTP'ing into the machine that their account is on? They may also use FrontPage extensions with Apache, and this could complicate things even worse.
I know there has got to be a better architecture for this. How do enormous sites like Yahoo and Excite tackle this problem? They have hundreds of thousands of users! Better yet, how could they tackle it with Open Source tools? Would, for example, a Turbo Linux cluster help this problem any, or would I still have to replicate the data across every node in the cluster (meaning I'd need up to 2TB of storage for each cluster node!) Then what happens if they decide they want to add another 10,000 users? I can't find pointers to information, or ideas on how to do this *anywhere*. Can you fellow Slashdotters give me any advice?"
BSD & Apache? (Score:1)
Vague rambling: :-)
Use multiple NFS (or CODA?) servers on the backend. All of the web servers (N seperate machines) would mount "/webspace" or something from each NFS server. (probably a soft mount, or via CODA) You need to specify all of the server mounts via apache "UserDir" directives so it will search for the user's homepage correctly.
If the NFS servers also crossmount then they can be FTP servers as well and be able to access any user's home directory. Then you just list all of the NFS servers in a DNS round robin for "ftp.blah" and the WEB servers in a round robin for 'www.blah' and it should work.
hmm.. (Score:1)
NAS is the way (Score:1)
Use FreeBSD without CGI allowed (Score:3)
Also, recommend to them to not allow CGI scripting - that would be a NIGHTMARE to support with 30,000 users. Not only would there be a huge amount of security holes, imagine the amount of server power that would take.
Of course, if you have a large amount of money to spend, get an S/390 and give each user a virtual machine running Linux
--
What is the expected usage? (Score:2)
Ok, but how much are people *really* going to use? The university I am at has about 20,000 users, and provides each with 50MB of disk space (to be used for everything including webspace). In total about 100GB is used, so the average per person is only 5MB. Since this includes much more than just webspace, I'm guessing that you'd find that 200GB would be more than enough for your users.
Other notes which might be of interest; my university runs apache on solaris, with the file system on a separate NFS-mounted box. The webserver (which is also FTP server and telnet server) is a four processor SUN box IIRC.
Re:BSD & Apache? (Score:1)
What's the current infrastructure? (Score:1)
If there's something there that your users are familiar with, it'll be easier for them and for your support people.
Cheap storage (Score:1)
Software Virtual Hosts (Score:2)
The most straight foward way is to avoid the use of fileserver entirely. The problem your having is your assuming your going to use URLs of the form http://www.sitename.edu/path-to-userdir/. If you instead give everyone their own subdomain this is quite easy.
Setup 10 apache/linux-or-bsd servers with 3000 users each. Setup a single DNS server that manages the subdomain "users.sitename.edu". Then give each user a subdomain of "username.users.sitename.edu". Map each subdomain to the IP of the appropriate server. You can manually configure apache or you can use one of the dynamic boot-time configuration schemes.
Of course this could be a problem if your client has a policy about DNS subdomain allocation. And 50MB * 3000 users per server is still big... you might want to buy a big big disk array and plug all the linux servers into it.
For that matter a rack full of Sun Netra T1s and a fibre channel disk array should be cheap enough and supported by Sun to boot. Netra's are dream to manage.
50Mb is quite a bit of space (Score:2)
If folks want more than that, they can pay extra, or go to a third-party hosting system.
Re:Software Virtual Hosts damn clever (Score:1)
Damn. good thinking--
willis/
Re:BSD & Apache? (Score:1)
willis/
FUD, methinks. (Score:1)
Last I heard approx. 80% are still BSD. Microsoft will probably, eventually, have to move to win2k, for technical (PR) reasons.
cheers, G
ps. Netcraft is now reporting (at least for me) hotmail.com is running Microsoft-IIS/5.0 on Windows 2000
Re:Use FreeBSD without CGI allowed (Score:1)
I don't know any free space providers who allow you to run your own CGI scripts, rather than just a few guest book / counter scripts that they provide.
Please, please, prove me wrong. Lets have some links, please.
cheers,
G
Re:Use FreeBSD without CGI allowed (Score:2)
I doubt that you should worry about the processing overhead that CGI may add - it will hardly be used. Now the security issues, well that's another matter.
cheers,
G
Re:Software Virtual Hosts (Score:1)
And if you don't use virtual hosts on the back end, you could have the rewrite box forward the request to serverid.domain.com which always rewrites to www.domain.com, you can seamlessly reallocate locations.
Re:BSD & Apache? (Score:1)
Hey, wait a minute...
Re:Software Virtual Hosts (Score:1)
mod_rewrite is very fast if you have clean rules (keep it simple, stup^dumbass). As far as a pretty URL goes i'd like http://user.blah.edu/cmatthew rather than cmatthew.user.blah.edu... that's just going too far the wrong way.
Many servers and Squid (Score:2)
If you decide to go with more then one web server take a look at Squid [squid-cache.org]. It can reverse proxy web request to make many servers look like one. You should be able to split the user names on alpha ranges.
From the Squid FAQ [squid-cache.org]
This doesn't quite solve your ftp problem. I did a quick search and didn't find anything that would direct a ftp to a different server based off of username. It shouldn't be hard to adapt a ftp proxy to do this for you, but I've never tried.
It wouln't be hard to write a quick php/cgi help page that given the user name would provide the used with the correct server address. Or you could make a few dns entries like a.ftp.host, b.ftp.host, etc and if the users name was tom they would use t.ftp.host.
Or you could ask Geocites for their user management software ;~)
Leknor
http://Leknor.com [leknor.com]
Re:Software Virtual Hosts (Score:1)
I'm curious to know what you mean by "the wrong way". I think using mod_rewrite is a good idea too. Managing 50,000 subdomains would certainly be interesting. An with mod_rewrite, depending on how you architect the system you could get away with far fewer rewrite rules. But these difference in complexity seem simple. Is there any other reason to avoid the subdomain solution?
doing 10k+ users on a SUN Ultra2+A1000 (Score:1)
The load on the box is very minimal and I fully expect this box to handle well over 50k before needing any type of upgrade.
The CGI's could easily kill just about any box if you allow 50k people to have access to the CGI-BIN, you might consider using a reverse proxy to distrubute the CGI-BIN accross serveral boxes.
-luke (lstepnio@majjix.com)
So basically your point is moot (Score:1)
Re:BSD & Apache? (Score:1)
Re:Many servers - definitely the real way (Score:1)
If your budget is really tight then load as much disk as convienent on each server, make them cheap linux or BSD servers, and base the node size on whats cheap in HDs these days. Organize your directory structure so one tree contains all data from each named server (atree/ and btree/ if a & b are on the same physical server) so you can split easily by node. and all the CGI et cetera on the local servers, so you can only slow it down a bit.
On another note, a typical thing to do is professors.host (who have fewer procs usually, but therefore get better response times) nongeeks.host and a.cs.host b.cs.host so that the cs users can get angry at each other for bringing down a server, whle the nongeeks who'll be more likely to think it is your fault won't have problems. On a similar note, if your setup allows easy user migration, you could disallow cgi on most servers, allowing it only by permission (and migrating that user, then.)
split by school (Score:1)
Split up the users by the school they are in, i.e. at GWU, we've got seas.gwu.edu(engineering), esia.gwu.edu(international affairs), and a couple others. Most university students don't run high volume stuff off their website - so you could probably get away with decently powered boxes - 1 for each school, with tons of SCSI HD space on each box. Or - get a big honkin SUN box and let it do everything (gwis2.circ.gwu.edu).
Of course - if this isn't a university - just throw my comments out the window
Re:Use FreeBSD without CGI allowed (Score:1)
Flea Circus Hosting Features:
- Full FTP and Shell access to your server
- Unlimited space for your files*
- Unlimited bandwidth usage
- youraddress.com hosting. Simply register
your domain and we'll host it.
- Unlimited E-mail accounts
- Full CGI-Bin and Server
Side Includes (SSI) support
I myself use them for my open source project.Mark Duell
Re:Use FreeBSD without CGI allowed (Score:2)
Many thanks,
G
specifics? (Score:1)
Dude, you should have bought this... (Score:1)
Re:Software Virtual Hosts (Score:1)
Re:Software Virtual Hosts (Score:1)
Slash should use mod_rewrite to map http://slashdot.org/sid/pid -- then get into all that ?blah=blah&purplemonkey=dishwasher
Look at typical usage, not peak usage (Score:1)
My swag would be that *maybe* 1/10th of the 2TB of theoretical maximum would be used. 200GB of storage is doable on a good server, with simple backup strategies, etc.
As for the Web server - I would bet that a single well-tuned machine (of the PC make) could do a reasonable job, assuming that no user has really nasty CGI scripts running. Yeah - it would slow down at peak times, but hey - it's a freebie for the students. You might saturate the uplink before the server.
It will take some system admin time on an ongoing basis to monitor performance and slap those students that run nasty CGI scripts that chew up lots of CPU time. That requires a carefully written user policy that basically states you can do stuff, but you can't hog the CPU. Then it also requires a system admin that can be a BOFH to clue in the lusers that need to be helped out. The need for maintenace is probably a greater cost than the actual hardware.
Re:Use FreeBSD without CGI allowed (Score:1)
The excuse that he didn't want his free space provider to get slashdotted was very amusing.