Scalable, Fault-Tolerant TCP Connections? 41
pauljlucas asks:
"My company is developing custom server software for an instant
messaging type server (under Solaris). Every client maintains a
TCP connection to the server when it is 'logged in'; the server
maintains state of who's logged in where. For large-scale
deployment, there are two problems: scalability and fault-tolerance.
A single server can handle at most around 64000 open sockets. To go
beyond this, you need many servers. Another way would be to 'fake' a
TCP stack in user-space (by reading/writing raw TCP packets) thus not
having one real socket per connection. For fault-tolerance, ideally
one would like N servers to maintain the exact state, at least for
the server process, so that if one goes off line, the other(s) can
pick up seamlessly. I'm thinking that both of these issues must have
already been solved without having to write lots of custom software.
Is anybody aware of off-the-shelf software and/or hardware solutions
(either commercial or freeware)?"
Why TCP? (Score:1, Informative)
It would be possible to implement a server-side solution, using TCP, but you'd need to keep track of things like sequence numbers and window sizes to move a TCP connection seamlessly between two computers (meaning you'd probably need to write a TCP stack).
Re:Why TCP? (Score:1)
If you don't do this, the UDP affinity will time out and they the server can no longer contact the client when there is an incoming message for it.
LocalDirector (Score:3, Interesting)
You can get a TCP load balancer like Cisco LocalDirector [cisco.com] or one its competing clones. They are expensive tho ($20,000)
Re:LocalDirector (Score:2)
Local Director is a horrible device. Unlike modern loadbalancers from F5, Alteon, Foundry, etc. the Local Defector is a layer-2 bridge. It cannot have more than one path to a given target, causes all the problems that bridges introduce into switched networks, and allows for potential security breaches, becuse it is commonly used to bridge between differing subnets!
All of the above vendors provide a proxy/switch style solution for layer 3 and above. If you can afford F5's BigIP HA+ in fail-over, this is a dream! Host-based on Intel, with a customized *BSD. Unless you are a freak for IOS-type management, Unix admins will love this.
Check out O'Reilly on Bridge-Path vs. Route-Path Server Load Balancing [oreilly.com]
Your actual problems are somewhat different (Score:5, Insightful)
You indicate that your scalability problem kicks in at around 64000 simultaneous clients. Having developed high-performance scalable servers I would recommend taking a look at The C10k Problem [kegel.com], which is rather sobering: handling 10000 simultaneous connections is already a bitch.
Basically if you are using select() or poll() and have 10000 connections, you can expect 30% of your timeslice (on a fast machine) to be taken scanning the connection list for available data. Your performance also goes down the drain after about 250 connections (empirical observation; our server handles async requests which are offloaded to a hardware device for processing, so most server overhead is packet handling).
To get more connections than that you need to look at OS-specific methods: IOCompletion ports on NT, /dev/poll on Linux & Solaris, and kqueue on FreeBSD. These scale must more linearly out to 10000 connections (depending of course on if your server can handle the total load), but still don't give you the ability to scale endlessly.
Finally, I know of no intrinsic reason (although OS implementations may have arbitrary limits) why a server should be restricted to 64k connections. A TCP/IP connection is defined by two endpoints, each being an IP address/port cobination. Your server's ip/port is always one of those endpoints, and the client are unique. Client connections on the other hand are limited by the number of available port on the computer, i.e. 64k.
For fault tolerance, your best bet is to look at the architecture of Enterprise Java servers. Effectively you have a load balancer up front which redirects packets to one of several application servers; any persistent information must be saved off to a database server, and if necessary two or more database servers must be configured to synchronise with each other on a continual basis.
I am not aware of any software that can magically do this for an arbitrary application, although you may find network shared memory libraries which can more or less accomplish this. But if you are concerned about performance, you are unlikely to find COTS software to do the job (since the current business model is to throw more computing power at it).
Re:Your actual problems are somewhat different (Score:1)
Basically if you are using select() or poll() and have 10000 connections, you can expect 30% of your timeslice (on a fast machine) to be taken scanning the connection list for available data.
Not exactly. For an instant messaging server, at the least, the more time you spend in select(), the more filedescriptors you are likely to select. You wind up wasting CPU time when there is little activity, but CPU time doesn't matter as long as you are handling requests at least as fast as they are coming in. In actual practice, using select() affects latency (on the order of milliseconds), but amortized the scalability is nearly O(n).
Your performance also goes down the drain after about 250 connections (empirical observation; our server handles async requests which are offloaded to a hardware device for processing, so most server overhead is packet handling).
This is likely a bug in the OS configuration, or the OS kernel. I've seen machines easily handle 10,000 connections without blinking. When testing an instant messaging server I wrote, a pentium III running FreeBSD easily handled 10,000 connections on a single process.
Re:Your actual problems are somewhat different (Score:2)
Hi. No bug in the OS or the kernel, but in assumptions ;) Our server is tailored for performance under load, not for massive amounts of idle connections. In such a scenario the time taken by select() has a serious impact on the overall performance.
In instant message by contrast, I will agree with you that the lighter processing load will be less affected by scanning (mostly idle) descriptors.
Re:Your actual problems are somewhat different (Score:1)
Select() is only expensive with idle connections. If the connection is not idle then select() only adds a trivial constant overhead.
The only assumption I'm making is that all connections being handled by select() have approximately the same throughput rates. If that assumption is made, then select will scale O(n) on the number of connections, but the number of requests handled per call of select() will also scale O(n) on the number of connections. Amortized over the requests handled, select() only adds a constant overhead.
That's the theory. The reality is that I have yet to see a real world case where there is an even distribution of load and yet select() is the performance bottleneck. I'm not denying that you're seeing a bottleneck at 250 connections. There may not even be a bug anywhere, you might just be seeing the maximum throughput for your particular situation.
Don't get me wrong, /dev/poll and the others are an improvement to select()/poll() in most circumstances. The interface is certainly more appropriate. But in terms of real world scalability, they don't help at all. Show me the benchmark showing otherwise, and I'll show you the flaw (the general flaw is sending only one request per select()).
Re:Your actual problems are somewhat different (Score:2)
Hi. Just FYI, out performance testing results. We have three implementations of this server: NT IO-Completion ports, select() and multi-threaded.
Testing was done using n identical clients, all flooding synchronous requests. i.e. each client sits in a "send, wait for reply" loop. The bottleneck is network bandwidth - we get sub-optimum performance except on Gb ethernet.
The nature of our server is such that max. performance is reached at at 8-12 conenctions. The multi-threaded server has the best performance out to 20-25 connections. IO completion ports scales out to 500 connections, dropping only 3% in performance (compared to 12 conenctions). select() never tops the scales, but is consistently about 1% behind IO completion ports out to 120 connections, then drops back to 10% behind out to 250 connections, then drops off horribly.
I will readily admit, however, that our implementation using select() could be more efficient. Even so, in architectural benchmarks it doesn't fare well beyond about 400 connections in a server doing IPC/RPC type transactions.
Re:Your actual problems are somewhat different (Score:1)
Testing was done using n identical clients, all flooding synchronous requests. i.e. each client sits in a "send, wait for reply" loop.
This is highly unrealistic for most applications, and is exactly the reason you are seeing performance issues. Most likely the server is processing the requests fine, but the client is sending them too slowly.
If your tests accurately reflect the expected usage, you're probably best off not using select() or I/O completion points or anything at all, and instead simply using nonblocking I/O in a loop.
Select() does have a problem with latency, and your lock-step test is exploiting that problem perfectly. In a real world internet environment the latency of the end-to-end connection is almost always going to cancel that out anyway, but if you really do have such non-random usage patterns you should be using that information to predict which filedescriptors are likely to have data instead of select()ing them at all.
The bottleneck is network bandwidth - we get sub-optimum performance except on Gb ethernet.
If the bottleneck is network bandwidth then how can the bottleneck also be select()? The increase in performance when using Gb ethernet might be the decrease in latency, not the increase in bandwidth. But that's just a pure guess.
Re:Your actual problems are somewhat different (Score:2)
In any case, the fact that many kernels fall on their face for far smaller numbers of connections is a result of simplistic data structures (linear lists, bit arrays, etc.). Why do kernel developers choose simplistic data structures? Beats me. Perhaps it's related to the fact that implementing and reusing good data structure libraries in C is just such a pain, but it's hard to say whether that's the cause of the problem or merely the consequence of the general mindset of kernel developers. In any case, there is no point in whining about the poor abstraction in many operating system kernels--obviously, nobody else wants to do the work. As long as kernel developers fix these problems when they come up in whatever way they like, everybody is happy. And they do fix them. Several UNIX kernels changed over from lists to hash tables in their network-related data structures when they hit performance limits, so whatever is wrong in your favorite kernel can be fixed by your favorite kernel hackers as well.
Re:Your actual problems are somewhat different (Score:2)
Hi. I'm afraid I still fail to understand how this shortcoming is intrinsic to TCP w.r.t server connections.
When a client makes outbound connections, each connection must be from a unique port. A client cannot share a port used by a server, and port is a 16 bit value, so there are (65536 - server ports - 1 [0 is not addressable]) client connections permitted from any given machine.
Now when it comes to the server, the situation is different. A server connection is identified by a single port on the server, irrespective of the client. The server distinguishes the client connection based on the client's IP and source port. That gives an intrinsic limitation of about 64k connections per client IP, and about 4G IPs.
I see no reason why this 'bookkeeping' (when applied to server connections) has to be done with a 16 bit number. That sounds like an arbitrary operating system limitation. It would imply that an incoming packet is examined for IP and source port, that looked up in a table with max 64k entries, and the result taken to be the connection id. This makes no sense -- the kernel would look up the ip:port combination and get a unique stream identifier, which is an int (32 bit).
Could you please explain clearly (in theory, or in terms of OS implementation) why this isn't the case? Thanks.
Scalable? (Score:1)
No matter what you do with the software to handle a giant number of connections, you still have the physcial limits of the machine don't you? The NIC and CPU can only do so much; so isn't that going to be a bottel neck for you?
It also seems like keeping the state information about 64K+ connections for whatever they are being used for has to involve some kind of overhead as well. You'd really need some efficent way to deal with organziing it all so its reasonably easy and fast to access.
You can't handle each connection with its own thread, but even if you break it up into several threads or several processes each handling a couple thousand connections (polling or something similar), thats got to have a lot of latency as well if you expect a large number of these connections to be active most of the time.
I could be wrong, I don't know exactly what you're trying to do. Maybe the connections are mostly idle. But I think that you are probably looking at more than one bottle neck forcing a single computer to do all the work.
In general, trying to customize a single machine, or single program to scale isn't usually a good solution. You'd probably be better to find a way to design the software to work with a number of machines. Most large websites have load balancers that distribute the requests to several machines. Large services typically do this, or rely on the fact that all users probably won't be connected all at once.
Re:Scalable? (Score:1)
I don't think so because most of the time a connection is idle. In my application, unlike a web server, it only gets traffic over it when there is an incoming message for a client. This is just like the real telephone network.
One of the goals is to reduce the cost of buying and maintaining the racks and racks of servers that are ordinarily needed for something like this. All those machines are needed only because of TCP/socket/port limitations not CPU or bandwidth because the connections are idle most of the time. The real telephone network isn't engineered to be able to handle everybody going off-hook simultaneously.Problem: state of the application (Score:1)
BUT...
...when the connection is being switched to a new server (for load balancing or failure switchover), the server's application does not know the connection and thus the switched one will be rejected. You will need to program your application to keep and distribute state of the connection. If you do stateless applications (e.g. web-based), then at least the (next) IP-connection will have a clean switchover on the srever side.
You want linuxvirtualserver (Score:3, Informative)
More importantly, you should check out http://www.linuxvirtualserver.org/ ; it's aim is exactly what you want:
"The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the cluster is transparent to end users. End users only see a single virtual server."
Sounds like a perfect match.
Sumner
Re:You want linuxvirtualserver (Score:2)
"The advantage of the virtual server via NAT is that real servers can run any operating system that supports TCP/IP protocol, real servers can use private Internet addresses, and only an IP address is needed for the load balancer"
So you can keep your application on Solaris with LVS.
Sumner
Forgot: clients talk UDP peer-to-peer (Score:1)
Hence, the server sees very little traffic in comparison.
Re:Forgot: clients talk UDP peer-to-peer (Score:2)
You've a instant messaging application, which is composed of a lot of clients and a central server.
The clients talk to the server only on start-up, but presumbely you want to keep the TCP connection open so you would know when a client goes offline.
That is not a good way to do it, not because you'll have problems with ports, you can have as many connections open as you want, after all, a TCP connection is identified by source_ip:port + dest_ip:port. But because of the overhead you'll encounter maintaining so many connections.
It's also not good to use UDP for peer-to-peer connections, why complicate your life with things that TCP already offers?
Persumbely, on start-up, a client calls the server, log-in, register its state (online,busy,D/A, etc), and asks for a list of names/ID of friends that it has.
You want to send it the IPs of those who are online, so it can send messages to them directly.
The way to do it, in my idea, is to handle it so:
Have a database of your clients, which would include username & password, an IP & state.
Another field should be a list of names that should be notified when this client goes online/change status.
When the clients calls in, authenticate it, and register it state in the DB, send it the list of IPs & states of the people online that he requested.
And register its state in the DB. Notify everyone that requested to be notified and is currently online that the user went online.
Cut the TCP connection. **
When the clients closes, have it send a message saying "I'm going offline".
Then change the state of it in the DB & inform the users linked to it. **
That way, the only connections that you've is of clients starting, changing status & shuting down.
That should lower your load considerably.
As for scalability & fault tolerance, just put it behind a load balancer, that way, if a server is busy or down, the request goes to another server, all of them are linked to the same DB, so the state is being preserved.
If a server goes down in the middle of a request, then the client should be smart enough to recover, and try again (that is what browsers do, and why load balancers works so well for HTTP).
Be sure to make the client stop after a couple of failed tries, though, you don't want to overload the network in case all your servers down for some reason.
What about errors, you ask? If a client is being terminate or disconnected without having a chance to inform you?
Well, the way I would do it is let other clients discover that.
If a client can't form a direct connection to another client, it should tell the server about this.
The server would try to reach the client by himself, and if he fails, would register that client as offline, store the request for when that user goes online again/discard it, and tell all the clients that are linked to the failed client about it [***].
I think it's much better than have the server poll at possibly hundreds of thousands of connections. (Or more, if you are lucky.)
After all, it doesn't matter if a client that no one is trying to call is offline while it's marked online. #
[**]
If you care about bandwidth/load, have the clients maintain a list of the people/or give it to him during log-on, and have the *client* do the notifications. On start-up, shut-down & status chagne. You'll have to handle the errors yourself, though. ***
[***]
You can be really nasty and have the client that discovered the error inform the rest of the clients that are linked to the failed client that it's gone. But that would require that client to have the list of people that want the failed client's link, which can be bad from privacy point of view.
[#]
If it does matter to you, have the clients poll the other clients in their list every hour or so, it would balance out so you wouldn't have too much lost connections marked as alive.
Re:Forgot: clients talk UDP peer-to-peer (Score:1)
Actually, I said "instant messaging type server." In reality, the application is to send real-time video/audio peer-to-peer. (My question would have been 5 pages long if I gave all the exhaustive details.)
No, we want to keep the TCP connection open so a client knows when it has an incoming message instantly. Polling is out of the question because (a) it's not instant and (b) would use enormous amounts of bandwidth we'd have to pay for.
It is for audio and video. Again, I simplified my question because this detail is mostly irrelevant. I only mentioned it to point out that the bulk fo the bandwidth is NOT going through the server, just the "call setup."
This does allow people not on your "buddy list" to call you (which every IM clients does in fact allow).
Re:Forgot: clients talk UDP peer-to-peer (Score:1)
Why not use UDP for this?
Tom
Re:Forgot: clients talk UDP peer-to-peer (Score:1)
Re:Forgot: clients talk UDP peer-to-peer (Score:2)
Why do it via the *server*, anyway? If you have a message from one client to another, then transfer it directly from one client to another, not through the server. That way, the client is aware instantly, and you aren't wasting bandwidth by transferring the data.
Why maintain them? (Score:1)
Re:Why maintain them? (Score:1)
Because multiply that by the goal of millions of users and that's a lot of bandwidth to pay for.
Did you consider UDP? (Score:2)
If you're thinking of going to the trouble of simulating TCP with raw sockets, UDP seems a simpler alternative to that.
Re:Did you consider UDP? (Score:1)
If you're thinking of going to the trouble of simulating TCP with raw sockets, UDP seems a simpler alternative to that.
This being an instant messaging server, presumably for the public, UDP would have too many problems going through certain firewalls.
He also asked for redundancy (Score:2)
The way to do this is to build a custom TCP stack and integrate it tightly into your app. A lot of work and hard to get right.
I would ask, "Do we REALLY" need this when our application already has to handle things like network failures? You might, though - I don't know what your application is.
Also, don't forget to use redundant routers, redundant firewalls, etc. If you use NAT, that imposes one more problem - transparently moving the connection table between the failed firewall and the working one.
Not quite off-the-shelf (Score:1)
A single server can handle at most around 64000 open sockets.
If my memory serves me correctly one can easily break the 64K limit by using multiple processes. This wasn't on Solaris, though.
Is anybody aware of off-the-shelf software and/or hardware solutions (either commercial or freeware)?
Assuming you have control over the protocol, I've written such a server. I don't have the code, but my non-compete agreement has expired.
Re:Not quite off-the-shelf (Score:1)
AFAIK, ephemeral ports for sockets are a system-wide resource. More than one process can't use the same IP/port pair on a machine.
Re:Not quite off-the-shelf (Score:1)
More than one process can't use the same IP/port pair on a machine.
More than one connection can't use the same 5-tuple of remote address, remote port, protocol, local address, local port. Multiple processes can use the same local address, local port, protocol 3-tuple, but some OSes have problems when multiple processes do an accept on that same 3-tuple from multiple processes. In any case, these limitations can be easily solved by using multiple local IP addresses and/or multiple local ports. I have seen up to 256 IP addresses on a single ethernet card with no problems. If bandwidth becomes a problem before CPU you can add multiple ethernet cards. You can then use round robin DNS to balance the load between different processes, or (preferably) build the port selection into the client/server protocol. Putting just a small bit of intelligence into the client will save you orders of magnitude in terms of simplicity, reliability, and scalability. You also could, but probably shouldn't, use a load balancer, to present a single IP/port combination to the world.
I'm almost tempted to do a quick test to confirm it, but I'm almost certain I've seen the 64K connections/machine broken. This was over a year ago, on FreeBSD. I'd imagine Solaris and Linux are able to do it as well, by this point.
Re:Not quite off-the-shelf (Score:1)
We already did that with the hope that we would solve the bigger problems later. :-)
Re:Not quite off-the-shelf (Score:1)
AFAIK, ephemeral ports for sockets are a system-wide resource.
Aren't ephemeral ports only used for outgoing connections, not incoming ones?
Re:Not quite off-the-shelf (Score:1)
Uhm, I think what I meant is that when you do an accept(2) on the listening socket, you get a new file descriptor that is attached to another socket connection that is independent of the listening socket. There are a finite number of those.