Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Unix Operating Systems Software

Scalable, Fault-Tolerant TCP Connections? 41

pauljlucas asks: "My company is developing custom server software for an instant messaging type server (under Solaris). Every client maintains a TCP connection to the server when it is 'logged in'; the server maintains state of who's logged in where. For large-scale deployment, there are two problems: scalability and fault-tolerance. A single server can handle at most around 64000 open sockets. To go beyond this, you need many servers. Another way would be to 'fake' a TCP stack in user-space (by reading/writing raw TCP packets) thus not having one real socket per connection. For fault-tolerance, ideally one would like N servers to maintain the exact state, at least for the server process, so that if one goes off line, the other(s) can pick up seamlessly. I'm thinking that both of these issues must have already been solved without having to write lots of custom software. Is anybody aware of off-the-shelf software and/or hardware solutions (either commercial or freeware)?"
This discussion has been archived. No new comments can be posted.

Scalable, Fault-Tolerant TCP Connections?

Comments Filter:
  • Why TCP? (Score:1, Informative)

    by Anonymous Coward
    The easiest solution would be to use UDP, and have the clients switch servers when they can't get a response. As long as the servers have a common state, the client should be able to switch between them without too many problems. And since UDP is connectionless, it will scale better - you won't have to open a socket for each client.

    It would be possible to implement a server-side solution, using TCP, but you'd need to keep track of things like sequence numbers and window sizes to move a TCP connection seamlessly between two computers (meaning you'd probably need to write a TCP stack).

    • You can't use UDP as easily because you have to frequently send a UDP packet from the client to the server to keep the affinity mapped if the client is behind a NAT, e.g., Linksys DSL modem/router or equivalent.

      If you don't do this, the UDP affinity will time out and they the server can no longer contact the client when there is an incoming message for it.

  • LocalDirector (Score:3, Interesting)

    by Huusker ( 99397 ) on Wednesday January 09, 2002 @06:53AM (#2808618) Homepage

    You can get a TCP load balancer like Cisco LocalDirector [cisco.com] or one its competing clones. They are expensive tho ($20,000)

    • NO!!!!!

      Local Director is a horrible device. Unlike modern loadbalancers from F5, Alteon, Foundry, etc. the Local Defector is a layer-2 bridge. It cannot have more than one path to a given target, causes all the problems that bridges introduce into switched networks, and allows for potential security breaches, becuse it is commonly used to bridge between differing subnets!

      All of the above vendors provide a proxy/switch style solution for layer 3 and above. If you can afford F5's BigIP HA+ in fail-over, this is a dream! Host-based on Intel, with a customized *BSD. Unless you are a freak for IOS-type management, Unix admins will love this.

      Check out O'Reilly on Bridge-Path vs. Route-Path Server Load Balancing [oreilly.com]

  • by Twylite ( 234238 ) <twylite AT crypt DOT co DOT za> on Wednesday January 09, 2002 @08:06AM (#2808735) Homepage

    You indicate that your scalability problem kicks in at around 64000 simultaneous clients. Having developed high-performance scalable servers I would recommend taking a look at The C10k Problem [kegel.com], which is rather sobering: handling 10000 simultaneous connections is already a bitch.

    Basically if you are using select() or poll() and have 10000 connections, you can expect 30% of your timeslice (on a fast machine) to be taken scanning the connection list for available data. Your performance also goes down the drain after about 250 connections (empirical observation; our server handles async requests which are offloaded to a hardware device for processing, so most server overhead is packet handling).

    To get more connections than that you need to look at OS-specific methods: IOCompletion ports on NT, /dev/poll on Linux & Solaris, and kqueue on FreeBSD. These scale must more linearly out to 10000 connections (depending of course on if your server can handle the total load), but still don't give you the ability to scale endlessly.

    Finally, I know of no intrinsic reason (although OS implementations may have arbitrary limits) why a server should be restricted to 64k connections. A TCP/IP connection is defined by two endpoints, each being an IP address/port cobination. Your server's ip/port is always one of those endpoints, and the client are unique. Client connections on the other hand are limited by the number of available port on the computer, i.e. 64k.

    For fault tolerance, your best bet is to look at the architecture of Enterprise Java servers. Effectively you have a load balancer up front which redirects packets to one of several application servers; any persistent information must be saved off to a database server, and if necessary two or more database servers must be configured to synchronise with each other on a continual basis.

    I am not aware of any software that can magically do this for an arbitrary application, although you may find network shared memory libraries which can more or less accomplish this. But if you are concerned about performance, you are unlikely to find COTS software to do the job (since the current business model is to throw more computing power at it).

    • Basically if you are using select() or poll() and have 10000 connections, you can expect 30% of your timeslice (on a fast machine) to be taken scanning the connection list for available data.

      Not exactly. For an instant messaging server, at the least, the more time you spend in select(), the more filedescriptors you are likely to select. You wind up wasting CPU time when there is little activity, but CPU time doesn't matter as long as you are handling requests at least as fast as they are coming in. In actual practice, using select() affects latency (on the order of milliseconds), but amortized the scalability is nearly O(n).

      Your performance also goes down the drain after about 250 connections (empirical observation; our server handles async requests which are offloaded to a hardware device for processing, so most server overhead is packet handling).

      This is likely a bug in the OS configuration, or the OS kernel. I've seen machines easily handle 10,000 connections without blinking. When testing an instant messaging server I wrote, a pentium III running FreeBSD easily handled 10,000 connections on a single process.

      • Hi. No bug in the OS or the kernel, but in assumptions ;) Our server is tailored for performance under load, not for massive amounts of idle connections. In such a scenario the time taken by select() has a serious impact on the overall performance.

        In instant message by contrast, I will agree with you that the lighter processing load will be less affected by scanning (mostly idle) descriptors.

        • Select() is only expensive with idle connections. If the connection is not idle then select() only adds a trivial constant overhead.

          The only assumption I'm making is that all connections being handled by select() have approximately the same throughput rates. If that assumption is made, then select will scale O(n) on the number of connections, but the number of requests handled per call of select() will also scale O(n) on the number of connections. Amortized over the requests handled, select() only adds a constant overhead.

          That's the theory. The reality is that I have yet to see a real world case where there is an even distribution of load and yet select() is the performance bottleneck. I'm not denying that you're seeing a bottleneck at 250 connections. There may not even be a bug anywhere, you might just be seeing the maximum throughput for your particular situation.

          Don't get me wrong, /dev/poll and the others are an improvement to select()/poll() in most circumstances. The interface is certainly more appropriate. But in terms of real world scalability, they don't help at all. Show me the benchmark showing otherwise, and I'll show you the flaw (the general flaw is sending only one request per select()).

          • Hi. Just FYI, out performance testing results. We have three implementations of this server: NT IO-Completion ports, select() and multi-threaded.

            Testing was done using n identical clients, all flooding synchronous requests. i.e. each client sits in a "send, wait for reply" loop. The bottleneck is network bandwidth - we get sub-optimum performance except on Gb ethernet.

            The nature of our server is such that max. performance is reached at at 8-12 conenctions. The multi-threaded server has the best performance out to 20-25 connections. IO completion ports scales out to 500 connections, dropping only 3% in performance (compared to 12 conenctions). select() never tops the scales, but is consistently about 1% behind IO completion ports out to 120 connections, then drops back to 10% behind out to 250 connections, then drops off horribly.

            I will readily admit, however, that our implementation using select() could be more efficient. Even so, in architectural benchmarks it doesn't fare well beyond about 400 connections in a server doing IPC/RPC type transactions.

            • Testing was done using n identical clients, all flooding synchronous requests. i.e. each client sits in a "send, wait for reply" loop.

              This is highly unrealistic for most applications, and is exactly the reason you are seeing performance issues. Most likely the server is processing the requests fine, but the client is sending them too slowly.

              If your tests accurately reflect the expected usage, you're probably best off not using select() or I/O completion points or anything at all, and instead simply using nonblocking I/O in a loop.

              Select() does have a problem with latency, and your lock-step test is exploiting that problem perfectly. In a real world internet environment the latency of the end-to-end connection is almost always going to cancel that out anyway, but if you really do have such non-random usage patterns you should be using that information to predict which filedescriptors are likely to have data instead of select()ing them at all.

              The bottleneck is network bandwidth - we get sub-optimum performance except on Gb ethernet.

              If the bottleneck is network bandwidth then how can the bottleneck also be select()? The increase in performance when using Gb ethernet might be the decrease in latency, not the increase in bandwidth. But that's just a pure guess.

    • You are wrong: the 64k limit is intrinsic to TCP, both on the server and on the client side. When the server accepts a connection, it needs to keep track of which packets belong to which connection. Otherwise, the packets from multiple clients running on the same client machine couldn't be assigned to the correct stream. That bookkeeping is done with a 16bit number. Take a look at the protocols. If you actually complied with TCP/IP timeouts as originally specified, you can't even sustain anywhere near the connection rates that modern web servers achieve, and the shortened timeouts actually are rather bad from the point of view of robustness.

      In any case, the fact that many kernels fall on their face for far smaller numbers of connections is a result of simplistic data structures (linear lists, bit arrays, etc.). Why do kernel developers choose simplistic data structures? Beats me. Perhaps it's related to the fact that implementing and reusing good data structure libraries in C is just such a pain, but it's hard to say whether that's the cause of the problem or merely the consequence of the general mindset of kernel developers. In any case, there is no point in whining about the poor abstraction in many operating system kernels--obviously, nobody else wants to do the work. As long as kernel developers fix these problems when they come up in whatever way they like, everybody is happy. And they do fix them. Several UNIX kernels changed over from lists to hash tables in their network-related data structures when they hit performance limits, so whatever is wrong in your favorite kernel can be fixed by your favorite kernel hackers as well.

      • Hi. I'm afraid I still fail to understand how this shortcoming is intrinsic to TCP w.r.t server connections.

        When a client makes outbound connections, each connection must be from a unique port. A client cannot share a port used by a server, and port is a 16 bit value, so there are (65536 - server ports - 1 [0 is not addressable]) client connections permitted from any given machine.

        Now when it comes to the server, the situation is different. A server connection is identified by a single port on the server, irrespective of the client. The server distinguishes the client connection based on the client's IP and source port. That gives an intrinsic limitation of about 64k connections per client IP, and about 4G IPs.

        I see no reason why this 'bookkeeping' (when applied to server connections) has to be done with a 16 bit number. That sounds like an arbitrary operating system limitation. It would imply that an incoming packet is examined for IP and source port, that looked up in a table with max 64k entries, and the result taken to be the connection id. This makes no sense -- the kernel would look up the ip:port combination and get a unique stream identifier, which is an int (32 bit).

        Could you please explain clearly (in theory, or in terms of OS implementation) why this isn't the case? Thanks.

  • It's really early in the morning right now, but here is what I am thinking :)

    No matter what you do with the software to handle a giant number of connections, you still have the physcial limits of the machine don't you? The NIC and CPU can only do so much; so isn't that going to be a bottel neck for you?

    It also seems like keeping the state information about 64K+ connections for whatever they are being used for has to involve some kind of overhead as well. You'd really need some efficent way to deal with organziing it all so its reasonably easy and fast to access.

    You can't handle each connection with its own thread, but even if you break it up into several threads or several processes each handling a couple thousand connections (polling or something similar), thats got to have a lot of latency as well if you expect a large number of these connections to be active most of the time.

    I could be wrong, I don't know exactly what you're trying to do. Maybe the connections are mostly idle. But I think that you are probably looking at more than one bottle neck forcing a single computer to do all the work.

    In general, trying to customize a single machine, or single program to scale isn't usually a good solution. You'd probably be better to find a way to design the software to work with a number of machines. Most large websites have load balancers that distribute the requests to several machines. Large services typically do this, or rely on the fact that all users probably won't be connected all at once.

    • The NIC and CPU can only do so much; so isn't that going to be a bottel neck for you?

      I don't think so because most of the time a connection is idle. In my application, unlike a web server, it only gets traffic over it when there is an incoming message for a client. This is just like the real telephone network.

      Most large websites have load balancers that distribute the requests to several machines.
      One of the goals is to reduce the cost of buying and maintaining the racks and racks of servers that are ordinarily needed for something like this. All those machines are needed only because of TCP/socket/port limitations not CPU or bandwidth because the connections are idle most of the time. The real telephone network isn't engineered to be able to handle everybody going off-hook simultaneously.
  • Fault-detecting switches (Alteon, Big5, Cisco Local Director) do transparent switching - as do software HA solutions (Rainfinity, Linux Virtual Server).

    BUT...

    ...when the connection is being switched to a new server (for load balancing or failure switchover), the server's application does not know the connection and thus the switched one will be rejected. You will need to program your application to keep and distribute state of the connection. If you do stateless applications (e.g. web-based), then at least the (next) IP-connection will have a clean switchover on the srever side.
  • by pthisis ( 27352 ) on Wednesday January 09, 2002 @08:41AM (#2808795) Homepage Journal
    First off, the Kegel's c10k page referenced earlier is definitely worth a read. And if you're under the impression that having 2^16 TCP port numbers limits you to 2^16 connections, that's not accurate. You can have hundreds of thousands of connections to one machine, presuming you manage them properly (as the c10k page points out).

    More importantly, you should check out http://www.linuxvirtualserver.org/ ; it's aim is exactly what you want:

    "The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the cluster is transparent to end users. End users only see a single virtual server."

    Sounds like a perfect match.

    Sumner
    • Just to make it clear:

      "The advantage of the virtual server via NAT is that real servers can run any operating system that supports TCP/IP protocol, real servers can use private Internet addresses, and only an IP address is needed for the load balancer"

      So you can keep your application on Solaris with LVS.

      Sumner
  • The TCP connections to the server are used only for one client to be able to locate a user on another client. Once the conversation starts, the clients talk peer-to-peer using UDP.

    Hence, the server sees very little traffic in comparison.

    • Okay, let's see if I understand.

      You've a instant messaging application, which is composed of a lot of clients and a central server.
      The clients talk to the server only on start-up, but presumbely you want to keep the TCP connection open so you would know when a client goes offline.

      That is not a good way to do it, not because you'll have problems with ports, you can have as many connections open as you want, after all, a TCP connection is identified by source_ip:port + dest_ip:port. But because of the overhead you'll encounter maintaining so many connections.

      It's also not good to use UDP for peer-to-peer connections, why complicate your life with things that TCP already offers?

      Persumbely, on start-up, a client calls the server, log-in, register its state (online,busy,D/A, etc), and asks for a list of names/ID of friends that it has.
      You want to send it the IPs of those who are online, so it can send messages to them directly.
      The way to do it, in my idea, is to handle it so:

      Have a database of your clients, which would include username & password, an IP & state.
      Another field should be a list of names that should be notified when this client goes online/change status.

      When the clients calls in, authenticate it, and register it state in the DB, send it the list of IPs & states of the people online that he requested.
      And register its state in the DB. Notify everyone that requested to be notified and is currently online that the user went online.
      Cut the TCP connection. **
      When the clients closes, have it send a message saying "I'm going offline".
      Then change the state of it in the DB & inform the users linked to it. **

      That way, the only connections that you've is of clients starting, changing status & shuting down.
      That should lower your load considerably.

      As for scalability & fault tolerance, just put it behind a load balancer, that way, if a server is busy or down, the request goes to another server, all of them are linked to the same DB, so the state is being preserved.
      If a server goes down in the middle of a request, then the client should be smart enough to recover, and try again (that is what browsers do, and why load balancers works so well for HTTP).
      Be sure to make the client stop after a couple of failed tries, though, you don't want to overload the network in case all your servers down for some reason.

      What about errors, you ask? If a client is being terminate or disconnected without having a chance to inform you?

      Well, the way I would do it is let other clients discover that.
      If a client can't form a direct connection to another client, it should tell the server about this.
      The server would try to reach the client by himself, and if he fails, would register that client as offline, store the request for when that user goes online again/discard it, and tell all the clients that are linked to the failed client about it [***].
      I think it's much better than have the server poll at possibly hundreds of thousands of connections. (Or more, if you are lucky.)
      After all, it doesn't matter if a client that no one is trying to call is offline while it's marked online. #

      [**]
      If you care about bandwidth/load, have the clients maintain a list of the people/or give it to him during log-on, and have the *client* do the notifications. On start-up, shut-down & status chagne. You'll have to handle the errors yourself, though. ***

      [***]
      You can be really nasty and have the client that discovered the error inform the rest of the clients that are linked to the failed client that it's gone. But that would require that client to have the list of people that want the failed client's link, which can be bad from privacy point of view.

      [#]
      If it does matter to you, have the clients poll the other clients in their list every hour or so, it would balance out so you wouldn't have too much lost connections marked as alive.
      • You've a instant messaging application ...

        Actually, I said "instant messaging type server." In reality, the application is to send real-time video/audio peer-to-peer. (My question would have been 5 pages long if I gave all the exhaustive details.)

        ... presumbely you want to keep the TCP connection open so you would know when a client goes offline.

        No, we want to keep the TCP connection open so a client knows when it has an incoming message instantly. Polling is out of the question because (a) it's not instant and (b) would use enormous amounts of bandwidth we'd have to pay for.

        It's also not good to use UDP for peer-to-peer connections ...

        It is for audio and video. Again, I simplified my question because this detail is mostly irrelevant. I only mentioned it to point out that the bulk fo the bandwidth is NOT going through the server, just the "call setup."

        Notify everyone that requested to be notified and is currently online that the user went online. Cut the TCP connection.

        This does allow people not on your "buddy list" to call you (which every IM clients does in fact allow).

  • Why not just have the clients send 'keep alives' every 20 seconds or so? And I'd definitely suggest using some of the above mentioned hardware TCP limiters.....
  • or did you just not mention why you rejected it?

    If you're thinking of going to the trouble of simulating TCP with raw sockets, UDP seems a simpler alternative to that.

    • If you're thinking of going to the trouble of simulating TCP with raw sockets, UDP seems a simpler alternative to that.

      This being an instant messaging server, presumably for the public, UDP would have too many problems going through certain firewalls.

  • A load balancer does not give redundancy. With a load balancer, if a server dies, NEW connections are sent to a different server instead, but the existing connections to the down server all are closed - an external non-OS integrated solution like load balancing does not give transparent failover on TCP connections. It works for HTTP because browsers are used to connections suddenly dieing and will simply retry. But, if the client isn't smart enough to reconnect, it won't work.

    The way to do this is to build a custom TCP stack and integrate it tightly into your app. A lot of work and hard to get right.

    I would ask, "Do we REALLY" need this when our application already has to handle things like network failures? You might, though - I don't know what your application is.

    Also, don't forget to use redundant routers, redundant firewalls, etc. If you use NAT, that imposes one more problem - transparently moving the connection table between the failed firewall and the working one.
  • A single server can handle at most around 64000 open sockets.

    If my memory serves me correctly one can easily break the 64K limit by using multiple processes. This wasn't on Solaris, though.

    Is anybody aware of off-the-shelf software and/or hardware solutions (either commercial or freeware)?

    Assuming you have control over the protocol, I've written such a server. I don't have the code, but my non-compete agreement has expired.

    • ... one can easily break the 64K limit by using multiple processes.

      AFAIK, ephemeral ports for sockets are a system-wide resource. More than one process can't use the same IP/port pair on a machine.

      • More than one process can't use the same IP/port pair on a machine.

        More than one connection can't use the same 5-tuple of remote address, remote port, protocol, local address, local port. Multiple processes can use the same local address, local port, protocol 3-tuple, but some OSes have problems when multiple processes do an accept on that same 3-tuple from multiple processes. In any case, these limitations can be easily solved by using multiple local IP addresses and/or multiple local ports. I have seen up to 256 IP addresses on a single ethernet card with no problems. If bandwidth becomes a problem before CPU you can add multiple ethernet cards. You can then use round robin DNS to balance the load between different processes, or (preferably) build the port selection into the client/server protocol. Putting just a small bit of intelligence into the client will save you orders of magnitude in terms of simplicity, reliability, and scalability. You also could, but probably shouldn't, use a load balancer, to present a single IP/port combination to the world.

        I'm almost tempted to do a quick test to confirm it, but I'm almost certain I've seen the 64K connections/machine broken. This was over a year ago, on FreeBSD. I'd imagine Solaris and Linux are able to do it as well, by this point.

      • AFAIK, ephemeral ports for sockets are a system-wide resource.

        Aren't ephemeral ports only used for outgoing connections, not incoming ones?

        • Aren't ephemeral ports only used for outgoing connections, not incoming ones?

          Uhm, I think what I meant is that when you do an accept(2) on the listening socket, you get a new file descriptor that is attached to another socket connection that is independent of the listening socket. There are a finite number of those.

Elliptic paraboloids for sale.

Working...