Partychat — migrating from Google App Engine to EC2

Posted by on Dec 17, 2011 in python, software | 17 Comments

I’m Vijay Pandurangan, and I’ve been working with some other super talented folks to help maintain the partychat code, and help pay for its services. Because of the latter, I was especially motivated to keep Partychat’s costs under control!

Recently, Google App Engine made some substantial pricing changes. This affected a lot of people, but especially partychat, a service with over 13k one-day active users.
In this blog post (and a couple of more to follow), I’ll describe various aspects of the pricing change and our ensuing migration from App Engine to Amazon’s EC2, the impact on users, including cost structures and calculations.

For those of you who don’t have the time to read everything, here’s the

tl;dr version:

  • Google’s new pricing was totally out of line; we were able to re-create similar service at about 5% of the cost of running the service on App Engine.
  • Google’s policy has resulted in higher costs for them (XMPP messages used to be entirely within their network, but now have to be sent to and from EC2), and reduced their revenue (we, and others will likely shun App Engine)
  • App Engine requires a very different design paradigm from “normal” system design.
  • Some App Engine modules lock you in to the platform. We had to make all our users transition from channel@partychapp.appspotchat.com to channel@im.partych.at because Google does not allow us to point domain names elsewhere.
  • You can operate things on EC2 substantially more cheaply than App Engine if you design correctly.
  • I strongly recommend against running anything more complicated than a toy app on App Engine; if Google decides to arbitrarily change its pricing or change the services they offer, you’ll be screwed. There is really no easy migration path. Random pricing changes coupled with lack of polish in appscale means that any solution differing even slightly from the ordinary is “stuck” on appengine.
  • Transitioning a running app is pretty hard, especially when it’s an open source project done on spare time.
  • By (partially) transitioning off of app engine, we’ve actually reduced our cost from the pre-increase regime and can deliver roughly equivalent capacity for the same cost:

Before: $2/day.
With price increase: ~$20/day.
On EC2(no prepay)/App engine hybrid: $1/day.
On EC2(annual prepay)/App engine hybrid: $0.80/day.
At this point, residual app engine charges still amount to approximately $0.50/day.
On EC2(annual prepay)/App engine hybrid using more memcache: $0.60/day.

Background:

Partychat is a group chat service. By going to a web site, or sending an IM, users create virtual chat rooms. All messages sent to that chat room alias are then broadcast to all other users in that channel. Here is an example of two channels with two sending users.

This is what happens from the perspective of our app:

Google’s service calls an HTTP POST for every inbound message to a channel, and outbound messages are sent via API calls.

Cost estimates:

Before Google’s pricing changes, our daily cost to process messages was about $2/day. Before the new pricing went into effect, I used some anonymous logging information to forecast how much the service would cost to operate in the new pricing regime.

As you can see from the graph, even limiting the maximum room size to 200 people (which would be a major disruption to our services), would have cost us well over $10/day, which is really unacceptable.

Migration:

Since this was an open-source project I decided to take the simplest approach possible to make this migration. I’d create an XMPP server on EC2 that would simply do all the sending and receiving of XMPP messages instead of App Engine.

Note that as a result of their policy, Google makes less money AND has higher costs! (All XMPP traffic now gets routed through EC2, which is taking up bandwidth on their peering links)

Our App Engine app still does almost all of the processing of messages (including deciding who gets sent which messages), but does not do the actual fanout (i.e. creating n identical messages for a broadcast message). That is handled in the proxy.

XMPP Server:

In order to run an XMPP proxy, we need to deploy an XMPP server with the ability to federate to other servers, and code that interfaces with that server and receives and sends messages.

There are a bunch of XMPP servers out there, but the overall consensus is that ejabberd, a server written in Erlang (!) is the best and most stable. It’s proven to be extremely stable, and efficient. The big issue is that configuration is really difficult and not very well documented. A couple of important points that took forever to debug:

  • ejabberd has a default traffic shaping policy. Traffic shaping restricts in-bound and outbound network traffic to according to a policy. Traffic that exceeds the limits are buffered until the buffer is full, then dropped. Partychat’s message load can often be substantially higher than the default policy’s limit for sustained periods, resulting in randomly-dropped messages.
  • if multiple servers associate with one component (more on this in the next section) ejabberd will round robin messages between the connections. This means that your servers have to run on roughly equal machines.

Proxy code:

XMPP supports two connection types, Client and Component. A client connection is what your chat client uses to connect to a server. It requires one connection per user, and the server remembers all the important state (who you are friends with, etc..). This is by far the simplest solution for writing something like partychat, but there are a few problems. It requires the server to keep track of some state that we don’t care about (Are the users online? What are their status messages? etc..) which adds load to the server’s database. This can be solved by increasing database capacity, but this is wasteful since these data are not used. More importantly, using client connections will require one TCP connection per user (see the image below). This means that for our service, with ~ 50k users, our server will need to handle 50k TCP connections. This is already a really large number, and will not scale that well.

The alternative (which I selected) was to use a component interface (see above image), which essentially gives you the raw XML stream for any subdomain. Your code is then responsible for maintaining all state, responding to presence and subscription requests.

Initially we used SleekXMPP, a python library to manage component connections. The state was stored in RAM and then periodically serialized and written out to disk. Since XMPP has provisions for servers to rebuild their state in case of data loss without human involvement, losing state is not catastrophic, though it results in substantially higher load on the system while redundant subscription messages are dispatched and processed. The state that we store currently contains:

user := string (email)
channel:= string (name of channel, utf-8)
inbound_state := {UNKNOWN, PENDING, REJECTED, OK}
outbound_state:= {UNKNOWN, PENDING, REJECTED, OK}
last_outbound_request := timestamp

The last outbound request timestamp is required to prevent subscription loops in case sequence ids are not supported (the spec details why this is important.)

Each inbound message results in an https request to the app engine server. The server responds with JSON containing a message that is to be sent, a list of all recipients, and the channel from which it is to be sent.

The python library was OK, but did not really operate too well at our scale. Profiling showed that much time was spent copying data around. The server periodically crashed or hung, resulting in dropped messages and instability. The inefficiency required us to use a medium instance ($0.17/hour) to serve traffic, which put us at about $4/day. Still substantially lower than App Engine, but too high!

The server was then re-written in C++, using gloox, libcurl, openssl, and pthreads. Each message is dispatched to a threadpool. Blocking https calls are made to the App Engine server, and the results are then sent via ejabberd. This server is able to handle our max load (~ 12-16 inbound messages per second, resulting in around 400-600 outbound messages per second) at under 5% cpu load on a micro instance (at $0.02/hour).

The system is mostly stable (a lingering concurrency bug causes it to crash and automatically restart about once every 12 hours) and should provide us with substantial room to scale in the future.

Issues:

Google Apps for Your Domain has grown in popularity recently. This presents us with a problem; in order for XMPP messages to be delivered to users on a domain, a DNS SRV record needs to exist for this domain. Google does not automatically create this record, but messages between google apps domains get delivered correctly (so we never saw this issue with app engine).

Another issue that slowed down development a lot was that some clients (generally iTeleport) send invalid XML over the wire, which cause the python XML code to throw namespace exceptions. This made SleekXMPP even less reliable, and required making code changes in core Python libraries. The C++ libraries handle this gracefully.

Future work:

In the near future, we will be disabling old @partychat.appspot.com aliases. Other future work includes billing for large channels, and reducing the number of write operations on App Engine.

Conclusions

Working on this project has been quite educational. First of all, migrating a running service with many users from Google App Engine is hard, especially if it uses esoteric services, such as XMPP. Appscale is a reasonable platform that could help with the transition, but it is difficult to use, and may not be fully ready for production. Google App Engine’s insistence on a different paradigm for development makes migration extremely difficult, since moving to a new platform requires rearchitecting code.

An even bigger problem is the fact that some aspects of your system (e.g. XMPP domains) are not under your control. We had to migrate our users to a new chat domain, because Google did not allow us to point our domain elsewhere. This was a huge inconvenience for our users. Since our service is free, it was less of a big deal, but for an actual paid service, this would be a serious problem.

Since pricing is subject to rapid, arbitrary changes, and transitioning is difficult, no system that is likely to become productionized at scale should be written on App Engine. EC2’s monitoring and auto-scaling systems (more on this in a subsequent post) are excellent, and don’t require buying into a specific design paradigm. In fact, moving this proxy to a private server, or rackspace would be quite trivial.

Edit: I wanted to add this, just to clarify my point:
It’s more than just a price/stability tradeoff. The problem is, as an App Engine user, one is totally at the mercy of any future price changes on App Engine because it is nearly impossible to seamlessly migrate away. The DNS aliases can only point to Google, and the API does not really work well on any other platform. So, at any time in the future, one’s service could suddenly cost 10x as much, and one won’t really have the option to move quickly. If one intends to scale, it’s better to never get into that state in the first place, and develop on EC2 instead. If EC2 raises its prices (highly unlikely since computing power is increasing and costs are decreasing), one can always move to rackspace or just get a private server.

It’s of course true that writing stuff on App Engine can sometimes require a lot less engineering work. But the difference is not really that substantial when compared to the possibility of being stuck on a platform that all of a sudden makes your company unprofitable. Changing a running service is very hard. Avoiding the problem by not getting stuck on App Engine is not trivial, but in my opinion the right call.


17 Comments

  1. leg
    Saturday, 17 December 2011

    Can you now support gatewaying partychat rooms into standard XMPP MUCs? That would be awesome.

  2. vijayp
    Saturday, 17 December 2011

    Yeah I think that should be possible. Let me re-read the spec to see exactly how that would work.

  3. Dave Peck
    Sunday, 18 December 2011

    This is quite interesting, thank you. As someone who has worked with and on App Engine since the early days (I gave a talk at Google I|O 2009 about scaling with GAE) I agree for the most part with your final conclusion: “no system that is likely to become product ionized at scale should be written on App Engine.”

    This is a sad conclusion to arrive at after all these years, especially when the original promise of App Engine was (essentially) “write your applications against our strange, quirky API, and they’ll scale far more cheaply and reliably than they could otherwise.” I think the pricing changes, coupled with regular performance problems, gives the lie to that promise.

    I’m porting several of my apps off of App Engine, including a big production app (https://www.getcloak.com/) — luckily, the only “odd” service I make use of is the Task Queue. (Specifically, I make use of Task Queue transactionality — so there isn’t a direct analog in, say, Celery.) GetCloak.com is heading to AWS as well.

    Cheers,
    Dave

  4. Simon Morris
    Sunday, 18 December 2011

    Thanks for the article… very informative.

    I have a few toy applications on GAE, and I do like the platform but the pricing changes have really put me off using it for any new ventures. Big shame really :(

  5. Not Sure
    Monday, 19 December 2011

    Unless your time is free, you’re overlooking all of the IT costs associated with running, maintaining, and securing an EC2 instance. And EC2 has ts own set of issues with reliability and cost (micro instances don’t really count. If you can run your business on a micro instance, it isn’t much of a business).

    I think you’re only half correct when you say “no system that is likely to become productionized at scale should be written on App Engine”. Companies that make real money via GAE aren’t even blinking at the price increases – the benefits of having it as GA and better support (and let’s be frank, chasing off the hobbyist users) are well worth the additional cost.

  6. Jason Kratz
    Monday, 19 December 2011

    But the issue this completely leaves out is the need to manage all of those EC2 instances. That counts for a hell of a lot and should always be factored in to any decision. The statement “I strongly recommend against running anything more complicated than a toy app on App Engine” is kinda ridiculous.

  7. vijayp
    Monday, 19 December 2011

    The problem isn’t so much the cost of appengine as the change in pricing. If app engine could make an app cost 20x what it would to run on EC2 (agreed, minus server administration, but that’s pretty small for our service), it could just as easily make it 200x or 2000x. Building a business where your tech is locked to a platform that you don’t control, can’t migrate off of, and has pricing changes that appear arbitrary is not, in my view, a good business practise.

  8. Tom Willis
    Tuesday, 20 December 2011

    So when amazon raises their prices substantially on EC2 to easily make it 200X or 2000X of what it would cost to run on app engine , will you be migrating back? :)

  9. vijayp
    Tuesday, 20 December 2011

    No, I will migrate to rackspace, or any number of other virtual server environments. It will take me almost no time to do because I’m not making use of any special API and I’m not locked in to amazon’s platform. Which, of course, is precisely why Amazon cannot raise prices like that.

  10. Links for December 16th through December 19th — Vinny Carpenter's blog
    Tuesday, 20 December 2011

    [...] Partychat — migrating from Google App Engine to EC2 « Vijay Pandurangan’s blog – Google App Engine’s insistence on a different paradigm for development makes migration extremely difficult, since moving to a new platform requires rearchitecting code [...]

  11. Jack
    Wednesday, 21 December 2011

    Woo, google removed the post that links to this article from the AppEngine group.

  12. Another Dev
    Wednesday, 21 December 2011

    “The problem isn’t so much the cost of appengine as the change in pricing.”

    The old pricing model wasn’t realistic. It failed to represent the actual costs of running the service, failed to represent which resources were more limited in supply than others.
    App Engine is now an official Google product. That means it isn’t going to fall off the face of the earth like Wave. In fact, the App Engine team is aggressively growing. It also means App Engine needs to be reasonably profitable.
    None of this (let alone all of it plus an SLA) justifies a new pricing model to you?
    If the fact that App Engine could not continue on the same plan doesn’t justify a new pricing model, what would you suggest they do instead of change it?
    What WOULD, to you, justify a new pricing model?

    Have you made any attempts at lowering your costs of running in the new App Engine pricing model?
    For example, going from single threaded to multi-threaded request processing could impact your costs dramatically. Is that $20 per day quote from a multi-threaded app?
    If you have not made attempts at understanding and working with the new pricing model, do you recognize that your $2 -> $20 examples and derived arguments are quite useless?

  13. vijayp
    Wednesday, 21 December 2011

    The pricing predictions are only related to the per-message charge for XMPP. Other charges were ignored for the price comparison analysis.

    It’s unlikely that the pricing increases were in any way linked to the cost of operating the service. As discussed in the blog, Google now handles exactly the same amount of XMPP traffic as before, except all of it is coming from outside the intranet instead of inside. I would be very surprised if the actual cost of maintaining the service were anywhere close to that number. I’m quite sure the price per XMPP message was picked without regard to service cost; probably picked to some unoptimized estimate of running such a service in the wild times some constant

  14. Dake
    Wednesday, 28 December 2011

    Do you have any additional arguments to back up this statement “I strongly recommend against running anything more complicated than a toy app on App Engine”

    From reading your article.

    1 You want to run a free service based on the XMPP API of google.
    2 Unfortunately for you the costs are now too high.
    3 So you need to implement your own XMPP API.
    4 Google doesn’t support sockets.
    5 So you end up with EC2.

    GAE is not 20x expensive than EC2, it misses the functionality to implement your own XMPP.

  15. Michael
    Wednesday, 28 December 2011

    FYI Tom, Amazon hasn’t increased their prices yet. In fact, they’ve dropped their prices on multiple occasions.
    As vijayp said, it is now much easier to migrate to another cloud provider. This lack of vendor lock in will keep Amazon’s prices in check.

  16. Sam Kaufman
    Friday, 6 January 2012

    That tl;dr was tl, dr.

  17. abhishek jain
    Friday, 20 April 2012

    Just a question vijay.
    I am building an app, and i need to know does google app engine now allows you to have room@your.domain.com instead of room@appspot.com for exchanging XMPP messages.
    I cannt get this working and is finding different responses from diff. posts on net.

    Pl. help
    thanks
    abhi

Leave a Reply