How to Upgrade a Legacy Heroku Database - without downtime

stickfigure · on Aug 23, 2012

Woah. Heroku is nice, but doesn't $6400/mo seem insane for a quadruple extra large ec2 instance? That's an impressive margin.

ashrust · on Aug 23, 2012

We've noticed the same price differential but if you include the cost of paying someone to manage the db and be on call in the event of an issue - it seems to even out.

We've moved some stuff off Heroku but not because of cost, most recently we discovered they have a tight max on scala apps and it was cheaper for us to put it on aws directly, rather than code around it on Heroku.

josegonzalez · on Aug 23, 2012

That sounds wrong, but I guess you have a premium on someone being "on-call" that I am not aware of. Wouldn't it be cheaper/easier to just ensure all your developers understand your infrastructure and can step in where necessary?

ashrust · on Aug 23, 2012

You're right, if we manage our own db, then we need everybody to understand how to troubleshoot across both software and hardware issues, rather than just file a ticket and follow instructions/suggestions.

Let's say we have 5 engineers and that kind of training costs us 1 hour per eng/per week. That's 20 hours a month, which is easily more than $1k/month and the cost grows as the team expands. This is before you account for the losses of having your own engineers on call for issues and setting up early warning systems specific to the database etc.

josegonzalez · on Aug 23, 2012

Good point.

Hardware aside - thats largely moot on both Heroku and AWS - the 5x cost of the server will very quickly outstrip the cost of a good developer/operations guy. It seems like a good idea - I initially just throw money at a problem too! - but at some point it does not make financial sense, which is what I was getting at.

Note that we don't have every engineer on staff capable of bringing up a new server/rebuild a damaged mysql replication setup, but we do have engineers that can:

  - ssh (or attempt to ssh onto) a dead instance
  - tail a log
  - check disk space, memory usage and cpu load
  - check if something is running
  - restart something that just randomly stopped

All that stuff is pretty basic, and will get you 80% of the way there - the other 20% being experience. I guess at some point you are paying for the experience of working with a datastore, so there it makes sense.

As far as early warning etc., that does take time, but it's not the big deal it appears to be.

EDIT: I can't math, and 80 + 10 = 90, not 100. Good thing I'm not a data scientist :)

jeltz · on Aug 23, 2012

For getting early warnings and helping your employees debug what went wrong there are several good monitoring tools. They wont magically solve your problems but it will make it much easier for non-database experts to run a database server. For monitoring PostgreSQL I have used the excellent plugins that are shipped with munin.

http://munin-monitoring.org/

EDIT: munin does also ship monitoring plugins for MySQL but since I have never used them I cannot vouch for them. The general health monitoring (disk usage, CPU usage, SMART status, inode usage, ...) of the machines provided by munin is also probably more valuable than the database specific monitoring.

jaytaylor · on Aug 23, 2012

Maybe, and that is something we'll be exploring more in the future. The prohibitive element is that it isn't easy to find devs of that caliber, so it probably still would approximately even out (with the understanding that excellent devs aren't cheap).

josegonzalez · on Aug 23, 2012

Yeah, my comment only makes sense if you can actually find the devs to do the work ;)

carsongross · on Aug 23, 2012

When you say 'cache utilization' are you referring to the postgres working set/cache? How are you measuring that?

jaytaylor · on Aug 23, 2012

"Cache utilization" refers to the amount of space the database is using in memory compared to the amount of total cache available on the dedicated Heroku database instance (as seen on their pricing page).

We are measuring it by issuing a `hr pg:info -asendhug` command which includes a field called "Data Size".

e.g.:

=== HEROKU_POSTGRESQL_GOLD (DATABASE_URL) Plan: Ika Status: available Data Size: 2.00 GB Tables: 76 PG Version: 9.1.4 Fork/Follow: Available Created: 2012-08-16 04:09 UTC Followers: HEROKU_POSTGRESQL_TANGERINE Maintenance: not required

As for whether this is the "working set/cache", it is not clear where the "Data Size" number comes from. This is an interesting question, I'll ask Heroku for more information and report back.

jaytaylor · on Aug 24, 2012

Heroku has replied..

> Cache utilization talks about the working set size, yes.

>

> The database size typically means just the result of

> pg_database_size() on your database (you can use the

> current_database() function as a shortcut). Note that this

> does include dead tuple bloat that is regularly cleaned

> up by Postgres autovacuum, so it's more of an estimate than

> a hard number.

carsongross · on Aug 24, 2012

Right, but that doesn't give you a good idea of how close your working set is to the max memory cache, unless your entire DB is active.

I'll ping them again and ask if they've found a good way to measure this. It seems like this is the most important number to look at when scaling a heroku/postgres project.

willlll · on Aug 23, 2012

I'm glad you were able to find a good solution.

indiecore · on Aug 23, 2012

Very cool. That's all I can really say, kudos for thinking on their feet, coming up with a clever solution and then sharing it.

ashrust · on Aug 23, 2012

We do try - and btw, we're hiring if you know any iOS engineers...

Let us know if you're in the unfortunate position of having to use it. We'd love to know if it works out for you.