256px-FireIcon

War Stories: Check Point Meltdown

Firewalls are usually deployed as a cluster, to provide failover capabilities. Protocols such as VRRP are used to that traffic is normally routed via one node, but if that node fails, the other one automatically takes all traffic. Connection synchronisation ensures that the backup firewall is always aware of all active sessions, so that a failover will be almost seamless.

This works pretty well – I spent years looking after a lot of Check Point clusters, predominantly on Nokia IPSO, later on SecurePlatform. Don’t hate me for it. Occasionally we’d have a power supply or disk drive failure, and traffic would seamlessly switch over to the standby node. It also worked well during upgrades – demote the node to secondary, patch it, promote to primary, test, switch back if needed.

It’s not true redundancy

All good. But there’s a problem – you don’t have TRUE redundancy. You’re protected against hardware failure, and you’ve also got some mitigation strategies to deal with problems related to software upgrades. But what happens if you have a software problem? Normally you’re running exactly the same code and configuration, so if a problem affects one node, it might affect both nodes. Check Point configuration is managed on a per-cluster basis too – when making firewall policy changes, you push the policy to both firewalls at once. You don’t do them in sequence.

That’s what happened to me – we had a software bug that took out the whole cluster.

Note: This story is about a Check Point cluster, but it’s not a specific attack on Check Point – this is a general problem that applies to any system that has shared code/configuration across a cluster

Meltdown

Check Point used to have a feature called “Smart Defense”. These days it’s known as IPS. It’s an integrated intrusion detection/prevention system within the firewall. It’s had a chequered history, with engineers heard to mutter “If it makes no sense, it must be Smart Defense.” Certainly I’ve seen some odd behaviour in the past, where it would drop specific sessions without bothering to log any indication of why it dropped them, or even admitting that it had dropped the session.

Check Point releases signature updates for SmartDefense/IPS on a frequent basis. Open up the Policy Editor, and it will say “Hey, these new signatures are available. Do you want to download/enable them?” One morning I saw a new signature for detecting specific types of malicious OSPF traffic. That sounds interesting. Maybe I’ll just put it in “Detect” mode, so it will tell me if it sees anything, but it won’t drop it. Can’t hurt right? It’s just detecting it, not dropping it.

Tick the box, push policy, done. Make note to check logs later to see if any of that traffic has been detected.

A few minutes later…complete meltdown. Everything’s offline. Nothing reachable in the data center. Can’t get from Enterprise network to DC, and it seems that all intra-DC traffic is broken. What the hell’s going on? After a bit of scrambling, we realise that the firewalls seem to be blocking everything, via all the virtual firewalls we have on that VSX cluster.

Can’t remotely login in to them, so we wheel out the Crash Cart, and plug a monitor into the firewall. Kernel Panic. Oops. That’s not good. Power-cycle the firewall, and it comes back up, and starts passing traffic for a minute or two…then kernel panic again. Oh no….more testing shows that if we reboot both firewalls, one goes master, takes a bit of traffic, then kernel panics…so the secondary takes over, passes traffic for a minute…and also kernel panics. Redundancy didn’t help here.

What can we do? Check Point firewalls are managed by a remote management server. You can’t change the configuration locally, like with an ASA. You MUST make changes from the management server…but how can I push out an updated policy, if the firewall won’t stay up long enough to install policy?

Oh and did I mention that the company had a major launch about to happen, within an hour? Yeah, you could say that there was a certain amount of panic, and people lighting their hair on fire and running around.

I knew that I needed to get back to the configuration I had installed yesterday. How to do it? I had backups for all systems – the management server & the firewalls. With Check Point systems, the backup from the firewall doesn’t really matter much anyway – you can just re-build the firewall, and push out policy again. So I ended up doing a restore on the management server to yesterday’s configuration, rebuilding the firewall from scratch, and pushing a working policy out to it. Service working again! All done within an hour too, I might add.

Lessons Learned

  1. Have backups, know where they are, and how to restore from them. I’d had enough system failures over the years to know I needed to have that scripted up. You’d be surprised how many people don’t do have backups of network devices.
  2. Have a good team around you, and trust them. I was lucky that I had a good group of people around me, and we’d spent years doing Operations work together. We knew how to handle those sorts of situations. One person ran interference, and cleared out space around the people working on the problem. This got rid of all the noise, and gave us time and space to figure out what had happened, and how we were going to recover. The worst thing you can do is stand over the engineer’s shoulder, and ask “Is it fixed yet?” every 5 minutes.
  3. Redundancy is more than just hardware. Understand what the failure domains are. This is partly why I don’t like stretched L2 designs – you’re providing some redundancy against hardware failure, but you still have a shared component that could take out everything. Ideal scenario would have been to have another data center that we could run all services out of, but at the time we didn’t have the budget.

It was a rather “interesting” morning all around, but we were able to recover, and life goes on. That’s the good thing about networking – once you’ve fixed the problem, stuff starts flowing again. It’s a lot more entertaining dealing with outages that involve lost data. Sometimes that stuff can never come back.

,

12 Responses to War Stories: Check Point Meltdown

  1. Duane Grant March 7, 2014 at 1:58 am #

    I know that you were in the heat of battle, but I’m wondering if you pulled all the interfaces except mgmt if it would have stayed up so you could revert the config?

    Just food for though for the next time this happens as it seemed like it could have been a traffic induced panic.

    –Duane

    • Lindsay Hill March 8, 2014 at 10:06 am #

      Yeah, I think it was traffic induced. It was a few years ago now, and I can’t recall for certain if we pulled all the cables. I _think_ we did, but still had the same issue, even with just mgmt plugged in.

      • Tony D August 6, 2014 at 3:30 am #

        FYI, we had a very similar issue this past December. ALL of our Checkpoint firewalls (running GAIA) downloaded a corrupt Antibot/Anti-Virus signature that put ALL of them in a reboot loop. The restore was a bit easier than your case. Once Checkpoint figured out the root cause, we just disabled the Antibot/Anti-Virus module on all the firewalls via a policy push.

        HUGE blow to Checkpoint’s reputation from my perspective, and even more so after reading your story.

        • Lindsay Hill August 8, 2014 at 2:48 pm #

          Ooh, that’s nasty. Very hard to regain trust after issues like that.

  2. Duane Grant March 7, 2014 at 2:05 am #

    I also meant to say that it was a great article as well and it’s a hard problem to solve. We have this same exposure at many levels of your network and I hope that we’re as prepared as you were.

  3. danmassa7 March 7, 2014 at 2:46 am #

    Excellent post. I had a university professor that used to say systems need to have “high modularity AND low coupling.” This Checkpoint system had too much coupling because it exhibited too much fate-sharing. Stretched L2 designs have the same problem.

    Hey, Checkpoint and Cisco (and others) both have the same setup: the two “redundant” firewalls are coupled together with lots of state coordinated between the two. Must be a good design, right? As network engineers we should be on the lookout for this kind of stuff and push vendors in the right direction.

    • Lindsay Hill March 8, 2014 at 10:05 am #

      There’s no easy ways out really – we want to be able to failover seamlessly, but that shared state creates a risk. Only thing you can do is have other clusters that don’t share state.

      I don’t want to go back to that whole “Dual vendors for firewall” crap we used to espouse – in my experience you’d end up with one of the sets of firewalls being poorly managed. Better to have them one vendor that you manage well.

  4. Jimbo Johnson March 8, 2014 at 3:11 am #

    May I ask what version were you running?

    • Lindsay Hill March 8, 2014 at 9:59 am #

      It was a few years ago – I think it was R65 at the time

      • alex March 17, 2015 at 8:21 pm #

        lol…. that was fun aye Lindsay :P

  5. ZZTop March 26, 2014 at 9:12 am #

    I’d add a fourth lesson (really intended for vendors) which is that a system that doesn’t have some form of out of band management is trouble waiting to happen. I see this all the time on SBCs from a certain vendor. You get a registration storm, say, and the box is overwhelmed to such an extent that you cannot even get in and configure your way out of trouble. Or worse, the troubleshooting tool makes things worse.

    We ditched CheckPoint a long time ago, so it’s surely waaay better now than when I used it, but at the time, having to rely on Windows starting, the FW1 service starting, and the management GUI coming up without incident when everything else was on fire caused us some serious indigestion. FWIW (and this is not a shameless plug, since everyone’s needs are different) we switched to a pair of OpenBSD firewalls running pf in a ramdisk with the config stored read-only on a removable disk. That way even if it all blew up in our faces we could get the config loaded on another box.

  6. Jule August 27, 2014 at 9:08 pm #

    It’s a perfect example of “technical changes” that do not follow a changemanagement process, what I often see.
    Besides that I would like to add a fifth lesson: test your changes before crippling production environment.