Ceph Outage with OSDs Heartbeat failure on Hammer (0.94.6)

Symptoms

  • The cluster went down after 24 OSDs were added and marked in simultaneously.
  • This was an erasure coded (10+5) RGW cluster on Hammer.
  • All the OSDs started failing and eventually 50% of the OSDs were down.
  • Manual efforts to bring them up failed and we saw heartbeat failures in OSDs log.
  • All OSD were consuming ~15G RAM and OSDs were hitting Out of memory errors.
2018-07-18 08:58:12.794311 7f4aa0925700 -1 
osd.127 206901 heartbeat_check: 
no reply from osd.55 since 
back 2018-07-18 08:45:13.647493 
front 2018-07-18 08:45:13.647493 
(cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794315 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.57 since back
  2018-07-18 08:45:42.452510 front 2018-07-18
   08:45:42.452510 (cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794321 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.82 since back 
 2018-07-18 08:45:13.647493 front 2018-07-18 
 08:45:13.647493
  (cutoff 2018-07-18 08:57:12.794247)
  • OSDs maps were out of sync
2018-07-18 08:56:52.668789 7f4886d7b700  
0 -- 10.33.49.153:6816/505502 >> 10.33.213.157:6801/2707
 pipe(0x7f4a4f39d000 sd=26 :13251
  s=1 pgs=233 cs=2 l=0 c=0x7f4a4f1b8980).connect
   claims to be 10.33.213.157:6801/1003787 not 
   10.33.213.157:6801/2707 - wrong node!   
  • An OSD has ~3000 threads, most of them in sleeping state.
  • Using GDB and getting a backtrace of all threads we found that most of the active threads were just Simple Messanger Pipe readers.
  • We were suspecting a memory leak in Ceph code.

Band-aid Fixes

  • Set norebalance, norecover, nobackfill

  • Adding swap memory to OSDs

  • Tuning heartbeat interval

  • Tuning OSD map sync and setting noout, nodown to let OSDs sync their maps.

$ sudo ceph daemon osd.148 status
{
    "cluster_fsid": "621d76ce-a208-42d6-a15b-154fcb09xcrt",
    "osd_fsid": "09650e4c-723e-45e0-b2ef-5b6d11a6da03",
    "whoami": 148,
    "state": "booting",
    "oldest_map": 156518,
    "newest_map": 221059,
    "num_pgs": 1295
}
  • Tuning OSD map cache size to 20
  • Finding processes other than Ceph
    • Processes consuming network, CPU, and RAM
    • Killing them
  • Starting OSDs one by one – that worked for us 🙂

RCA

  • The major culprit was a rogue process that was consuming massive network bandwidth on OSD nodes.
  • As network bandwidth was not enough, many messenger threads were just waiting.
  • The Simple Messanger threads are sync threads and would wait till they get through.
  • That is one of the reasons of an OSD having ~3000 threads and consuming ~15G of memory.
  • As network was saturated, OSDs heartbeat signals too were blocked and they were either committing suicide or dying of OOM.

References

Written with StackEdit.

BlinkBlank

Knowledge is the seed of wisdom.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.