Repartitioning the mail-server on the fly

We run a Zimbra mailserver at our company. We’re trying as hard as we can not to run it, because there’s really no real value in running your own Zimbra mailserver when most of your users want Outlook anyway, and cloud email services like Office 365 are so affordable. But we do it anyway, for Business Reasons.

Anyway, that’s not really the point of this post, so I won’t dwell on that. The important part is that the server is an Ubuntu 14.04 LTS server, that has been upgraded over the years, and probably started out at Ubuntu 10.04 LTS or similar. Virtual machines have a tendency to outlive the hardware they’re on.

We have a recurring issue with the VM. Automatic OS updates are enabled, including installing new kernels. However, automatic updates on Ubuntu do not seem to remove older kernel versions for us, so the /boot partition, where the kernel among other things are stored, will often fill relatively rapidly. When it does full, apt becomes sad and broken, and we have to clean up the mess manually. In order to mitigate this, we’d like to just increase the size of /boot on that box. On this box, /boot was on the primary partition /dev/sda1, and is just a dismal 228 MB in usable size.

As an aside, I know that historically it used to be important to locate your kernel near the start of the drive, because LILO (which historically was the bootloader of choice for Linux) couldn’t read past the first GB or so of the drive. I really doubt that’s still an actual issue with Grub, but history lives on I guess. Bootloaders are still a bit of an arcane magic to me. Anyway, I really want to change the existing system as little as possible, because it’s a production machine.

Ideally, what I want to do is to resize the /boot partition to make it larger. Before even looking, I knew that was going to be a pain, because of the partition that was inevitably directly following it. So I looked a little closer:

# df -h
Filesystem               Size  Used Avail Use% Mounted on
udev                     3.9G   12K  3.9G   1% /dev
tmpfs                    799M  596K  798M   1% /run
/dev/mapper/zimbra-root  2.0T  625G  1.3T  34% /
none                     4.0K     0  4.0K   0% /sys/fs/cgroup
none                     5.0M     0  5.0M   0% /run/lock
none                     3.9G     0  3.9G   0% /run/shm
none                     100M     0  100M   0% /run/user
/dev/sda1                228M  101M  116M  47% /boot
/dev/sdb1                9.8G  1.6G  7.7G  17% /opt/zimbra/redolog
/dev/sdc1               1008G  680G  278G  72% /opt/zimbra/backup
# pvs
  PV         VG     Fmt  Attr PSize PFree
  /dev/sda5  zimbra lvm2 a--  1.99t 24.38g
# lvs
  LV     VG     Attr      LSize  Pool Origin Data%  Move Log Copy%  Convert
  root   zimbra -wi-ao---  1.95t
  swap_1 zimbra -wi-ao--- 16.00g
# fdisk -l /dev/sda

Disk /dev/sda: 2190.4 GB, 2190433320960 bytes
255 heads, 63 sectors/track, 266305 cylinders, total 4278190080 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0006250a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      499711      248832   83  Linux
/dev/sda2          499712  4278190079  2138845184    5  Extended
/dev/sda5          501760  4278190079  2138844160   83  Linux
#

Aha! There is hope after all! The /dev/sda1 partition, which I’d like to make bigger, is actually followed by an extended partition, which contains only one logical partition, /dev/sda5, which is a physical volume for an LVM volume group.

My plan becomes:

Hook a temporary scratch disk large enough to temporarilly hold the contents of the LVM vg “zimbra”.
Initialize the scratch disk and add it to the “zimbra” volume group.
Evacuate the /dev/sda5 physical volume onto the scratch disk.
Remove /dev/sda5 from the volume group and remove the /dev/sda5 and /dev/sda2 partitions.
Increase the size of the /dev/sda physical disk in order to accommodate the new expected load of data.
Change the end of the /dev/sda1 partition in order to increase its size to a few gigabytes or so.
Perform an online filesystem resize of the /boot partition.
Move the data back off the scratch disk.
Scrap the scratch disk.
Make a storage clone of the running server, and make sure it should boot properly.
Reboot the production server just to make real sure.

Note that there’s no no step in which the server becomes unavailable for a lengthy during a data move. This looks like a good plan on paper!

The final reboot step is just in there for extra insurance, I’d rather have a reboot fail when I’m there to fix it, outside of regular working hours, than having it fail in uncontrolled circumstances.

Let’s see if we can execute the plan.

Originally, my plan involved shrinking the /dev/zimbra/root volume to about 1 TB, down from 2 TB, to speed up any moves, and negate having to resize the virtual disk. However, I was disappointed to find out that online filesystem shrinking is not possible with ext4.

Because this involves a critical production workload, which you really don’t want to have to roll back from backup (lost mail and all that), I wanted to test all the steps first on a production environment, so the first step was for me to take a clone of the virtual machine. In my environment, we use Nimble Storage, and we had the Zimbra stuff on one LUN, so it was exceedingly easy to just create a snapshot clone of the VMFS volume the Zimbra VM lives on, and then add the cloned VM to the VMware inventory.

After removing the virtual network card (important, because I really don’t want the lab clone touching any real stuff), I powered up the VM. It took a few minutes longer than usual, because of the system and Zimbra itself grumbling about missing networking, but I really didn’t care about that. It was to be expected, but eventually my clone booted, ready for me to perform my experiments on.

Step 1: Hook a temporary scratch disk large enough to temporarilly hold the contents of the LVM vg “zimbra”.

First order of business was to hook up a scratch disk. Because I don’t want scratch data blowing up a thin provisioned datastore, I created a new LUN on our Nimble Storage, 3 TB in size, for the purposes of holding scratch data. I then created a 2 TB virtual disk that I hooked to the lab VM.

I make a mental note that when I’m doing this for real, I should add the scratch LUN to the Nimble Storage volume collection that’s periodicaly taking storage snapshots of our mailserver, that way we have consistent snapshots in case something awful would happen right in the middle of our move.

Looking at the console, it seems the scratch disk received the device name /dev/sdd.

Step 2: Initialize the scratch disk and add it to the “zimbra” volume group.

This part was easy.

# pvcreate /dev/sdd
# vgextend zimbra /dev/sdd

Done!

Step 3: Evacuate the /dev/sda5 physical volume onto the scratch disk

This part was easy too, once you got past the slightly too simple syntax.

# pvmove /dev/sda5

This is all I needed to type. The parameter the disk that I want to evacuate (to any other drive(s) in the volume group, in my case just /dev/sdd), which perfectly aligns with my use case.

The only hard part was the time taken. This took about 8 hours to run on our setup, so I left this run until the next workday. I figured that’s entirely sensible to kick off at like 10 PM to run until 6 AM, during the “quiet hours” of the mail system.

Step 4: Remove /dev/sda5 from the volume group and remove the /dev/sda5 and /dev/sda2 partitions.

More easymode!

# vgreduce zimbra /dev/sda5
# fdisk /dev/sda

Command (m for help): d
Partition number (1-5): 5

Command (m for help): d
Partition number (1-5): 2

Command (m for help): w

I did get an error message “Re-reading the partition table failed with error 16: Device or resource busy”. The error message suggested to run partprobe, so I did. I’m not sure if it actually did anything, but if anything it shouldn’t do any harm.

# partprobe /dev/sda

Step 5: Increase the size of the /dev/sda physical disk in order to accommodate the new expected load of data.

This is where I ran into a small issue. The drive was very oddly sized at 1.9921875 TB. I thought I’ll just bump it up to like 2.1 TB. That should be plenty. That’s when VMware decided to give me this:

Hot-extend was invoked with size (4509715660 sectors) >= 2TB. Hot-extend beyond or equal to 2TB is not supported.
The disk extend operation failed: msg.disklib.INVAL

Fortunately, the message was quite clear. All right, so I can’t extend it to 2.1 TB. How much can I extend it, what exactly is 1.9921875 TB? Flipping over to “GB” it informs me that it’s exactly 2040 GB, which seems like an arbitrary number, but at least it’s below 2 TB, and there’s just enough space to do what I want to do. Thank God that whoever was before me working on this had the foresight to leave a few GBs of possible expansion. It likely was actually myself, but I can’t recall actually thinking about it, I probably just felt 2040 was a rounder number than 2047.

Anyway, I expand the drive to 2047 GB in size using vSphere Client and it happily complies.

After that, I issue the magic incantation:

echo 1 > /sys/class/block/sda/device/rescan

This is to that Linux can rescan the physical drive to recognise the extra 7 GB of space.

Step 6: Change the end of the /dev/sda1 partition in order to increase its size to a few gigabytes or so.

Because I was only able to grow the disk by 7 GB, this also limits the maximum size of the boot partition to approximately 7 GB + 250 MB (its previous size). If I still want to be able to fit the data back there. I decide to not think too hard about it, and just make the boot partition be 7 GB in size, that should leave some headroom in case things go wrong.

# fdisk -l /dev/sda

Disk /dev/sda: 2197.9 GB, 2197949513728 bytes
27 heads, 59 sectors/track, 2694833 cylinders, total 4292870144 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0006250a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      499711      248832   83  Linux

As we can see, the drive has increased from ~2190 GB to ~2197 GB. (Note: These are power-of-10 gigabytes, not power-of-2 ones, which explains the apparent discrepancy from the other numbers.)

So, noting that the partition has to start at 2048, we create the new partition and set it bootable:

# fdisk /dev/sda

Command (m for help): d
Selected partition 1

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-4292870143, default 2048): 2048
Last sector, +sectors or +size{K,M,G} (2048-4292870143, default 4292870143): +7G

Command (m for help): a
Partition number (1-4): 1

Command (m for help): w
The partition table has been altered!

Caling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

Again I decide to throw in a partprobe. Can’t hurt.

# partprobe /dev/sda

Step 7: Perform an online filesystem resize of the /boot partition.

This was painless, I just issued:

# resize2fs /dev/sda1

And it just worked. Or so I thought! (Cue foreboding ominous music!)

Step 8: Move the data back off the scratch disk

OK, this is actually several steps. Create the new data partition, initialize it as a LVM physical volume, add it into the LVM volume group, and pvmove the data off the scratch disk. I decided to keep the layout the same, i.e. an extended volume covering the rest of the drive, and a logical volume inhabiting the full extent of that extended volume.

# fdisk /dev/sda

Command (m for help): n
Partition type:
   p   primary (1 primary, 0 extended, 3 free)
   e   extended
Select (default p): e
Partition number (1-4, default 2): 2
First sector (14682112-4292870143, default 14682112): 
Using default value of 14682112
Last sector, +sectors or +size{K,M,G} (14682112-4292870143, default 4292870143): 
Using default value of 4292870143

Command (m for help): n
Partition type:
   p   primary (1 primary, 1 extended, 2 free)
   l   logical (numbered from 5)
Select (default p): l
Adding logical partition 5
First sector (14684160-4292870143, default 14684160): 
Using default value of 14684160
Last sector, +sectors or +size{K,M,G} (14684160-4292870143, default 4292870143): 
Using default value of 4292870143

Command (m for help): w
The partition table has been altered!

Caling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

With the partition created, it’s just:

# partprobe /dev/sda
# pvcreate /dev/sda5
# vgextend zimbra /dev/sda5
# pvmove /dev/sdd

And then, wait another 8 hours.

Step 9: Scrap the scratch disk.

First, just remove the PV from the volume group:

# vgreduce zimbra /dev/sdd

Then, I just hot-removed the VMDK from the VM.

Step 10: Testing

This time around, I’m doing this on the test box, which has no production data flowing through it, and also no network connectivity. So I took a VMware snapshot (including memory) of the state of the machine at this point. The idea is that I can always revert the snapshot and end with the machine booted but not bootable, so that if something goes wrong, I can go “back in time” to fix it.

Expecting the box to reboot successfully, I just rebooted, and I was greeted with failure:

error: unknown filesystem.
Entering rescue mode...

Long story short. I try a lot of things at this point, including going back and setting the bootable flag on /dev/sda1 (which I had forgot to set previously, but which was irrelevant), trying different partition sizes etc. Nothing seemed to work. This is when I stumbled upon Debian bug 766799: On-line resizing of root filesystem with resize2fs makes system unbootable; grub says: unknown filesystem.

Long story short, in my particular scenario, resize2fs made a valid ext2 filesystem, using a feature that grub is not able to recognize. I decide to fix this problem by reverting the snapshot and simply re-creating the /boot filesystem (copying its contents away to a different folder), using the following steps:

# cp -a /boot /boot2
# umount /boot
# mkfs.ext2 /dev/sda1
# mount -a
mount: special device UUID=...... does not exist

At this point, I edited /etc/fstab to substitute /dev/sda1 for the UUID. I could also have replaced the UUID, but I was running with a console with no convenient copy-paste support.

# mount -a
# cp -a /boot2/. /boot
# rm -rf /boot2
# update-grub
# grub-install /dev/sda

At this point, I take another snapshot with memory, and rebooted. And, lo and behold, it rebooted fine!

At this point I was feeling smug and satisfied that I had caught a problem with my procedure before actually performing it, and I pat myself on the back on my good thinking of me to run this on a clone before breaking my production box!

Step 11: Doing it for real

At this point, I feel comfortable with doing this for real.

First, I go and nuke the VM and the storage snapshot clone that I was using for testing. I won’t be needing that any more, after all. That also gets rid of the scratch VMDK inside the scratch datastore, leaving it empty.

Currently, the Zimbra volume was protected as a standalone volume in or Nimble Storage array. In order to maintain our data protection posture, I changed the Zimbra volume to be protected as a part of a volume collection instead of as a standalone volume, and then proceeded to add the scratch volume to that volume collection.

Having configured that, I go back, and repeat all the steps from earlier in the post.

When the time comes to do the first pvmove, instead of invoking it immediately, I delay it to run at 22.00, like this:

# at 22:00 today
warning: commands will be executed using /bin/sh
at> pvmove -b /dev/sda5
at> <EOT> (I pressed Ctrl+D)
job 2 at Thu Apr 14 22:00:00 2016
#

I added the -b flag in order for pvmove to run in the background rather than in the foreground, since I’m starting it off an at job.

Returning to work the next morning, I breeze through steps up until the online resize of the filesystem. Because I already know that won’t work, I instead copy the contents of /boot away to the root filesystem, blow away the old partition, fix up /etc/fstab, copy the stuff back, and then run grub-update and install-grub, as outlined earlier. Although, since I’m now working with a terminal with copy-paste capabilties, instead of replacing the uuid with the device name in /etc/fstab, I simply update the UUID to the correct one.

Once I reach the step to do the second pvmove, again, I use at to delay it to 22:00, as above.

At this point, I decide that it’s a good idea to take a snapshot clone of the Zimbra volume collection and mount up a clone of the VM to ensure that booting works correctly. The location of the actual data on LVM would seem to be immaterial. After cloning up the two datastores, and adding the clone VMX to inventory, I have to adjust the VM to point the scratch disk to the cloned datastore, as well as remove the network card. After starting the clone, it boots just fine, which confirms that I did everything right.

So, all that remains is copying the data back off the scratch drive, and scrapping the scratch drive, but that’ll have to wait overnight.

After returning the next morning, I see all the data having been moved. So, as a final step:

# vgreduce zimbra /dev/sdd

Then, I removed the VMDK from the VM.

I decided to wait until the next storage snapshot rolls around on the hour before finally, killing the scratch LUN that it lived on, in the exceedingly unlikely event I’d need something off it.

All that remains now is rebooting the server to make sure it boots properly for real. I’m resonably confident it will, given all my precautions, but you never really know until you actually do it. That’ll have to wait for the maintenance window conveniently scheduled tonight.

After an embarrassing typo, the server was ultimately rebooted an hour prior to the actual service window, nevertheless, the server booted properly. Yay!

At the end, what really bit me in the ass wasn’t what I tested, it was what I didn’t test. I had not considered that the virtual machine had been configured to only back up certain disks of the mail server, which meant that the backups from the time the scratch disk was in use were not complete or usable. Fortunately, Zimbra keeps its own backups, which mitigated this. This actually turned out to be a boon, becuase this way only one data move had to be backed up into an incremental file, cutting in half the backup load, which is important because of our very slow off-site backup replication, that will take a few days to catch up, because of this.

Another thing that bit me in the ass is that I did not consider that the backup job would be running at the same time as the data move, meaning a VMware snapshot was present on disk during the time of the data move, which in general is a big no-no. In this case, it doesn’t actually seem to have affected anything, though.

I guess the moral of the story is to always consider what the backup system might do during an operation like this. I considered the Nimble Storage snapshots, but not backups, which was an oversight.