Datrium Experience (pt6)

 Downhill, Fast

In only a matter of months after our purchase and weeks after we finished our migration everything was running smooth. Local SSD cache humming at about 50-60% and primary storage capacity is at 60%. right where we had calculated it to be. Our plans did eventually call for an expansion, but we'll get into that a little later.

With everything humming along, updates were released. vCenter, ESXi, host firmware, NIC drivers, etc. and the DVX solution all had updates released within days of each other. I read all the release notes and found no reason not to upgrade everything to the latest version of what was available. You would think there is nothing wrong with updates. Updates are a way of life for us Sys admins.

if you don't update you get hacked. if you do update you get hacked. Nothing out of the ordinary here just firmware and driver updates. security and performance updates. Management and host updates. Controller and system updates. All with a little note saying this version or higher is supported.

I dont know if I mentioned it before, but we bought Datrium branded Dell hosts from Datrium. Dell has an OEM program that will rebrand any of their server systems to your company's specifications. The BIOS is branded, and the face plates and bevels are branded. One caveat to this is that those hosts are limited to slightly lesser versions of firmware. This is presumably to allow vendors to keep up with the versions released.

When using the built in iDRAC update manager the system is aware that is rebranded and that it is only allowed to install the older versions of firmware that are in line with the rebranding. An interesting conflict is when you get a security or bug or performance update notice that for a newer version of the same hardware. 

I proceeded to update the host firmware to the latest firmware and not the firmware that is available from the iDRAC. This also had a higher version ESXi driver that was associated with it. a Higher version that is in common use with the non-rebranded gear I might add. this was the start of a downhill journey that is going to end shortly with us migrating to a new solution.

The combination of the newer firmware and the newer driver was too much for the DVX system. A sign of things to come. Once we had everything up to date we started having random SSD failures. This would normally not be an issue since the system is designed to fail over to good SSDs either on the local host or to neighboring hosts. Instead we had an APD that resulted in unscheduled production down time. BAD JUJU.

In order to recover the VMs that were protected by HA, we had to reboot the host. For some reason the lock files were still being accessed and HA couldn't work. BAD JUJU. This extended the recovery of the production hosts. Once we were back up and running we had to get in touch with Datrium support for RCA.

After a quick peak at the system the tech couldn't find any good reason for any issues. We exported support bundles and uploaded them to Datrium for engineering review. After a day or so they found that the driver and firmware were too new and mismatched. The Firmware was installed from the Dell support site and are valid for the model we have, except for the rebranding caveat. The driver was installed by the vCenter Lifecycle controller.

So how the hell is the combination too new?!?!?!? Needless to say I backed the versions down and things were stable, until we had another SSD failure with the same symptoms and the "correct" versions of driver and firmware, and an APD that required aa host restart to get our VMS back on line.

The beginning of the End has been marked.

For a system that is sold as being resilient and redundant, this wasn't looking good. We decided to run with it and work with tech support to ensure that moving forward we only used the documented firmware and driver combinations. We had to work closely with tech support because the documentation didn't state which version worked with our gear. The documentation we were able to find was for the r640 systems and not the r640 OEMR. The distinction was more then enough to break things. Apparently!

Later after reading the Datrium DVX release notes which state that the DVX version we were updating to was compatible with vSphere 7.0 with the only specifics mentioning vCenter 7.0 u2 and the DVX 5.2.1.0 hyperdriver. Should work, no problem, right? Nope. vCenter 7.0 u2 g was not compatible and I had to roll it back.

Based on my experience with vCenter 7.0 u2g, Datrium released a bulletin stating that it doesn't work and not to upgrade yet. It took a few weeks for them to release an update for the DVX that would allow the new vCenter version. Which then took me a couple more weeks to get a change control approved, which eventually worked, and I was totally up to date and everything was working like it was sup posed to.

 For a little bit.

At the time of writing I am battling with a hyperdriver on a single host, that is exactly the same as the rest of my hosts, that is continuously failing. I have uninstalled and reinstalled, reloaded and everything else short of rebuilding the host. That might be an option sooner rather then later.

The most frustrating bit of this whole process is that the initial testing of this system was spectacular. I created some test scripts that generated test files and loads. The performance in all areas far surpassed every other solution we tested. The excitement for the production deployment was palpable. And the initial deployment went so smoothly that we had no reason to suspect that the bottom would fall out from this solution.

We have gone to great lengths to keep this solution in good standing. The potential of this solution is amazing and I personally haven't seen anything as innovative since LeftHand, pre HP buy out. I'm still holding out hope that VMware will incorporate some of this amazing tech into vSAN. Probably not cause vSAN sucks and not sure that VMware will ever make it less complicated.

No comments:

Post a Comment

Top Tech Talk

Datrium Experience (fin)

This series has discussed the performance, the ease of use, the resiliency, the flaws and the finale of the Datrium DVX platform. From here ...