TEch taLK
I'm a nerd, these are my ramblings
Pages
Datrium Experience (fin)
Datrium Experience (pt6)
Downhill, Fast
In only a matter of months after our purchase and weeks after we finished our migration everything was running smooth. Local SSD cache humming at about 50-60% and primary storage capacity is at 60%. right where we had calculated it to be. Our plans did eventually call for an expansion, but we'll get into that a little later.
With everything humming along, updates were released. vCenter, ESXi, host firmware, NIC drivers, etc. and the DVX solution all had updates released within days of each other. I read all the release notes and found no reason not to upgrade everything to the latest version of what was available. You would think there is nothing wrong with updates. Updates are a way of life for us Sys admins.
if you don't update you get hacked. if you do update you get hacked. Nothing out of the ordinary here just firmware and driver updates. security and performance updates. Management and host updates. Controller and system updates. All with a little note saying this version or higher is supported.
I dont know if I mentioned it before, but we bought Datrium branded Dell hosts from Datrium. Dell has an OEM program that will rebrand any of their server systems to your company's specifications. The BIOS is branded, and the face plates and bevels are branded. One caveat to this is that those hosts are limited to slightly lesser versions of firmware. This is presumably to allow vendors to keep up with the versions released.
When using the built in iDRAC update manager the system is aware that is rebranded and that it is only allowed to install the older versions of firmware that are in line with the rebranding. An interesting conflict is when you get a security or bug or performance update notice that for a newer version of the same hardware.
I proceeded to update the host firmware to the latest firmware and not the firmware that is available from the iDRAC. This also had a higher version ESXi driver that was associated with it. a Higher version that is in common use with the non-rebranded gear I might add. this was the start of a downhill journey that is going to end shortly with us migrating to a new solution.
The combination of the newer firmware and the newer driver was too much for the DVX system. A sign of things to come. Once we had everything up to date we started having random SSD failures. This would normally not be an issue since the system is designed to fail over to good SSDs either on the local host or to neighboring hosts. Instead we had an APD that resulted in unscheduled production down time. BAD JUJU.
In order to recover the VMs that were protected by HA, we had to reboot the host. For some reason the lock files were still being accessed and HA couldn't work. BAD JUJU. This extended the recovery of the production hosts. Once we were back up and running we had to get in touch with Datrium support for RCA.
After a quick peak at the system the tech couldn't find any good reason for any issues. We exported support bundles and uploaded them to Datrium for engineering review. After a day or so they found that the driver and firmware were too new and mismatched. The Firmware was installed from the Dell support site and are valid for the model we have, except for the rebranding caveat. The driver was installed by the vCenter Lifecycle controller.
So how the hell is the combination too new?!?!?!? Needless to say I backed the versions down and things were stable, until we had another SSD failure with the same symptoms and the "correct" versions of driver and firmware, and an APD that required aa host restart to get our VMS back on line.
The beginning of the End has been marked.
For a system that is sold as being resilient and redundant, this wasn't looking good. We decided to run with it and work with tech support to ensure that moving forward we only used the documented firmware and driver combinations. We had to work closely with tech support because the documentation didn't state which version worked with our gear. The documentation we were able to find was for the r640 systems and not the r640 OEMR. The distinction was more then enough to break things. Apparently!
Later after reading the Datrium DVX release notes which state that the DVX version we were updating to was compatible with vSphere 7.0 with the only specifics mentioning vCenter 7.0 u2 and the DVX 5.2.1.0 hyperdriver. Should work, no problem, right? Nope. vCenter 7.0 u2 g was not compatible and I had to roll it back.
Based on my experience with vCenter 7.0 u2g, Datrium released a bulletin stating that it doesn't work and not to upgrade yet. It took a few weeks for them to release an update for the DVX that would allow the new vCenter version. Which then took me a couple more weeks to get a change control approved, which eventually worked, and I was totally up to date and everything was working like it was sup posed to.
For a little bit.
At the time of writing I am battling with a hyperdriver on a single host, that is exactly the same as the rest of my hosts, that is continuously failing. I have uninstalled and reinstalled, reloaded and everything else short of rebuilding the host. That might be an option sooner rather then later.
The most frustrating bit of this whole process is that the initial testing of this system was spectacular. I created some test scripts that generated test files and loads. The performance in all areas far surpassed every other solution we tested. The excitement for the production deployment was palpable. And the initial deployment went so smoothly that we had no reason to suspect that the bottom would fall out from this solution.
We have gone to great lengths to keep this solution in good standing. The potential of this solution is amazing and I personally haven't seen anything as innovative since LeftHand, pre HP buy out. I'm still holding out hope that VMware will incorporate some of this amazing tech into vSAN. Probably not cause vSAN sucks and not sure that VMware will ever make it less complicated.
Dartium Experience (pt5)
Turbulent Waters
After several system updates and upgrades it was found that keeping the easy to use ship on course, took us thru some pretty heavy seas. We were promised that the redundancy built into the system would keep our data online all the time. We found that wasn't always the case. And we found that the reasons for each incident ( there were a few ) were either a mismatched driver or firmware revision.
From the shock and awe of the deployment and management experience, the performance and recovery functionality to the SSD "nonfailure" failures to actual controller failures and unsupported supported drivers, the seven Cs of the Datrium Experience have been a very Turbulent place.
Cool... The tech is awesome!
Without rewriting the previous posts the following features are core to the solution. Its a software defined solution that allows for separation of performance and capacity. The layered approach to processing allows for encryption at rest, compression, deduplication, recovery, and resiliency in a fairly compact package.
Simple installation, deployment and even expansion. Single pane of glass management and incorporated backups. Not to mention the super compact physical dimensions of our completed system.
The storage nodes can be added to the array in a matter of moments. Rack, stack, and power it on, go to the management console and add the node to the array. Backups are included and it's supper easy to do a restore.
Calm... It's fairly reliable
The system has some decent internal monitoring and it calls home to support. It did notify tech support of some issues before they became a downtime problem. With the current release (5.3.1.0) they included duplicate IP monitoring among several others.
Confounded... except when it's not
Unexpected issues took weeks or months to become a problem and even longer to diagnose.
The SSDs are supposed to have a few features. Firstly they are supposed to be sharable between hosts on a failure. However we found that most of the time the SSD failure resulted in an All Paths Down that killed all our VMs on that host. Recovery is simple and we've had way to much practice.
The duplicate IP monitoring for instance has been complaining about all the ESXi hosts having a duplicate, but when the network team investigated all they found was the MAC address for VLAN definitions that have been shutdown.
Contemptible... frustratingly tight tolerances
I have no ill will toward Datrium tech support, those guys know ESXi, they know the storage system and a lot of the in between. They do have the same problems as every other solution provider, tier one support has to wait on higher tiers for help.
However the when the system had issues and they became problems the final determination took a while to find and was usually a bad driver or a driver/firmware mismatch. The tolerances on the system are very narrow.
Purchased in good faith the solution would be usable for at least 5 years, hopefully 7, we bought re-branded OEM Datrium servers and storage nodes. Their support, like I said is great, however the legalities of re-branded gear make it nealy impossible to get any first party support from the original vendor of the gear, in this case Dell.
Working with other customers on this same solution who only bought the storage and equivalent Dell hosts, at the time was a little silly. Looking back I'm sure they will not have a big a budget issue as my team.
We knew that the technology being developed by this startup was too good not to be bought up by a bigger company looking to fill out their hyper-converged catalog. So the fact that the purchase was a gamble was always on our mind. But we knew the tech was tooooo good to get shelved.
Or so we thought. When VMware bought them out we had high hopes. Word came from the industry leader that they might be keeping it. The hope was dashed in a mater of weeks, they killed the DVX solution.
With the death of the solution we are scrambling to fill the void. We are not at liberty to run a primary server and storage environment without primary support. I know of several organizations that can get away with that model, but I think they are dumb! And that cavalier attitude will be the downfall of their solutions.
That's right I said it... DUMB! It doesn't get used enough to describe when people do dumb things! Like killing the best fit solution to local government servers and storage. I think that VMware killing the DVX solution is dumb.
Even with all the ups and downs, the solution is a great fit. I believe with a few more serious years of development it would be the best solution on the market for disaggregated hyper-converged. I've made it about halfway through my story and at this point you might be wondering why the spoilers? Why do a summary in the middle? I'm not really sure, this is just what came out.
There will be more to this story, but i have to warn you... It will not be as happy as it has been.
Script Logging
My logging method has two parts, a script block and an invoke-command line. Its not the typical function call. Just a small piece to get started, and lay the support for recording what your script is going to do.
First we create a path for the log with date stamp
$logLocation = "c:\beep\logs\$($MyInvocation.MyCommand.Name)_$(get-date -f "MM_dd").log"
The scriptblock has 2 parameters $msg and $quiet. Msg is the text you want to record in the log and Quiet is a switch that allows console output to be silenced. The block will loop until its able to change the updated variable, affectively retrying until the write completes or the counter reaches 5. Each line is date stamped. If not quiet then write the message to the console.
$logMsg = {
param([string]$msg,[switch]$quiet)
$updated = $false
$cnt = 0
while(!$updated -or $cnt -ge 5)
{ try
{ "$(get-date -f "MM/dd/yyyy HH:mm:ss") - $($msg)" | Out-File $logLocation -Append}
catch{$cnt ++}
$updated = $true
}
if(!$quiet){$msg}
}
#Call log routine
Invoke-Command $logMsg -ArgumentList "Starting script: $($MyInvocation.MyCommand.Name)"
Top Tech Talk
Datrium Experience (fin)
This series has discussed the performance, the ease of use, the resiliency, the flaws and the finale of the Datrium DVX platform. From here ...