Datrium Experience (fin)

This series has discussed the performance, the ease of use, the resiliency, the flaws and the finale of the Datrium DVX platform. From here the story is going to need an end. A Fare well.

This product had been researched and POCed. Baselined and tested against the competition in categories we thought mattered and for the most part did. However we missed some of the core business components that make a good technical solution a great one.

When we considered that the tech we were buying is from an up and coming company we knew that it would be bought out. It was written in the sky with fireworks and lightning. The consideration that it might be scrapped was in the back of our minds and we thought, Nahh this stuff is too good to be shutdown!

We were wrong. They were bought out by VMware at a point when it was still primarily owned by Dell, who has their own storage systems. And the DVX gear didn't really fit into their line up. Not to mention the hardware component of the solution was very proprietary, and VMware is a software company.

The DVX story, at least for my team and I, has been a perilous one. A great idea brought to life by great engineers, coders, and support team, that was seen by the bigger fish as a good investment for all the wrong reasons. At least from my perspective.

My team and I felt this tech was the next evolution of hyper-converged. We knew that this was an extensible and positive move toward a software defined storage system that could scale in which ever direction we needed it to. 

Our support contract is ending soon and we may not have the option to renew. And we may not have the budget to go another direction. We are finding ourselves between a rock and hard place. 

It has been an incredible ride and I really can't wait to see what's next. I wish the timing were a bit more in line with my budget, but what are you going to do?!?

THE END

Datrium Experience (pt6)

 Downhill, Fast

In only a matter of months after our purchase and weeks after we finished our migration everything was running smooth. Local SSD cache humming at about 50-60% and primary storage capacity is at 60%. right where we had calculated it to be. Our plans did eventually call for an expansion, but we'll get into that a little later.

With everything humming along, updates were released. vCenter, ESXi, host firmware, NIC drivers, etc. and the DVX solution all had updates released within days of each other. I read all the release notes and found no reason not to upgrade everything to the latest version of what was available. You would think there is nothing wrong with updates. Updates are a way of life for us Sys admins.

if you don't update you get hacked. if you do update you get hacked. Nothing out of the ordinary here just firmware and driver updates. security and performance updates. Management and host updates. Controller and system updates. All with a little note saying this version or higher is supported.

I dont know if I mentioned it before, but we bought Datrium branded Dell hosts from Datrium. Dell has an OEM program that will rebrand any of their server systems to your company's specifications. The BIOS is branded, and the face plates and bevels are branded. One caveat to this is that those hosts are limited to slightly lesser versions of firmware. This is presumably to allow vendors to keep up with the versions released.

When using the built in iDRAC update manager the system is aware that is rebranded and that it is only allowed to install the older versions of firmware that are in line with the rebranding. An interesting conflict is when you get a security or bug or performance update notice that for a newer version of the same hardware. 

I proceeded to update the host firmware to the latest firmware and not the firmware that is available from the iDRAC. This also had a higher version ESXi driver that was associated with it. a Higher version that is in common use with the non-rebranded gear I might add. this was the start of a downhill journey that is going to end shortly with us migrating to a new solution.

The combination of the newer firmware and the newer driver was too much for the DVX system. A sign of things to come. Once we had everything up to date we started having random SSD failures. This would normally not be an issue since the system is designed to fail over to good SSDs either on the local host or to neighboring hosts. Instead we had an APD that resulted in unscheduled production down time. BAD JUJU.

In order to recover the VMs that were protected by HA, we had to reboot the host. For some reason the lock files were still being accessed and HA couldn't work. BAD JUJU. This extended the recovery of the production hosts. Once we were back up and running we had to get in touch with Datrium support for RCA.

After a quick peak at the system the tech couldn't find any good reason for any issues. We exported support bundles and uploaded them to Datrium for engineering review. After a day or so they found that the driver and firmware were too new and mismatched. The Firmware was installed from the Dell support site and are valid for the model we have, except for the rebranding caveat. The driver was installed by the vCenter Lifecycle controller.

So how the hell is the combination too new?!?!?!? Needless to say I backed the versions down and things were stable, until we had another SSD failure with the same symptoms and the "correct" versions of driver and firmware, and an APD that required aa host restart to get our VMS back on line.

The beginning of the End has been marked.

For a system that is sold as being resilient and redundant, this wasn't looking good. We decided to run with it and work with tech support to ensure that moving forward we only used the documented firmware and driver combinations. We had to work closely with tech support because the documentation didn't state which version worked with our gear. The documentation we were able to find was for the r640 systems and not the r640 OEMR. The distinction was more then enough to break things. Apparently!

Later after reading the Datrium DVX release notes which state that the DVX version we were updating to was compatible with vSphere 7.0 with the only specifics mentioning vCenter 7.0 u2 and the DVX 5.2.1.0 hyperdriver. Should work, no problem, right? Nope. vCenter 7.0 u2 g was not compatible and I had to roll it back.

Based on my experience with vCenter 7.0 u2g, Datrium released a bulletin stating that it doesn't work and not to upgrade yet. It took a few weeks for them to release an update for the DVX that would allow the new vCenter version. Which then took me a couple more weeks to get a change control approved, which eventually worked, and I was totally up to date and everything was working like it was sup posed to.

 For a little bit.

At the time of writing I am battling with a hyperdriver on a single host, that is exactly the same as the rest of my hosts, that is continuously failing. I have uninstalled and reinstalled, reloaded and everything else short of rebuilding the host. That might be an option sooner rather then later.

The most frustrating bit of this whole process is that the initial testing of this system was spectacular. I created some test scripts that generated test files and loads. The performance in all areas far surpassed every other solution we tested. The excitement for the production deployment was palpable. And the initial deployment went so smoothly that we had no reason to suspect that the bottom would fall out from this solution.

We have gone to great lengths to keep this solution in good standing. The potential of this solution is amazing and I personally haven't seen anything as innovative since LeftHand, pre HP buy out. I'm still holding out hope that VMware will incorporate some of this amazing tech into vSAN. Probably not cause vSAN sucks and not sure that VMware will ever make it less complicated.

Dartium Experience (pt5)


Turbulent Waters 

With great power, and feature sets, comes great responsibility. The Wizards at Datrium created a very awesome product with some very strict tolerances. They made an easy to use product that incorporates primary and backup storage with replication and all in an integrated single pain of glass.

After several system updates and upgrades it was found that keeping the easy to use ship on course, took us thru some pretty heavy seas. We were promised that the redundancy built into the system would keep our data online all the time. We found that wasn't always the case. And we found that the reasons for each incident ( there were a few ) were either a mismatched driver or firmware revision.

From the shock and awe of the deployment and management experience, the performance and recovery functionality to the SSD "nonfailure" failures to actual controller failures and unsupported supported drivers, the seven Cs of the Datrium Experience have been a very Turbulent place.

Cool... The tech is awesome!
Without rewriting the previous posts the following features are core to the solution. Its a software defined solution that allows for separation of performance and capacity. The layered approach to processing allows for encryption at rest, compression, deduplication, recovery, and resiliency in a fairly compact package.

Collected... It's easy to use
Simple installation, deployment and even expansion. Single pane of glass management and incorporated backups. Not to mention the super compact physical dimensions of our completed system.

The storage nodes can be added to the array in a matter of moments. Rack, stack, and power it on, go to the management console and add the node to the array. Backups are included and it's supper easy to do a restore. 

Calm... It's fairly reliable
The system has some decent internal monitoring and it calls home to support. It did notify tech support of some issues before they became a downtime problem. With the current release (5.3.1.0) they included duplicate IP monitoring among several others.

Confounded... except when it's not
Unexpected issues took weeks or months to become a problem and even longer to diagnose.

The SSDs are supposed to have a few features. Firstly they are supposed to be sharable between hosts on a failure. However we found that most of the time the SSD failure resulted in an All Paths Down that killed all our VMs on that host. Recovery is simple and we've had way to much practice.

The duplicate IP monitoring for instance has been complaining about all the ESXi hosts having a duplicate, but when the network team investigated all they found was the MAC address for VLAN definitions that have been shutdown.

Contemptible... frustratingly tight tolerances
I have no ill will toward Datrium tech support, those guys know ESXi, they know the storage system and a lot of the in between. They do have the same problems as every other solution provider, tier one support has to wait on higher tiers for help. 

However the when the system had issues and they became problems the final determination took a while to find and was usually a bad driver or a driver/firmware mismatch. The tolerances on the system are very narrow.

Constrained... can't reuse the gear
Purchased in good faith the solution would be usable for at least 5 years, hopefully 7, we bought re-branded OEM Datrium servers and storage nodes. Their support, like I said is great, however the legalities of re-branded gear make it nealy impossible to get any first party support from the original vendor of the gear, in this case Dell.

Working with other customers on this same solution who only bought the storage and equivalent Dell hosts, at the time was a little silly. Looking back I'm sure they will not have a big a budget issue as my team.

Canned... VMware stuffed 'em 
We knew that the technology being developed by this startup was too good not to be bought up by a bigger company looking to fill out their hyper-converged catalog. So the fact that the purchase was a gamble was always on our mind. But we knew the tech was tooooo good to get shelved.

Or so we thought. When VMware bought them out we had high hopes. Word came from the industry leader that they might be keeping it. The hope was dashed in a mater of weeks, they killed the DVX solution.

With the death of the solution we are scrambling to fill the void. We are not at liberty to run a primary server and storage environment without primary support. I know of several organizations that can get away with that model, but I think they are dumb! And that cavalier attitude will be the downfall of their solutions.

That's right I said it... DUMB! It doesn't get used enough to describe when people do dumb things! Like killing the best fit solution to local government servers and storage. I think that VMware killing the DVX solution is dumb.

Even with all the ups and downs, the solution is a great fit. I  believe with a few more serious years of development it would be the best solution on the market for disaggregated hyper-converged. I've made it about halfway through my story and at this point you might be wondering why the spoilers? Why do a summary in the middle? I'm not really sure, this is just what came out.


There will be more to this story, but i have to warn you... It will not be as happy as it has been.


Script Logging

Most of my scripts have a very standard element. Logging!

My logging method has two parts, a script block and an invoke-command line. Its not the typical function call. Just a small piece to get started, and lay the support for recording what your script is going to do.

First we create a path for the log with date stamp
  $logLocation = "c:\beep\logs\$($MyInvocation.MyCommand.Name)_$(get-date -f "MM_dd").log"
 

The scriptblock has 2 parameters $msg and $quiet. Msg is the text you want to record in the log and Quiet is a switch that allows console output to be silenced. The block will loop until its able to change the updated variable, affectively retrying until the write completes or the counter reaches 5. Each line is date stamped. If not quiet then write the message to the console.

  $logMsg = {
    param([string]$msg,[switch]$quiet)
    $updated = $false
    $cnt = 0
    while(!$updated -or $cnt -ge 5)
    { try
      { "$(get-date -f "MM/dd/yyyy HH:mm:ss") - $($msg)" | Out-File $logLocation -Append}
      catch{$cnt ++}
      $updated = $true
    }
    if(!$quiet){$msg}
  }
#Call log routine 
Invoke-Command $logMsg -ArgumentList "Starting script: $($MyInvocation.MyCommand.Name)"


This simple block of code allows for multithreaded scripts to be able to log to a single file in a thread safer way. I say safer and not safe. There are other ways to implement a thread safe method. The idea with this is to just keep it simple and little extensible. Adding -asjob to the invoke-command would make this a separate job and even more thread safer.

Datrium Experience (pt3)

Part 3 - Horse Power and Torque

The way in which Datrium disaggregated the solution is what made it so appealing. The software runs on the host and the storage was completely separate. Horse power is applied closest to the VMs where you want it and the torque is across the wire where the heavy lifting is happening. Top it off with a little software defined ignition system that can go from fast to insane and you have a solution that was right up our aisle.

Hyperdrive...r

The storage controller software is broken up into 2 parts. Frontend and backend. The front end was on the host in the form of a VIB that acts like a driver. This driver serves several functions.

The first is to connect the host to the backend and present the storage to the host. This is in the form of NFS. It's a local share. This is presumably to make use of the local flash.

The connection to the backend was some proprietary concoction that used ethernet, 25Gb in our case. Another critical function was storage efficiency. Before data is written across the wire, it's compressed, deduplicated, optionally encrypted and written to the flash.

Supposedly another key function is that hosts are able to share their local flash with fellow hosts in the event that host has an issue with its local flash. This has been a serious point of failure in our environment with no clear reason or solution in sight.

This is not the point I slam their support. Those guys know their product and quick to respond. The team is good. This is the point that I slam the engineering team for making a system that does not fail over properly to neighboring host flash when local flash fails. More on this later.

You have the ability to alter performance when resources get tight. A nifty little switch on the host DVX console. One side says Fast the other Insane. This switch let's you select how much compute resources you allow the hyperdriver to consume. Even on Insane we have not noticed any negative impact on host resource utilization.

The last feature I'm going to mention is replication. When the system is configured for snapshots and replication the host portion of the control software handles the replication of data. And according to the sales guys, all the data transmitted is encrypted and all the storage efficiencies are maintained.

Payload

When we think of torque we are usually thinking in terms of how much can I pull. The DVX storage nodes definitely pull their weight. The second half of the controller software has a bunch of little goodies.

Along with collecting and distributing all the data written from the hosts to all the DVX nodes in the cluster, it maintains storage efficiencies and the doubles down on them. When a host writes something it's compressed and deduced on the host then passed to the DVX node, here it is written with the data from all the other hosts in the cluster. There is potentially a lot of the same stuff that can be compressed and deduped all over again to save even more capacity.

The storage recovery process, or SR runs every 4 hours. This is what I was told. In actuality, on larger capacity systems it runs until it finishes and then starts again on the next 4 hour rotation.

We have seen this process take over a day and half to finish on a data set the normally takes less than 18 hours. Not the end of the world if you have enough spare capacity to allow the process to compete. Our systems can write over 1TB an hour in new data that eventually becomes little more than 10-50GB of actual used capacity after this process runs.

More goodies include encryption at rest, 10TB drives, erasure coding 2, dual hot-swap controllers and redundant power supplies. Encryption at rest was a big selling point for us so, when they said they can do it without an impact on performance, it obviously put these guys higher in the points then some of the competition at the time.

Most of the goodies, features and design aspects of the solution seem a lot more like magic then technology, but more on that next time.

Top Tech Talk

Datrium Experience (fin)

This series has discussed the performance, the ease of use, the resiliency, the flaws and the finale of the Datrium DVX platform. From here ...