Datrium Experience (pt4)

Part 4 - Magic

The Datrium concept was one of simplicity and function. They created a system that is easy to start and just as easy to run. This system seamlessly integrates with vSphere and houses all your data in one place.

The Wizards

Datrium has C level and engineering staff from both VMware and DataDomain. This is a match made in bowels Middle Earth. And the technology they created would stand up against anything the dwarves of Moria could produce.

Top level storage and deduplication designers from DataDomain, who wrote the book on deduplication are now applying what they know to production storage instead of just backup. Software strategies from the minds of former VMware executives guiding an administrative experience so simple an entry level help desk tech could do it.

These Wizards created a magical storage system that allows for, not just ease of use, but the ability to put a bag of holding inside another bag of holding without destroying the universe.

Bag of Holding

I have previously mentioned some of the features of the system, but the one that had most of my attention all throughout my experience with Datrium, is snapshots. The DVX system allows for a lot of snapshots. At the storage level they are integrated with the storage efficiencies and are virtually instant.

They sell it as being able to support millions of snapshots. They don't show a lot of detail in their UI, but I think they are really only taking a volume snapshot and showing it as if its an individual snap at the VM level. This level of trickery allows for the statement of millions of snaps when they are really only doing a few hundred.

Some kind of magic.


The integration with vSphere is simple and effective, for the most part. If you want to recover a vm simply navigate to it, shut it down, go to the monitoring tab and select the DVX console. Select the point in time you want to recover to and go.

This process is only missing the part that turns the VM back on. Very easy and stupid fast. We once recovered a VM that weighed in at over 90TB in 30 seconds. The system takes a snap just before it brings the old data forward so you can recover to just before you restored. It's a nice little safety net. Don't think we ever needed it, but nice to have it all the same.

There is also the ability to dive into an individual VMDK and recover a limited amount of data. You can navigate any point in time and select files and folders that can be exported to an ISO and mounted to other VMs for recovery or distribution. It is limited though, their method starts to break down at about 10GB of data.

Mirror, mirror on the wall

One more part of the snapshots is the replication of the snaps to become a backup. Now there are a few parts to this. First is the replication groups that must be setup ahead of time. The there I the schedule and finally the selection of the replication site... withing the schedule.

It's not complicated! But it's results are excellent. I mentioned that I have 3 datacenters. My backup cycle with this system.is as follows. Primary DC replicates to secondary and tertiary, Secondary DC replicates to primary and tertiary. And tertiary just sits there and accepts all the incoming data.

Due to bandwidth limitations there are some tricks we play with scheduling but in the end all of our data is in all 3 places with a few days of each other. And in the cataclysmic event that we need to recover from the tertiary site, what little is missing will be, mostly, irrelevant.

The Mages

There was a point a little while ago where you thought I was going to start bashing on the tech support guys, never going to happen. They are some of the best I've worked with.

The Wizards gave them some very powerful magic, and they have learned to use it well. The system has a support mode that opens a tunnel to their system. Backdoor you say?!? Kind of, but you have to open it for them. Once unlocked this spell gives the makes the ability to disenchant anything that ailes the system, from bad firmware or stuck processes to pin pointing a dying dragon, I mean controller.

They are as professional as clergy and just as mystical. They have a level of knowledge about the DVX and ESX that rivals most of the so-called vSphere experts.

If I have a single complaint about the Datrium technical support team is that I have had to call upon their sorcery so much.

Datrium Experience (pt2)

Part 2 - Space and Time

As with any good story about space and time we will be starting in the middle. At this point in our story the heroes are under the impression that their solution is the right one and it will change the way their organization handles storage forever!

Time

After numerous tests and calculations it was determined that the total solution consisted of 4 Dell OEMR r640's with 2 CPUs @ 20 cores each and 768 GB of RAM. Two Cisco Nexus 9k switches and 2 Datrium DVX 12x10 shelves, per site. The POC that we did included the initial configuration of the system at our secondary site, so of course the first thing we did was install the new servers and add them to the existing vCenter and DVX cluster. it took more time to rack it then it did to configure it and get the first VMs running.

After the Secondary site was up and running we moved over the primary site. We racked the servers, switches and DVX nodes. We wired everything nice and neat. We powered everything on and stepped through the config notes we took during the POC config.

The process was simple. After ensuring that the network configuration was sound and we had our trusty cheat sheet with our IP addresses all mapped out. We started. Console cable into the primary controller, indicated by the blue light on the back of the controller. Login with the default credentials, and start the initial config. Without checking my notes, I think it was option 1.

The serial session was via Putty of course and the background was black and the font was white. The prompts came up, primary IP, DNS, vCenter, new admin password, etc. I think it took 5 minutes to enter the values and another 15 for the system to validate. Once we hit enter on the last option we were done.

Space

Moving on the vCenter the plugin installed itself. We connected to the hosts the vCenter plugin was aware that we were looking at an unassigned host and gave us the option to mount the datastore. Mount you say... a datastore you say?!?!? Sure why not? and BAM! 160TB of luscious free space was staring back at us from an NFS masked proprietary connection to a 32 TB cached lake of 10TB spinning disks.

To say it was magical would be a stretch, but it was incredibly easy. For our 2 datacenters with 8 hosts, 3 switches and starting with 4 DVX nodes, total we spent less than 8 hours racking, wiring, powering and configuring, before the first VM was powered up on the new system. In my previous experiences, 8 hours was the first day of racking and stacking, just the storage. This was huge. It validated everything that was sold as the Datrium DVX solution.


The 4 Dell Hosts took up 4 RU, the switches took 2 RU and the DVX nodes took 4 RU, with some spaces for cables, the whole thing took less than a 3rd of a rack and had as much available capacity as our 2 racks next to it and the 4 m820 Dell blades we were replacing that took 10RU because of the chasey.

As a story of space and time, we were saving both. And we were just getting started.


Their Thoughtfulness

One of the key selling points of the DVX system is that it's a Hyper-Converged Software Defined storage solution, that separates the storage management from the capacity. They called it Open-Converged, the industry called it Disaggregated Hyper-Converged. Same thing! This spoke to us on a fundamental level. Our organization is pretty steady on our compute resources, but we have forever retention on most of our data so it just grows over time.

The idea of Hyper-Converged was appealing due to the software defined nature and simplified management and use with in vCenter. But ALL hyper-Converged platforms have the same fatal flaw, compute and storage had to grow together. Some providers tried to package smaller CPUs with higher capacity and give you a storage heavy node, but it still required vSphere licensing. These approaches meant well didn't deliver what we needed.

The Open-Converged concept was perfect. It delivered, ease of use and versatility. The DVX platform went one step further and delivered on performance like no other, but that is literally another story. The disconnection of the management and capacity was possible due to the hyper-driver that runs on the ESXi host. It makes use of the hosts CPU, memory and SSDs.

That's right, they put SSDs local to the host, just like most other HCI solutions, except their capacity is on the other end of a 25Gb TwinAx cable. As part of the system sizing they adjust the cache to an appropriate size based on active data in your environment. And for the most part in our environment we stay within the bounds of this cache to this day. Everything from files shares to SQL databases are in active, local SSD cache on the individual host the VM is running on. Performance BABY!

If your like me and some of my friends your going to scream, wait how do you ensure data integrity down to the disk? Great question! New data gets written to the local SSD, then to the flash cache on the disk shelf, as soon as the shelf acknowledges the write, the host can go on about its business and the shelf can move that data to the disk once it's been encrypted, compressed and deduplicated. Since its all written to flash first there is no issue with write integrity. (According to the sales team and documentation.)

What about if there is an SSD failure, or a switch, connection or shelf failure. Well the sales guys have some great stories about the internal redundancies and host SSD sharing capabilities. The DVX cluster is using EC-6, which means that the whole cluster is one big RAID with 2 parity disks for up to 10 shelves. The trade off here is that the disks aren't hit as hard or as often since most reads are supposed to happen from the local SSDs on each host.

The system design and presentation is very thoughtful. It fit our requirements, needs, hopes and dreams, or so we thought...

Our Thoughtfulness

The requirements we came up with were very generic. 320 cores (with hyper-threading), 3TB of RAM, 32 TB of flash cache, and 160 TB of storage. This was for a single Datacenter and match it for our secondary datacenter. We have a third site, but it's just for a data dump so the requirements there was we just need a storage nodes to dump data to. For the DVX solution this also required a host that will run the DVX hyper driver, 2 for redundancy.

Our primary datacenter has the highest need for redundancy, since nobody wants to get a call at 3 in the morning saying the Fire Department can't get to their shared drives, we have a little more effort in that design then we do in our secondary site. But minimally more, I mean 1 more switch. Everything else is the same.

We are not as thoughtful as Datrium.

Datrium Experience (pt1)

Part 1 - An Introduction

Simplicity in the realm of enterprise storage is a relative thing. I've have found that with time managing several types of storage from multiple vendors that certain parts of storage management are easier than others. I found that having an understanding of the layers of storage helps to keep things in perspective. I have also found that for my job, I wear several hats, only one of which is storage management.

Today I want to talk a little about why I chose to purchase and use the Datrium DVX solution for my primary storage. I'm not sure right now how many posts this will take, but in this series I hope to explain the good and the bad that is the Datrium experience.


Backstory

Let's start with the previous iterations of the storage architectures I've managed, a little back story if you will. Does anybody remember LeftHand storage? No?!?!?! This was supper simple iSCSI storage that was marketed as hyper redundant. Each shelf was its own controller and was a RAID 5/6 with a network RAID 10 on top of that. The biggest drawbacks at the time (other than HP bought them out) were, it was limited to 1Gb ethernet, SSD weren't available and the RAID controller only had a 8GB cache.

After LeftHand came NetApp FAS. For me this was a huge change in paradigm. A single controller with fiber attached SAS shelves. So many layers and options for each layer. It was (and still is) a very extensible system. I jumped in head first in to the deep end of enterprise storage. It was fun, it was also the primary thing I managed for the better part of a year while my other projects fell further and further behind.

While managing the Virtual Environment, Guest VMs, Active Directory, IIS, systems monitoring, patch management and, operations and security projects, with storage management taking as long as it did with NetApp, the decision was made to find something a little easier to manage. The power and versatility of the NetApp platform made for an overwhelming management experience and took too much time away from the multitude of other schtuff on my plate. This led to a total review of what we need and what we want.

While trying to find easier and less time consuming storage solutions we found the Datrium DVX. This solution is fast, runs on commodity storage, super simple initial configuration and an easier ESXi host integration. After a pretty intense POC process that included it and several competitors, the DVX solution was up and running in our datacenters.

Notes moving forward

This is intended to be a story about my experience not a technological presentation.
Please understand that there is a lot more going on here. I don't want to bore you with all the details, although I might, of all the steps, decisions, migrations, and installations, I already have a lot to tell here. So we are going to soldier on with the knowledge that we needed something new.

The rest of this series will describe the features, interpretations and overall experience using Datriums DVX system. Also note that this all took place over 2 years ago.



Top Tech Talk

Datrium Experience (fin)

This series has discussed the performance, the ease of use, the resiliency, the flaws and the finale of the Datrium DVX platform. From here ...