When is a good time for a healthcare system to consider cloud infrastructure?  A client with an application portfolio of 800+ applications, needed to better standardize and automate system management. The teams managing the applications, servers, and underlying components could not scale to manage the differences of each system, and staffing and budgeting were continuing to get cut.  A better way was needed to be able to consistently manage systems.

Challenges

Applications in the client portfolio were of various states of complexity and maturity, and could not immediately be consolidated without impacting the acute and ambulatory sites.  Applications were hosted in datacenters and MDFs in various regional locations.  Migrating applications to cloud hosting, while technical feasible, had to ensure that vendor support and management were not impacting the clinical experience.  Multiple applications could not easily be migrated to remote hosting or cloud hosting due to latency sensitivity, and various other technical reasons.  Migration to alternate management could not introduce downtime where possible.  Finally, an application catalog existed, but was last updated years ago, did not have relevant application metadata, and was about 50% current at best.

Our Approach & Methodology

First and foremost, establishing a refresh of the application catalog was an imperative.  Rather than focusing on standardized app catalog metadata, we included data collection that would help us build out a new strategy and to qualify and categorize applications based on technical, process and business criticality/dependency.  Multiple hosting locations would continue to be a reality for the foreseeable future, so cloud hosting was not a viable model.  So we implemented hyper-converged platforms to host applications at each regional location to ensure physical distance and/or latency impact would not be a factor.  We identified >65% of applications in the portfolio that were ideal candidates for automated management based on a number of criteria.  To migrate production systems into the local hyperconverged platforms we were also able to create a snapshot based model that would perform instant local P2V or V2V migration, allow us to facilitate local DR failover validation, perform UAT with the clinical staff, and then cut over to the converged platform.  We allowed both the legacy system and new hyperconverged platform to run in parallel for 2+ weeks to ensure impact was mitigated, and then ultimately decommissioned each legacy system.

Success & Value Realized

  • Hosting in hyperconverged platforms locally we were able to realize a significantly reduced physical footprint. Server and storage devices shrunk to 90%+ in some cases.  This had a positive realized value return of floor space (which was planned to be converted to usable office space).
  • Reduced environmental costs (cooling, electricity, etc.) for the more condensed and efficient hyperconverged devices.
  • Many hundreds of thousands of dollars of annual maintenance on hardware was effectively eliminated.
  • Being able to refresh the application catalog with valuable metadata, we were able to create automation scripts that tightly aligned to business needs. RTO/RPO models were very closely aligned to the needs of the clinical departments
  • Snapshot based recovery and replication were now push-button and automation-based simple. Each app was snapshotted on its business need, and replicated.  Recovery was just as simple from any snapshot.
  • An entirely new Disaster Recovery (DR) model was created, with snapshot based replication to cloud hosting. Virtual networking was also implemented to allow for transparent IP and network based redirection so that in any disaster scenario, diverting user activity to the cloud hosted DR images did not require physical touching each and every endpoint device
  • Using cloud as a DR platform allowed for a significantly reduced initial cost model for cloud-hosting, and allowed the client to begin testing long-term feasibility of cloud hosting production systems on a case by case basis

Lessons Learned

  • A majority of applications were updated/maintained very inconsistently. Being able to standardize automated management of systems, we were able to bring > 85% of systems up to compliance with patching and updates without any major staff movement
  • Automation management required a new skill set for staff, and those that had Unix backgrounds and/or scripting expertise (SQL, Perl/Python, etc.) quickly adapted to the new skill set and were rejuvenated with their work
  • Being able to standardize virtualization, and automation was able to reduce workforce manual and physical intervention >90%
  • Critical applications were able to be put on a significantly short snapshot window (1 hour) to allow for near immediate roll-back in any impact scenario. This guaranteed recovery in less than 1 hour in many instances.
  • Once we were able to take a comprehensive inventory of the entire server inventory (40K+) we had found that there were only 12 common server builds. Further scrutiny allowed us to standardize on 3 server builds.  Automating 40k+ servers would have been impossible.  Automating 12 server builds would be difficult, but feasible.  Automating 3 server builds became a simple success.