So my name is Val. And we have also Kamal.
Yeah, hello. This is Kamal.
So we are from Hotwire.com. And we have a story to tell today. The story relates to how we use SharePlex in our company. And the key points of this story are migration from a data center to the cloud. So any time I'm going to say "data center," I mean our data center at San Jose. And the cloud is AWS West Region US West 1.
So there are many companies today migrating to the cloud. We are one of them. I think that there is one key differentiator, the way we did our migration compared to other companies I've seen out there. Before putting the presentation together, I went out and searched how other companies did their migration. Everybody was claiming zero downtime, near zero downtime, minimum downtime.
I went into their presentation and, always, I found a little key piece which was indicated that, indeed there was a little brief moment when the traffic had to be switched from one data center to another. So I was able to debunk any migration out there, when you see database migration assistant with using their tool they built for migrating databases. Our key differentiator is that, from the customer perspective, the migration itself of the full applications stack happened totally transparently.
So that's the story I'm going to tell. Hotwire is in the business for close to 20 years now. Our first booking was done in August 2000. We're part of Expedia Group. So there are many brands within Expedia. Hotwire is one of those. We are the only brand within Expedia which sells what we call Hot Rates.
A Hot Rate is a hotel where you do not know the hotel name before you book. But we reveal the name after we book. And we do that so you can make great savings. Basically, we hide the hotel name so you can book a always more expensive hotel than you would book on the retail. You would ask why hotels don't sell on the half price.
Just because if you, for example, knew that a Hilton cost, instead of $400 instead of $200, you would never go buy on retail. You would always go buy on 20, so they would not make as much revenue. Where Hotwire comes in-- if there are unsold rooms in a given day, they come to Hotwire to sell the otherwise unsold rooms. So they give the rooms to Hotwire at real great rates, which typically goes 30% to 40%.
We do have not only hotels. We sell car and vacations. So today, we have many searches done on our site, many bookings. And I think that's why the zero downtime is the key differentiator of this migration, just because-- imagine if you had to take downtime to switch your traffic from one data center to another. Then you incur downtime. It includes the business outage. And you incur a loss.
On top of it, not only that you migrate forward, because you're migrating to a new application stack, you have a new installation, so many software stacks on the AC2 machines, new networking. More often, you incur also problems. And there are chances you will have to roll back. So any time we say it's a zero downtime, it's not only zero downtime rolling forward, but rolling backwards.
I'll touch on details how many times we had to go back and forth before we found the destination, the target cloud stack, absolutely perfect. But we did it many times. So zero downtime is really the key. So my role at the company is senior data architect. Kamal is the manager of the DBA group. They did the heavy lifting of configuring SharePlex, installing the software. And you will see what it involves, as far as making it all happen in zero downtime.
The talking points today are a little system overview. We'll talk about the migration approach. Then we're going to talk-- what does it mean to have an active-active data center. Then we're going to touch on the very core of this approach, which is the bidirectional replication. When we say bidirectional-- you may find in the SharePlex documentation it's peer-to-peer replication or master-master replication.
So we're not talking about master read-only. This is pure active-active, which means both data centers take a read and write. There are challenges coming with this. And that's where I would like to spend most of my time, which is related to conflict resolutions. For example, what happens if a customer searches for the same hotel room, same destination? What happens if we happen to book the same room in two data center stacks? What happens if a customer comes in and updates his first name on both data centers?
So these are real challenges we have to solve along the way. So the challenges and conflict resolution is the very core of this migration effort. The active-active-- as I said before, it's mainly that we can roll the traffic forward and roll the traffic back. The benefit to a thing that it's for eliminating downtime, meaning continuous availability. But I would actually precede that with-- reducing risk is the number one, I think, benefit of zero downtime.
Meaning, when you come and create a new complete software stack and move 100% traffic of customers to the new data center, you may experience any performance issues in anywhere, starting in the application servers down to the database. You would have to roll back the traffic. Now if you did not do it with zero downtime, you would have to incur downtime rolling forward. Then you run into performance issues or some application level issues. That's the second problem.
Third problem is taking the traffic back again. So not only that, if you will look closely, if you roll back, then you have to resend the data back to the original data center. So there is so many things you can run into if you're thinking about switching traffic. The zero downtime-- I would say, reducing risk is the number one benefit.
The migration approach itself means that we have different layers. So in the application stack, we obviously did not move the services. We cloned them. So we had a copy of our system in the data center and in the cloud. So it wasn't like we were shutting down one service in San Jose and starting it up in the cloud. We cloned them, so we had the exact copy of the two available.
The database is a similar story. We did clone our database in the cloud. But not only was it a clone, we also did a major upgrade. We were running on a 11G and upgraded to 12C. So I would call it a major platform upgrade. We also changed file systems, moving from a Linux file system to a ASM, Oracle's managed storage management.
And so the traffic management-- I will touch on this. We used a command. I want to pause here at this picture, which in the later slides we'll get a little bit more complicated. So I tried to extract as much as possible and make this really stand out and burned into your memory.
So the key, again, was to clone our applications, which are running on-premise to the cloud, and have a really good traffic management on front, using some great levers, which would direct traffic either to one premise or into the cloud. So the approach we took with this migration is that we cloned the applications to the cloud 100%, meaning it could take 100% of the customer traffic.
We used one of the levers ECMI provides in the product they call local traffic management. There, you can configure things like 100% traffic coming in. I want to do a random split of the traffic. You say what percentage you want to set. We initially set, like, 1% of the traffic. We want to put it on the AWS stack. So this is the illustration of the oral approach where we've decided to build strong and reliable routing through our software to each of the data centers.
And by putting 1% of gradual traffic into one stack, our software engineers and the site ops were monitoring the health of the stack. And if we run into any problems, conversion rates, drop searches, or low latency, we would seamlessly roll the traffic back from being at 1% to 0, so going from 99% to 100%.
So this is the key of the approach, where literally, for the customer, if you start searching on our site, you say, go. I want to go to San Francisco. Book a great hotel. You hit Search. Initially, the conversation-- where I say conversation, it's the [INAUDIBLE] application may start in San Jose. The moment you see the Search Results and go to Details-- if ECMI would decide to split the traffic at that point, you will continue booking on the cloud.
So it is as active-active as it can get. Meaning, both stacks can take reads and writes. So at the very core of this architecture is SharePlex, obviously a software from Quest Software, which is a replication software between two databases. The key differentiator between SharePoint and other software replications is that SharePlex allows you to handle conflicts.
So things like, what happens if a customer decides to change his zip code or change his name in both data centers at the same time? Imagine you open Firefox, Chrome. You simultaneously update your account. What happens next? What will resolve this conflict? Each update-- they then travel across the wire. SharePlex has solutions for this. So that's one of the key differentiators why we went with SharePlex.
In fact, we've used SharePlex for a very long time, over 15 years-- not in the active-active configuration. But we have a warehouse. In between warehouse and the production database, we have what we call operational data store, a read-only database. So historically, we always did a one-way replication of OLTP to a read-only, just for offloading read-type of workloads.
A little bit of the system overview-- which now, it gets a little bit more complicated before we go to the actual migration. So this is our stack before migration. The key point here is to highlight the way we route traffic to, actually, our applications. That's one. Number two, we were already in kind of a hybrid state before the migration.
I don't know how many of you run on-premise. But we already have some things built in AWS. And typically, the services built in AWS will talk back to on-premise. So we have a story of being a monolithic application. And we're in the journey of decomposing it into microservices. So we've done that to a certain degree, where we have decoupled many functionalities of the monolith and moved them into the cloud.
So these box, these applications, means that a monolith has been decomposed into some microservices today. So that's the hybrid state, when we have some on-premise, some to the cloud. The scope of the migration is, how do we move the applications being on-premise, including our transactional database and cache-- it's Oracle Coherence cache-- to the cloud?
So now, going back to the main point here-- how do we route traffic? So on the edge we have traffic management from ECMI. That's your edge, which the domain resolutions and the actual modifying of headers for a later split is initiated at. And the ECMI-- I'm sure you guys use, most likely, ECMI with a similar edge. We have API management, which man-handles the API keys, the rates you can call.
And then we have a software-based routing, which is built in-house. It's NGINX based. So how does traffic routing look when a customer comes in? It's that traffic management ECMI, using the global traffic management, modifies the header. It adds a string to the header, a key value. When it goes through the API management routing, this software-based routing takes the header, looks into where this traffic is supposed to go.
Is it's supposed to go to on-premise or to the cloud? So depending on the value, it says, it will direct the traffic to either of those two. So when I go back to these levers on the previous slide, when I say we moved-- when we moved one lever on top, it only meant that, on the ECMI level in the global traffic management, we've changed what they call a weighted routing, where you have two targets.
You literally log in and change from 0 to 1. Submit the change to global traffic management, and it will start applying a header to every incoming traffic, real-time. And as the traffic flows deeper, the NGINX routing-- it's a reverse proxy. It will direct the traffic either here or there. So the scope of the migration was how we move our software stack, which is this here, to cloud.
And so this comes to the next slide, which is the actual-- I would call it the gymnastics of the migration, how the actual implementation of the steps looked like. So for simplicity, I abstracted as much as I could. But I left only necessary out here. So again-- a similar picture, a different view. Traffic control-- then you have your application.
This here is just to highlight that we don't only have applications in our software stack, we have sorts of variety of batch applications, credit card optimization reversal, TripAdvisor review loads. So those are patch jobs. So not everything is customer-facing. We have in-house jobs. I'm sure all of you have.
We have some utility programs. And then we have cache, each talking to Oracle OLT database. This down here-- to highlight that we have operational data store, which is the read-only database, as a staging area going to the warehouse. We have a warehouse, just for simplicity, [INAUDIBLE] from here. And we have Oracle Financials, which are reading data from OLTP, done through [INAUDIBLE].
So this is the starting step of the migration. I abstracted, also, the fact that we already have some applications in the cloud, microservices talking to on-prem. I don't think that's relevant at this point, because this is the stack we're talking about migrating to the cloud.
So first thing we did was installing a EC2 machine in the cloud with the necessary software on it, Oracle SharePlex and ASM. This is pure binaries, no database, no tables, no schema in it. This is just software running on the [INAUDIBLE] machine. You would want to ask why we did not go for RDS, for example. The answer there is that we have so many custom third parties running on the current Oracle database, like Monitor This, [INAUDIBLE] daemon, Splunk agent-- so many third-party things that RDS is not supporting those today. You cannot have it in the cloud.
So our solution was to continue using the Linux in a cloud in formal institute. So this step is just installing the binaries. You may want to ask why [INAUDIBLE] when earlier mentioned that we've upgraded to 12C. That's an important question. You didn't ask. But it's welcome at the later stages.
Key point number 1-- without having any database, without having any table schemas-- anything. We've started the replication across the wire. So this moment, we have a database which takes 100% of traffic. We established the SharePlex replication one way with no post, meaning that all your data coming into the on-premise database is being replicated across the wire to AWS. And it's being queued up in the import queue.
Or is it import queue or post queue?
It's a import queue. But it's getting installed in the import queue. We have stopped the post purposefully, so that your data is not getting lost. It's getting stored at that point of time. And when the upgrade has happened, you know you can sync your data. So you take the CM from where it has come. And it will come in the later stage. You'll understand why we did this. But there is no post. And it's [INAUDIBLE] setting the import queue.
Yeah-- great. Thanks, Kamal. So notice, there is no mention of a database serial number, the SCN number. It's just that some nice afternoon, we decide, OK. We're having the software installed. Let's start replicating and hold the data in queues, No anything snapshot talk yet. This is just, at some moment, you search in something data, where you know, at some point, you will have to scrub the data and apply the rest. And this is what's coming next.
At this point we have a operating system in storage-level clone, where we can take a image of the database and clone it into the cloud. A key point here is that this clone has a xy serial number of the data. This is 11G still, because it's a one-to-one clone on the DB-- same cloud system. Using Armin, we've restored the data into the cloud.
So at this moment, we have a database where the data in here does not look the same as here, just because this is already behind, since we've taken the clone. So this database is already behind. However, the data is somewhere in the queue from where you want to go back and continue posting. The little sequence 1-1-- I'll touch on this later. This relates to how to avoid conflicts in two data centers.
Most likely, you are guessing that sequencing-- we generate a lot of primary case of sequences local. This is just illustrating that we still carry the same sequence in both DBs. At a later stage, it will change. So no need for clone anymore, because the data has been used and being Armined into the [INAUDIBLE] cloud.
So what the ABIs did was upgrading the DB in place. So they took the database down, applied new binaries, and it threw the upgrade process. So now, the database is 12C. Still no post. Still same values for everything [INAUDIBLE] an object wise what the [INAUDIBLE] database is.
So at this page, if you notice, there's a change of sequence. So log into the database. We created all sequences to start with different offset than the original database set. This is one of the key moments where you're trying to avoid conflict. So for example, if you go create a purchase order, on-premise it starts with 1. If we routed traffic at this point to this data center, it would start with 3. Most of our IDs in our databases are Oracle sequence objects generated.
This is the next key moment. This is all key components here. Reconcile the post queue-- we'd start with a specific SCN. This clone was aware of which SCN. SharePlex has as a great feature where you can say, hey, I wanted this for every transaction to happen before a specific serial number. You know how Oracle updates writes data. It attaches a SCN number to each transaction, so you can identify where you are, as far as the age of the data.
So SharePlex has a great feature which allows you to reconcile a queue. Meaning, I want to purge everything in a queue happening before xy and start supplying everything after xy. That's the key here of not losing a single write or update done on-premise. After reconcile, we have a one-way replication from data center to the Cloud-- still one-way.
So the next step was to establish the bidirectional replication. The typical routine process is to start the necessary processes. Again, if you look up SharePlex implementation, it may come under peer-to-peer, master-master, active-active. At this point, the database is configured such that, if you had traffic coming in-- which you don't, by the way. If you have traffic coming into this database, you replicate the direction.
One thing which is not illustrated here is conflict resolution. So in SharePlex, you can go in and configure what they call a conflict resolution. We'll touch on it later, how it's actually working. But before applying traffic to the DB, you have to configure these things when you handle exceptions. What if my updated data, coming from on-premise, looks different in the cloud, because it's been updated since?-- how to handle this conflict.
Next thing is creating another replication from the cloud to ODS. This is a interesting picture, which illustrates that you are replicating not only back to on-premise but also to ODS. Some of you, if you don't know SharePlex, it already accounts for replication topologies like this. Meaning, if a write comes to on-premise, it's being replicated to the cloud. SharePlex is aware that the transaction is being applied by SharePlex.
So it's not forwarding the same transactions into ODS. Yes meaning we would not be having a given transaction coming in to the US twice. So if you never come into the loop share CPLEX knows that the replication has been applied by SharePlex. It uses a specific Oracle user. So the capture process in SharePlex looks for any transactions posted by SharePlex-- I'm not going to export.
Because two DBs are the same at this moment, we can switch our finance to read from the new database. Next step, we are cloning the applications. So what I said, the migration approach is to clone the application. So we moved our institute machines and services and the monolith to the cloud. And we connected to the DB.
At this point, what we have available is a application stack in the cloud, not open to public. So you see, the traffic management is not active yet. At this point, we can log into our servers and test the applications internally. So we can do some searches, bookings internally before exposing anything to public. A functional test we can do, a nice integration test-- run some of our test suites against this data center and make sure that it's not only handling the transactions, it's also replicating back to on-premise and also being forwarded to reporting and the finance.
So the next crucial step is, when we gain enough confidence that the new stack is ready to take some public traffic, we go and actually, at ECMI level, we say, go. And we apply 1% of traffic to the new data center. That's where we actively start watching and running our dashboards and observe the health system. We have great side operations where we monitor the funnel, Home page, Search, Search Results, Booking, Confirmation. So we monitor the health at this point.
And if we like what we see at 1%, we can go and apply more to 30%, 50%. We had a phase of this migration where we were running 50-50 for more than a week. So we were observing the health of the stack, the performance for a week. I'll say it now before I forget later. We actually had to also roll back many times, due to application routing issues, performance issues.
Some analytics raised concerns. Numbers don't look right. So before going and analyzing the problem, we just routinely rolled back the traffic, because we had the leverage of zero downtime without impacting customers. So this is the key differentiator here, that you, at any step, are capable of rolling your traffic back without affecting a single customer. So when it comes to challenges with this active-active, as I said, it's the conflicts you have to deal with.
What it means "conflicts"? So it's kind of an abstract definition. But conflicts come when a time interval between updating the two DBs is less than the latency of the replication. So we say, if there is one with we have one second less than we have under one second-- the latency, when I write into on the database, it gets exported and posted to the other DB-- it's less than one second. So if you have an update to each of the databases for the same record within that latency, you have conflict. Meaning that the data arriving to the new database has been updated since the write happened on the other one.
So you have to deal with that. So that's what we call conflict. There's a variety of conflicts that can endure. I tried to group those into four types, starting with simples, custom, and complex conflicts. So the simple ones-- if you update some simple fields, flags, phone number, dates, names-- the resolving of this conflict is fairly simple. You have to just make decisions, which host has priority. Or you can apply generic conflict resolutions in SharePlex. The most recent update wins.
So for example, if the San Jose data center gets the update as latest, a few milliseconds later-- because you're reducing timestamps, you're down to milliseconds-- you can say, hey, this update wins. And you apply these transactions to both DBs. Meaning, you're discarding one in one data center. So that's the key with simple ones. You always discard one and one update wins, whether on the update date priority or host priority.
Custom conflicts are a little bit more complex. We do have a table, for example, Rooms Available. So some hotels load our rates to our databases. So for Sunday for two adults, I have a room type like this with availability 10. If a customer books a hotel, we decrease the rooms available tonight.
So now think about it. If you book the same room on two different data centers, both data centers decrease the availability from 10 to 9. If we applied simple conflict resolution, you would discard one of the updates. And you would end up having rooms available-- 9, 9-- in both data centers, which is wrong. You cannot ignore and discard one of the bookings, because the correct rooms available value should be 8.
So great news-- SharePlex has a conflict resolution between what they call custom conflicts, where you can handle this situation. The key technique we use there is that SharePlex not only replicates the new value of the field but also the old one. They call it pre-image and post-image.
So for example, the update inventory, rooms available, set 9, where the ID is this. So when they replicate this update, they append where rooms available equals 10. So we know that the value before update was 10. So SharePlex appends the pre-image values. It enhances every update statement with what they call pre-image values, which is the state of the record before update.
So when the record comes across the wire, we know it has changed from 10 to 9. So what the custom conflicts helps you is that you can go in. They provide a hook where you can take that conflict resolution into your local PL/SQL and you can do your custom coding.
What we did was saying, hey, we're not going to apply the new value. We're going to calculate the delta. We're going to do the 10 minus 9. We're going to say, it's decreased by 1. So we're going to apply the delta not the post-image value to our new rooms available. So that's how the database remained consistent. Actually, the goal of resolving conflicts is to achieve consistency in the DB. With OLTP, consistency is sometimes more important than availability. But that goes into capturing.
This wasn't complex enough. Next use case is the most complex you can experience. I would say, that's the mother of all conflicts. So forget about rooms available. Now think about a customer account. I don't know about your databases. But in ours, we have a Customer table which has a primary key of Oracle Sequence [INAUDIBLE]. So we create a customer, insert the customer. The database will use the next value from the sequence. And you have a nice customer ID.
Besides this, we also have a unique key on the email. I'm running out of time, but this is the most important piece for writing time. So if you create an account in each data center within the latency of the replication-- so if I open Firefox and Chrome and I create an account in both data centers, I will get a customer ID 123 in San Jose. And I get a 345 in the other one.
Now if I have a unique key on that table-- email is a unique key-- then let's slow down and think through what happens. The one database-- I will go back a picture a little bit here. One database will replicate an insert statement into this DB, saying, insert into customer 123, email JohnSmith@Google. And this one will replicate, insert into customer 435, email-- same email-- JohnSmith Google.com.
Post process-- it takes the answer, tries to insert into Customer. So it posts the insert. What happens? Fails-- unique key violation. How do you resolve this? Now think deeper. You're not only creating an account, you're creating a booking with the new account. So not only Create Customer comes through the pipe. Insert Purchase Order, insert Payment Receipt, insert Payment Method, insert Traveler, insert Reservation-- you have 16 tables lined up to insert a record, which the master, which is a customer, has been violated.
So these are the conflicts, custom complex, where you have a table with both PK and UK. This is one of the areas where we did not achieve 100% automation. However, for the duration of migration, which was a couple of months, we incurred four or five such of these scenarios. You can however solve for this with having more code in your conflict resolution.
Tricky gets on the side when a customer only creates booking, he decides to have different payment methods, he decides to have a different password, then there are some scenarios I don't think we can fully automate. Self-inflicted conflicts-- I'll go back to the picture here. What if your application is not going through the proper routing? Say for example, you're creating a customer account.
You have a microservice call, which just updates his password. So we have a microservice that means just passwords. And it's not properly routed, but it makes a call across the data centers. So you have any Insert into Customer here. But the update of the password comes from here. And because it's a cross data center call, it happens for sure in a time less than the latency of the replication. So what my call out is here is that the-- Apple update popped up here-- so the last type of conflicts is when your application induces conflicts on its own.
Conflict resolution using SharePlex-- you can stop by after this presentation. If there's interest, we can walk you through how exactly you configure SharePlex for resolving conflicts. You have to wrap triggers. You have to apply general conflict rules. You have to avoid conflicts. There's so many things going on into resolving conflicts that this, on its own, can be three hours presentation.
Think about latency. I go back-- just one last key here. Think about latency. If you have a job running in one data center, which updates customers-- so if you have a statistics API calculate statistics for each customer, you get this file and then apply it to a database, which updates half million customers, you're replicating across the wire. You don't necessarily have to have an outage or SharePlex issues between the two data centers.
Just by the nature of having so much updates flowing from one data center to another, you're inducing latency. Because one SharePlex de-queues sequentially on a given queue. So here's just to watch out and avoid conflicts by creating as many queues as possible for replication. There are constraints why you don't want to end up being 1,000 that goes to having to talk about referential integrity.
So I'm ready [INAUDIBLE] to answer these issues, and Kamal as well. There are other considerations. If you wanted to turn in this not going from migration into a permanent active-active, master-master, or peer-to-peer data center, here are the considerations you have to think about when having your application as a final state. You basically have to adjust every tier of your stack. Just a few key here-- call outs-- probably the balance field-- change the event sourcing. So don't maintain one Rooms Available field.
Rather, have a event sourcing when you're omitting updates and you have a written [INAUDIBLE] process-- what if it's not active-active but it's active-active-active? If you have more regions, then you might run into some interesting scenario where, simply, conflict resolution won't do for certain scenarios. Prefer natural case-- this goes back to the Customer table. You might not use synthetic IDs all the time. Sometimes it's just better to go with common sense and maybe have a unique key as an email.
You have to make this decision, because customer cannot change his email if you go that route. So there's always some ups and downs when we're [INAUDIBLE]. Replication changes use more queues, so that's basically it.
I can dive in to each of those slides deeper, if there's interest. We have the very same presentation tomorrow, same time. So we'll be here. If you want to hit us up earlier, focus on some specifics-- sure. It will be welcomed.