How hard is it to setup replication? Most information out there makes it simple since they are high-level replication setup. The instructions you find are:
- Installing the software
- Starting up the software
- Run preparation for replication
- Create a replication configuration (set of objects for replication)
- Starting replication
- Instantiation of the replicated object
- Remove all duplication, avoid double posting
- Start the apply process
As you can see that this is quite generic which can apply to most replication solution out there. In this set of BLOG, I will focus on item 4 - Create a replication configuration. From this point on, the information is quite specific to SharePlex and its syntax. This sounds simple but I will point out some of the concerns as we dive deeper.
- Wildcard replication
- The ability to use a wildcard to select objects to replicate
- For example:
- The ability to use a wildcard to select objects to replicate
- In this example, I'm telling SharePlex to find all objects, tables, and sequences, under the user called "TOM" and put it into replication
- SP_OCT_AUTOADD_ENABLE – a parameter to tell SharePlex to add any new object into replication if it qualifies for replication from the configuration file
When do I choose to use wildcard replication with autoadd?
With my replication experience over the years, with replication, you either have a simple or more complex one. All replication start out simple and the complexity comes in when the volume is more than one process can handle. Since you have no control over the volume factor, you will have to find an alternative to resolve this concern.
If you lucky enough to have a simple replication, low volume, wildcard replication and Autoadd is the right choice. You will have a more simplified method of replication with easy control.
Now on the other hand, if your volume is high, what choices do you have?
Parallelism drives throughput. Every product goal is to have the maximum amount of parallelism but sometimes it is not enough. You will need more process to reach the throughput that you desired. With replication, the apply process seems to be the slowest process. For SharePlex, this is Post. To get more post, you will need to use named queue feature through the usage of the SharePlex route.
With the above definition, I can easily create a configuration that allows me to have multiple post queues to increase performance throughput. For example:
This is quite simple so what is the real problem? The true problem is how do I split the data so I can get the best possible throughput but doesn’t violate any table relationship. If you replicate data for reporting and if the report depends on 2 or more tables to generate, you want them to be consistent with the same view. What if you sent these related tables down the different post process, the result is that you might not retain your table consistency. If your table consistency is not being honored, your report will have a potential for error. I'm not sure if I like to be the one to break the news to my CFO about uncertainty in my report. I might have to find another career path.
If this is a big problem, how can I ensure the table consistency?
- Database relationship definition – for Oracle, that would be constraints, especially Foreign Key Constraint.
I guess you can use something like this to find all the referential constraints and its related tables.
select owner,constraint_name,constraint_type,table_name,r_owner,r_constraint_namefrom all_constraintswhere constraint_type='R'and r_constraint_name in (select constraint_name from all_constraintswhere constraint_type in ('P','U'));
Based on the information, I can create a config file to group those tables with relationship to avoid data anomaly. That is definitely a good idea but what about those table relationships that handles within the application? Hmm, this is a hard one for the DBA since most don’t have intimate application knowledge to make this call. Even if they do, what is the confidence level on it? In these cases, we usually reach out to an application DBA. Even the Application DBA might not have all the answer for us. Most application DBA knows about the top tables in the application. They won't know all tables in the application. Sometimes, an application is so big that no one knows enough about it. Take SAP, for instance, it has well over 30,000 tables. Forget about the relationship, I doubt if anyone can name the 30,000 tables. SAP doesn’t have a database referential constraint, can I make a leaping faith that they don’t have any table relationship? Well, we know that is an ignorant answer, so what can we do?
Over the years, I have talked to a few DBAs that adamant about their knowledge of the table relationship and they are 100% confident of how they group the table. My hat goes off to them since I can never be so sure about the relationship, especially if the application has over 100 objects. I have to admit the older I am and the more I see, the more skeptical I am. Aristotle says it best.
Aristotle — 'The more you know, the more you know you don't know.'
I’m not smarter than the average bear so I’m sure I’m not smarter than Aristotle; I will listen to his wise word.
The SIMPLE problem is now getting harder and harder. What if we merge the database and application knowledge. Combine my finding at the database level and enlist an application expertise to help me with the dissecting of the tables, this should build my confidence in front of my CFO. Hmm… It is getting better. But am I brave enough here on the solution here? Maybe with the best mind from the database side and application side, I should be quite confident. Somehow this still doesn't sit well with me. As the application gets bigger and bigger, no one person will know it all and my confidence level will start to diminish.
What if there is a way that we can build a knowledge database using the knowledge of table relationship that we see in replication and it will get better with exposure? Would this be the holy grail to determine how you can get maximum parallelism without breaking you transaction modeling? SharePlex happens to have one of the unique features that studies the data model in replication and the longer it runs the better the knowledge it's built. SharePlex has the ability to scan through the redo_log/archive_log for changes and it has information about the transaction and changes. SharePlex starts to build the replication model by combining the database relationship with application relationship by shifting through the replicated data.
This SharePlex ability in-house in a command called analyze. You can run it for specify amount of time and during those time it will study your workload and transaction table relationship. It is taking the database modeling with foreign key constraint in consideration as well. Once the period end, it will provide you with the grouping of tables that has a relationship with each other. With this information, I can safely split the tables into a separate queue for best performance. The best part is that it upheld the table relationship. Here is an example of how we can run it.
First, we will create a configuration for it to analyze.
Sample config – test
This will start the program and it will start collecting information for the next 7 days. This should reflect all the tables that you want to replicate. It is the same syntax as the SharePlex config. Analyze will spawn a capture process to study your transactions and will group them with table relationship. Once the run is over, SharePlex will suggest a replication model for you to use in your configuration. The longer you run it, the more accurate the model will be.
Here is the output of the sample run.
Having access to this information, you can easily build a configuration file with the best possible throughput without worrying about your table relationship. I guess at this point, I'm pretty confident in front of my CFO about my reports.
As the application changes with patches, this model might change. We will have to go through the same process to understand the new relationship and make the appropriate change. Well, knowledge is power. First, we like to know if our model changes and taking action against the change is up to you. I will leave that topic for another blog at a later time.