Data Subsetting Using a SQL Query with Parameters

A Step by Step Example of Data Subsetting

Let’s use the following table EMP of schema SCOTT as an example. Suppose you have your entire employee data stored in the EMP table of your central DB. The EMP table has the following definition. Each employee has an employ ID (column EMPNO) and belongs to a department (column DEPTNO):

CREATE TABLE “SCOTT”.”EMP”

(    “EMPNO” NUMBER(4,0),

    “ENAME” VARCHAR2(10),

    “JOB” VARCHAR2(9),

    “MGR” NUMBER(4,0),

    “HIREDATE” DATE,

    “SAL” NUMBER(7,2),

    “COMM” NUMBER(7,2),

    “DEPTNO” NUMBER(2,0),

     CONSTRAINT “PK_EMP” PRIMARY KEY (“EMPNO”) ENABLE,

     CONSTRAINT “FK_DEPTNO” FOREIGN KEY (“DEPTNO”)

     REFERENCES “SCOTT”.”DEPT” (“DEPTNO”) ENABLE

);

 

Your company has grown very big and become geographically distributed. Employees started to have problems accessing the EMP table because the central DB is under heavy load and the network is not always fast and stable. So you decide to create a local DB for each department. The EMP table in a local DB only contains data for employees belonging to that department. Here we are subsetting by DEPTNO. In real world applications, people may do subsetting by country, region, branch office name/ID, and clinic name/ID etc.

Following are the steps to take to use Pervasync to synchronize the EMP table in a local DB with the EMP table in the central DB.

  1. Prepare the central DB schema.

You may need to make some adjustments to the structure of your existing table to make it ready to be published as a sync table.

Pervasync requires that all sync tables have a primary key. The EMP table already has a primary key: column EMPNO. So, we are fine.

Assume we want to do the subsetting by DEPTNO. Then, we should have a table that defines the association between DEPTNO and EMPNO. One way to do this is to add a DEPTNO column to EMP table. EMP table already has a DEPTNO column, so we don’t need to do anything. Another way is to create a separate table that has two columns EMPNO and DEPTNO with EMPNO being the primary key column. You know what I mean if you learned about table normalization.

Now we have a decision to make: do we want to keep EMPNO as the sole primary key column or do we want to make DEPTNO and EMPNO together a composite primary key? Both have pros and cons. You make the decision based on your business requirements. If we keep EMPNO as the sole primary key column, we have to make sure that local DB do not assign conflicting EMPNO values when adding new employees locally. In Pervasync, you can create a sync sequence for the EMPNO column and draw globally unique values from the sync sequence locally (For more information and explanation of sync sequence, see section 5.5.2 Sync Sequence). If you make DEPTNO and EMPNO together a composite primary key, you only need to make sure EMPNO is locally unique when you add a new employee. If your company does not require every employee to have a company wide unique ID (value of EMPNO), you can choose the latter way.

In summary, to make a table ready to be published as a sync table, it has to have a primary key defined and an association between the primary key and the subsetting column defined.

  1. Publish the sync table and add it to the sync schema

When you define the sync table via the Pervasync web admin console at Sync Schemas-> Sync Tables-> Add new Sync Table, or via the admin Java API using method addSyncTable of class SyncServerAdmin, you have a chance to supply a data subsetting query. If you already have a sync table that you want to use for subsetting, you can click “Update”, and then edit “Subsetting Query”. Data subsetting query determines which rows go to which local database. There is no restriction on the query as long as it returns the Ids of the rows to be synced to a local DB. If we keep EMPNO as the sole primary key column we would use the following query:

SELECT EMPNO FROM SCOTT.EMP WHERE DEPTNO=${DEPTNO_PARAM}

 

If we make DEPTNO and EMPNO together a composite primary key we would use the following query:

SELECT DEPTNO, EMPNO FROM SCOTT.EMP WHERE DEPTNO=${DEPTNO_PARAM}

 

In the above queries, we have a placeholder, ${DEPTNO_PARAM}. The variables enclosed in ${} are subscription parameters. Queries for other tables could share the same parameter names. The collection of unique table level parameters becomes parameters for the published sync schema.

The queries in our example are among the simplest as they have only one parameter and reference only one table. One could have multiple parameters and use table joins in the query. Pervasync employs different algorithms for simple and complex queries to achieve maximum efficiency for both cases. You need to specify a subsetting mode when doing the subsetting: SIMPLE or COMPLEX. Subsetting mode SIMPLE is different from COMPLEX in that a SIMPLE query can select from only one table, which could be the sync table itself or a different table. For more information and explanation of SIMPLE or COMPLEX subsetting modes, see section 5.4.2: “Subsetting Modes: SIMPLE and COMPLEX”.

  1. Subscribe a sync user to a sync schema

For each local DB, you create a sync client – a combination of user and device. When you subscribe a sync client to a sync schema via the Pervasync web admin console at Sync Schemas-> Schema Subscriptions-> Add new Schema Subscription, or via the admin Java API method addSchemaSubscription of class SyncServerAdmin, you have a chance to supply a value for each of the subscription parameters. If you already have a Schema Subscription that you want to use for subsetting, you can click “Update Subscription”, and then enter the value of the parameter, which you specify as a placeholder in subsetting query at step 2. In our case, we will see DEPTNO_PARAM listed as a parameter name and we can assign a value. For example, if we want a sync client to sync data for DEPTNO=1, we put 1 as the value for DEPTNO_PARAM.

  1. Sync

During synchronization for the sync client we just subscribed, effectively the following query is executed:

SELECT EMPNO FROM SCOTT.EMP WHERE DEPTNO=1

 

So that only the rows with DEPTNO=1 are synced to this client.

 

Subsetting Modes: SIMPLE and COMPLEX

The parameterized SQL query is at the heart of data subsetting. There is virtually no restriction to the query as long as it returns the primary key values of the desired data sub-sets. This gives users great flexibilities but at the same time, it poses great challenges to the sync engine, which has to figure out physical and logical changes to the data sub-sets. Algorithms that apply to generic queries may not be scalable. On the other hand, algorithms that perform well for simple queries may not apply for complex queries.

Pervasync asks users to indicate the complexities of their queries via subsetting modes so that different algorithms can be applied to achieve best possible performance and scalability.

Criteria for SIMPLE query

There are two subsetting modes, SIMPLE and COMPLEX. Criteria for SIMPLE query are as follows.

  1. The query can reference only one table. This table can be the table you are subsetting or a different table.

    The query predicates that involve the table columns can use not only “=”, but also other conditional operators like “>”, “<“, and “like” etc. For example, the following queries qualify as SIMPLE queries,

    SELECT EMPNO FROM SCOTT.EMP WHERE DEPTNO=${DEPTNO_PARAM} AND MGR = ${MGR_PARAM}

 

SELECT EMPNO FROM SCOTT.EMP WHERE DEPTNO=${DEPTNO_PARAM} AND SAL < ${SAL_PARAM}

 

SELECT EMPNO FROM SCOTT.EMP WHERE SAL > ${SAL_MIN} AND SAL < ${SAL_MAX}

 

SELECT EMPNO FROM SCOTT.EMP WHERE JOB like ‘%Engineer’

 

  1. The query cannot use functions, expressions, joins, set operations like “union”, “intersect” etc.
  2. Query referenced columns (e.g. DEPTNO, MGR, SAL and JOB in the above examples) cannot have NULL values and the column data type cannot be too long. The reason is that these columns will become part of a primary key for a Pervasync internal table so they have to satisfy primary key column criteria of the database. If you cannot make these columns “NOT NULL” and have shorter length, mark the subsetting query COMPLEX.

Scalability implications

In general, SIMPLE queries are much more scalable than COMPLEX queries in terms of database space and sync engine processing time. For SIMPLE queries, sync engine performance is virtually independent of the number of sync clients you have.

COMPLEX queries can be scalable if a large number of clients share a same set of parameter values. For example, suppose you have 1000 clients and there are only 10 unique values for the subscription parameter DEPTNO_PARAM, the resources needed by sync engine are about the same as 10 clients each having a unique value for the parameter. Still, if at all possible, we recommend you use SIMPLE queries as opposed to COMPLEX queries.

Converting a COMPLEX Query to a SIMPLE Query

If you mark a SIMPLE query as COMPLEX, sync will still work but just not as efficiently. However, if you simply mark a COMPLEX query as SIMPLE, the sync may not work – you may find that client is not getting the changes you expect.

Fortunately, there are ways to convert a COMPLEX query to a SIMPLE query. For example, let’s say you want to sync users’ emails to their mobile devices and you use email recipient as subsetting parameter. The query qualifies as a SIMPLE query:

SELECT EMAIL_ID FROM USER_EMAILS WHERE RECEIPIENT = ${EMAIL_ADDRESS_1} OR RECEIPIENT = ${EMAIL_ADDRESS_2}

 

Now, let’s say you only want to sync last 30 days’ emails to save space on device. The query becomes COMPLEX:

SELECT EMAIL_ID FROM USER_EMAILS WHERE (RECEIPIENT = ${EMAIL_ADDRESS_1} OR RECEIPIENT = ${EMAIL_ADDRESS_2}) AND RECEIVE_DATE > (SYSDATE – INTERVAL ’30’ DAY(5))

 

To convert this query to a SIMPLE query, you can add a column INCLUDE_IN_SYNC, which takes values of “Y” for yes and “N” for no and defaults to “N”. You create a DB job to update the INCLUDE_IN_SYNC values every night:

UPDATE USER_EMAILS SET INCLUDE_IN_SYNC=”Y” WHERE INCLUDE_IN_SYNC=”N” AND RECEIVE_DATE > (SYSDATE – INTERVAL ’30’ DAY(5))

 

With this in place, you would be able to use the following SIMPLE query to achieve the same result as the original COMPLEX query.

SELECT EMAIL_ID FROM USER_EMAILS WHERE (RECEIPIENT = ${EMAIL_ADDRESS_1} OR RECEIPIENT = ${EMAIL_ADDRESS_2}) AND INCLUDE_IN_SYNC=”Y”

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s