FACT v3 Available
DB Extracts Up to 3X Faster

IRI is pleased to announce the third generation of Fast Extract (FACT) software for very large database (VLDB) unloads. FACT is a key component in data warehouse ETL (extract-transform-load) and ELT operations, offline reorgs, archive projects, and database migrations.

FACT v3 extracts big data from DB tables into flat files far faster than prior versions and other unload methods. On a test workstation, FACT v3 unloaded a 260GB Oracle test table in 54 minutes, vs. 161 using v2.5. In addition to parallel hints and max row specifications, FACT v3 supports three proprietary query splitting methods for multi-threaded extraction.

FACT v3 on Unix and Linux servers currently extracts from Oracle 8.1 (and above) and DB2 UDB 8.2 (and above), and will eventually also support:

  • Sybase (ASE and IQ) – 12.5 and above
  • Altibase – 4.2 and above
  • Tibero – 4.0 and above
  • MySQL – 5.0 and above
  • MS SQL – 2003 and above

along with: 

  • Oracle – 8.2.0 and above
  • DB2 – 8.2.0 and above
  • MS SQL – 2003 and above

on Windows.

Please contact fact@iri.com if you are interested in testing FACT v3. For more information, see also:



CoSort Speeds Top BI Tools
BOBJ, Cognos, Microstrategy 

A senior BI/DW architect in Singapore has recently proven the performance benefit of CoSort for centralized pre-transformation (data franchising) on three of the top data visualization applications (BI tools) in the market. Time-to-delivery tests conducted with and without CoSort in SAP's Business Objects (BOBJ), IBM's Cognos, and Microstrategy reporting environments showed a dramatic performance difference with even small amounts of input data.

Given the need to sort, join, and aggregate flat file sources prior to producing reports in each BI tool, comparison tests were conducted whereby these transforms ran first inside the tools (i.e. within the BI layer), and then externally -- via CoSort's Sort Control Language (SortCL) program (i.e. outside the BI layer). In every case, transforming the data with CoSort first made sense. It was not only at least 2-3 times faster, but eliminated the need to repeat the transformations and maintain the data and metadata in every reporting project.

For more information on the benchmarks, refer to the individual news announcements at:
http://www.iri.com/news/category/press_releases/
and to see benchmark details and screen shots, see the blog articles on each under:
http://www.iri.com/blog/category/business-intelligence/

 


Virtual IRI Workbench Demos
VMWare and IBM Virtual Appliances

Those who have installed and configured the IRI Workbench to run on their own systems know there is some work to do preparing the GUI environment. Demo installations have required downloading the GUI and CLI packages, licensing CoSort and FACT, connecting to Oracle over different protocols, and importing the demo archive. IRI has now made that entire process much easier - and can deliver a demo-to-market package far faster - through a VMware image. Those with VMplayer can play the image and find everything ready to go.

IRI is also in the process of preparing a similar deliverable based on Linux for the IBM Kernel-based Virtual Machine (KVM) and PowerVM
® environments via IBM's Virtual Appliance Factory (VAF). This will also enable IRI to certify and market CoSort, FACT, FieldShield, RowGen, and NextForm within the IRI Workbench "Ready for PureSystems."

Also under consideration are on-line versions of these images that will preclude the need for downloads and installation. Cloud users can eventually log in to transform and protect their data under a Software as a Service (SaaS) business model with IRI.
 


Tech Tip: Secure National IDs
FieldShield Data Masking Examples

Among the different field-level data protection functions offered in FieldShield and CoSort (SortCL) are character masks for all or part of fields containing personally identifying information (PII). Users can define the replacement character and specific byte locations within each datum to permanently obscure certain parts of the field value.

Standard masks for US Social Security (SSN) and credit card number (CCN) values are provided in the FieldShield data masking dialog in the IRI Workbench GUI, built on Eclipse. The dialog also allows custom character replacements for any string, and then creates target /FIELD specifications in FieldShield (or SortCL) job scripts to apply the 'replace_chars' or 'mask' function at runtime.

Below are examples of /FIELD statements for commonly used data masks of other national ID numbers that FieldShield or CoSort (SortCL) users can specify in the /OUTFILE sections of their job scripts:


# Canada SIN, e.g. 459-238-962 => 459-xxx-xxx
/FIELD=(ID=mask(id,"x",5,3,"x",9,3) ...

# Chile RUT, e.g. 320505-13-3565 => xx05xx-xxx5
/FIELD=(ID=mask(id,"x",1,4),POS=2,SEP=",")

# Hong Kong NID , e.g. 27224729 => 2722xxxx
/FIELD=(ID=mask(id,"x"5,4),POS=2,SEP=",")

# Korea SSN, e.g. => 640907-1031419 => xxxxxx-1031419
/FIELD=(ID=mask(id,"x",1,6),POSITION=2,SEP=",")

# Malaysia NRIC, e.g. => xx05xx-xxx5
/FIELD=(ID=mask(id,"x",1,2,"x",5,2,"x",8,2,"x")...

# Nordic countries PIN/CPR, e.g. => 211099-xxxx
/FIELD=(ID=mask(id,"x",8,4),POSITION=2,SEP=",")

# Spain DNI, e.g. 29.572.047C => xx.xxx.xx7C
/FIELD=(ID=mask(id,"x",1,2,x,4,3,x,8,2) ...

# Taiwan NIC, e.g. F250286893 => F25xxxx893
/FIELD=(ID=mask(id,"x",8,4),POSITION=2,SEP=",")

# UK NINO, e.g. SP-123456-D => SP-xxxxxx-D
/FIELD=(ID=mask(id,"x",4,6),POS=5,SEP="|")

If you have any questions or need help implementing data masking or other content-aware data loss prevention functions (like encryption, pseudonymization, randomization, de-identification, hashing, etc.), please email fieldshield@iri.com.

Data Franchising  
= BI Optimization
 
The data franchising business is not about business franchise data! Rather, data franchising is a term of art in data integration that refers to packaging big data for the benefit (and to optimize the performance) of business intelligence (BI) tools.

Richard Sherman coined the term 10 years ago as a component of the "Data Integration Framework" (DIF), with the full article (and a synopsis of his vast BI/DW implementation background) here: http://www.information-management.com/issues/20050301/1021516-1.html

While the term data franchising is not in wide use, BI optimization is. The latter tends to encompass architectural considerations like the integration, sharing, and quality control of data for those who will use that data for analysis. BI consulting companies like IT Renaissance discuss BI optimization in such terms; data franchising lives in the context of BI optimization and data integration.

Both data franchising and BI optimization require that the transaction tables and files (the information sources for analytics) be integrated or processed outside of, and prior to the BI layer. Data franchising tasks are specific data transformations that take place after data integration has pulled data from source systems into the data warehouse (DW) or operational data store (ODS).

According to Sherman, the process of data franchising "overlaps with data preparation with regard to extracting the data from a data source and reformulating it into a data mart. However, the data franchising component does not need to perform many of the more sophisticated data integration functions of data preparation processes." 

The specific data preparation activities in data franchising are typical transformations like filtering, sorting, joining, aggregation, and reformatting (to store the data in formats that BI tools can digest, like CSV or XML):



So, data franchising is a subset of data preparation and data integration, in that it makes the following assumptions:

1) the data sources in play come from the DW or ODS; i.e. after the initial data sources were acquired;

2) data cleansing has already occurred between acquisition and the DW so the franchising tool is operating with clean data;

3) slowly changing dimensions and normalization issues have also already been addressed; and,

4) the DW/ODS data is in a relational format with a documented data model and data definitions readily available.


These conditions are also perfect for using CoSort's Sort Control Language (SortCL) program to perform the heavy lifting of data franchising. Some of the more complex work of data integration may or may not be in the purview of SortCL capabilities (like esoteric data acquisition or advanced data cleansing).

In fact, Sherman points out that "although you may be more productive if you use a single ETL tool for both data preparation and data franchising, it may be more economical and faster to use a less robust ETL tool." This also means that you do not necessarily even need a full-fledged ETL tool to optimize the performance of business intelligence platforms, particularly if you do not have a formal DW or ODS to begin with.

However, it is worth mentioning with respect to the data integration issues above that SortCL can handle most of them anyway (including slowly changing dimensions, pivoting, and some data scrubbing). 

There are several additional benefits to franchising data (and removing the overhead of data transformation from the BI layer) that Sherman's article does not mention, like:

1) Speed - specialized tools like CoSort are optimized for high volume data transformation, and thus perform faster than BI, ETL, SQL, 3GL and other engines

2) Simplicity - coding and combining all the transformations in a single, self-documented 4GL job script is the easiest way to learn and modify transformation tasks

3) Singularity - running transformations centrally, before the BI layer, allows multiple BI applications to make use of the results

4) Reuse - centralizing the data definition and manipulation metadata in text repositories facilitates identification and repeat use


Application-specific definition or execution of these transformations, on the other hand, requires multiple, slower transformations, and limits access to the transformation metadata and the transformation results.

Successful examples of data franchising can be seen in CoSort's pre-processing of data for BI tools like Business Objects from SAP, Cognos from IBM, and Microstrategy. See the mention of the benchmarks in this newsletter, the IRI blog, and press releases posted under http://www.iri.com/news/category/press_releases/.