RowGen v3
Big Test Data - Automatically!

IRI's uniquely powerful test data synthesizer, RowGen, is being updated to v3.1. This is a major new release that allows DBAs and data architects to rapidly create and load the quantity and quality of test data necessary to populate a database or enterprise data warehouse (EDW) -- all while uniquely integrating the rich data transformation and formatting capabilities of CoSort's SortCL program that ensure the most intelligent, production-like data values and formats are created.

RowGen v3 will run from the IRI Workbench GUI built on Eclipse, on the command line, or from batch programs. RowGen Control Language (.rcl) files specify the generation of test data in database table, flat-file, and custom report formats. For test database creation, RowGen's 'DB Test Data' wizard in the IRI Workbench automates these steps:


Parse - by selecting the schema and tables you want to populate, RowGen v3 translates the DB table descriptions and integrity constraints into .rcl scripts that specify the source structure, dependent sets, and data creation, in the order necessary to populate the tables in the right format, and with all primary and foreign key relationships intact (regardless of complexity).

Generate - by building and running the .rcl scripts to create one test file per table that can be bulk loaded, and/or saved for future use. 

Populate - by bulk loading the target tables in the right order with the pre-sorted test data ... test data that is not only structurally and referentially correct, but is safe for outsourcing (compliant with data privacy laws), and truly intelligent; i.e. realistic, business-rule-conforming, and robust enough to stress-test applications with the right data value ranges and row volumes.

Another RowGen v3 wizard creates ad hoc test data targets in custom flat-file, report, and table formats. RowGen job scripts are based on, and compatible with, the CoSort SortCL program, and as such, can leverage many of the same data manipulation and formatting functions for customizing the look and feel of the test targets. Support is also provided for all and valid (joined) pairs, weighted distributions, and the inclusion of randomized real data to enhance the appearance of test data without compromising security. 



IRI Software Reach Growing in India, Indonesia
Serving Domestic Users and MNC BPOs

Though based in the United States where most of its customers are, IRI is strategically focused on growing its business abroad through partners who serve big data users concerned about price-performance, functional versatility, and ease-of-use. IRI is proud to announce and welcome these new distributors:

Unisoft Infotech Pvt Ltd. is a Bangalore-based solutions provider best known for its SAP expertise and clientele around the world. Unisoft is moving IRI software to users in India's sizable domestic markets for big data processing and data-centric protection solutions, by leveraging its offices throughout the country and its partnerships with business process outsources (BPOs) developing, implementing, and supporting applications for use in India and multinational corporations (MNCs). Satellite offices in Chennai and Delhi can also provide licensing and support for CoSort, FACT, FieldShield and RowGen.

PT Cybertrend Intrabuana and PT Sinar Surya Teknologi (SST) are two more authorized IRI resellers in Jakarta that have a base of customers and prospects in telco, government, and financial services institutions. Cybertrend has particular expertise in open source BI technologies like Pentaho, where IRI tools like CoSort add value as a data franchising (preparation) engine. SST has a more Oracle-centric focus, indicating a need for Fast Extract (FACT) for Oracle and CoSort in ETL situations. SST also has a competency in DB security and DLP where FieldShield and RowGen are good fits.

These and other partnerships are forming in South and Southeast Asia, as well as other parts of the world. Watch this space for ongoing appointments and contact info@iri.com if you need to work with, or are interested in becoming, an IRI software representative.



Using FieldShield in DB Apps
... and Protecting Every Row 

IRI's FieldShield package is uniquely capable of protecting personally identifying information (PII) data at the field level across multiple databases and flat files with functions in ten distinct categories determined by business rules and specified data conditions:

Encryption Masking
Pseudonymization Expressions
De-Identification Encoding
String Shifting Replacement
Randomization Omission

Database applications that update and query tables may need to secure data going into, or being retrieved from, tables. The data must be protected on the way into the table, undergo protection on the way out, or remain protected in the database. In each case, the goal is to prevent unauthorized access to sensitive information.

This blog article,
http://www.iri.com/blog/data-protection/using-fieldshield-in-database-applications/
lists options for using FieldShield in your programs.

Consider improving the performance of your applications by protecting even less data:
http://www.iri.com/blog/data-protection/selecting-pii-for-secure-queries-fieldshield-filters/

Contact fieldshield@iri.com if you are interested in incorporating one or more FieldShield protections into your application environment.



Tech Tip - CoSort Speed Tuning
Modify System and 'cosortrc' Parameters

The availability of resources and how they are allocated to CoSort jobs can have a profound impact on the efficiency and throughput of big data transformations like sorting, joining, and aggregation.

First, consider the number of system threads you want to use for parallelism. The thread_max value in the cosortrc file determines the top number of computational threads you can assign to each job. Even if your system can apportion more threads than physical CPU cores aboard, IRI licensing recommends you only pay to use the latter limit so you can tie each thread's work to a physical resource.

Next is memory. Increasing MEMORY_MAX will improve performance when the data being processed can fit in RAM.  When processing large data sets, temporary work files are used and the performance improvement from a large MEMORY_MAX setting is less significant. 

Proper HDD (work_area) specification is important because jobs too large to fit in memory will use temporary files on those drives. During transformation, the overflow data is written to the work areas you specify, then read from there and merged onto the target drive. Because of this flow, the source and target drives are never accessed at the same time, so it is fine if these are on the same physical drive. But because the work areas are accessed at the same time as the source and target drives, they should be on separate physical drives (and I/O controllers) from the source and target drives -- and even from each other if you use multiple threads and have other available drives (since each thread can write to a separate temp drive).


Most of the time used by a large sort job is for file I/O. Even in the most complex SortCL scripts, the majority of processing time is spent reading and writing data from and to the disks. Thus, anything you can do to increase memory and accelerate I/O on your existing systems will speed up big data transformation throughput. One of the biggest performance improvements can come from a dedicated, high-speed I/O channel for overflow; two 6GB/second SATA drives in a striped configuration (RAID 0), on a dedicated controller, works well. If you must use the same drive for source, work and target files, however, increasing the BLOCKSIZE parameter can help.

Resourcing those transformations to in-memory databases or appliances is not usually a necessary, much less cost-effective, alternative; and, staging data in the DB or BI layer a la ELT, is not an efficient one.

Finally, note that there are a number of other, more esoteric memory and behavior-related parameters in the cosortrc file you can adjust, and that you can make these settings apply at a global, user, and job level as warranted. For more information, see Section D of the Appendix chapter in your CoSort manual, and contact support@iri.com with any questions.

DW Schema Introduction
 
Data warehouse users deploy a number of special schema models that represent transformations from relational database models. 

Star schema is the simplest and most common database modeling structure used in data warehousing. The star schema is so named because its structure resembles a constellation of stars; i.e. several bright stars (facts) surrounded by dimmer ones (dimensions) where one or more fact tables reference different dimension tables. Star schemas are designed to improve ease-of-understanding, and retrieval performance, by minimizing the number of tables to join when materializing a transaction.

To understand why star schemas are so popular, consider an example database from a store chain where sales-related data in a central fact table is linked by a primary key “Id” column in data, store, and product dimension tables:


http://upload.wikimedia.org/wikipedia/en/thumb/f/fe/Star-schema-example.png/300px-Star-schema-example.png

The Fact_Sales table has a three-column (compound) primary key (Date_Id, Store_Id, Product_Id) and a non-primary Units_Sold column attribute that can be used in calculations. This structure allows queries on the Fact_Sales table (joined to the dimensional table) to retrieve specific information, such as the number of TVs sold by brand and country in a given year.

The denormalized nature of the star schema model is effective for query performance, OLAP, and BI, since there are only a few tables involved, but larger dimension tables can take a long time to load because of de-normalization and redundancy. In addition, the denormalized or redundant data must be carefully controlled during extract, transform, load (ETL) operations, such that users should not be able to use the data until it is in a consistent state.  

Snowflake schema is similar to star schema but more complex and not subject to the above restrictions in ETL, because the dimension tables are normalized (and thus there are more tables to manage):

http://www.datawarehouse4u.info/images/snowflake_schema.jpg

More powerful computers and databases can handle this complexity, but the schema remains less popular in big data warehousing environments because look-ups are still faster across fewer tables.

Fact constellation schema is the most complex of the popular DW schemas: 

http://www.datawarehouse4u.info/images/fact_constellation_schema.jpg

For each star schema it is possible to construct fact constellation schema (for example by splitting the original star schema into more star schemes, each of which describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables. 

The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.

So what do these schemas have to do with IRI Software?

  • CoSort’s Sort Control Language (SortCL) program can create and populate output file and table targets that match the layout and structure of the schema. Populating the tables can be achieved through ODBC inserts or via pre-sorted flat files fed into a database bulk load utility; e.g. a direct path load into Oracle.
     
  • Fast Extract (FACT) can rapidly unload very large fact (transaction) and dimension tables into flat files for archiving, migration, replication, transformation, and/or reporting purposes. SortCL can then perform SQL-equivalent queries (faster, outside the database) using sort, join, aggregation, selection, and custom layout design commands to produce business intelligence in detail and summary report formats.
     
  • FieldShield can apply common encryption functions across linked fields in one or more tables in the schema … preserving data formats, recoverability, and referential integrity at the same time.
     
  • Based on the layout information from this database, RowGen can produce and load referentially correct test data in both fact and dimension tables, pre-sorted over the primary key value.

Contact support@iri.com for help.