CoSort Update Coming Soon
CoSort Version 9.5.2


IRI's flagship product for manipulating big data - CoSort - is being updated to v9.5.2. This is a minor new release that adds new features to CoSort's SortCL program since the release of v9.5.1 last year, including:

  • Fuzzy Logic Look-Up. An expansion of the set file search algorithm to find the value of an entry whose key satisfies an other-than-equal relationship to the search argument. Prior releases would find the value only when the search argument was equal to the key.
  • Slowly Changing Dimensions. An application of the fuzzy logic feature that finds the value for a search argument that is closest to a table (set file) entry. As an example, consider rows with non-uniform, ordered dates, and gas prices that have changed over time. With or without an entry for 2012-06-28, the SCD search can find the price of gas before, on, or after 2012-06-28.
  • Literal Values. Constants can now be named and specified in /FIELD statements for derived and other field functions, conditions, cross-calculations, and reports. Manipulate and display static character and numeric strings along with input data; for example, apply a fixed rate to principal values in a source table.
  • String Analysis. Like the SQL 'instr' function in Oracle, the new SortCL function looks for characters in a field and displays the offset number for the specified occurrence of the string in that field.
  • Thai Characters. SortCL can now collate (sort) and generate test data in Thai Unicode and single-byte Thai formats, and convert between these sets.
  • Random Field Data. SortCL can create random data in fields per the data type and min/max size range specified. It can also randomly select data from an existing list of real SET file values. Both approaches allow users to add random data to existing input sources as new column values.
  • More Encryption. FIPS-compliant OpenSSL, 3DES, and AES-128 libraries are now built-in, adding more protections for fields containing PII.
  • Other Protections. In addition to new randomization and encryption functions, the new CoSort (and next FieldShield) release will feature 256-bit hashing, custom data masking, and field shifting.

If you would like to test CoSort v9.5.2, click here to send your request with the details needed. 



IRI Expands in S.E. Asia
New Offices in Thailand and Indonesia

New resale and support partners will expand the reach of IRI's growing software product line in Bangkok and Jakarta:

Bara Advanced InfoTech Co., Ltd., (BARA), part of Thai conglomerate "Bara Windsor & Company", was established in 1994 to support hardware and applications software for rapidly growing companies. BARA is a top-ranked supplier for  HP (COMPAQ PC) and IBM eServer/pSeries (RS/6000) systems in Thailand and has a strong Oracle practice. BARA personnel recently received initial training on CoSort, Fast Extract (FACT), and FieldShield in the IRI Workbench environment, and can assist in proofs of concept, evaluation and licensing, and customer technical support. If you are interested in IRI Software in Thailand, click here to reach BARA.

PT. Emerio Indonesia, a subsidiary of Emerio Corp. in Singapore (an NTT Communications Company), creates and optimizes data warehouse ETL and BI solutions for companies in the financial services and telco industries. Emerio is becoming versed in the language of SortCL to provide faster, and more affordable solutions for big data staging, capturing and reporting on changed and slowly changing data, and optimizing database performance without the need for costly hardware. Click
here for Emerio's contact details in Indonesia and afield.



Hadoop Development Underway
CoSort to Run on Private Grids

Long before Hadoop became synonymous with big data, CoSort was. As one of the original 'big data' companies, IRI has been keen to distribute CoSort operations across secure, mission-critical customer grids that can handle petabytes of data. Over the years, IRI looked at Infiniband, Oracle Parallel Server, Platform Computing, Beowolf Clusters, and others. However, it was not until Hadoop emerged as a widely adoptable open source standard that IRI began to make related R&D investments.

The object of IRI's efforts in Hadoop is to implement SortCL executions in MapReduce code, breaking up huge data transformation jobs across available nodes and merging the results onto the specified target platform. CoSort is already running in a major telco grid in Asia using Hadoop, but a general-purpose release awaits additional development in both MapReduce and CoSort code, along with feedback from IRI's big data users interested in this development.

Therefore, if you have specific requirements for, or interest in, exploiting CoSort in the Hadoop environment, email
support@iri.com

 

Tech Tip - Pivot Transform
Transpose Rows to Columns in SortCL

One of the many, more advanced ways in which CoSort performs SQL and DW functions on sequential data is shown in pivoting transformations, where rows can be rotated into columns and vice versa. The following example might be used to consolidate numeric sales data.

The input file below, transactions.dat, contains three-column rows: a store code, department code, and transaction amount for that department. The object of this transformation is to create one record for each store containing a column for each store, and then total transaction amounts for each department.

11,2,41.45
11,3,100.00
11,3,54.99
11,1,12.78
54,3,201.59
11,2,124.78
11,3,69.67
41,2,150.39
41,1,2.95
41,2,146.98
54,3,53.65
41,3,24.78
41,3,39.99

In the CoSort Sort Control Language (SortCL) script below:

  • The INPUT section defines the layouts of the source file, transactions.dat
  • The INREC section reformats the input records so that the transaction amount is in the correct column for the department, and puts a zero where there is no transaction amount for a given department
  • The SORT order on StoreCode allows us to group the transaction amounts by store code
  • The OUTPUT section summarizes the transaction amounts in each department, grouped by store code, and defines the target layout.
/INFILE=transactions.dat
  /FIELD=(StoreCode,POSITION=1,SEPARATOR=',')
  /FIELD=(DeptCode,POSITION=2,SEPARATOR=',')
  /FIELD=(TransAmount,POSITION=3,SEPARATOR=',',NUMERIC)
/INREC  
  # Change the input record format so that TransAmount
  # goes into the correct column for the DeptCode, and
  # place zeros in the other columns
  /FIELD=(StoreCode,POSITION=1,SEPARATOR=',')
  /FIELD=(DeptCode1,POSITION=2,SEPARATOR=',',NUMERIC,\
     IF DeptCode == "1" THEN TransAmount ELSE 0)
  /FIELD=(DeptCode2,POSITION=3,SEPARATOR=',',NUMERIC,\
     IF DeptCode == "2" THEN TransAmount ELSE 0)
  /FIELD=(DeptCode3,POSITION=4,SEPARATOR=',',NUMERIC,\
     IF DeptCode == "3" THEN TransAmount ELSE 0)
/SORT
  /KEY=StoreCode
/OUTFILE=UniqueSums.out
  # Collapse the columns with SUM statements to get a
  # Unique StoreCode and TransAmount for each DeptCode
  /FIELD=(StoreCode,POSITION=1,SEPARATOR=',')
  /FIELD=(DC1_sum,POSITION=2,SEPARATOR=',',NUMERIC)
  /FIELD=(DC2_sum,POSITION=3,SEPARATOR=',',NUMERIC)
  /FIELD=(DC3_sum,POSITION=4,SEPARATOR=',',NUMERIC)
  /SUM DC1_sum FROM DeptCode1 BREAK StoreCode
  /SUM DC2_sum FROM DeptCode2 BREAK StoreCode
  /SUM DC3_sum FROM DeptCode3 BREAK StoreCode

The target, StoreSums.out, contains:

11,12.78,166.23,224.66
41,2.95,297.37,64.77
54,0.00,0.00,255.24

The original input format was flattened to rows based on the StoreCode field, which was also used as the 'unary change' break key when summing. For those familiar with the Microsoft SSIS platform, DeptCode is the Pivot Key (top row value), StoreCode is the Set Key (left column value), and TransAmount is the Pivot Value (table values).

Pivoting Transformations
 
Changing between column and row formats is a basic online analytical processing (OLAP) activity that optimizes data structures for queries and reports. 

A pivot transformation turns several rows of data into one, denormalizing a data set into a more compact version by rotating the input data on a column value.

For example, a normalized sales report that includes store name, date and sales amount has several rows for each store. In this case, each row for that store may have a sales amount for each month. Pivoting the data on the month column produces one row per store, with month names shown as column names, and sales amount values in the month column.

An unpivot transformation on the other hand, converts columns into rows, normalizing the dataset by expanding values in multiple columns in a single record into multiple records with the same values in a single column.

There are several ways these data transformations can be accomplished, depending on the platform or tool you use. One way is a pivot table.
 
According to Wikipedia, "a pivot table is a data summarization tool found in data visualization programs such as spreadsheets or business intelligence software. A pivot-table can sort, count, total or give the average of the data stored in one table or spreadsheet. It displays the results in a second table (called a "pivot table") showing the summarized data. Pivot tables are also useful for creating unweighted cross tabulations." The user sets up and changes (rotates) the summary's structure by dragging and dropping fields graphically. The concept is named for this pivoting of the summary table.

Another way database administrators pivot their tables is through SQL, and the code is typically complex. See this blog example on data normalization for an example. 

Pivot transformations, like many other functions that can be performed in a database or ETL tool, may also be performed externally in the file system for performance reasons, or need to be because the data sources are already in flat files.

IRI believes that data and applications should be independent, and that databases should be used for secure data storage and retrieval, not big data transformation or single-point solution control. Its customers transform big data externally, in jobs that are easier to introduce, code, and change.

IRI's CoSort product -- and its Sort Control Language (SortCL) Program in particular -- is a data transformation and reporting tool that runs outside the database layer. Pivoting and other transfoms are coded in a 4GL familiar to SQL users, or with field drag/drop operations in an Eclipse GUI supporting SortCL scripts.
 
SortCL reports can combine a wide range of sort, join, sum, filter, string manipulation, field function, PII protection, and formatting options to produce change data capture (delta) reports, display calculations from slowly changing dimensions, create custom file, XML and HTML targets, etc. SortCL jobs are portable and can handle data in files and tables at the same time.

See the tech tip in this newsletter for a pivoting example in SortCL, and watch this space for more OLAP-related solutions for big data from IRI, The CoSort Company.