Open Source Quality

From Innovations, a website published by Ziff-Davis Enterprise from mid-2006 to mid-2009. Reprinted by permission.

The market for data quality software has gone open source, and it’s about time.

Late last month, Talend Open Data Solutions, a maker of data integration and profiling software, made its Talend Data Quality software available under the General Public License. This follows the French company’s move in June to open its Open Profiler product in a similar way. The tools can be used together to assess the quality of the information in customer databases and to correct basic name and address information against a master list maintained by the US Postal Service.

What’s more important, though, is the potential of open sourcing to bring down the costs of the complex and frustrating data cleansing process. As I noted a few weeks ago, data quality is one of the most vexing problems businesses face. Data that’s inconsistent, out of date or incorrectly formatted creates inefficiency, angry customers (have you ever gotten three direct mail pieces at the same time, each addressed a little differently?) and lost opportunity.

Solutions to data quality problems have existed for decades, but they’ve always been sold by small vendors, usually at prices starting in the six figures. Over the last few years, many of these vendors have been snapped up by bigger software firms, which then bundled data quality tools and services into giant software contracts. There aren’t a lot of vendors left that specialize in solving the quality problem specifically.

This fragmentation has frustrated a process that should be a part of every company’s IT governance practice. While each company has its own data quality issues, many are common to a large number of businesses. Talend’s approach doesn’t address the cost of accessing databases of “clean” data, but it has promise to make the cleansing process itself cheaper and more automated.

The secret is the magic of open source, which enables users to easily exchange their best work with each other. For just one example of how this works, check out SugarCRM’s SugarExchange. This collection of third party applications has been built by contributions from SugarCRM’s customers and independent developers. While some modules carry license fees, many don’t. The point is that software authors who useful extensions to the base CRM system have the wherewithal to share or sell them to others who have similar needs. That’s difficult or impossible to do with proprietary software.

This so-called “forge” approach to development lends itself particularly well to data quality because so many issues are common to multiple companies. For example, if I come up with a clever way to compare employee records to a database of known felons, I should be able to share it and even charge for it. That isn’t possible when the market is spread across an assortment of small, closely held companies. If Talend can extend its TalendForge library to incorporate a robust collection of data quality components, it can make data cleansing practical and affordable to a much larger universe of companies.

The data quality problem is so pervasive that it demands a collaborative approach to a solution. This is a good start.