Welcome!

Open Source Authors: Tony Baer, Patrick Burke, Jeremy Geelan, Maureen O'Gara, Pat Romanski

Related Topics: Open Source

Open Source: Article

Open Source: The Next Frontier for Data Quality Management

Data quality, a pervasive & critical business issue

Data is the fundamental building block of every business, data in the form of client information, sales information, employee information, and financial information fuels the operation of every business. In today's business environment, which enables data entry from multiple points and through myriad processes, data quality has become an increasing concern for businesses trying to succeed in an ever more competitive atmosphere.

Data quality or data integrity as defined as incomplete, erroneous, or incompatible data is part of every business's day-to-day operation. Furthermore, as new flexible data entry options become available, the opportunity for data quality issues to be introduced into enterprise data increases. Overall business strategy is also increasing the prevalence of data quality issues as mergers, acquisitions, and department consolidations becoming part of almost every business's growth initiatives.

Data quality issues are often latent in an enterprise until a critical business initiative becomes road blocked because the enterprise data can't comply with the needs of the business. Companies of every size in every industry are increasingly reporting issues with data quality. The Data Warehousing Institute reported that 50% of its respondents felt that company data quality is worse than the organization thinks. Furthermore, more than half of respondents indicated their organizations had suffered losses due to poor data quality.

Data encompasses all the critical decision-making variables in an organization, including financial data, employee data, client data, prospect data, and inventory data. Viewing data that is erroneous or incomplete can seriously impact the decisions an organization makes and the strategies it employs. Recent research from Aberdeen indicates that the state of a company's data quality directly impacts its growth, profitability, and ability to compete. Poor data quality obscures an organization's view causing it to miss additional revenue opportunities, risk regulatory issues, and forfeit the intelligence gained from a clear view of business data.

As the prevalence and impact of data quality issues become more apparent, concern over these issues is reaching beyond the IT community to the C-suite. A recent study by the Financial Executives Research Foundation indicates that data quality across the enterprise was its number one concern, surpassing information security and Sarbanes-Oxley. Finance professionals cited information integrity as the key issue impacting overall corporate operations and performance.

Data quality is every organization's sleeping monster. It quietly erodes profitability, impedes growth, and hinders the implementation of mission-critical business initiatives.

The Limitations of Commercial Data Quality Solutions
Once an organization recognizes its data quality issues and their operational impact, it typically evaluates commercially available solutions to address the problem since most companies lack the IT infrastructure and knowledge to address enterprise data issues. However, for most companies seeking a data quality solution, the evaluation process is a sobering one because most commercially available solutions are costly, complex, and require software licenses and term contracts, while only addressing a portion of the overall issue.

Commercially available data solutions are fundamentally flawed in their implementation model. To be most effective data quality processes should be deployed at multiple touch points throughout an organization. Full implementations are almost impossible because they become cost-prohibitive when licenses are expanded to encompass more users and multiple systems.

Commercial solutions are also prohibi-tive to many organizations due to their term contract commitments, software licenses, and implementation requirements. Price tags for traditional solutions can often total in the hundreds of thousands of dollars if not over a million dollars, not including the human capital within the organization needed to manage the solution in concert with the provider. Such price tags make commercially available data solutions inaccessible to many small and mid-size enterprises that need data quality solutions.

Another drawback of traditional solutions is that they offer only cookie-cutter product approaches to data quality. Since most companies have data issues that are unique due to their specific organizational history and infrastructure, traditional cookie-cutter solutions often require significant programming and custom code development - all requiring additional testing, resources, and money, adding significantly to the complexity of the solution for implementation and service management.

Moreover, support for traditional solutions is typically limited to the providing vendor due to the proprietary software and licenses involved in the implementation of the solution. This restriction further increases the price tag of the conventional solution since support, service, and implementation can total as much as 70% of the purchase price of the solution.

Open Source: The Next Frontier for Data Quality Management
While open source has been gaining traction and attention in many business solutions, data quality solutions have remained an area where open source is not widely utilized. Open source, however, is well equipped to address the limitations of traditional software-based solutions or SAAS solutions and create industry-leading data solutions. Open source solutions are inherently better suited to address the needs of comprehensive data quality management with their flexibility, cost efficiency, customization, rapid integration, and turnkey scalability options.

A key benefit of open source data quality solutions is that they can be implemented at multiple data entry points throughout an organization because they require no license purchases. This flexibility creates a more comprehensive and longer-term solution than single-point commercial solutions.

Open Source data quality solutions also provide a significant cost advantage over conventional quality solutions because they require no software license purchases or management. Software licenses can account for up to 20% of the cost of a traditional implementation. This represents a significant cost savings to organizations. Furthermore, software licenses typically come with lengthy contract commitments attached, impacting the cost structure for an organization for a significant if not perpetual period of time.

Moreover, open source data quality software can be easily customized to address the unique data fingerprint of every organization eliminating the need to retrofit cookie-cutter traditional solutions with code modifications and custom programming. This customization ability reduces the complexity of the solutions and offers faster implementations, simpler integrations, less testing, and more rapid results than commercial solutions.

Another benefit of open source solutions is that servicing is more flexible and cost-efficient because it isn't tied to proprietary licensing. Service can then be provided by the technology vendor, secondary vendor, or internal resources. Furthermore, the open source community can also provide support and innovation for solutions as they evolve within an enterprise.

Lastly, open source data quality solutions have the added value of using the new technology processing systems dedicated to providing "pay as you go" (utility computing) processing options for turnkey scalability. This offers a further significant cost advantage over commercial solutions that require licenses tied to hardware. Data solutions are especially prone to scalability issues due to the volume of data undergoing processing, many traditional solutions become easily stressed due to these needs, increasing the costs, delaying results, and reducing the return on investment for traditional solutions.

It's clear that an open source solution for data quality offers many benefits to clients over conventional solutions. Open source provides all businesses access to critical data quality solutions that can positively impact their overall profitability, growth, and competitive position. Furthermore, the existence of the open source community enables a solution users' immediate access to shared knowledge and implementation enhancements, rather than waiting months or years for another software release. Open source can offer organizations the most customer-centric data quality solution available in the marketplace today with flexibility, customization, and significant cost advantages.

Research Sources:
•  TDWI. "Taking Data Quality to the Enterprise through Data Governance 2005."
•  Aberdeen Report. "{Customer Data Quality, The Roadmap to Growth and Profitability 2007."
•  Technology Issues for Financial Executives 2007 Annual Report.

More Stories By Subbu Manchiraju

Subbu Manchiraju is a vice president at Infosolve Technologies, which provides business clients with comprehensive data solutions that leverage the power of their enterprise data to achieve business objectives and create strategic opportunities-- without the burdens of cumbersome licensing agreements, complex term contracts and expensive hardware requirements.

Comments (5) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
rabk 05/07/09 05:04:00 AM EDT

i've seen that Quality management is a must, for example the other day i was looking at hi5.com and they didn't even have quality policies, what's wrong with that?

Tommy 05/06/09 05:57:45 AM EDT

I find mrmo's comment right: if you want an open source data quality tool, you need to go directly to the open source software editors. There are plenty of programs to satisfy your needs.

Just look at Pentaho or Talend to use their data quality products. They are open source and downloadable on the company’s website.

mrm0 08/15/08 05:17:57 PM EDT

Echoing the other commenter, InfoSolve does not provide open source. They provide source code for things they build on top of OSS to people who pay them. There is a distribution of source to the payer, so it's really a source code license. I think the magazine should do a little more homework before providing this type of information as it's misleading and promotes another fauxpen-source vendor.

Vivek 05/28/08 01:42:21 PM EDT

Hi,
I represent Aggregate Profiler from Arrah Technology. It is an open source and more than 6000 downloads. I would apprciate if you can review that and give your feedback.
Future plan is to use Modeling and scheduling so that it can run in batch mode also.
http://www.arrah.in

Kasper Sørensen 03/28/08 09:27:45 AM EDT

I absolutely agree with everything you're saying about the advantages of Open Source data quality but I find it less convincing when faced with the fact that Infosolvetech does not provide an Open Source licensed solution that complies with the Open Source definition! I've tried several times to find the source code for your OpenDQ product, but found that you had to be a paying customer to get it? How open is that? And how do you benefit from a non existing community?

So now the point my point is obvious... Find another Open Source data quality solution to gain those benefits that you speak of. Try using DataCleaner (which I will gladly admit that I represent), Aggregate Profiler or Open Data Profiler.

Respectively:
http://www.eobjects.dk/datacleaner
http://sourceforge.net/projects/dataquality/
http://sourceforge.net/projects/dataprofiler/