Are you dealing with duplicate data?
Does your data not fall under exact match?
Are the duplicates in your data not consistent for an exact match?
Are you struggling with cleansing of different types of data duplicates?
If you have answered yes to most or all of the aforementioned questions then the solution to your problem is Fuzzy Matching. Fuzzy matching allows you to deal with the above mentioned problems easily and efficiently.
What is Data Matching?
Data Matching is the process of discovering records that refer to the same data set. When records come from multiple data sets and do not have any common key identifier, we can use data matching techniques to detect duplicate records within a single dataset.
We perform the following steps:
Standardize the dataset
Pick unique and standard attributes
Break dataset into similar sized blocks
Match and Assigning weights to the matches
Add it all up — get a TOTAL weight
What is Fuzzy matching?
Fuzzy matching allows you to identify non-exact matches of your dataset. It is the foundation of many search engine frameworks and it helps you get relevant search results even if you have a typo in your query or a different verbal tense.
There are many algorithms that can be used for fuzzy searching on text, but virtually all search engine frameworks (including bleve) use primarily the Levenshtein Distance for fuzzy string matching:
Levenshtein Distance: Also known as Edit Distance, it is the number of transformations (deletions, insertions, or substitutions) required to transform a source string into the target one. For example, if the target term is “book” and the source is “back”, you will need to change the first “o” to “a” and the second “o” to “c”, which will give us a Levenshtein Distance of 2.
Additionally, some frameworks also support the Damerau-Levenshtein distance:
Damerau-Levenshtein distance: It is an extension to Levenshtein Distance, allowing one extra operation: Transposition of two adjacent characters:
Ex: TSAR to STAR
Damerau-Levenshtein distance = 1 (Switching S and T positions cost only one operation)
Levenshtein distance = 2 (Replace S by T and T by S)
How to Use Fuzzy Matching in TALEND?
Step 1: Create an Excel “Sample Data” with 2 columns “Demo Event 1” and “Demo Event 2”.
Demo Event 1: This column contains the records on which we need to apply Fuzzy Logic.
Demo Event 2: This column contains the records that need to be compared with the Column 1 for Fuzzy match.
Step 2: In TALEND use the above Excel as input in the tfileInputExcel component and provide the same file again as input to the same component as shown in the diagram.
Step 3: In the tFuzzyMAtch component choose the following configurations as shown in the below diagram.
Step 4: In the tMap we need to choose the following column to take an output.
Demo_Events_1
MATCHING
VALUE
Step 5: Finally, you need to select an tFileOutputExcel component for the desired output.
In the final Extracted file, the Column “VALUE” shows the difference between the records and matches the records to their duplicate.
Conclusion:
In a nutshell, we can say that the use of TALEND’s Fuzzy Matching helps in ensuring the data quality of any source data against a reference data source by identifying and removing any kind of duplicity created from inconsistent data. This technique is also useful for complex data matching and data duplicate analysis.
About Girikon
Girikon is a reputed provider of high-quality IT services including but not limited to Salesforce consulting, Salesforce implementation and Salesforce support.
Girikon boasts of a strong Data Management Practice when it comes to handling CRM data. With a strong team of Data Architects, Data Specialists, ETL Experts and Data Stewards, we have successfully walked hand-in-hand with our clients in helping them define & implement custom-fitted Data Management Strategies. We aim at CRM deployments that are high performing, scalable and adhering to security protocols, data privacy and third-party compliances that matter in your industry like Payment Card Industry (PCI), the Health Insurance Portability and Accountability Act (HIPAA) etc.
We have extensive experience in Data Extraction, Transformation and Loading (ETL) data from a wide variety of sources including legacy applications, ERP systems, CRMs & other web content, Standard relational databases, NoSQL Database (MongoDB), on premise/cloud-based applications, Files (e.g. XML, Excel, CSV, flat files) and web service APIs.
Our Enterprise Data Integration skills extends to cutting-edge ETL tools including Talend, Informatica etc. and support delivery of reliable data integration solutions to our clients across the globe.
Girikon’s Data Services include:
Data Integration Services using Talend/Informatica – Girikon’s team of experts provide scalable data integration and data quality solutions for integrating, cleansing and profiling of all kinds of corporate data using Talend & Informatica.
Master Data Management Services – Our MDM services include consolidation of data across various businesses in an enterprise using Talend or otherwise. We help create a single “version of the truth” for our customers.
Application Integration Services – Using Mulesoft, we specialize in providing a common set of application integration tools to build a service-oriented architecture, to connect and manage services in real-time.
Data Preparation Services – These services include manipulation of data into a form suitable for further discovery, visualization, processing and enrichment.
Data Migration from various orgs in Salesforce – We have successfully completed several enterprise-wide business consolidation projects for our customers. Along with the Salesforce system development to meet the required business needs, we have gained thorough experience in migration of the related Data to enable synchronized business.
Data Migration from different CRMs like Sugar, MS Dynamics to Salesforce – In addition to inter-org migration of Data within the Salesforce environment, we are also adept in migrating data from other CRMs like Sugar, MS Dynamics etc. to Salesforce.
Data Stewardship Services – Of late, we have seen a surge in demand for Data Stewardship services requiring resources to be responsible for maintenance and quality of data required throughout the organization. By scaling up to serve the needs of our existing accounts in these areas, we now have developed a dedicated team of Data Stewards who are ready to become custodians of your organizations data in a way that would facilitate your growth.
End-to-End ETL (Extract, Transform and Load) Services – While as mentioned above, we can take up activities in parts if that is the business need, what we exceptionally excel at is end-to-end ETL processes. Using Talend, Informatica etc., we would love to help you to eliminate the silos in your business, bring in data from multiple sources and Load to Salesforce for a consolidate view resulting in good, well-analyzed decision making.
How Girikon, as a Salesforce Consulting Partner Helps
Organizations looking at any kind of Data Services in relation with Salesforce can reach out to Girikon for assistance with consulting, design, execution and training in the above mentioned areas and rest assured of quality deliverables.
ETL – Extract data from multiple, varied sources, transform it as required to meet the need and then Load data to Salesforce (Tools – Talend, Informatica etc.)
Data preparation (or data preprocessing) – We can help prepare and deliver clean, usable data for use as per business requirements
Data Stewardship – Girikon’s Data Stewards possess the required expertise & experience to be responsible custodians of your business data’s quality & maintenance
CRM Master Data Management & Data Migration – We are experts in making your data much more usable in a very cost efficient manner. We have our in house de-dup & data merging application which makes the process much more simpler than it would be otherwise.
Any specific Training & Support – Girikon’s Data consultants have so much exposure and experience of varied systems, situations & solutions that they would love to share some of the knowledge gained with your teams to bring in various perspectives. Additionally, we can also help with specific processes and tools related trainings.