Creat membership Creat membership
Sign in

Forgot password?

Confirm
  • Forgot password?
    Sign Up
  • Confirm
    Sign In
Creat membership Creat membership
Sign in

Forgot password?

Confirm
  • Forgot password?
    Sign Up
  • Confirm
    Sign In
Collection
For ¥0.57 per day, unlimited downloads CREATE MEMBERSHIP Download

toTop

If you have any feedback, Please follow the official account to submit feedback.

Turn on your phone and scan

home > search >

Detecting near-duplicate text documents with a hybrid approach

Author:
Varol, C.   Hari, S.  


Journal:
Journal of Information Science


Issue Date:
2015


Abstract(summary):

Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm’s accuracy performance by 27% on average and achieved above 90% common shingles.



Page:
405-414


VIEW PDF

The preview is over

If you wish to continue, please create your membership or download this.

Create Membership

Similar Literature

Submit Feedback

This function is a member function, members do not limit the number of downloads