Term Paper on "Advanced Data Clustering Methods of Mining Web Documents"

Home > Topics > Computers & Internet My Account

Term Paper 34 pages (9410 words) Sources: 1+

[EXCERPT] . . . .

clustering in algorithms that employ abstract categories for the pattern matching and pattern recognition procedures used in data mining searches of web documents. With the rapid advances in data mining software technology now taking place website managers and search engine designers have begun to struggle to maintain efficiency in "mining" for patterns of information and user behavior. Part of the problem is the enormous amount of data being generated, making the search of web document databases in real time difficult. Real-time searching is critical for real time problem solving, high-level document searches and prevention of database security breaches.

The analysis of this problem will be followed by a detailed description of weaknesses in data mining methods, with suggestions for a reduction of preprocessing to improve performance of search engine algorithms, and recommendation of an optimum algorithm for this task.

The first investigators who gave a serious thought to the problem of algorithm speed were persons conducting researching in the area of database searches. The field is still in its infancy; most of the tools and techniques used for data mining today come from other related fields such as pattern recognition, statistics and complexity theory. Only recently have the researchers of these various fields been interacting to solve mining and timing issues.

Significance of the Study

Data mining is a knowledge discovery process that uses algorithms and advanced statistical models to analyze data in accordance with a set or sets of rules, as determined by the particular current application. Data mining methods may be classi

Continue scrolling to

download full paper ⤓

fied according to the functions they perform, the class of application they are used in, or the domain of the search (e.g., the Internet is now a common domain for data mining searches). On the whole, data mining models fall into three basic categories: classification, associations and sequencing, and data clustering. "Clustering" assembles documents into related groups, or clusters, without relying on predefined categories. It essentially provides a visual map of the documents, with links between related documents, making it easy to browse through a collection, extracting multiple documents, or one particular topic. The principal scholarly contribution of this study will be a refinement of the testing procedures for clustering algorithms, and a conclusion regarding the optimum algorithm for online real-time searches of web document databases.

The data mining process is inherently iterative: the output of one step may be sent as a feedback to a previous step, as well as to the next step in the process. These steps are categorized into data pre-processing and discovered knowledge post-processing groupings. But as we will see, there are various techniques available to perform these tasks: classification, association and clustering, the focus of the present study. Data classification and association are appropriate to many data mining projects where there are pre-defined rules or concept categories, but many abstract problems require creation of new abstract categories in order to provide the substructure for a more focused search. If the rule set is derived directly from the audit data, any slight deviations from the pattern scheme may go undetected; conversely, minor deviations from "normal behavior" can trigger false alarms. Abstract categories can mitigate this problem.

The following chart (Figure 1) presents the basic taxonomy of the alternative data mining techniques from the user viewpoint, and the purposes of the algorithm and performance analysis.

Figure 1 Data Mining Models: Categories, Clustering and Rule Based Analysis

From "Finding needles in the haystack: Mining meets the web" by Zorn, Emanoil, Marshall, and Panek. Copyright 1999. Online. Reprinted with permission.

Project Design

For this project design, the procedure calls for the algorithms to be selected on the basis of the common feature of abstract categories for clustering data, for the purpose of classifying data according to a generic category scheme.

The principal components of the research design are the following:

the operating system (Windows XP) the laboratory computer (Generic PC of 2.0 megahertz processing speed) the data structures: data clustering algorithms identifying abstract categories; data is parsed by the algorithm to yield search results

The experimental procedure: Testing of the algorithms for elapsed time

Evaluation of results of the speed trials: Time series analysis

Conclusions: Analysis of the optimum algorithm designs and recommendations for research applications

Definitions

Class description (Classification) -- summarization of a collection of data (class characterization). Class description includes summary properties such as count, sum and average as well as data dispersion such as variance and quartiles.

Association (Detection of Relations) -- association relationships or correlations among a set of items, expressed in the rule form showing frequently occurring attribute-value conditions within a given data set. Association analysis is researched with efficient algorithms, including level-wise A priori search, mining multiple-level, multi-dimensional associations, mining associations for numerical, categorical and interval data, meta-pattern directed or constraint-based mining and mining correlations.

Clustering -- identifies embedded clusters in the data, where the cluster is a collection of "similar" data objects, as expressed by distance functions. Data mining research has focused on high quality and scalable clustering methods for large databases and multi-dimensional data warehouses.

Time-series analysis -- analyses large sets of time-series data, searching for similar sequences or sub-sequences and mining sequential patterns, periodicities, trends and deviations.

Overview of the Methodology

The methodology employed in this project will be experimental analysis, with the objective of testing the feasibility of abstract category data clustering algorithms for a real world web application. In order to perform this test, a group of five linear time clustering algorithms will be applied to a sample group of online web documents, simulating the activities of a web search engine looking for similar words, phrases or sequences in a large database set of web articles, publications and records. The five techniques compared will be the K-Means, Single Pass, Fractionation, Buckshot and Suffix Tree clustering algorithms.

The procedure will be to measure the execution time of the test algorithms in clustering data sets consisting of whole documents, excerpts and key words of a fixed quantity and size. The tests will be performed on a standard desktop personal computer running the Microsoft Windows XP operating system at a processing speed of at least 1 gigahertz, so simulate the real world activities of a conventional office worker or librarian doing a document search. Times will be recorded for a series of 10 tests repeated for the same group size and numbers of files, in order to determine the optimum clustering algorithm for real time online web document searches.

Organisation of the Study

The proposed design method is to conduct a speed trial analysis of the various designs currently used in search engine algorithms. The initial evaluation will be followed up with an analysis of the weaknesses in current algorithms and some suggestions for improvement. The study will comprise five chapters: Introduction/Statement of Problem, Review of Literature, Alternative Solutions, Feasibility Tests, Evaluation and Implementation. Future testing will be performed through the implementation of the selected algorithm in a regular application of web-based document searches by a librarian's search engine.

Purpose of the Study

The purpose of this study is to conduct research that will analyze and improve the use of data clustering techniques in creating abstract categories in algorithms, allowing data analysts to conduct more efficient execution of large-scale searches. Increasing the efficiency of the search process requires a detailed knowledge of abstract categories, pattern matching techniques, and their relationship to search engine speed.

Data mining involves the use of search engine algorithms looking for hidden predictive information, patterns and correlations within large databases. The technique of data clustering divides datasets into mutually exclusive groups. The distance between groups is measured with respect to all the available variables, versus variables that are specific predictors, to produce "abstract categories" for analysis. Search engine algorithms and user audit trails are complex, leading to time-consuming quests for specific information. It is anticipated that the proposed study will identify the most efficient and effective data clustering algorithms for this purpose.

Conduct of the Project:

Ten tests, repeated for the same numbers of files and group size, are timed and recorded to find an optimal clustering algorithm for real time database document searches.

A standard desktop personal computer running OS Microsoft Windows XP at a processing speed of one gigahertz or more will simulate the real world activities of an office worker doing a document search. Databases for the time tests are in the public domain.

No additional software, equipment or travel costs are likely to arise. At present, no human resources are anticipated to be necessary to accomplish the research goals. If any interviews or surveys are conducted, appropriate human subjects protocols will be followed.

Statement of Deliverables:

The methodology employed is an empirical analysis of speed tests conducted on groups of similar algorithms. The execution time of the algorithms will be measured for clustering sets that consist of whole documents and excerpts and key words of a fixed number.

The distribution of all the activity types in the search will be determined. Common features are identified within the… READ MORE

Download Full Paper

Write a New Paper for Me

Quoted Instructions for "Advanced Data Clustering Methods of Mining Web Documents" Assignment:

“

Email *****@aol.com for the sources.

Hello there I will sent you an email with the requirements and my proposal.

How to Reference "Advanced Data Clustering Methods of Mining Web Documents" Term Paper in a Bibliography

“Advanced Data Clustering Methods of Mining Web Documents.” A1-TermPaper.com, 2005, https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188. Accessed 5 Oct 2024.

Advanced Data Clustering Methods of Mining Web Documents (2005). Retrieved from https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188

A1-TermPaper.com. (2005). Advanced Data Clustering Methods of Mining Web Documents. [online] Available at: https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188 [Accessed 5 Oct, 2024].

”Advanced Data Clustering Methods of Mining Web Documents” 2005. A1-TermPaper.com. https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188.

”Advanced Data Clustering Methods of Mining Web Documents” A1-TermPaper.com, Last modified 2024. https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188.

[1] ”Advanced Data Clustering Methods of Mining Web Documents”, A1-TermPaper.com, 2005. [Online]. Available: https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188. [Accessed: 5-Oct-2024].

1. Advanced Data Clustering Methods of Mining Web Documents [Internet]. A1-TermPaper.com. 2005 [cited 5 October 2024]. Available from: https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188

1. Advanced Data Clustering Methods of Mining Web Documents. A1-TermPaper.com. https://www.a1-termpaper.com/topics/essay/clustering-algorithms-employ/6183188. Published 2005. Accessed October 5, 2024.

Related Term Papers: