59、Web Usage Mining Cluster Simulation: K-Means vs Fuzzy C-Means-优快云博客

本文链接：https://blog.youkuaiyun.com/pandas7gardener/article/details/149893538

Web Usage Mining Cluster Simulation: K-Means vs Fuzzy C-Means

1. Introduction

In the digital age, the vast amount of data available in datasets has made it impossible for humans to extract the required information without efficient data mining algorithms. Web usage mining (WUM) is a technique that analyzes proxy server log repositories to understand user surfing behavior. It aims to detect website visit patterns, which can help improve website interfaces, predict user requests, and enhance browsing experiences.

The process of web usage mining involves three main steps: preprocessing, detecting probability patterns, and investigating these patterns. This article focuses on a comparative study of two clustering methods, K-Means and Fuzzy C-Means, using proxy server log datasets.

2. Pre - Processing of Proxy Server Log File

Web usage mining starts with preprocessing the proxy server log file. The input for the entire web usage mining process is the proxy server log dataset. This data needs to be preprocessed to provide the best input for the web mining and behavior pattern - detecting fuzzy clustering algorithms.

The preprocessing section can offer three forms of output records. The frequent pattern identification phase only requires the websites visited by a given web user, where the order of the websites is irrelevant.

2.1 Dataset Details

The authors collected proxy server logs from 01/11/2017 to 31/03/2018. There were around 4139 available users, with a maximum request of 102,747 by a single web user. They considered 5% of the maximum request (5137), and found 86 customers with more than 5137 requests. These customers visited 10,486 websites.

2.2 Cluster and Cluster Analysis

Cluster : In clustering, a set of data items with relative attributes are grouped together. In the fuzzy clustering approach, each data point has a degree of gratification for each cluster. Data points not belonging to a cluster have a gratification of 0. Fuzzy clustering is more powerful in handling outliers and original data with vagueness, ambiguity, and imprecision.
Cluster Analysis : Cluster investigation is essential for understanding user likes and dislikes, web user behavior, and pattern detection. It provides a fuzzy cluster validity standard based on a method that detects the compactness and separateness of fuzzy partitions.

2.2.1 K - Means Algorithm

The K - Means cluster model is considered a baseline model. The steps of the K - Means algorithm are as follows:
1. Input :
- K: the required number of clusters.
- D: n data points in the dataset.
2. Output : A set of K clusters.
3. Method :
- Step 1: Arbitrarily select K objects as the initial cluster centers.
- Step 2: Repeat the following steps:
- Step 3: Re - initialize every object to the cluster.
- i. Repeat the following for each data point:
- ii. Compute the distance from each centroid to the data point. Assign the data point to the cluster whose centroid is the closest. Then compute the mean for that cluster center.
- iii. Update the cluster mean to that centroid.
- iv. Continue until the last data point.
- Step 4: Continue until the cluster values remain the same.

The following mermaid flowchart shows the K - Means algorithm:

graph TD;
    A[Start] --> B[Number of Cluster K];
    B --> C[Generate Initial Centroid];
    C --> D[Calculate Distance using Euclidean Distance];
    D --> E[Clustering of the objects];
    E --> F{No object to move};
    F -- Yes --> G[End];
    F -- No --> H[Re - calculate Centroid];
    H --> D;

However, the K - Means method has some issues. It is effective for crisp data with perfect borders, but in reality, clusters often have fuzzy borders and overlap. Natural data usually contains uncertainty, imprecision, and ambiguity.

2.2.2 Fuzzy Clustering Algorithm

Fuzzy C - Means (FCM) is a fuzzy version of the K - Means method. It allows data points to belong to more than one cluster based on their degree of interest. The FCM algorithm estimates the cluster centers and assigns an amount of gratification between 0 and 1 to each data point.

The steps of the proposed FCM algorithm are as follows:
1. Step 1: Assign
- Total clusters c = 2
- Number of patterns on the dataset (n)
- Fuzziness control (m, n), with (m, n) > 1
- Iteration number t = 0
- Stopping criteria ε = 0.001
- Initialize centroids (vi(0))
2. Step 2: Calculation
- For 1 ≤ k ≤ n and 1 ≤ i ≤ c:
- Compute the distance (d^{2}(x_{k}, v_{i})) between the kth pattern and the ith cluster center using one of the following equations:
- (d(x_{i}, v_{j})=\sum_{k = 1}^{n}|x_{i}^{k}-v_{j}^{k}|^{2})
- (d_{cov}(X_{i}, V_{j})=\frac{d_{cov}(X_{i}, V_{j})}{\sqrt{d_{var}(X_{i})\times d_{var}(V_{j})}})
- Update the fuzzy membership degree (\mu_{ik}^{(t)}) using (\mu_{ik}=\left[\sum_{j = 1}^{c}\left(\frac{d^{2}(x_{k}, v_{i})}{d^{2}(x_{k}, v_{j})}\right)^{\frac{2}{m - 1}}\right]^{-1})
- Compute the cluster center (v_{i}^{(t)}) using (v_{i}=\frac{\sum_{k = 1}^{n}\mu_{ik}^{m}x_{k}}{\sum_{k = 1}^{n}\mu_{ik}^{m}}), where (1\leq i\leq c)
3. Step 3: Stopping Criterion
- If (\max_{ik}|v_{i}^{(t + 1)}-v_{i}^{(t)}|<\varepsilon), then stop. Otherwise, increment the iteration number (t=t + 1) and go to step 2.

The following mermaid flowchart shows the Fuzzy C - Means algorithm:

graph TD;
    A[Start] --> B[Enter Web Log Data];
    B --> C[Number of Cluster K];
    C --> D[Cluster Centre is calculated];
    D --> E[Cluster membership value is initialized randomly];
    E --> F[Update Degree of Membership];
    F --> G[Calculate Objective Function];
    G --> H{Object to move};
    H -- Yes --> E;
    H -- No --> I[End];

3. Experiment and Result Analysis

The authors used the VNSGU proxy server dataset for the experiment. The details of the dataset are as follows:
| Source | Considered users | Total records | Considered web site | Number of users | Matrix | Total months | Algorithm | Unique web site | 5% from highest request sender |
| — | — | — | — | — | — | — | — | — | — |
| VNSGU web sever | 86 | 3,016,859 | 10,486 | 4302 | 86 × 10,486 | 4 months | K - Means, FCM | 37,859 | 102,747 |

3.1 K - Means Results

By analyzing the K - Means clustering results, out of 86 users:
- Almost 71% of users are interested in Cluster 5.
- Almost 25% of users are interested in Cluster 1.
- The remaining 3% of users are interested in Cluster 2, Cluster 3, and Cluster 4.

The following table shows the K - Means clustering algorithm user versus cluster:
| User | K - Means | User | K - Means | User | K - Means | User | K - Means | User | K - Means | User | K - Means |
| — | — | — | — | — | — | — | — | — | — | — | — |
| 1 | 5 | 16 | 1 | 31 | 5 | 46 | 5 | 61 | 5 | 76 | 5 |
| 2 | 5 | 17 | 1 | 32 | 5 | 47 | 5 | 62 | 5 | 77 | 5 |
| 3 | 1 | 18 | 1 | 33 | 5 | 48 | 5 | 63 | 2 | 78 | 5 |
| 4 | 1 | 19 | 1 | 34 | 5 | 49 | 5 | 64 | 5 | 79 | 5 |
| 5 | 1 | 20 | 5 | 35 | 1 | 50 | 5 | 65 | 5 | 80 | 5 |
| 6 | 5 | 21 | 1 | 36 | 5 | 51 | 5 | 66 | 5 | 81 | 5 |
| 7 | 5 | 22 | 1 | 37 | 5 | 52 | 5 | 67 | 5 | 82 | 5 |
| 8 | 5 | 23 | 1 | 38 | 5 | 53 | 5 | 68 | 5 | 83 | 5 |
| 9 | 5 | 24 | 1 | 39 | 5 | 54 | 4 | 69 | 5 | 84 | 5 |
| 10 | 1 | 25 | 1 | 40 | 5 | 55 | 5 | 70 | 5 | 85 | 5 |
| 11 | 1 | 26 | 5 | 41 | 5 | 56 | 5 | 71 | 5 | 86 | 5 |
| 12 | 1 | 27 | 1 | 42 | 3 | 57 | 1 | 72 | 5 |
| 13 | 1 | 28 | 5 | 43 | 5 | 58 | 5 | 73 | 5 |
| 14 | 1 | 29 | 5 | 44 | 5 | 59 | 1 | 74 | 5 |
| 15 | 1 | 30 | 5 | 45 | 5 | 60 | 5 | 75 | 5 |

K - Means is fast but provides less information. A user can belong to only one cluster, and we don’t get detailed information about their interests.

3.2 Fuzzy C - Means Results

With Fuzzy C - Means, we get the amount of gratification for users belonging to each cluster. Out of 86 users:
- 27% of users are highly interested in Cluster 4, with an average amount of gratification of 0.67.
- 73% of users are more interested in Cluster 1 and Cluster 5 compared to Cluster 4. The average amount of gratification of these 73% of users in Cluster 1 and Cluster 5 is 0.369, while in Cluster 4 it is 0.229, showing an interest gap of 0.14.
- The 27% of users interested in Cluster 4 have an average gratification of 0.17 in Cluster 1 and Cluster 5, with an interest gap of 0.47.

The following table shows the Fuzzy C - Means clustering algorithm user amount of gratification versus cluster (partial data shown):
| User | Cluster1 | Cluster2 | Cluster3 | Cluster4 | Cluster5 |
| — | — | — | — | — | — |
| 5 | 0.061 | 0 | 0 | 0.877 | 0.061 |
| 18 | 0.066 | 0 | 0 | 0.868 | 0.066 |
| 13 | 0.091 | 0 | 0 | 0.817 | 0.091 |
| 21 | 0.094 | 0 | 0 | 0.811 | 0.094 |
| 22 | 0.094 | 0 | 0 | 0.811 | 0.094 |

Fuzzy C - Means provides more detailed information about user interests, but it has limitations such as the prior requirement of the number of clusters, and it is more complex and time - consuming.

In conclusion, both K - Means and Fuzzy C - Means have their own advantages and disadvantages. K - Means is fast and simple but has limitations in handling overlapping clusters and noisy data. Fuzzy C - Means offers better results for overlapping situations in proxy log files and provides a more detailed view of user interests, although it is more complex and slower.

Web Usage Mining Cluster Simulation: K-Means vs Fuzzy C-Means

4. Comparison of K - Means and Fuzzy C - Means

The comparison between K - Means and Fuzzy C - Means reveals their distinct characteristics in various aspects:

4.1 Computational Efficiency

K - Means : It is known for its computational speed. The algorithm is relatively straightforward and has a lower time complexity, making it a good choice when dealing with large datasets where quick results are required. For example, in the experiment with the VNSGU proxy server dataset, K - Means was able to quickly assign users to clusters. However, its simplicity also means that it may not provide in - depth information about the data.
Fuzzy C - Means : This algorithm is more complex and time - consuming. It involves multiple iterations of calculating distances, updating membership degrees, and recalculating cluster centers. The fuzzy nature of the algorithm requires more computational resources, especially when dealing with a large number of data points and clusters.

4.2 Handling of Clusters

K - Means : It is suitable for datasets with well - defined, non - overlapping clusters. Each data point is assigned to exactly one cluster, which can be a limitation when dealing with real - world data where clusters may overlap. For instance, in the case of user interests in websites, there may be users who have interests in multiple areas, but K - Means cannot represent this overlap.
Fuzzy C - Means : It can handle overlapping clusters effectively. Each data point is given a degree of gratification for each cluster, allowing it to belong to multiple clusters to different extents. This is more in line with the real - world scenario where users may have diverse interests.

4.3 Noise and Outliers

K - Means : It is sensitive to noise and outliers. Since it uses the mean of data points in a cluster to represent the cluster center, outliers can significantly affect the position of the center and lead to inaccurate clustering results.
Fuzzy C - Means : It is more robust to noise and outliers. By assigning a low degree of gratification to outliers, they have less influence on the cluster centers. This makes it a better choice when dealing with datasets that may contain noisy or inaccurate data.

5. Practical Applications and Recommendations

The choice between K - Means and Fuzzy C - Means depends on the specific requirements of the application:

5.1 K - Means Applications

Large - scale Data with Well - Defined Clusters : When dealing with large datasets where the clusters are expected to be well - separated and non - overlapping, K - Means can be a good choice. For example, in customer segmentation for a large e - commerce platform where customers can be clearly divided into different groups based on their purchase behavior (e.g., high - spenders, low - spenders, frequent buyers, infrequent buyers).
Quick Initial Analysis : If you need to get a quick overview of the data and identify the general clusters, K - Means can provide a fast result. This can be useful in the initial stages of a data analysis project.

5.2 Fuzzy C - Means Applications

Datasets with Overlapping Clusters : When the data has overlapping clusters, such as in the case of user interests in multiple types of content on a website, Fuzzy C - Means can provide a more accurate representation of the data.
Data with Noise and Outliers : In datasets where noise and outliers are present, Fuzzy C - Means can produce more reliable clustering results. For example, in sensor data collection where there may be measurement errors or interference.

6. Future Directions

The field of web usage mining and clustering techniques is constantly evolving. Here are some potential future directions:

6.1 Hybrid Approaches

Combining the advantages of K - Means and Fuzzy C - Means, or other clustering algorithms, could lead to more effective clustering methods. For example, using K - Means for an initial rough clustering and then using Fuzzy C - Means to refine the results in areas where more detailed information is needed.

6.2 Incorporating More Data Sources

Web usage mining can benefit from incorporating more data sources, such as user demographics, social media data, and device information. This additional data can provide a more comprehensive view of user behavior and lead to more accurate clustering.

6.3 Adaptive Clustering

Developing algorithms that can adapt to changes in the data over time. For example, as user interests change, the clustering algorithm should be able to adjust the clusters accordingly.

In summary, both K - Means and Fuzzy C - Means are valuable clustering methods in web usage mining. Understanding their strengths and weaknesses is crucial for choosing the appropriate algorithm for a given dataset and application. By leveraging these techniques effectively, website publishers can gain a better understanding of their users’ needs and improve the user experience.