The invention discloses a Chinese 
Web document online clustering method based on common substrings. As known to all, search engines are important in application of 
information searching and positioning with sharp increase of information on 
the internet. 
Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The 
Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet 
online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-
Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting 
a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing 
pairwise similarity of the Web documents on the basis of the model to acquire a 
similarity matrix, (5) adopting an improved 
hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and 
label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering 
label generation and clustering time effects.