Mining the hierarchical structure of Internet review topics and realizing a fine classification of review texts can help alleviate users’ information overload. However, existing hierarchical topic classification methods primarily rely on external corpora and human intervention. This study proposes a Modified Continuous Renormalization (MCR) procedure that acts on the keyword co-occurrence network with fractal characteristics to achieve the topic hierarchy mining. First, the fractal characteristics in the keyword co-occurrence network of Internet review text are identified using a box-covering algorithm for the first time. Then, the MCR algorithm established on the edge adjacency entropy and the box distance is proposed to obtain the topic hierarchy in the keyword co-occurrence network. Verification data from the Dangdang.com book reviews shows that the MCR constructs topic hierarchies with greater coherence and independence than the HLDA and the Louvain algorithms. Finally, reliable review text classification is achieved using the MCR extended bottom-level topic categories. The accuracy rate ([Formula: see text], recall rate ([Formula: see text] and [Formula: see text]1 value of Internet review text classification obtained from the MCR-based topic hierarchy are significantly improved compared to four target text classification algorithms.
The box-covering method plays a fundamental role in the fractal property recognition and renormalization analysis of complex networks. This study proposes the hub-collision avoidance and leaf-node options (HALO) algorithm. In the box sampling process, a forward sampling rule (for avoiding hub collisions) and a reverse sampling rule (for preferentially selecting leaf nodes) are determined for bidirectional network traversal to reduce the randomness of sampling. In the box selection process, the larger necessary boxes are preferentially selected to join the solution by continuously removing small boxes. The compact-box-burning (CBB) algorithm, the maximum-excluded-mass-burning (MEMB) algorithm, the overlapping-box-covering (OBCA) algorithm, and the algorithm for combining small-box-removal strategy and maximum box sampling with a sampling density of 30 (SM30) are compared with HALO in experiments. Results on nine real networks show that HALO achieves the highest performance score and obtains 11.40%, 7.67%, 2.18%, and 8.19% fewer boxes than the compared algorithms, respectively. The algorithm determinism is significantly improved. The fractal dimensions estimated by covering four standard networks are more accurate. Moreover, different from MEMB or OBCA, HALO is not affected by the tightness of the hubs and exhibits a stable performance in different networks. Finally, the time complexities of HALO and the compared algorithms are all [Formula: see text], which is reasonable and acceptable.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.