1 Introduction
With the widespread application of Web 2.0, self-media platforms, such as online forums and online communities, have gradually become the main form of information exchange. While users enjoy convenient technology, they also face difficulties in making decisions caused by explosive review data. The topic modeling of the review dataset can realize the "short description" of the document, thus providing the possibility for mining the hidden semantic structure of large-scale datasets. However, in the process of topic recognition and evolution, the dynamic change of the number of topics makes it difficult to quantitatively analyze the relationship between the content relevance of a document and the number of topics
[1]. In addition, current topic recognition models are mostly based on a fixed number of topics, which cannot represent the semantic relevance between topics. At the same time, the recognition results depend only on the probability between topics, which makes it difficult to characterize the inherent hierarchical relationship of comment events. Therefore, it is extremely urgent to dig deeper topic relationship on the review topic.
After years of research, topic detection and tracking
[2] has gradually formed a relatively complete set of algorithms and systems, the goal of which is to classify massive texts according to topics and track their evolution. According to the different text representation models in the corpus, the current topic evolution methods can be divided into two categories. The first type is cluster evolution analysis based on vector space. This type of method treats high-dimensional corpus text as an unordered set of low-dimensional words. It measures the similarity distance between texts and compares the change of the subject at different times. Lu, et al.
[3] proposed a K-means clustering method (EEAM) based on the multi-vector model. This method constructs topic events by calculating the similarity between sub-topics. The topics at different moments are matched according to the similarity between the event vectors to generate a topic evolution set. Lin, et al.
[4] proposed a news review topic evolution model (WVCA) based on word vectors and clustering algorithms. This model first introduced the word vector model into text stream processing to construct word vectors in time series, and then used the
-means cluster to achieve the extraction of topic keywords. Cigarrán, et al.
[5] proposed an unsupervised topic detection algorithm (TDFCA) based on formal concept analysis (FCA). By combining similar content in formal concepts into concept lattices, formal concepts are used as the basic carrier to construct Twitter-based terms. Guesmi, et al.
[6] proposed an event topic selection model (FCACIC) based on FCA. This method uses hierarchical clustering to focus on detecting common interest communities (CIC) in social networks, avoiding the introduction of new topics during the topic detection process. It cannot achieve non-human participation, as this type of method only utilizes the similarity distance between texts to determine the correlation between topics.
In order to cope with the topic detection of massive documents in complex environments, some scholars have proposed probabilistic topic analysis. This type of method considers the topic to be smooth in the time dimension, and uses the topic posterior probability of the
time slice as the
time slice. A priori probability, combined with the calculation of the similarity between topics, reduces the calculation bias caused by part-of-speech differences. For example, the probabilistic index model
[7] (PLSI) and the implicit Dirichlet allocation model (LDA)
[8] map out the process of topic identification and evolution by establishing joint probability between texts, topics, and words. AlSumait, et al.
[9] added the online processing function of the text on the basis of the LDA model, and proposed an online Dirichlet probability model to achieve online tracking of topics. Although the focuses of the above studies are different, a common drawback is that the identification of topics relies heavily on the number of topics in text clustering or classification, and the number of topics needs to be specified in advance or iteratively obtained according to a given threshold, which cannot meet the topic evolution process. To enact this need, Herinrich
[10] proposed the infinite latent Dirichlet allocation (ILDA) model, which implements topic classification based on the time-dependent relationship of text. However, this method still has the problem of "short sightedness". Considering the iterative problem of the optimal number of topics, there are more meaningless topics, without considering the weight of different topic feature words under the changing number of topics.
The approaches mentioned above have two drawbacks. First, these approaches rely on the number of topics of text clustering. Specifically, the recognition of topics are represented in a certain way without considering the semantic changes of the topic feature words under the dynamic number, which fails to avoid false inheritance of the topic. Second, the correlation strength of feature words under different topics in the ILDA model is weak, which is difficult to mine the inherent hierarchical relationship of events.
The motivation of the paper is to establish a partial order constraint relationship between topics and feature words. To achieve this goal, the model for building topic feature lattice under dynamic topic number (TFL_DTN) is proposed, which realizes the dynamic change perception of topics in time series. Specifically, the TFL_DTN model first obtains the topic-feature word probability matrix and the document-topic probability matrix by modeling documents, topics and feature words; then, the topic association matrix is established, and the features under different topics in the document are calculated according to the joint probability among them. Finally, multi-granularity topic networks are identified based on the characteristics of strongly correlated topics.
2 Related Work
2.1 The Theory of ILDA
LDA is an unsupervised probabilistic model based on probabilistic latent semantic analysis (PLSA), which can implement implicit topic mining of documents
[11]. The LDA model is a three-layer Bayesian network, where documents can be viewed as discrete topic words, and different topics converge to a limited mixture of topic feature words with probability, which is shown as
Figure 1. However, the hyper parameters
and
of this model need to be set in advance, and after many simultaneous, the number of topics
, which is manually set, is related to the granularity of text division. In extreme cases, an excessively large number of topics will merge too many divided text topics. The ILDA model
[12] generates an empty topic for integration, unable to obtain valid topic description information, which cannot meet the actual needs of topic division. Besides, the time-dependent relationship of the text realizes the topic classification under the dynamic number of topics. The model structure is shown in
Figure 2.
Figure 1 LDA model structure |
Full size|PPT slide
Figure 2 ILDA model structure |
Full size|PPT slide
There are two main differences between the two models in
Figures 1 and
2. First, on the basis of LDA, ILDA changes the value of the topic number
to a dynamic variable that can be arbitrarily selected in the interval
. Second, the document-topic distribution matrix
in LDA is determined by the Dirichlet distribution of the hyper parameter
polynomial, and
in ILDA is determined by the joint Dirichlet allocation process (DAP), regardless of the polynomial distribution of the hyper parameter
[13]. DAP is a prior distribution based on random probability, which can be obtained by polynomial
. The polynomial
is a polynomial mixture that obeys the Griffiths-Engen-McCloskey (GEM) random measure distribution. The detailed calculation process can be found in
[14]. The calculation of the base distribution O is shown in Equation (1)
[15]. The advantage of DAP is that its input is not a fixed number, but a discrete variable that changes dynamically. ILDA is a three-layer Bayesian network. By abstracting a document into a polynomial distribution containing
topics, and abstracting a topic into a polynomial distribution containing multiple feature words, it implements joint modeling of documents, feature words, and topics. At the same time, the number of topics depends on the random prior distribution of the hybrid model. It no longer requires that the topic priors of the document must obey the Dirichlet distribution, thereby reducing the sensitivity of the topic model to the number of topics and improving the ability to model large corpora.
2.2 The Theory of Formal Concept Analysis
FCA is a formal method that takes the formal context as its domain, which focuses on describing the hierarchical relationship between concepts
[16]. This theory takes the partial order relationship between formal concepts as the core, and realizes the semi-automatic identification of multi-level ordered concept nodes by establishing the mapping relationship between description objects and attributes
[17]. From the perspective of semantic relationship mining, the concept lattice construction process described by FCA theory can be regarded as the process of hierarchical relationship mining between topic nodes. Meanwhile, the association relationship between the topic concepts is obtained to enhance the semantic relationship between the feature words and the topic.
The mathematical foundation of FCA theory is lattice theory and order theory. The modeling process can be described as follows: First, based on the binary membership between objects and attributes, a ternary context is established including (objects, attributes, relationships). Afterwards, the formal context obtains formal concepts that satisfy the partial order relationship. Finally, a formal concept lattice is established based on whether there is an order relationship between the concepts. In the above process, concept nodes at different levels can reflect different generalization and instantiation relationships between objects, which provide new ideas for obtaining the semantic correlation between topics and feature words.
3 Construction of TFL_DTN
Although the ILDA model can realize online topic identification under dynamic topic numbers, it only determines the topic correlation degree through the probability dependence relationship between topics. Besides, it does not take into account the change in the weight of feature words that may be caused by changes in the topic number. At the same time, the model cannot effectively obtain the hidden hierarchical relationships between topics, and lacks the semantic modeling ability of multi-granularity knowledge. Therefore, this paper makes use of the good dynamic topic modeling ability of the ILDA model, by introducing feature word weight parameters into the topic model, and combining the formal concept analysis method to establish a topic recognition model TFL_DTN. The model first utilizes the ILDA model to simulate the dynamic topic generation process. Secondly, the strength of the connection between the topic and its feature words is determined to establish the topic formal context based on the joint probability. Finally, the concept features are used as a guide to construct the topic feature lattice to identify a multi-granular topic network including a document library, a topic array, and a feature word set, so as to realize the conceptual visual modeling of multi-layer network topics.
3.1 Model Construction based on TFL_DTN
The topic modeling of TFL_DTN can be divided into two sub-models: Self-adaptive topic analysis model (STAM) and Topic feature lattice construction model (TFLCM). First, the STAM model assumes that there is probability dependence between documents, topics, and feature words. Each document converges to a topic with a probability, and each topic extracts feature words with a certain probability, thereby forming a three-layer production probability distribution. Among them, the document is a topic polynomial distribution that obeys the Dirichlet distribution process, and the topic is also a feature polynomial that obeys the Dirichlet distribution, which is shared by the document set containing different mixed topic proportions and feature word weights. For the convenience of explanation, the meanings of the variables and parameters in the model are shown in Table 1. The topic analysis process of the STAM model is depicted as follows. First, we use Gibbs sampling to obtain the dynamic optimal number of topics and establish a document-topic probability matrix and topic-feature word probability matrix to extract topics and feature words respectively. Then, candidate feature words with top high word frequencies from the document-feature word matrix can be selected, on the basis of extracting feature words with higher weights. Finally, the above steps are iterated to obtain the topic with feature words. The STAM model reconstructs the probabilistic dependency relationship between the topic and the feature words based on the ILDA model. In essence, the model does not change the generation of documents, topics and feature words, where the relationship still maps topics and feature words to the same semantic space through the probability selection model. Therefore, STAM can still be regarded as a three-layer Bayes network. The functional dependence of the variables and distribution matrix in the model is shown in Figure 3.
Table 1 Parameter comparison in TFL_DTN model |
Symbol | Implication | | Symbol | Implication |
| Number of corpus documents | | | Topic-feature word probability matrix |
| Number of candidate feature words | | Document-topic probability matrix |
| Dynamic topic array | | Feature word weight matrix |
| Hyper parameters of document-topic probability matrix | | Dirichlet distribution |
| Hyper parameters of topic-feature word probability matrix | | Topic collection |
| Hyper parameters of random parameter probability distribution | | Document set in the original corpus |
| Joint Dirichlet-Craykey distribution polynomial | | Document-feature word matrix |
| Topic variable | | Feature word set |
| Feature word variables | | Topic association matrix |
| Weight parameter of feature words | | Formal context association matrix |
Figure 3 STAM model structure |
Full size|PPT slide
The TFLCM model assumes that the probability value of the pair of document and topic has a positive correlation with the correlation strength of the pair of topic and feature word. The greater the probability that a document selects a topic is, and the greater the probability that a topic selects a feature word is. By setting the threshold, the strongly related topic features are filtered out and mapped into a context matrix of formal context, and the topic feature lattice is finally generated. The generation process of the TFLCM model is expressed as follows. First, the association probability with the highest probability value from the document-feature word probability matrix is extracted, as well as the topic association matrix, by calculating the feature word association strength under different topics in the document. Afterwards, the feature words with strong correlation are generated, and the correlation matrix of topic formal context is generated. Finally, the generated topic feature lattice is reduced through formal concept analysis. The transformation relationship among matrix variables in the model is shown in Figure 4. Based on the above analysis, the relevant definitions are given as follows.
Figure 4 Variable conversion relationships in TFLCM |
Full size|PPT slide
Definition 1 (Document-Feature Word Matrix) For any document set containing documents and feature words , the frequency vector of the feature word sequence contained in can be computed respectively, then the document-feature word matrix for can be represented as , where .
Definition 2 (Document-Topic Probability Matrix) For any document , , if topic probability vectors about topic is generated, the sampling probability of topic named as can be obtained on the basis of the document-topic probability matrix of .
Definition 3 (Topic-Feature Word Probability Matrix) For any topic , , if the feature word probability vectors about the feature word is generated, the sampling probability of feature word in the topic can be obtained on the basis of the topic-feature word probability matrix .
Definition 4 (Feature Matrix of Feature Words) Let the dependence of the feature word's probability on the topic under the number of topics be , then the weight matrix of the feature word is , where , .
Definition 5 (Topic Association Matrix) Let the association set between topic and feature words satisfy the following constraints, then it is called the topic association matrix under the topic . In particular, if , where , is called a strong association matrix (denoted as ) of , recorded as the feature set of all topic associations that satisfy the constraints in the topic set. Constraint 1. , . Constraint 2. For any , .
Definition 6 (Topic Formal Context) Let the topic formal context be , where represents the topic set and represents the feature word set. represents the mapping relationship between the topic and the feature set on condition that , .
Definition 7 (Topic Feature Lattice) Let the topic formal context be , when and , for any two-tuple satisfying and , can be called a set of formal concepts on condition that when or , there is a partial ordering relationship that makes be true, where * operation is defined as Equations (1) and (2). The partial order relationship set of all formal concepts in topic formal context constitutes the topic feature lattice denoted as .
To sum up, the topic in the TFL_DTN model is a latent variable that depends on a mixture of document-topic polynomials, and the feature words depend on significant variables of the multimodal mixture between (topic, feature word) and (feature word, feature word weight). The core idea of the model is described as follows. Firstly, potential semantic associations among variables can be established, through the probability dependence on documents, topics, feature words and feature word weights. Meanwhile, the Dirichlet stochastic process is viewed as the prior distribution of the Bayes network according to Gibbs sampling. What's more, the sampling algorithm obtains the number of dynamic topics, and establishes a document-topic probability matrix, a topic-feature word probability matrix, and a feature word weight matrix. Finally, the TFL_DTN model calculates the topic association matrix to filter out strongly related topic features and maps them into a formal context association matrix. To enact this need, a binary partial order relationship between topics and feature words is established to generate a topic feature lattice. The overall structure of the TFL_DTN model is shown in Figure 5.
Figure 5 Overall structure of TFL_DTN model |
Full size|PPT slide
3.2 Model Reasoning and Parameter Iteration
Since the derivation and parameter estimation of variables and distribution matrices in the TFL_DTN model are mainly handled by the STAM model, the TFLCM model mainly performs secondary filtering and correlation analysis of topic feature words. Therefore, this section mainly discusses the hidden variable and the matrix , parameter estimation. At the same time, the matrix relationship of the TFLCM model is transformed into the algorithm description in Subsection 3.3.
3.2.1 Model Reasoning
The STAM model first introduces hyper parameters and for topic probability distributions to represent mixed documents. Meanwhile, the parameter is utilized to represent the probability distribution of feature words for mixed topics. Afterwards, the topic of a word is extracted according to the topic probability distribution, and the characteristic words of the topic are generated on the basis of the characteristic word probability distribution. In the above process, since the in the STAM model undergo multiple iterations, their initial values have little effect on the calculation of the model. The prior can be calculated by the GEM polynomial distribution. Therefore, for the joint probability distribution of the solution model, the posterior conditional probability of the variable w must be obtained first, and then it can be used as the prior conditional probability of the probability matrix to calculate the topic polynomial distribution. Finally, the Gibbs sampling algorithm is used to approximate the estimation. A steady-state distribution matrix of the probability matrices , is obtained. For the convenience of explanation, the meanings of the variables during parameter iteration are shown in Table 2.
Table 2 Parameter description in STAM model |
Symbol | Implication |
| The number of total documents in the corpus |
| The number of topics in training set |
| The number of words in training set |
| The number of words assigned to topic in document |
| The number of total assigned topics in document |
| The number of words that feature word is assigned to topic |
| The number of total words assigned to topic |
| Mixed hyper parameter distribution |
| Probability estimation of topic in document |
| Probability estimation of feature word under topic |
| Word frequency in document with conditional probability of topic |
| Word frequencies of all assigned topics in document |
| The word frequency of conditional probability of feature word in document |
| Word frequencies of all feature words in topic |
The joint probability of all observable and hidden variables in the model with the hyper parameters is shown in Equation (4).
By solving the integrals for and in the above formula, the probability dependence between variables can be further solved, as shown in Equation (5).
The above formula can be further expressed as shown in Equation (6).
where represents the probability that a super parameter generates its feature words according to the probability under the topic. represents the probability that the feature word weight matrix depends on the feature word distribution. represents the prior distribution of a Bayes network that depends on the Dirichlet random process. Equation (6) can be further expressed as follows (Equation (7)).
The posterior probability of the available document library is shown in Equation (8).
From the above formula, the Gibbs sampling formula can be further obtained as shown in Equation (9).
3.2.2 Parameter Estimation
The STAM model first assigns random topics to feature candidates, and then iteratively calculates the probability distribution of feature words until the probability is stable (Equation (9)). After that, topic is extracted from the matrix (Equation (10)), and feature words can be extracted from the matrix with the probability of the formula (Equation (11)).
3.3 Algorithm Description
According to the description mentioned above, the parameter iterative process of the STAM model, as well as the matrix relationship conversion process of the TFLCM model can be described in Algorithm 1.
Table Algorithm 1 Topic feature lattice construction algorithm |
Input: , , , Document set after initial tokenization , Number of initial topics , Initial topic weight , Iteration threshold , Mapping function Output: Topic feature lattice , matrix , , , Number of topics , Topic association matrix , Topic set |
Step 1: For each |
Step 2: For each topic |
Step 3: |
Step 4: , |
Step 5: |
Step 6: end for |
Step 7: For each in |
Step 8: |
Step 9: |
Step 10: End for |
Step 11: For all documents |
Step 12: for all words in |
Step 13: Sample |
Step 14: Get |
Step 15: Create a new topic in |
Step 16: Sample |
Step 17: end for |
Step 18: if |
Step 19: Re-sampling |
Step 20: end if |
Step 21: End for |
Step 22: For each word in |
Step 23: Calculate |
Step 24: Generate |
Step 25: Choose a from where |
Step 26: Choose a from where |
Step 27: Choose a from where |
Step 28: End for |
Step 29: Get , , |
Step 30: For each in |
Step 31: |
Step 32: |
Step 33: . |
Step 34: End for |
Step 35: Get |
The proposed algorithm of TFL_DTN can be divided into two sub-models: STAM and TFLCM. STAM starts initializing the number of topic number and generating the matrix , on the basis of ichlet distribution as shown from steps 1 to 6. The proposed algorithm then gets the document-feature word matrix by calculating vector of feature word frequency as shown from steps 7 to 10. The model iteratively samples the feature words, and calculates the feature word weight matrix under different topic numbers to obtain the topic-feature word probability matrix and document-topic probability matrix as shown from steps 11 to 21. Consequently, topics, feature words with weights are extracted on the basis of extracting feature words with higher weights as shown from steps 22 to 29. To build the topic feature lattice, TFLCM starts calculating topic association matrix by extracting the association probability with the highest probability value from the document-feature word probability matrix, as well as generating the correlation matrix of topic formal context as shown from steps 30 to 35.
4 Results and Discussions
4.1 Preprocessing
We randomly select 1, 583, 275 online review data of the 20 automobile brand forums from the two websites of Auto Home and Netease Auto, from August 1, 2019 to September 20, 2019. First, the initial document is segmented, and the standard document corpus is obtained by removing data such as stop words, special symbols, and useless tags. Then, the text is converted into a set of review phrases, and a document-word matrix is established. Afterwards, the TF-IDF vector can be calculated to obtain attribute feature words of comment data.
4.2 Results Analysis
4.2.1 Comparison of the Optimal Number of Topics
In order to verify the rationality of the number of dynamic topics in the STAM model, = 0.1, = 0.1, = 0.2, while the initial topic weight = 0.5, and the iteration threshold is 0.2. The number of topics in the Baseline method (LDA model) is set to = 40, whose value is set manually in advance. The STAM model only specifies the number of algorithm iterations (200 and 280 respectively), and determines the number of topics in the document by week.
It can be seen in
Figure 6 that although the content of the events described in the corpus is relatively fixed, the number of topics in different periods is dynamically changed, which reflects the correlation between the evolution of topics and the number of topics. In addition, the real data is summarized by the method of manual annotation, and the number of topics varies in the interval
[15, 60], which is consistent with the experimental results of the STAM model. At the same time, there is no positive correlation between the document size and the number of topics, but related to the degree of clustering of the actual topics. For example, during the 200 iterations of the STAM model, the document set of the second week contains 847 texts, while the document set for Week 3 is composed of 561 texts. In contrast, the topic number of the former is only 32 but the one of the latter is 49.
Figure 6 Dynamic curve for topic number |
Full size|PPT slide
In addition, in order to test the capabilities of topic prediction and text representation in STAM model, the perplexity of the above models in the corpus document is calculated. The smaller the value of the perplexity is, the stronger topic prediction capability for the document and it has. The calculation of perplexity is shown as Equation (12). And the experimental results are shown in Figure 7.
Figure 7 Perplexity curve |
Full size|PPT slide
where, .
Figure 7 shows that the perplexity curve of the STAM model is lower than that of the Baseline method as a whole, and the perplexity of the dataset gradually decreases as the number of topics increases. Second, when the number of topics is = 70, the changeable degree is small, which indicates that the topic distribution under this topic number tends to be stable and the model achieves optimal performance, while the STAM model achieves the best performance when the number of topics is = 62, indicating that the model has relatively few topics. In that case, the ability to capture the correlation between topics under a dynamic number of topics is stronger, which reduces the model's dependence on the number of topics and improves the data representation ability for small sample data sets.
4.2.2 The Construction of Topic Feature Lattice
When the model's iterative probability threshold is set to 0.01, the document-feature word matrix can be acquired in the corpus. At the same time, when a relatively stable state is reached in the STAM model, both the document-topic probability matrix and topic-feature word probability matrix are output. The top 10 feature words with higher probability are extracted, and their feature word weights are calculated separately. Due to the large number of topics, Table 3 lists only six relatively concentrated topics.
Table 3 Results of topic feature words (Partial) |
Topic name | Topic feature words with probabilities (descending order) |
Topic 22 | brake 0.0923 / sideslip 0.0776 / blind zone 0.0681 / resonance 0.0635 / vision 0.0489 / vehicle warning 0.0437 / stability 0.0332 / weight 0.0332 / vehicle stall 0.0274 / loose parts 0.0165 |
Topic 8 | fuel consumption 0.0851 / cost performance 0.0739 / price 0.0722 / comprehensive performance 0.0696 / per hundred kilometers 0.0696 / high speed 0.0584 / working condition 0.0492 / wind resistance 0.0477 / auto parts zero ratio 0.0368 / idle speed 0.0295 |
Topic 34 | acceleration 0.0654 / maximum torque 0.0554 / power 0.0512 / vehicle climbing 0.0477 / idle speed 0.0461 / engine 0.0313 / turbine 0.0313 / transmission 0.0296 / performance 0.0296 / new energy 0.0212 |
Topic 68 | peculiar smell 0.1284 / ride space 0.0937 / suspension system 0.0735 / NVH 0.0667 / Soundproof 0.0545 / vehicle seat 0.0479 / tire noise 0.0448 / assisted driving 0.0379 / seat ventilation 0.0345 / human-computer interaction 0.0307 |
Topic 71 | maintenance 0.0762 / after-sales service 0.0754 / 4S 0.0694 / engine oil 0.0516 / vehicle failure rate 0.0507 / vehicle inspection 0.0472 / vehicle paint 0.0367 / tire 0.0286 / accessories 0.0286 / working hours 0.0104 |
Topic 79 | vehicle stalled 0.1374 / jitter 0.1238 / steering 0.0863 / clutch 0.0794 / automatic 0.0794 / exhaust 0.0634 / vehicle 0.0432 / gear shift 0.0415 / brake 0.0364 |
Based on the identification results of topic feature words in Table 3, the content of the topic sets is analyzed manually, which is summarized as the following comment topics: Topic 1 (Topic 22) is security evaluation; Topic 2 (Topic 8) is economy evaluation; Topic 3 (Topic 34) is dynamic performance evaluation; 4 (Topic 68) is comfort evaluation; 5 (Topic 71) is service evaluation; Topic 6 (Topic 79) is manipulative evaluation. The top 10 associated topics of the reviews are listed in Table 4, in which the main characters of security-related topic are No. 13 and No. 22. The feature words contain the topic of the braking, sideslip, blind area, early warning etc. These words are highly associated with vehicle safety of the vehicles, which is strongly aligned with the classification results of manual annotation. In addition, according to algorithm 1, the document set is mined for strongly correlated topic features, and the relational matrix of formal context is established to construct the topic feature lattice. The corresponding part of Hasse structure of topic feature lattice is shown in Figure 8, which shows that the closer the concept is to the top-level root node, the more generalization features of topic words are, such as vehicle length, wheelbase, weight. The term specialization is usually more prominent, such as the acceleration, torque, vehicle power, and vehicle hill climbing associated with node Topic 34. The conclusions show that the topic feature lattice based on the TFLCM model can intuitively find the hierarchical relationships of different topic feature words, with a good modeling ability in obtaining generalization of topic words and semantic relationships.
Table 4 Strongly related topic features (Partial) |
Topic category | Strongly related topics (in descending order) |
Security evaluation | Topic 22 / Topic 13 / Topic 46 / Topic 7 / Topic 21 / Topic 88 / Topic 74 / Topic 62 / Topic 107 / Topic 95 |
Economy evaluation | Topic 8 / Topic 11 / Topic 33 / Topic 40 / Topic 56 / Topic 75 / Topic 99 / Topic 7 / Topic 61 / Topic 115 |
Dynamic performance evaluation | Topic 34 / Topic 124 / Topic 81 / Topic 73 / Topic 84 / Topic 113 / Topic 18 / Topic 22 / Topic 51 / Topic 92 |
Comfort evaluation | Topic 68 / Topic 16 / Topic 53 / Topic 63 / Topic 77 / Topic 127 / Topic 137 / Topic 12 / Topic 64 / Topic 76 |
Service evaluation | Topic 71 / Topic 5 / Topic 107 / Topic 143 / Topic 19 / Topic 67 / Topic 55 / Topic 112 / Topic 17 / Topic 35 |
Manipulative evaluation | Topic 79 / Topic 24 / Topic 61 / Topic 19 / Topic 107 / Topic 46 / Topic 93 / Topic 40 / Topic 122 / Topic 51 |
Figure 8 Hasse diagram of the topic feature lattice (partial) |
Full size|PPT slide
4.3 Discussion
In order to verify the rationality of the TFL_DTN model, the accuracy rate, recall rate, F1 value, and mean absolute error (MAE) are selected as the evaluation indicators. Meanwhile, a comparison experiment is performed with the TFIDF algorithm
[18], the TDFCA algorithm
[15], and the ILDA algorithm
[12] on the same data set.
Tables 5 and
6 are the comparison results of the evaluation indexes of the above algorithms. The results show that the prediction performance of the TFL_DTN model is significantly better than the other methods on the six review topics. The accuracy, recall and F1 value of the measured data can be maintained around 0.65, and the MAE value can be below 0.85. The reason is that the TFL_DTN model combines the probabilistic relationship and partial order relationship between topic feature words and topics, which not only effectively reduce the dimensionality, but also improve the topic awareness of the document in the changing topic word
.
Table 5 Comparison of accuracy and recall of different algorithms |
| TFIDF | | TDFCA | | ILDA | | TFL_DTN |
Accuracy | Recall | | Accuracy | Recall | | Accuracy | Recall | | Accuracy | Recall |
Security evaluation | 55.86 | 56.31 | | 56.88 | 59.64 | | 58.34 | 61.47 | | 65.88 | 60.53 |
Economy evaluation | 57.24 | 58.14 | 59.37 | 61.16 | 62.76 | 62.83 | 67.52 | 61.04 |
Dynamic performance evaluation | 58.33 | 58.67 | 60.55 | 60.44 | 61.53 | 63.35 | 68.19 | 62.49 |
Comfort evaluation | 57.75 | 56.07 | 59.31 | 57.33 | 62.45 | 62.51 | 66.67 | 63.39 |
Service evaluation | 54.38 | 55.98 | 55.76 | 57.28 | 57.77 | 59.08 | 62.98 | 61.17 |
Manipulative evaluation | 59.15 | 57.34 | 61.98 | 58.46 | 62.49 | 60.46 | 65.74 | 60.61 |
Table 6 Comparison of F1 values and MAE of different algorithms |
| TFIDF | | TDFCA | | ILDA | | TFL_DTN |
F1 | MAE | | F1 | MAE | | F1 | MAE | | F1 | MAE |
Security evaluation | 56.08 | 1.671 | | 58.23 | 1.434 | | 59.86 | 1.198 | | 63.09 | 0.788 |
Economy evaluation | 57.69 | 1.931 | 60.25 | 1.552 | 62.79 | 1.371 | 64.12 | 0.862 |
Dynamic performance evaluation | 58.50 | 1.656 | 60.49 | 1.274 | 62.43 | 1.154 | 65.22 | 0.796 |
Comfort evaluation | 56.90 | 1.637 | 58.30 | 1.394 | 62.48 | 1.144 | 64.99 | 0.835 |
Service evaluation | 55.17 | 1.671 | 56.51 | 1.485 | 58.42 | 1.291 | 62.06 | 0.884 |
Manipulative evaluation | 58.23 | 1.937 | 60.17 | 1.576 | 61.46 | 1.242 | 63.07 | 0.927 |
5 Conclusions
The proposed method TFL_DTN designs a topic recognition visualization to optimize the topic semantic correlation feature generated by the ILDA model. The model iteratively generates the topic-feature word probability distribution matrices and document-topic probability distribution matrices, based on the conditional probabilistic dependency relationship among topics, documents, and feature words. Through the calculation of feature word weights and strong correlation matrix, a visual concept lattice for topic feature is constructed, which realizes the generalization and specialization of semantic relationships between topic features.
Experiments show that the TFL_DTN model has a good ability of topic recognition under dynamic subject numbers. To enact this need, the following innovative points are made in this paper:
1) A method is proposed for calculating the correlation strength of feature words under different topics using joint probability of topic-feature words.
2) A method is proposed to construct topic feature lattice in formal context association matrix at multi-granularity.
In order to improve the calculation accuracy of the topic prediction model, the future research will focus on the semantic analysis of topic sentiment, to deeply dig the online users' sentiment tendencies, and establish text sentiment for the hidden features of topics.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}