Exploring Evolution of Public Opinions on Tianya Club Using Dynamic Topic Models

Zhihua YAN, Xijin TANG

Journal of Systems Science and Information ›› 2020, Vol. 8 ›› Issue (4) : 309-324.

PDF(503 KB)
PDF(503 KB)
Journal of Systems Science and Information ›› 2020, Vol. 8 ›› Issue (4) : 309-324. DOI: 10.21078/JSSI-2020-309-16
 

Exploring Evolution of Public Opinions on Tianya Club Using Dynamic Topic Models

Author information +
History +

Abstract

Online media have brought tremendous changes to civic life, public opinions, and government administration. Compared with traditional media, online media not only allow individuals to browse news and express their views more freely, but also accelerate the transmission of opinions and expand influence. As public opinions may arouse societal unrest, it is worth detecting the primary topics and uncovering the evolution trends of public opinions for societal administration. Various algorithms are developed to deal with the huge volume of unstructured online media data. In this study, dynamic topic model is employed to explore topic content evolution and prevalence evolution using the original posts published from 2013 to 2017 on the Tianya Zatan Board of Tianya Club, which is one of the most popular BBS in China. Based on semantic similarities, topics are grouped into three themes: Family life, societal affairs, and government administration. The evolution of topic prevalence and content are affected by emergent incidents. Topics on family life become popular, while themes "societal affairs" and "government administration" with bigger standard deviations are more likely to be influenced by emergent hot events. Content evolution represented by monthly pairwise distance matrix is very easy to find change points of topic content.

Key words

topic modeling / dynamic topic models / text mining / topic evolution

Cite this article

Download Citations
Zhihua YAN , Xijin TANG. Exploring Evolution of Public Opinions on Tianya Club Using Dynamic Topic Models. Journal of Systems Science and Information, 2020, 8(4): 309-324 https://doi.org/10.21078/JSSI-2020-309-16

1 Introduction

Online media platforms, such as blogs, microblogs, and bulletin board systems (BBS), provide open environments for netizens to browse news and share opinions. As anybody can create and propagate news freely, online media have attracted large numbers of netizens and become one of the most influential news sources in China[1]. People not only share their day-to-day activities, but also express their opinions on Internet about societal events. Public events, such as corruption, environment pollution, economic development, national security and international relations, may trigger fierce online discussions easily and become high-profile incidents. For example, the bankruptcy of P2P lending companies becomes public online first and the problem of left-behind children is also an enduring topic. What is more, the propagation of fake news and rumors may steer public opinions and lead to societal unrest and instability. For example, the Guo Meimei incident produced negative effects on the Red Cross Society of China and created a crisis of trust in charities[2]. Hence, online media provide abundant source materials for public opinion research[3]. A huge number of unstructured online media data have created, and are used to understand the opinions and the collective behaviors of the general public on Internet public events[4].
Due to the great influence of public opinions on societal stability, management of public opinion has become an important task for government. Public opinions, which always caused by emergency events, are complex and uncertain. Current research pay more attention to fluctuation of popularity of public opinion. However, revealing the evolution of public opinions requires understanding how its contents change over time. Public opinions caused by different events appear quite differently. Although various machine learning techniques have been employed, it is still a persistent challenge to identify emerging societal trends and detect topics from large collections of unstructured content. In recent years, novel text mining methods, such as topic models, sentiment analysis, have proven to be effective methods for analyses of online public opinions[5].
In this study, We offer a comprehensive view of evolution of online public opinions. An empirical analysis of the evolution of public opinions on Tianya Club, which is one of the most popular Internet forums in China, is implemented. Topic prevalence evolution and content evolution are analyzed to gain an insight into the emerging societal trends of online media. Original posts are collected from Tianya Zatan Board of Tianya Club from 2013 to 2017, and data mining and statistical approaches are used to analyze the topics generated by topic models. Besides, distance matrix is generated to detect the change point of the public opinions, and the change ratio is employed to measure the change rates of public opinions.
The remainder of this paper is organized as follows. Section 2 describes the development of topic modeling and the dynamic topic models. In Section 3, the process of data collection and data preprocessing is introduced. Various measures to quantify topic prevalence and topic similarity are introduced. Section 4 presents our findings on the evolution of topics on Tianya Zatan Board using an inter-topic distance map, a Cox-Stuart trend test and a monthly pairwise distance matrix. Finally, we summarize this study and propose possible directions for future work.

2 Related Works

2.1 Probabilistic Topic Model

Modern statistical topic models, originating from latent semantic analysis[6] and information retrieval[7], provide a way to uncover latent semantic structures in collections of documents using natural language processing. Topic models are generally based on the bag of words hypothesis, which presumes that the order of words in documents does not matter.
Introduced by Blei et al., latent Dirichlet allocation (LDA) is a three-level hierarchical Bayesian model[8]. It is a generalization of the probabilistic latent semantic analysis (pLSA) model, and focuses on discovering latent topics from large collections of documents. Unlike the pLSA model, LDA introduces Dirichlet prior distributions for document-topic and topic-word distributions. Besides, In the LDA model, a topic is represented as a word probability distribution, and documents are generated by sets of topic probabilities.
To understand the evolution of topics in corpus, many topic models incorporate temporal information, such as dynamic topic model (DTM), continuous time dynamic topic model (cDTM), multiscale dynamic topic model (MDTM), temporal Dirichlet process mixture model (TDPM), and so on. DTM is used to track the evolution of topics in a sequential collection of documents[9]. As a variant of DTM, cDTM uses Brownian motion to model latent topics through the corpus with continuous time. The TDPM does not require predefined topic number and the parameters of each topic evolve over time[10]. The difference between long and short timescales is considered in MDTM, and the current word distributions of topics rely on the previous epoch[11].

2.2 Topic Evolution

Research on topic evolution originates from topic detection and tracking (TDT), which aims to search and organize event-based topics from textual news media materials[12, 13]. Topics are thought to be a set of news stories relating to real-world event. With the advent of online media, TDT is also widely used for public opinion detection and topic evolution research. Furthermore, topic detection and topic evolution are brought in focus of scientometrics and text mining.
Traditional scientometric methods, such as co-word networks and co-citation analysis, have been used for research of topic identification and topic evolution[14, 15]. Many multi-disciplinary findings have been achieved, such as the government policy vane[16], discipline development[17], pop music trends[18] et al. Compared with co-word networks and co-citation analysis, the hierarchical Dirichlet process, a generative probabilistic topic model, performs better[19].
Topic models have been applied in various fields, such as scientific literature analysis and public opinion research, to reveal the evolution of topics across large collections of documents. Topic evolution has been successfully applied to explore and predict the scientific research trends. Hall et al. applied LDA to the ACL Anthology to discover the research topic threads in the field of computational linguistics from 1978 to 2006[20]. Based on the topic model, Sun and Yin used LDA to uncover fifty key topics from transportation research articles and identified some general research trends[21]. Greene and Cross explored the political agenda of the European Parliament to unveil the plenary agenda and detected latent themes using a dynamic topic modeling approach[22].
Along with the wide use of online media, user-generated content is becoming an important data sources for topic evolution research. Lau, et al. tracked emerging events and trending topics on Twitter using LDA[23]. Barua, et al. used LDA to identify the main topics presented at Stack Overflow, which was a question and answer website about computer technologies, and the variation in topics over time[24]. Cao and Tang employed DTM to explore the temporal patterns of changing topics on Tianya Zatan Board of Tianya Club[25]. Morimoto and Kawasaki used DTM to analyze the chronological evolution of the topics from online news and forecast financial market volatility[26].

3 Data and Methods

3.1 Data Collection

In this study, we use a representative online media platform, Tianya Club, for an empirical analysis of topic evolution. Tianya Club is one of the most popular Internet forums in China, with online forums, blogs[27]. By the end of 2017, Tianya Club has 130 million active users and more than 13 million visits each day. On Tianya Club, netizens can not only browse and post freely but also track popular posts easily. As a result, numerous public opinion events are first revealed on Tianya Club. More importantly, Tianya Club can speed up the spread of public opinions and can expand the influence of negative events.
Tianya Zatan Board which is the 2nd largest board on Tianya Club, covers a variety of topics, such as daily lives, education, economics, historical events, and so on. Almost every emergent incident news can be found there, and attract netizens to browse and reply. Generally, significant societal incidents lead to heated discussions. For example, the posts about left-behind children easily trigger intense discussions about education and juvenile delinquency[28]. Nevertheless, for most of the time, netizens mainly talk about health, employment, and family relationships.
In order to perform a research on topic evolution, and identify emerging topic trends in China, a Python crawler is employed to collect original posts from the Tianya Zatan Board. There are 1, 746, 307 original posts are collected, as shown in Figure 1. Due to great impact by new social media, such as microblog and Wechat, posts on the Tianya Zatan Board decline about 70% from 2013 to 2017. Nevertheless, Tianya Zatan Board remains a good object to understand the changes of public opinions in China.
Figure 1 Original posts on the Tianya Zatan Board from 2013 to 2017

Full size|PPT slide

3.2 Data Preprocessing

As the contents of posts on Tianya Zatan Board are generated by ordinary netizens, the raw data contain large quantities of noise words, such as spoken language, urls, and emoticons. Moreover, there are a lot of advertising posts and empty posts. Hence, we carry out a data pre-processing on Tianya Zatan dataset. Firstly, we discard advertising posts, and short posts with content length less than 30 words. Next, the posts are segmented into words, while common Chinese stop words are removed. To refine the quality of corpus, we use customized reserved words based on Baidu hot words[29]. Finally, we remove words if word frequency or document frequency is less than 50[30]. Table 1 is the statistics of final corpus in our study. The refined corpus has 1, 404, 634 original posts, and a vocabulary of 66, 844 words, which occur a total of 211.9 million times in the corpus.
Table 1 The statistics of Tianya Zatan dataset after preprocessing
Year Original posts # Corpus # (thousand) Words in dictionary #
2013 437, 806 64, 845 66, 761
2014 383, 573 59, 176 66, 811
2015 258, 060 39, 580 66, 817
2016 200, 742 30, 156 66, 806
2017 124, 453 18, 158 66, 691
Total 1, 404, 634 211, 915 66, 844

3.3 Dynamic Topic Modeling

In DTM, corpus is divided into discrete sequential epoch, which are modeled by a k-component topic model. For each epoch, the k-component topic model evolves from previous epoch. Let βk,t denote the word distribution of topic k at epoch t, αt denote the document topic distribution at epoch t. Both of them follows Gaussian distribution. Let η be the log proportions of θk,t, which is the topic distribution of document d at epoch t. Let Wt,d,n be the probability of word n in document d at epoch t. The topics and the words follow multinomial distributions respectively.
Following Blei and Lafferty[31], the process of generating the DTM is as follows:
1. Draw topics βt|βt1N(βt1,σ2I).
2. Draw αt|αt1N(αt1,δ2I).
3. For each document:
(a) Draw ηN(αt,α2I)
(b) For each word:
i. Draw ZMult(π(η))
ii. Draw Wt,d,nMult(π(βt,z))
Note that π maps the multinomial natural parameters to the mean parameters,
π(η)=exp(η)iexp(ηi),
(1)
π(βk,t)w=exp(βk,t,w)wexp(βk,t,w).
(2)
In this research, the DTM package1 is used to create a DTM model of the Tianya Zatan corpus using the Kalman filter variational approximations. The number of topics k is set to 60 and the topic distribution Dirichlet parameter α is set to 0.01.
1The DTM code package can be downloaded from https://github.com/blei-lab/dtm.
The performance of DTM is affected greatly by the unit epoch. As the events on Tianya Zatan Board always persistent for a short time, it is difficult to detect emergency incidents if we choose bigger unit epoch, such as year. By contrast, small unit epoch, for example day or hour, always leads to long computing time. To find the long-term rules of topic evolution, we make a tradeoff between computing time and model performance and divide the corpus by month. The unit epoch is set to one month, so there are 60 epochs. The DTM program take approximately 91 hours with 36 threads, using 136 G RAM.

3.4 Metrics and Analysis

1) Topic prevalence. Topic prevalence, which is proportional to the estimated number of tokens generated by given topic across the entire corpus, reveals the popularity of a topic. In DTM, topic k's prevalence can be represented by the average posterior probabilities[32]. Assume that θd,k,t is the posterior distribution of topic k at epoch t for document d. Mt is the number of documents at epoch t. The topic prevalence is defined as:
θk,t¯=1Mtdθd,k,t.
(3)
By calculating the topic prevalence over time, we can obtain the evolution rules of topics and group the topics with different labels.
2) Pairwise topic similarity. In topic models, topics are represented as multidimensional probability distributions. It is a challenge to measure the similarity between two probability distributions. In text mining, distances measures such as the Euclidean distance, cosine distance, and Kullback-Leibler (KL) divergence are widely used. Let P and Q be two probability distributions, the KL divergence is defined as[33]:
DKL(P||Q)=xXlogP(x)Q(x).
(4)
The KL divergence is non-negative, and is zero only when P and Q are identical. However, the KL divergence is asymmetric, and DKL(P||Q) and DKL(P||Q) are always different. Hence, we use the Jensen-Shannon distance (JSD), which is based on the KL divergence, as a measure to quantify the similarity between P and Q[34, 35]:
JSD(P||Q)=[12DKL(P||M)+12DKL(Q||M)]12,
(5)
where M=12(P+Q).
Given logarithm base 2, the bounds of JSD are 0 and 1. The higher the JSD is, the lower the similarity between the probability distributions. Using JSD, we compute the inter-topic distance and pairwise topic similarity over time.

4 Results and Analysis

4.1 Discovering Topics

Based on DTM, we generate 60 latent topics from Tianya Zatan corpus. These topics cover all aspects of society, such as family ties, marriage, livelihoods, environmental pollution, e-business, swindling, and so on. As previously described, topics are represented as probability distributions over words, and topic semantics can be obtained from the top 10 or 20 words. For example, for January 2013, the topic with feature words of "netizens, microblog, internet, media, news, post, event, reply, Tianya and comment" is about social media, and the topic with feature words of "production, food, milk powder, product, criterion, drug, transgene, product safety, sell and detection" relates to product quality safety. As the DTM is unable to provide labels for topics, automated methods have been put forward to label topics[36]. Nevertheless, manual labelling is still widely used in topic mining[37]. In this study, we manually label each topic using a short phase on the basis of highest-probable words, as shown in Table 2.
Table 2 Examples of topics of Tianya Zatan corpus in January of 2013
No. Label Top words (Jan. 2013) Theme1 Trend2
1 Institution management Work, Management, System, Supervision, Organization 2 +
5 Product quality safety Production, Food, Milk powder, Product, Criterion 2 C
9 Company operations Work, Management, System, Supervision, Organization 2 C
18 Marriage Woman, Man, Girl, Marriage, Divorce 1 +
27 Financial gegulation Capital, Estate, Immigration, Investment, Gong Aiai 3 +
34 Job Job, Poster, Friend, Tianya, Feeling, Graduation 2 C
45 Traffic accident Driver, Vehicle, Yellow light, Traffic police 3 -
52 Telecom fraud Phone, Swindler, Contact, Information, ID card 3 -
1 1: Family life; 2: Societal affairs; 3: Government administration.
2 +: Up trend; -: Down trend; C: Constant trend.
To gain an insight into the semantic relevance between DTM topics, we visualize them in two dimensions using the multidimensional scaling (MDS) algorithm[38, 39]. Figure 2 depicts the inter-topic distance of 60 latent topics in January 2013 and January 2017, respectively. Here we use JSD to acquire the pairwise topic distance between topics. The smaller JSD between two topics is, the more similar they are. The variation of pairwise topic distance helps to uncover changes of semantic relevance between DTM topics. For example, both Topic 7 "Child Rearing" and Topic 18 "Marriage", close to each other in Figure 2, are about family life, and their JSD is quite small. Generally, the topics that contain a lot of common words always appear in the same posts have more semantic affinity, are closer in inter-topic distance map.
Figure 2 Inter-topic distance map over time for all topics generated from the Tianya Zatan corpus

Full size|PPT slide

Inter-topic distance map helps to uncover the changes of hot topics on Tianya Zatan Board. The size of nodes is proportional to the marginal distributions of topics across the corpus. The bigger the size of the node, the more prevalent the topic appears in Tianya Zatan corpus. As shown in Figure 2, the topic prevalence changes greatly from 2013 to 2017. In January of 2013, the hottest topic is Topic 15 "Civil rights", while Topic 59 "Family ties" is the hottest topic in January of 2017.
In DTM, each topic is associated with a time sequence of probability distributions. Clustering algorithm and distance measure are two key components of time series clustering. Unsupervised clustering algorithms, such as k-means, spectral clustering, Gaussian mixture models et al., are widely applied to cluster unlabeled data. K-means clustering is popular for cluster analysis and has many variants such as k-medoids, k-means++, and fuzzy clustering. However, the performance of k-means clustering is affected by initial centroids greatly. To address this problem, we employ spectral clustering algorithm, which is simple to implement and always outperforms k-means algorithm[40]. Comparing with Euclidean distance, dynamic time warping (DTW) is a more sophisticated metric to measure time series distance[41]. In this study, DTW is used to compute similarity matrix of topic vector sequences, and silhouette coefficient is calculated to estimate the number of clusters. These 60 topics are grouped into three overarching themes: Family life, societal affairs, and government administration, as shown in Table 3.
Table 3 Summary of topics generated by DTM
Theme Details Topics #
Family life Marriage, Friends, Feelings, Travel, Entertainment, Traditional culture, Diet, Religion, Job, etc. 15
Societal affairs Social media, Urban environment, Product safety, Decoration, Livelihood, School, Spiritual civilization, etc. 27
Government administration State-owned firm, Building demolition, Official corruption, Civil rights, Criminal offence, Migrant workers, etc. 18
There are great differences between themes. Theme "family life" focuses on daily life of citizen at home, and is composed of topics about marriage, friends, travel, entertainment and so on. Theme "societal affairs" includes topics about societal activities, like social media, urban environment, product safety, and house decoration. While theme "government administration" relates to events in connection with government, such as state-owned firm, building demolition, official corruption, criminal offence, etc. About half of the topics are about "societal affairs", whereas about 30 percent topics are about "government administration". Surprisingly, there are only 23 percent topics about "family life". It is because that Tianya Zatan Board is famous for exposure of societal unfair events, and discussions of societal hot events. While online media generated topics are highly relevant to media platforms, for instance, Sina Weibo is famous for contents about jokes, funny images, and entertainment events[42].

4.2 Topic Prevalence Evolution

In order to gain comprehensive understanding of the dynamics of topic prevalence, we use a statistical approach and a graphical representation are used to reveal the dynamic characteristic of topic prevalence in the Tianya Zatan corpus. As shown in Figure 3, a heat map is used to visualize the Top 30 topics on average prevalence from 2013 to 2017, and the darkness of a color from white to red indicates the strength of topics. The most popular topic is Topic 55 "life perception", which consists of thoughts about personal life and work including words such as livelihood, happiness, world, being, dream, life, and so on. Furthermore, it fluctuates slightly, and keeps as the hottest topic from 2013 to 2017. The top five hottest topics of Tianya Zatan corpus are: Topic 55 "life perception", Topic 14 "livelihood", Topic 59 "family ties", Topic 34 "job" and Topic 50 "civil servant management". Only Topic 50 does not belong to theme "family life". Especially, 12 out of 15 topics with theme "family life" belong to top 30 topics, and lead to high mean of topic prevalence, as shown in Figure 4(a).
Figure 3 Heat map displaying the top thirty topics on average prevalence from 2013 to 2017

Full size|PPT slide

Figure 4 Box plot of mean and standard deviation of topic prevalence

Full size|PPT slide

Figure 3 also provides evidences that the evolution of topic prevalence is affected by emergent incidents. Topics are generally composed of chatter discussions. However, high-profile events attract large number of netizens to browse and post on Tianya Zatan Board, and lead to drastic fluctuation of topic prevalence. The popularity of mobile Internet speeds up the dissemination of information, and increases fluctuation of topics. For example, Topic 18 "marriage" is an ordinary topic until the happening of Wang Baoqiang (a famous actor in China) divorce case in August, 2016. Fierce debate is provoked on Tianya Zatan Board and continues for several years. However, most of topics triggered by emergent incidents only last a short time. We use standard deviation of prevalence to measure topic fluctuation, and reveal the influence of emergent events. In Figure 4(b), topics belonged to theme "societal affairs" and "government administration" have larger standard deviations, and are more susceptible to emergent incidents.
Topic temporal patterns are influenced by emergent incidents greatly. Topics on blogs are divided into "chatter" topics and "spike" topics according to the impact of outside world eventscao. Topics on Tianya Zatan Board are summarized into common topics, specific days' topics and topics about societal incidents based on monthly data of Tianya Club[25]. To gain more general conclusions, we use corpus of Tianya Zatan Board from 2013 to 2017, and find that topics show different evolutionary characteristics. Based on patterns of prevalence variation, we group them into chatter topics, bursty topics and periodic topics. Chatter topics are mainly about daily discussions without trending issues on Tianya Zatan Board. Bursty topics are affected by public emergent incidents, and fluctuate greatly. Periodic topics fluctuate periodically and relate to cyclical events or days, for example, the Nation's Day holiday.
Topical trends reflect the focus of online discussions and help to predict the direction of public opinions in the future. The Cox-Stuart trend test is applied to check whether the topics are statistically significant up tread or down tread with the standard 95% confidence level. As shown in Table 4, about one third topics have constant trends, 18 topics have up trends, and 21 topics have down trends. There are great differences between themes on topical trends. Theme "family life" includes only one down treading topic and nine up treading topics. On the contrary, theme "government administration" includes four up treading topics and thirteen down treading topics. Theme "societal affairs" have more constant trend topics, and the number of up and down treading topics are approximately the same. For further analysis, we sum up proportion of topics according to themes, as shown in Figure 5. The proportion of theme "family life" keeps increasing, while theme "government administration" gradually lowers. These phenomenons reflect that the focus of netizens on Tianya Zatan Board changes gradually, and netizens are more inclined to discuss daily life events online.
Table 4 Topic trends using the Cox-Stuart trend test with α=0.05
Type Family life Societal affairs Government administration Total
Up trend 9 5 4 18
Down trend 1 7 13 21
Constant 4 16 1 21
Total 15 27 18 60
Figure 5 Variation of theme proportion

Full size|PPT slide

4.3 Topic Content Evolution

As analyzed above, high-profile events lead to not only fluctuations of topic prevalence, but also semantic changes of topics. Generally, controversial or significant events attract large quantities of netizens to participate in discussions, and become the focus of public opinions. In this subsection, we employ change ratio and distance matrix to quantify topic content evolution.
Change ratio, which is expected to reflect the impact of exogenous events, is used to measure changes of topic content between nearby epoch. Bigger change ratio means greater impact of emergent event. In this study, let Tt and Tt1 be topic probability distributions at epoch t and t1, respectively, the change ratio measure is computed by JSD(Tt,Tt1). The distance matrix is defined by pairwise distances between epochs using the JSD and is visualized by heat map. Each cell of the matrix represents pairwise topics dissimilarity ranging from 0 to 1.
The problems of product quality and safety are long-term hot subjects on Tianya Zatan Board. For example, Topic 5 "product safety" includes words, such as "production", "food", "milk powder", "product", "criterion" et al. Figure 6(a) reveals the month-to-month content changes of Topic 5 from 2013 to 2017, and darker cells mean more dissimilar of topics. The values of cells on diagonal are zero because of the same topics. In addition, a few mutational points, which indicate the great changes of topic content, can be found below the diagonal of heat map. These content changes are always caused by emergent incidents, such as controversies of GMO-foods in December 2013 and illegal vaccine case in March 2016. The changes of change ratio coincide with the mutational points. In DTM, evolution of topic content is reflected as the variation of probability distribution. As illustrated in Figure 6(b), the probability of key words, which stand for the emergent incidents, changes greatly. For example, the probabilities of "transgene", "Cui Yongyuan" and "Fang Zhouzi" stayed high during controversies of GMO-foods in December 2013. Similarly, "vaccine" became the hottest word during illegal vaccine case in March 2016.
Figure 6 Content evolution of Topic 5 "product safety". (a) Monthly pairwise distance matrix of Topic 5. (b) The representative words in high-profile events. (c) Top 10 words of Topic 5 at each epoch

Full size|PPT slide

To gain an overview of topic content evolution, mean and standard deviation of content change ratio of topics on Tianya Zatan Board are computed. Figure 7(a) displays that compared with theme "societal affairs" and "government administration". Smaller means of topic content change ratio are happened to topics in theme "family life". Hence, content of topics in theme "family life" keeps more stable from 2013 to 2017. In addition, in Figure 7(b), bigger standard deviations in both "societal affairs" and "government administration" themes reflect that both themes are more likely to be influenced by emergent hot incidents.
Figure 7 Box plot of mean and standard deviation of topic content change ratio

Full size|PPT slide

5 Results and Analysis

The occurrence and development of public opinions are separated from online media. Compared with traditional news media, online media platforms facilitate the propagation of news, as anyone can create and propagate news freely. However, the propagation of fake news and rumors may lead to societal unrest and instability. Hence, finding evolution laws of public opinions is crucial for societal governance in contemporary China. In this study, DTM is employed to generate latent topics from the Tianya Zatan corpus from 2013 to 2017, and graphical approaches and statistical methods are used to demonstrate the variation in topics over time. Both topic prevalence and topic content are analyzed to gain insight into the dynamics of public opinions on Tianya Zatan Board.
In DTM, not only the variation in topic prevalence but also the changes of topic content can be obtained by computing the JSD. Based on semantic similarities, spectral clustering algorithm is employed to group topics generated from Tianya Zatan corpus into three clusters: Family life, societal affairs, and government administration. On Tianya Zatan Board, netizens pay more attention to societal events such as urban environment, product safety, education, et al, which suggests that with the development of society, people pursue a better living environment and hope to resolve increasingly severe environment and safety issues.
The fluctuation of topic prevalence is affected by emergent incidents. On Tianya Zatan Board, topics in theme "family life" are more likely to be a chatter topic with higher means of prevalence and lower standard deviations of prevalence. In contrast, theme "societal affairs" and "government administration" contain more topics affected by emergent high-profile events. To uncover the focus of Tianya Zatan Board and the trends of public opinions, Cox-Stuart trend test is used to check whether the topics have statistically significant increasing or decreasing trends with the standard 95% confidence level. Topics belonged to theme "family life" become more and more popular, while the prevalence of theme "government administration" continuously decreases over time. Besides, topics are divided into chatter topics, bursty topics and periodic topics according to the impact of emergent incidents.
For further analysis of the evolution of topic content, a monthly pairwise distance matrix and a change ratio measure are employed to detect the variation over time. The distance matrix is visualized by a heat map and topic content changes of topics are represented by top ten words. Based on these analysis, it is easy to detect the change epochs on heat map and identify major events using key words. Furthermore, topic contents in theme "family life" stay more stably, and topics in theme "societal affairs" and "government administration" more susceptible to emergent hot events.
In summary, this study provides methods to detect and uncover the laws of topic evolution in large collections of unstructured media corpus. Although DTM is useful and powerful, it takes a significant amount of time to generate topics from extreme volume of archived online media data. Hence, it is necessary to explore more powerful algorithms and tools to perform analysis.

References

1
Dong T, Liang C, He X. Social media and internet public events. Telematics and Informatics, 2017, 34 (3): 726- 739.
2
Cheng Y, Huang Y H C, Chan C M. Public relations, media coverage, and public opinion in contemporary China: Testing agenda building theory in a social mediated crisis. Telematics and Informatics, 2017, 34 (3): 765- 773.
3
Murphy J, Link M W, Childs J H, et al. Social media in public opinion research: Executive summary of the Aapor task force on emerging technologies in public opinion research. Public Opinion Quarterly, 2014, 78 (4): 788- 794.
4
Rohani V A, Shayaa S, Babanejaddehaki G. Topic modeling for social media content: A practical approach. Proceedings of 3rd International Conference on Computer and Information Sciences, 2016: 397-402.
5
Sobkowicz P, Kaschesky M, Bouchard G. Opinion mining in social media: Modeling, simulating, and forecasting political opinions in the web. Government Information Quarterly, 2012, 29 (4): 470- 479.
6
Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis. Journal of the American society for information science, 1990, 41 (6): 391- 407.
7
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM Press, 1999.
8
Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of machine learning research, 2003, 3, 993- 1022.
9
Blei D M, Lafferty J D. Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning, 2006: 113-120.
10
Ahmed A, Xing E. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: With applications to evolutionary clustering. Proceedings of the SIAM International Conference on Data Mining, 2008: 219-230.
11
Iwata T, Yamada T, Sakurai Y, et al. Online multiscale dynamic topic models. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010: 663-672.
12
Allan J, Carbonell J, Doddington G, et al. Topic detection and tracking pilot study: Final report. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 1998: 194-218.
13
Allan J. Introduction to topic detection and tracking. Topic detection and tracking. Boston, MA: Springer, 2002.
14
Chen C, Ibekwe-SanJuan F, Hou J. The structure and dynamics of cocitation clusters: A multipleperspective cocitation analysis. Journal of the Association for Information Science and Technology, 2010, 61 (7): 1386- 1409.
15
Leydesdorff L, Nerghes A. Co-word maps and topic modeling: A comparison using small and mediumsized corpora (N ≤ 1000). Journal of the Association for Information Science and Technology, 2017, 68 (4): 1024- 1035.
16
Rule A, Cointet J P, Bearman P S. Lexical shifts, substantive changes, and continuity in State of the Union discourse. Proceedings of the National Academy of Sciences, 2015, 112 (35): 10837- 10844.
17
Lu L Y Y, Liu J S. A novel approach to identify the major research themes and development trajectory: The case of patenting research. Technological Forecasting and Social Change, 2016, 103, 71- 82.
18
Mauch M, MacCallum R M, Levy M, et al. The evolution of popular music: USA 1960-2010. Royal Society Open Science, 2015, 2 (5): 150081.
19
Ding W, Chen C. Dynamic topic detection and tracking: A comparison of HDP, C-word, and cocitation methods. Journal of the Association for Information Science and Technology, 2014, 65 (10): 2084- 2097.
20
Hall D, Dan J, Christopher D. Studying the history of ideas using topic models. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2008: 363-371.
21
Sun L J, Yin Y F. Discovering themes and trends in transportation research using topic modeling. Transportation Research Part C: Emerging Technologies, 2017, 77, 49- 66.
22
Greene D, Cross J P. Exploring the political agenda of the European parliament using a dynamic topic modeling approach. Political Analysis, 2017: 25(1): 77-94.
23
Lau J H, Collier N, Baldwin T. On-line trend analysis with topic models: # twitter trends detection topic model online. Proceedings of COLING 2012, 2012: 1519-1534.
24
Barua A, Thomas S W, Hassan A E. What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering, 2014, 19 (3): 619- 654.
25
Cao L N, Tang X J. Topics and trends of the on-line public concerns based on Tianya forum. Journal of Systems Science and Systems Engineering, 2014, 23 (2): 212- 230.
26
Morimoto T, Kawasaki Y. Forecasting financial market volatility using a dynamic topic model. Asia-Pacific Financial Markets, 2017, 24 (3): 149- 167.
27
Cao L N, Tang X J. Prevailing trends detection of public opinions based on Tianya Forum. Proceedings of International Conference on Intelligent Data Engineering and Automated Learning, 2013: 186-193.
28
Sun L, Yin Y. Discovering themes and trends in transportation research using topic modeling. Transportation Research Part C: Emerging Technologies, 2017, 77, 49- 66.
29
Hu Y, Tang X J. Using support vector machine for classification of Baidu hot word. Proceedings of International Conference on Knowledge Science, Engineering and Management (KSEM 2013). Springer, 2013: 580-590.
30
Blei D M, Lafferty J D. A correlated topic model of science. The Annals of Applied Statistics, 2007, 1 (1): 17- 35.
31
Blei D M, Lafferty J D. Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning, 2006: 113-120.
32
Griffiths T L, Steyvers M. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004(suppl 1): 5228-5235.
33
Kullback S, Leibler R A. On information and sufficiency. The Annals of Mathematical Statistics, 1951, 22 (1): 79- 86.
34
Osterreicher F, Vajda I. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 2003, 55 (3): 639- 653.
35
Endres D M, Schindelin J E. A new metric for probability distributions. IEEE Transactions on Information Theory, 2003, 49 (7): 1858- 1860.
36
Mehrotra R, Sanner S, Buntine W, et al. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2013: 889-892.
37
Lau J H, Grieser K, Newman D, et al. Automatic labelling of topic models. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 1536-1545.
38
Chuang J, Ramage D, Manning C, et al. Interpretation and trust: Designing model-driven visualizations for text analysis. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2012: 443-452.
39
Sievert C, Shirley K. LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 2014: 63-70.
40
Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing, 2007, 17 (4): 395- 416.
41
Aghabozorgi S, Shirkhorshidi A S, Wah T Y. Time-series clustering-A decade review. Information Systems, 2015, 53, 16- 38.
42
Yu L L, Asur S, Huberman B A. Trend dynamics and attention in Chinese social media. American Behavioral Scientist, 2015, 59 (9): 1142- 1156.

Acknowledgements

The authors gratefully acknowledge the editor and two anonymous referees for their insightful comments and helpful suggestions that led to a marked improvement of the article.

Funding

the National Key Research and Development Program of China(2016YFB1000902)
the National Natural Science Foundation of China(71731002)
the National Natural Science Foundation of China(71971190)
PDF(503 KB)

333

Accesses

0

Citation

Detail

Sections
Recommended

/