1 Introduction
As an emerging artificial intelligence technology, deep synthesis technology (deep synthesis) can be used to create new digital content or replace existing multimedia data information such as video, images, audio, and text through deep learning
[1, 2]. As a cutting-edge technology, it has a good application prospect, and has been applied in film and television entertainment, social transmission, e-commerce marketing, education and art, medical survey and many other fields.
Advances in AI algorithms, lower computing costs and rapid growth of Internet data have led to exponential growth in low-cost, low-threshold, deep synthesized video. Deep synthesis technology itself is not harmful, and its underlying technology has benign uses
[3]. For example, deep learning algorithms are applied to create personalized art works and improve the virtual interactive experience. However, with the increasing commercialized application of deep synthesis technology, problems of technology abuse supported by the powerful simulation capability of deep synthesis technology gradually appear. On that stage, deep learning technology has been alienated into the “deep forgery”, which is recognized to the public. The immediate availability of deep forgery technology allows cyber criminals to create cheap, realistic forgery quickly. This technology reduces the cost of large-scale engagement in information war and expands the range of involvers who can participate in. When combined with other technologies and social trends, deep fraud could bring many risks: exacerbating cyber attacks, leading to various threats related to privacy and security, accelerating the spread of fake information and increasing the decline of people's trust in official institutions
[4−6].
The innovation point of this paper is that: First, with the study of deep synthesized videos, this paper expands the scope of the analysis of the research on the factors influencing video dissemination effect, supplements the existing literature on the lack of research on the dissemination effect of videos based on deep synthesis technology, grasps the dissemination law of deep synthesis videos, and thus makes feasible suggestions for regulating the risk of alienation of deep synthesis technology; second, when constructing the ordinal regression models to define the explanatory variables, in addition to considering the influencing factors of information dissemination, this paper adds the technical factors of deep synthesized video from the characteristics of the research object, studying the influencing factors of the effect of deep synthesis video propagation from the two dimensions of information dissemination and deep synthesis technology; third, this paper constructs an ordered regression model of ordinary video propagation effect and compares the influencing factors of deep synthesized video and ordinary video, highlights the characteristics of the deep synthesized video propagation effect.
2 Literature Review and Theoretical Basis
2.1 Deep Synthesized Video
The most obvious difference between deep synthesized video and normal video is the potential risks based on the deep synthesis technology. Through the secondary synthesis and editing of videos, deep forgery technology will generate highly “fake” information. Once such information is spread broadly, it may bring incalculable negative effects to network public opinion, social security and political stability
[7, 8].
From a technical point of view, there are five principal types of deep synthesized video
[9]: The first is facial reimplantation, which manipulates the facial expression of the target subject in the video based on the input of the source participant. The second is face change, which exchanges the face of the source participant to the face of the target subject. Through the deep fraud technique, the subject in the target video can be completely used as the source subject. This type of deep synthesis does not require too much of either the training data or the related techniques. The third type is the whole-body fraud. Whole-body depth fraud combines face switch with face reimplantation technology, and extends artificial intelligence technology to the whole body. Because the underlying technology is not advanced enough, this type of fraud is more difficult to execute than facial fraud. The fourth is audio depth fraud, which refers to the application of artificial intelligence technology to synthesize the speech of the target. This technology has been mature, and Baidu's DeepVoice3 service can clone audio voice based on audio samples in three seconds. The fifth is the deep fraud based on virtual characteristics. Using generative adversarial networks, fully synthesized faces can be generated, and it cannot be traced to specific source by reverse image search. This means that it is possible to artificially generate human images and videos that do not exist in reality. Currently, in-depth synthesized videos released on new media platforms mainly use technologies such as facial reimplantation, face changing and audio fraud.
2.2 Theory of Information Transmission
According to the information dissemination theory
[10], the process of information transmission is composed of the sender, media, receiver, content and feedback: The sender refers to the initiator or source of information dissemination; the media refers to the media or channel of information dissemination, such as newspaper, TV, and Internet; the receiver refers to the receiver or audience of information dissemination; the content refers to the content of the information dissemination or the information itself; the feedback refers to the feedback or response of the receiver to the information dissemination, which can be an oral or action response. Meanwhile, the theory of social network analysis
[11] analyzes the relationship and interaction between people in a society, and reveals the mechanism of social network structure and function. In the process of deep synthetic video propagation, the social network analysis theory can be applied to analyze the relationship, connection, interaction and influence in the network. Social behaviors such as retweets and likes have become ways to connect with nodes in social networks.
2.3 Evaluation of Video Transmission Effect
The effect of video communication refers to the changes in cognition, emotion, attitude, behavior and other aspects after the audience receives the information in the form of video communication. The content reported by the communication media will first affect the audience's cognition and feeling of things, and then affect people's behavior. In order to measure the effect of video communication, it is necessary to comprehensively consider the changes in cognition, attitude and behavior after receiving video information.
At present, there are two main types of evaluation of video propagation effect: One is to measure the transmission effect directly by the number of likes and plays of the video. For example, the study of Jin, et al.
[11] found that the number of likes, comments and forwarding of videos are taken as indicators to quantify the video transmission effect of Tik Tok platform. Zhang, et al.
[12] discovered that according to the characteristics of Bilibili, the transmission effect is evaluated from three aspects: the transmission breadth, the transmission depth and the transmission participation. Another method is the establishment of the evaluation index system of video transmission effect. For example, based on the four links of the addiction model, the study of Gao, et al.
[13] established the indicator system in four aspects: Triggers, action factors, rewards and inputs. According to Zhang, et al.
[14], based on the all-information emotion theory and super IP theory, the short video communication effect evaluation system with demand power, behavior power, experience power, content power, personality power, sharing power and realization power as the first-level indicators is constructed through the identification of key influencing factors. In the research of establishing the evaluation index system of video transmission effect, there are many qualitative studies, which has certain subjectivity. It is necessary to further combine the platform data to measure the effect of video propagation comprehensively and objectively.
2.4 Factors Influencing the Network Information Transmission Effect
According to the different service objects of the network information dissemination, the network information dissemination can be divided into several sub-fields. Generally, there are different factors affecting the effect of network information transmission in different fields. For instance, in the field of government affairs, Lu, et al.
[15] found that the influencing factors based on the transmission effect in Bilibili are the submission section, video publisher, original type and duration of the published video; Chen, et al.
[16] found that the transmission effect was measured in three dimensions: Transmission depth, breadth and participation, and the influencing factors include video content theme, video category, video editing rate, cover type, screen form, subtitle use, video duration and organizational form; in the field of popular science, Chen, et al.
[17] discovered that video type, video length, expression form, video content, discourse style and interaction degree are important factors affecting the propagation effect of popular science video in Tik Tok. Chen
[18] made a descriptive analysis of 92 hospitals' TikTok accounts yielding that the important factors influencing communication effectiveness were content analysis, presentation, and video length. Li, et al.
[19] took the push order and content nature of the graphic information as the crucial factors to affect its propagation effect. Zhao, et al.
[20] assumed that the important factors affecting the average daily reading volume of a single article in the academic journal WeChat official account are the order of publication, the time of publication and the article type of the published content. In the sphere of product marketing, based on the WeChat operation data of well-known brands, the study of Peng, et al.
[21] found that the publish time, topic location and the amount of content information are important factors affecting the effect of information dissemination.
3 Research Design
3.1 Data Collection
YouTube provides global users with high-level video browsing, downloading, forwarding and other services. The number of in-depth synthesized videos published on the platform is considerable. This article uses YouTube videos published on January 31, 2018 to November 20, 2022 as the data source. To forge good effect, we choose 1500 deep synthesized video within 4 minutes, collect video information, including video number, video id, publish time, video likes, video views, video duration, publisher subscription, publishers total video views, video clarity, video fraud form and video fidelity, etc. At the same time, in order to compare with the influencing factors of ordinary video transmission effect, we randomly pick 1500 ordinary videos within 4 minutes and collect relative data such as video number, publisher id, publish time, video likes, video views, video duration, publisher subscription, publishers total video views and video clarity.
3.1.1 Manual Labeling of the Data
We filter the preliminary data by manual labeling, and the 30000 initial videos were labeled. The subjective evaluation index of deep synthetic videos refers to the qualitative variables that help to assess the quality of the videos. Table 1 shows some key subjective evaluation indicators used in the initial data screening process:
Table 1 Subjective evaluation indicators of data screening |
Subjective Variables | Illustration |
Visual Quality | Refers to the overall appearance of the deepfake video, including clarity and the balance of color. High-quality deepfake should exhibit minimal artifact, noise, or distortion (high or low). |
The Area of the Publisher | Determine the area where the forgery subject is located by language and video background. |
Content Consistency | High-quality deepfake should keep the look, lighting, and style consistent throughout the video. Inconsistencies, such as sudden changes in skin color or facial features, may indicate poor quality. |
Time Coherence | Refers to the fluency and continuity of the movement. A good deepfake shouldshow smooth movement without any lag, flicker, or sudden jump. |
Audio-Visual Synchronization | Refers to the alignment of the audio and video components in the deepfake. Voice, sound effects, and other audio elements should be accurately synchronized with the corresponding visual prompts (such as mouth movements and facial expressions). |
Context Rationality | Refers to whether the deepfake fits into the context of the surrounding. High-quality deepfake should not only look and sound convincing, but also make sense in the overall narrative or context. |
Fraud form | Currently, fake forms of deep synthetic video released on new media platforms (face reimplantation, face change, audio fraud and so on). |
3.1.2 Reliability Evaluation
The labeling reliability may be very unstable and inconsistent due to subjective differences in annotators. To remove the labels with lower reliability, we use the expectation maximization (EM) algorithm to evaluate the reliability of each annotator to attain high reliability labels.
Assuming that the number of videos is , and the videos are annotated by annotators (the number is ). Firstly, binarize the labels into a 0-1 matrix () about annotation category , in which if the annotator labels the video as category , then () is 1, and 0 otherwise.
Then, estimate the reliability of each annotator by optimizing the label likelihood. The reliability is expressed as two M-dimensional probability vectors: and , where denotes the reliability probability that the annotator correctly marking the emotion category , while is the reliability probability that the annotator does not label the emotion category . and are not relevant. indicates whether the video has the label of the emotion category . We initialize the by voting.
According to the definition above, in step E of the EM algorithm, the reliability probability is used to estimate the posterior probability which is the probability that the video is correctly annotated as the emotion category :
where is the expected probability of emotion category , initialized by , and is calculated as follows:
In the step M of the EM algorithm, firstly, update the as follows:
Then, the and are updated by the maximum likelihood estimation:
Finally, set a as the convergence target in the EM algorithm:
We can further determine whether the is converged:
where represents the times of iterations and is the convergence threshold set empirically to 0.000001. If is converged, the reliability of all annotators is attained, otherwise return to step. After several iterations, the annotation reliability of all annoators is not less than 0.83. Therefore, the annotation is true and valid.
3.1.3 The Screening of the Representative Video
Based on the integration of subjective score and objective score, we choose the most representative 1500 videos from 30000 initial video screen, and the selected subjective score index includes visual quality, the area of the publisher, content consistency, time coherence, audio-visual synchronization, context rationality, fraud form. Objective rating indicators include clarity, number of comments, likes, playback, and publisher subscriptions. The three forms of fraud and the five areas of forgery subjects in subjective scores were taken as the main screening categories, and categories were screened. Each category was integrated according to the objective score and subjective score, and the top 100 videos were taken as representative videos for follow-up analysis.
3.2 Research Hypothesis
In this paper, the video views and the video likes are adopted to represent the influencing factors of the video propagation effect. The study hypotheses are as follows:
Preceding studies revealed that the number of fans on WeChat platform significantly affects the dissemination effect of information. Besides, the authority and professionalism of publishers on Weibo platform are also important factors affecting the dissemination of information
[23]. In the YouTube platform, the influence of the publisher is an important factor affecting the propagation effect. Once a new video is released, the more subscriptions the publisher has, the more likely the new video is to be followed. The total amount of the videos reflects the quality of the videos. Based on this, this paper proposes the research hypothesis:
H1: The influence of the publisher significantly affects the transmission effect of the video; YouTube platform classifies videos according to the contents, and viewers will select the videos they want to watch according to the video category. Considering this, the following assumptions are made:
H2: The influence of the video content category on the video transmission effect is different.
With the explosive growth of information, clear themes and fast pace have become the distinctive characteristics of some high-volume videos. Therefore, the following assumptions are proposed:
H3: Video duration significantly affects the propagation effect of the video.
The technical forms of deep synthesized video mainly include facial reimplantation, face change, whole body fraud, audio fraud and synthesized human. The application of different technologies presents different video effects, and therefore the audience's visual experience is also different. Based on this, the following assumptions are made:
H4: The influence of the form of deep synthesized video on the video propagation effect is diverse.
The technology of deep synthesized video can be reflected by the video clarity, the fake character fidelity and the movement fluency. The higher the clarity of the deep synthesized video is, the more realistic the fake characters in the video is, the smoother the movements are, and the viewer will have better watching experience, who are more likely to produce emotional changes due to the video content. Based on this, the following assumptions are made:
H5: The counterfeiting technology of deep synthesized video has a significant impact on the effect of video transmission.
H5a: The higher the clarity of the deep synthesized video has, the better the video transmission effect will be.
H5b: The fidelity of the fake character in the deep synthesized video significantly affects the video propagation effect.
3.3 Variable Definition
3.3.1 Explained Variables
There are two different modes of individual processing information: Heuristic and systematic clues
[24]. The heuristic clue refers to the non-content and situational clue contained in the information itself, and the systematic clue refers to the content characteristics of the information itself. Based on the heuristic-systematic model, this paper measures the cognitive situation and emotional change of deep synthesized video. Video cognitive situation corresponds to the breadth of propagation, and emotional changes correspond to the depth of transmission
[25]. YouTube platform corresponds to the number of views and likes respectively. These two specific evaluation indicators, through quantitative and qualitative analysis, measure the effect of deep synthetic video dissemination in two dimensions: the breadth of dissemination and the depth of dissemination, and study the influencing factors of the transmission effect.
Video likes: According to the number of likes, videos are divided into five levels: Low, general, medium, strong and very strong. For deep synthesized videos, likes less than 5 were assigned to 1, 5 to 20 is 2, 20 to 40 is 3, 40 to 100 is 4 and above 100 is 5. For ordinary videos, likes less than 200 were assigned to 1, 200 to 2000 is 2, 2000 to 10000 is 3, 10000 to 100000 is 4 and above 100000 is 5.
Video views: The value of the video playback is divided into five levels: Low, general, medium, strong and very strong. For deep synthesized videos, views less than 50 were assigned to 1, 50 to 500 is 2, 500 to 1000 is 3, 1000 to 5000 is 4 and above 5000 is 5. For ordinary videos, views less than 1000 were assigned to 1, 1000 to 100000 is 2, 100000 to 1000000 is 3, 1000000 to 10000000 is 4 and above 10000000 is 5.
3.3.2 Explanatory Variables
Video type: Videos have different categories, such as film, television music, science and technology education, life, entertainment, news and information, etc. Through manual annotation, the value of entertainment video is assigned to 1 and non-entertainment video is 0.
Video duration: Logarized video time recorded in seconds.
Publisher influence: The influence of the publisher on YouTube is mainly reflected in the attention of their published works. We use the total view amounts of videos published by the publisher to represent the publisher influence, and the variable is logarithmic in empirical analysis.
Technical form: Deep synthesized video mainly includes facial reimplantation, face change, whole body fraud, audio fraud and synthesized human. Because face change is the main form of YouTube deep synthesized video fraud technology, so through artificial annotation, the face change form is assigned to 1, the others are assigned to 0.
Video clarity: The video clarity mainly includes 360P, 480P, 720P, 1080P, 1440P and 2160P. Assign a video of 1080P or above to 1, the others to 0.
Video fidelity: With manual annotation, the various degrees of the fidelity of the fake characters in the deep synthesized video are sorted and marked. In the order of , the deep synthesis video is assigned in the order of fidelity.
Table 2 Description of the variables |
Type | Variable Name | Specify |
Explained Variable | Video likes | According to the size of the value, the video likes are divided into five degrees: Low, general, medium, strong and very strong: low = 1, general = 2, medium = 3, strong = 4, very strong = 5 |
| Video views | According to the size of the value, the video playback volume is divided into five degrees: Low, general, medium, strong and very strong: low = 1, general = 2, medium = 3, strong = 4, very strong = 5 |
Explanatory Variable | Video type | Entertainment = 1; non-entertainment = 0 |
Video duration | The natural log of the uniform seconds |
Publisher influence | The natural log of the total playback of the published video |
Forms of fraud | Face change = 1; non-face change = 0 |
Video clarity | Above 1080P = 1; below 720P = 0 |
Video fidelity | From low to high assignment of |
3.4 Model Construction
In this paper, video type, log video length, log publisher's total video plays, forgery method, video clarity, and video fidelity are taken as explanatory variables affecting the propagation effect of deep synthesized video, and the number of plays and likes of deep synthesized video are taken as explanatory variables and analyzed the factors affecting the propagation effect of deep synthesized video using ordered probit model respectively.
For ordinary videos, after variable screening, the number of plays and likes of ordinary videos were taken as explanatory variables, and the video type, log video length, total number of plays of published videos by log publishers, and video clarity were taken as explanatory variables affecting the propagation effect of ordinary videos, and the factors influencing the propagation effect of ordinary videos were analyzed by using the ordered probit model. is the critical value of the dependent variable, is the corresponding latent variable of , and is derived from the comparative relationship between and .
Based on the above relationship, the response probability for each of the dependent variable given the independent variable can be calculated.
where is a distribution function.
4 Empirical Analysis
4.1 Descriptive Analysis
Descriptive statistical analysis is performed on 1, 500 deep synthesized videos and 1, 500 ordinary videos lasting within 4 minutes, as shown in Table 3. In terms of transmission effect, compared with ordinary videos, the number of plays and likes of deep synthesized videos is less on the whole, and the gap between them and ordinary videos is larger. In terms of title expression, the deep composite video with special sentence title is less than the ordinary video. In terms of video types, deep synthesized video is mostly entertainment, while ordinary video types are less entertainment. In terms of video duration, the overall duration of ordinary video is higher than that of deep synthesized video. In terms of the total number of views of the video posted by the publisher, the average video is even higher. In terms of video clarity, there are more ordinary videos above 1080P than deep synthesized videos. In addition, the forgery form of deep synthesized video is mostly facial change, and the video fidelity is mostly in the middle state.
Table 3 Descriptive analysis |
Variable Name | Deep Synthesized Video | | Ordinary Video |
Mean | Standard Deviation | Least | Most | | Mean | Standard Deviation | Least | Most |
Views (times) | 128479.000 | 1597507.000 | 1.000 | 39553709.000 | | 12170976.000 | 1697014.370 | 8.000 | 1378705743.000 |
Likes (times) | 1977.409 | 21159.570 | 0.000 | 600353.000 | | 215365.700 | 841377.800 | 0.000 | 12390000.000 |
Description of the title | 0.002 | 0.045 | 0.000 | 1.000 | | 0.009 | 0.093 | 0.000 | 1.000 |
Video type | 0.810 | 0.010 | 0.000 | 1.000 | | 0.156 | 0.363 | 0.000 | 1.000 |
Video duration (seconds) | 37.565 | 45.871 | 2.000 | 238.000 | | 121.899 | 71.071 | 8.000 | 239.000 |
Total view amount of videos posted by publisher (times) | 57854122.740 | 689816045.500 | 37.000 | 17774778161.000 | | 8830516181.000 | 34273523832.000 | 16753.000 | 207316000000.000 |
Forms of fraud | 0.799 | 0.401 | 0.000 | 1.000 | | | | | |
Video clarity | 0.355 | 0.479 | 0.000 | 1.000 | | 0.873 | 0.333 | 0.000 | 1.000 |
Video fidelity | 3.601 | 1.059 | 1.000 | 5.000 | | | | | |
4.2 The Stepwise Regression Analysis
In order to deeply investigate the significant influencing factors of the dissemination effect of deep synthesized videos, we established a fixed-order regression full model with video plays and video likes as the explanatory variables, and video type, video duration, total views of videos posted by publishers, video clarity, forgery form and deep synthesized video fidelity as the explanatory variables, respectively.
4.2.1 Take the Deep Synthesized Video Playback as the Explanatory Variable
The stepwise regression full model is established with the amount of deep synthesized video views as the dependent variable, and the AIC and BIC method are adopted to select variables. The results are shown in Table 4.
Table 4 Stepwise regression model with deep synthesized video views as the dependent variable |
Variable name | Full-model Regression | AIC | BIC |
Cut-off term 1 / 2 | 1.789*** | 1.729*** | 1.789*** |
Cut-off term is 2 / 3 | 2.755*** | 2.694*** | 2.755*** |
Cut-off term is 3 / 4 | 3.304*** | 3.242*** | 3.304*** |
Cut-off term of 4 / 5 | 3.881*** | 3.819*** | 3.881*** |
Video type | 0.196** | 0.203** | 0.196** |
Log video duration | 0.191*** | 0.207*** | 0.191*** |
Video clarity | 0.087 | | 0.087 |
Log total view amount of videos posted by publisher | 0.139*** | 0.140*** | 0.139*** |
Forms of fraud | 0.412*** | 0.425*** | 0.412*** |
Video fidelity | 0.042 | | 0.042 |
Global test of the model | The -values are less than 0.001 | The -values are less than 0.001 | The -values are less than 0.001 |
| Note: * * *, * * and * indicate significance at 1, 5 and 10 respectively. Since there are 5 different values of the explained variables, there are 4 different intercept terms, which play a role in the regression model similar to the intercept terms in the ordinary regression model, and are generally not interpreted. |
Among them, the AIC selects four explanatory variables: Video type, video duration, total video views by publisher, and forgery form. At the 5% significance level, all four variables chosen by the AIC were significant. The estimate of video type entertainment parameters is significantly positive, indicating that the spread of entertainment video is more widely distributed compared to non-entertainment deep synthesis videos; the length of log video is significantly positive, indicating that the longer of the deep synthesized video, the higher the spread breadth; the estimate of the total video playback is significantly positive, indicating that the stronger the influence of the deep synthesized video publisher, the higher the breadth of distribution; synthesized the forgery-form face-swapping parameters is significantly positive, indicating that the depth of synthesized video is more widely spread compared with the deep synthesized video of non-face change technology. Compared to the AIC model, the BIC model adds two explanatory variables: Video definition and video fidelity. This is because the significance level of video clarity and video fidelity are relatively low compared to the other four highly significant explanatory variables (video type, log video duration, log publisher total video views and forgery form).
4.2.2 Take the Number of Deep Synthesized Video Likes as the Explanatory Variable
The stepwise regression full model is established with the amount of deep synthesized video likes as the dependent variable, and the AIC and BIC variables are selected. The results are shown in Table 5.
Table 5 Stepwise regression model with deep synthesized video likes as the dependent variable |
Variable Name | Full-model Regression | AIC | BIC |
Cut-off term 1 / 2 | 1.603*** | 1.599*** | 1.603*** |
Cut-off term is 2 / 3 | 2.655*** | 2.651*** | 2.655*** |
Cut-off term is 3 / 4 | 3.144*** | 3.140*** | 3.144*** |
Cut-off term of 4 / 5 | 4.190*** | 4.185*** | 4.190*** |
Video type | 1.070*** | 1.056*** | 1.070*** |
Log video duration | 0.116*** | 0.125*** | 0.116*** |
Video clarity | 0.049 | | 0.049 |
Total number of videos played by log-publishers | 0.162*** | 0.163*** | 0.162*** |
Forms of fraud | 0.300*** | 0.283*** | 0.300*** |
Video realism | 0.101*** | 0.102*** | 0.101*** |
Global test of the model | The -values are less than 0.001 | The -values are less than 0.001 | The -values are less than 0.001 |
| Note: ***, ** and * indicate significance at 1, 5 and 10 respectively. Since there are 5 different values of the explained variables, there are 4 different intercept terms, which play a role in the regression model similar to the intercept terms in the ordinary regression model, and are generally not interpreted. |
The BIC selects six explanatory variables: video type, logarithmic video duration, video clarity, log total video views of publisher, forgery form, and video fidelity. The five variables selected by the BIC are significant at the 1% significant level. Video-type entertainment parameter estimates are significantly positive, indicating that the number of likes is higher for entertainment videos compared to non-entertainment types of deeply synthesized videos. synthesized significant positive estimates for the log video duration parameter, indicating that deeper synthesized videos have higher depth of distribution with longer duration. Log publisher total video views parameter estimates are significantly positive, which shows that the higher the total video views of the deep synthesized video publisher, the higher the propagation depth is. Parameter estimates for the technology form variable are significantly positive, and it indicates that the deep synthesis video in the form of face change has a higher like volume than other forms of deep synthesis video. Video-fidelity parameter estimates are significantly negative, and it shows that the higher the fidelity is, the lower the like amount of the deep synthesized video. Compared with the BIC model, the AIC model removes the explanatory variable of video clarity.
4.2.3 Comparison of Ordinary Video and Deep Synthesized Video
The stepwise regression full model is established with the amount of ordinary video playback as the dependent variable, and compared with the full model of deep synthesized video stepwise regression. The results are shown in Table 6. At the 5% significance level, the common videos are significant with three explanatory variables. The video type entertainment parameter is estimated to be significantly positive, indicating a higher number of video plays for the entertainment type compared to the non-entertainment type videos. The log video duration parameter estimation is significantly negative, indicating that the shorter the video length, the higher the transmission breadth; the total video broadcast volume of the logarithmic publisher is significantly positive, indicating that the higher the total video playback volume of the ordinary video publisher, the stronger the transmission breadth. Compared with ordinary video, the parameter estimation of video type entertainment and total video playback of deep synthesized video publishers is significantly positive, but the parameter estimation of log video duration is significantly positive, indicating that the longer the time of deep synthesized video, the higher the breadth of video transmission.
Table 6 The comparison of ordinary video and deep synthesized video with video views as the dependent variable |
Variable Name | Full-model Regression |
Ordinary Video | Deep Synthesized Video |
Cut-off term 1 / 2 | 0.764*** | 1.789*** |
Cut-off term is 2 / 3 | 1.582*** | 2.755*** |
Cut-off term is 3 / 4 | 2.051*** | 3.304*** |
Cut-off term of 4 / 5 | 2.822*** | 3.881*** |
Video type | 0.193** | 0.196** |
Log video duration | 0.085** | 0.191*** |
Video clarity | 0.065 | 0.087 |
Total number of videos played by log-publishers | 0.123*** | 0.139*** |
Forms of fraud | | 0.412*** |
Video realism | | 0.042 |
Global test of the model | The -values are less than 0.001 | The -values are less than 0.001 |
| Note: ***, ** and * indicate significance at 1, 5 and 10 respectively. Since there are 5 different values of the explained variables, there are 4 different intercept terms, which play a role in the regression model similar to the intercept terms in the ordinary regression model, and are generally not interpreted. |
The stepwise regression full model is established with the amount of ordinary video likes as the dependent variable, and compared with the full model of deep synthesized video ordering regression. The results are shown in Table 7. At the 5% significance level, the common videos are significant with two explanatory variables. The parameter estimation of video definition above 1080P is significantly negative, indicating that the video above 1080P is lower than the ordinary video definition below 720P; the parameter estimation of total video playback of log publishers is significantly positive, indicating that the higher the total video playback of ordinary video publishers, the higher the video playback. Compared with ordinary video, the parameter estimation of video type entertainment, logarithmic video duration and total video playback of deep synthesized publishers are all significantly positive, but the video clarity is not significant.
Table 7 The comparison of ordinary video and deep synthesized video with video likes as the dependent variable |
Variable Name | Full-model Regression |
Ordinary Video | Deep Synthesized Video |
Cut-off term 1 / 2 | 1.495*** | 1.603*** |
Cut-off term is 2 / 3 | 2.386*** | 2.655*** |
Cut-off term is 3 / 4 | 3.030*** | 3.144*** |
Cut-off term of 4 / 5 | 3.900*** | 4.190*** |
Video type | 0.133* | 1.070*** |
Log video duration | 0.005 | 0.116*** |
Video clarity | 0.168** | 0.049 |
Total number of videos played by log-publishers | 0.149*** | 0.162*** |
Forms of fraud | | 0.300*** |
Video realism | | 0.101*** |
Global test of the model | The -values are less than 0.001 | The -values are less than 0.001 |
| Note: ***, ** and * indicate significance at 1, 5 and 10 respectively. Since there are 5 different values of the explained variables, there are 4 different intercept terms, which play a role in the regression model similar to the intercept terms in the ordinary regression model, and are generally not interpreted. |
5 Conclusions and Policy Recommendations
5.1 Conclusions
5.1.1 Factors Affecting the Transmission Effect of Deep Synthesized Video
In this paper, we analyze the factors influencing the propagation effect of deep composite videos by using the number of video plays and the number of likes as indicators of video propagation effect respectively. From the perspective of video views, the transmission effect of entertainment deep synthesized video is better than that of non-entertainment deep synthesized video; the video duration has a significant positive influence on the transmission effect; the publisher influence has a significant positive influence on the transmission effect; the transmission effect of the deep synthesized video of face change is better than that of non-face changing technology. To represent the video propagation effect, the transmission effect of deep synthesized video of different content types has a significant difference, and the transmission effect of entertainment video is better; the duration of video has a significant positive impact on the propagation effect of deep synthesized video; the total amount of video views of the publisher has a significant positive impact on the propagation effect of deep synthesized video; the transmission effect of deep synthesized video in the form of using face change is better than that of non-face changing deep synthesized video; video fidelity has a significant negative impact on the propagation effect of deep synthesized video.
Whether the transmission effect of deep synthesized video is measured by the amount of video likes or the amount of video views, the transmission effect of deep synthesized video of different content types is significantly different, which indicates that YouTube platform users tend to watch entertainment deep synthesized videos for the purpose of relaxation and pleasure when watching videos.
In the era of fast entertainment, viewers prefer videos that provide information and emotional value in the shortest possible time, so it also puts forward high requirements for video duration. In this paper, we focus on deep composite videos of 4 minutes or less, in which the longer the video, the richer the information it contains and the more likely it is to provide emotional value to the viewer and thus get retweeted by the viewer. Therefore, among YouTube deep composite videos of 4 minutes or less, longer videos are more effective in spreading.
The publisher influence of YouTube platform also plays an important role in the dissemination effect of videos. The more effective a publisher's previously uploaded videos are, the more likely their videos will be recommended, which in turn will reach more viewers. Therefore, the total views of videos published by the publisher has a significant positive impact on the effect of video transmission.
On YouTube, there are five main forms of deep synthesized video forgery forms: Facial reimplantation, face change, whole body fraud, audio fraud and synthesized humans. Different forms of forgery have different requirements for fraud technology, and the face changing technology is more mature, and the video quality is better. Therefore, the propagation effect of the deep synthesized video in the form of face change is better.
To sum up, the dissemination effect of deep synthesis videos on YouTube platform is mainly influenced by factors such as video type, video length, publisher's influence and the form of forgery.
5.1.2 Comparison of the Factors Influencing the Transmission Effect of Deep Synthesized Video and Ordinary Video
From the descriptive analysis results, the transmission effect of ordinary video in YouTube platform is significantly better than that of deep synthesized video. The number of ordinary videos with special sentence titles and subtitles is significantly higher than that of deeply synthesized videos. At the same time, in the deep synthesized video, the proportion of videos with content type of entertainment is significantly higher than that of ordinary video, and the overall duration of ordinary video is higher than that of deep synthesized video, and the influence of publishers is also higher. It can be seen that the deep synthesized videos on YouTube platform are still in a state of development.
From the stepwise regression analysis results, both video views or video likes for explained variables, in terms of video content type, the depth of the entertainment synthesized video and ordinary video transmission effect is better than the entertainment, it shows that when YouTube users choose watching video, video type is an important factor, viewers tend to choose entertainment video to watch and make positive feedback. In terms of the influence of publishers, whether on the deep synthesized video or on the ordinary video, the stronger the influence of publishers, the better the video transmission effect. This is also related to the recommendation mechanism of the video platform. The attention the publishers have gained will improve the chance of being recommended for releasing new videos, thus making it easier for the newly published videos to reach the viewers. It is worth noting that when the video propagation effect is expressed by the amount of video playback, the video duration of the deep synthesized video within 4min has a significant positive impact on the propagation effect, while the video duration of the ordinary video within 4min has a significant negative impact on the propagation effect. This suggests that for regular videos, viewers tend to choose to watch videos that provide information and emotional value in the shortest possible time as possible. For deep synthesized videos, viewers prefer to watch longer videos.
5.2 Policy Recommendations
Based on the empirical analysis results and the development status of deep synthesized video, the following suggestions are put forward:
First, further development and deployment of detection technologies. The forms of forgery commonly used in deep synthetic videos include audio forgery, face-swapping technology, and facial re-implantation, etc. The dissemination of deep synthesized video may bring serious damage in some important scenes. Therefore, governments should collaborate with industry to fund research on the further development and deployment of detection technologies, especially for use by government agencies, media organizations, and fact-checkers, requiring digital platforms to deploy detection tools, particularly to identify and mark content generated through a deep forgery process
[1]. For example, technologies such as watermarks and blockchain can be used to certify the media to ensure that any individual or organization can verify its source, creator, and authenticity
[12]. In addition, more and more methods can be used to identify the main spreader, for example, Kai and Li adopted a new K-Shell decomposition method to identify influential spreaders on community networks
[25]. In addition, as more and more deep synthesized videos take the form of audio forgery, relevant technologies can be adopted to analyze the audio samples generated by artificial intelligence, and compare them with human speech in spectrogram analysis.
Second, strengthen the audit standards for the platform's in-depth synthesized video. Deep synthesized videos do not necessarily deceive individuals, but their uncertainty in turn may reduce the public's trust in news on social media. T Deeply synthesized videos do not cost much and will continue to decline as the technology becomes public. On some large social platforms, deep synthesized videos that spread false information may be quickly identified and deleted before they spread, but on some small platforms, they still spread in the space with less network monitoring. Therefore, all platforms need to strengthen the audit standards for videos to avoid the spread of false information. In addition, platforms also need to strengthen the audit standards for video publishers, and support multiple cooperation with trusted information providers, such as local and national media news providers
[1], encourage platforms to help users identify sources of information to better assess whether the information is credible.
Third, popularize the knowledge of deep synthesis technology and enhance public education. Deep synthesized video on the Internet emerges is endlessly in various forms. It is difficult for ordinary people to identify the deep forgery technology only with the naked eye. Therefore, understanding the source of the video, the credibility of the disseminator, and its background can help individuals determine whether a video is reliable. Moreover, the cultivation of key video communicators also helps to suppress the spread of disinformation
[26].
In short, with the rapid growth in the number and speed of deeply synthesized videos, all affected stakeholders need to take action to prevent the spread of disinformation.
5.3 Limitations
To sum up, the effect of the deep synthesized video transmission of YouTube platform is mainly influenced by the comprehensive factors such as video type, video duration, publisher influence and forgery form.
The video transmission effect of YouTube platform is also related to multiple factors such as video audience, operation subject, platform requirements, and mechanism, so the study in this paper also has some limitations. First, the study sample in this paper is 1, 500 videos selected from the deep synthesized videos within 4 minutes, and the obtained analysis results have some limitations. Second, the manual assignment of video fidelity and the classification of video categories in this paper are subjective, which will have some influence on the research results. To sum up, this paper still has some limitations in terms of research data and methods. In addition, the weights of the influencing factors are not further explored, so the further research is needed.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}