1 Introduction
The information surrounding us is usually inaccurate, incomplete or uncertain, but our research and analysis are both based on the process of information, which means that in order to draw a conclusion, we have to be capable of dealing with the rough information. The idea of rough set theory was raised by Pawlak in 1982
[1]. It deals with knowledge on the basis of partition structure, where every set of a single partition is seen as a concept. The roughness of each concept reflects its classification ability: The rougher the concept is, the worse classification ability it grasps. And on the contrary, the more accurate the concept is, the better classification ability it grasps, with higher recognition degree at the same time. As a result, a knowledge base can be obtained, with which the inaccurate or uncertain knowledge can be described.
Rough set theory is a mathematical tool that deals the with inaccurate, inconsistent or incomplete information system. After decades of research, rough set theory has been applied to a variety of areas, especially in the medical field
[2-5]. There are a number of classification algorithms based on rough set theory to analyze the information system. The reduction of data is a fundamental process of knowledge discovery, and it is also a significant measure to seek for the most simplified and generalized rules in the information system. There have been various achievements by applying traditional rough set methods to calculate the minimal reduct set of the static data system
[6]. However, when dealing with an information system consisting of a mass amount of data, especially when the data is chaotic, traditional methods have a comparatively worse performance in making a forecast of previously unseen objects. So looking for the most stable reduct set becomes a key problem in reduction analysis
[7].
Classic statistical techniques have a remarkable performance over ideal static data systems, but due to their sensitiveness to the noise, heuristic methods are not appropriate to deal with chaotic data. Reduction in standard rough sets is based upon the indiscernible relation
[8], which can be changed by even one polluted element, and thus a completely different reduct set may be obtained. The reality is that most data is chaotic, especially for incomplete data tables, where the accuracy of the attribute relations plays a vital role. When faced with a massive amount of changing data, static algorithms are not always stable, having the defects of being not able to reflect local changes and time consuming at repeated work.
To solve the problems listed above, a new concept of dynamic reduction
[9] was raised by Bazan in 1994. With dynamic reduction, the calculation of the most stable reduct set of a given decision table becomes possible. The reduct set obtained in this way can generate the rules with stronger generalized abilities, which can predict the unseen objects with stronger ability of tolerating noise. Moreover, when dealing with incomplete and chaotic data, the dynamic reduction performs much better
[10]. There are also many dynamic reduct applications of vast use
[11, 12].
The reduct core plays a vital part in rough set theory, for the reduction core implies the potential meaning of a given decision table, so we have also made several achievements on dynamic core and proved its properties. Though there exist a number of algorithms working on dynamic reduction, there is little literature on sampling process, for the researchers believe that random sampling in the decision table will make the sample set representative. In this paper, we first point out the flaws in former sampling method. And then we present a new sampling algorithm that can also ascertain the size of sampling set. Furthermore, in order to cut down the time consumed in dynamic reduction calculating, we propose a quick reduction algorithm that reduces the computational complexity.
2 Dynamic Core
The idea of dynamic reduction is carried out on the basis of rough set theory. It has an advantage over static methods especially in massive and chaotic data systems, because dynamic reduct set contains the most stable reducts in the process of random sampling of the original decision table.
Let be a decision table, be the universe, be its decision attribute. By we denote the set of all the subtables of . , then by we denote the set
any element of is called an -dynamic reduct of .
The definition of F-dynamic reduct can be generalized into ()-dynamic reduct, which is defined by
Let be a decision table, by we denote the set:
and then by we denote the set
Any element of is called an ()-generalized dynamic reduct of the decision table . And the number is called the stability coefficient. is a term frequently used in dynamic reduct theory, so it has a short term called generalized reduct. Similarly, we can also define the terms of core in dynamic reduction theory just like it in static rough set theory.
If (for any ) then the number:
is denoted as the stability coefficient of the generalized-dynamic reduct (relative ).
Attribute reduction is one of the most key studies in rough set theory, and the calculation of attribute core has a fairly significance of computing the reduct set. The elements of attribute core set exist in every single attribute reduct, otherwise the indiscernible relation of the information system cannot be guaranteed. There are a number of heuristic algorithms taking advantage of this feature to construct reduct set from the attribute core
[13], in which way less time will be consumed and thus the efficiency will be improved. We take the precision coefficient into account to elaborate the features of dynamic core, which completes the theoretical system. It lays a foundation to the deeper discussion of dynamic reduct and dynamic core.
A great deal of literature gives a thorough discussion of static attribute core of traditional static reduct set of the decision table, which is quite clear, showing the core set consists the most stable attributes in the information system. If any element that belongs to the core part is eliminated from the attribute set, the indiscernible relation will be changed
[14].
The research on dynamic core is still worth the discussion. Due to the limitations of the sample, the attribute core of an information system can be unstable and sensitive to the random changes of the objects. The dynamic reduct set consists of, to some extent, the most stable reducts. So it is necessary to seek for the dynamic core set to obtain those reducts with high reliability, which possess better generalization and robustness ability.
Let be a decision table. We denote the set
as the reduct core set of .
The reduct core set of any subtable is denoted as
By we denote the set
as the set of F-dynamic core of ( relative).
Theorem 1 Let be a decision table, be the universe, be its decision attribute. , then for a given decision table, the Union of F-dynamic reduct includes its F-dynamic core.
Proof.
From Formula (1), we can know that
thus, . For any ,
which can also be denoted as
Then,
From Theorem 1, we can notice that F-dynamic core exits in all F-dynamic reduct (). As the most stable core in a decision table, the dynamic core possesses the features of the core and can be used as the basis to compute the reduct set heuristically. The definition of F-dynamic core can sometimes be too strict, so in order to apply it into noise data processing, we need a more general notion of generalized dynamic core. It's called ()-dynamic core. . The definition is as followed.
Let be a decision table, be the universe, be its decision attribute. ,
, . denotes the set that contains all the reducts of decision table . and denote the F-dynamic core and the -dynamic core respectively. Apparently, the following properties stand.
1) If , then ;
2) ;
3) If , then .
Theorem 2 Let be a decision table, be the universe, be its decision attribute. . For a given decision table, the Union of -dynamic reduct includes its -dynamic core.
Proof. From Formula (8), we can know that
And from Formula (2), we can notice that
Apparently, , then,
So,
3 Generalized Dynamic Core
The definition of F-dynamic core can sometimes be too strict, for it requires the F-dynamic core must be included in the core of decision table , so we also apply a generalization of F-dynamic core, which is called the F-generalized dynamic core, denoted as .
Similarly, we proposed with the definition of ()-generalized dynamic core defined as follows
Similarly, it has a short term of generalized reduct core.
From Theorem 1, Theorem 2 and Formula (13), the following equations can be easily proved.
Let be a decision table, be the universe, be its decision attribute. , ,
1) ;
2) ;
3) If , then .
Theorem 3 Let be a decision table, be the universe, be its decision attribute. , :
The Union of -generalized dynamic reduct s includes its -generalized dynamic core s.
The Union of -generalized dynamic reduct includes its -generalized dynamic core.
Proof. From Formula (3), we can know that
And for any ,
Then
2) Based on the definition, we can know that for any attribute , , they all fulfills the condition: , suppose a exists in subtables, that is , then, , , so,
From the definition of -generalized dynamic reduct, we can know that
Since , for any ,
According to Formula (1), so
Let be a decision table, , the properties below can be concluded.
1) ;
2) ;
3) ;
4) ;
5) .
By comparing dynamic core with static core, we can observe that the dynamic core is included by the static core, which indicates that the dynamic core is more stable and it depicts the information system.
4 Statistical Sampling Analysis
There are two methods proposed formerly to calculate the size of
: One is based on the trace of subtables and the other is based on the probability of each subtable, both of which have the problem of being too subjective. In order to overcome this setback, Bazan came up with a new algorithm to calculate
when the probability of
existing in each subtable is unknown. First, the family of all subtables of the universal decision table is denoted by
, and let the probability of
in each subtable
be
. But actually it is impossible to obtain
, so the stability coefficient of the generalized dynamic reduct
of the decision table
is taken as the maximum likelihood estimator of the probability
[15].
From the Moivre-Laplace theorem, we know that
has approximately a standard normal distribution. Hence,
If is a maximal acceptable estimation error of , so
It's apparent to know that if , the value of takes the maximum value of 0.25. So,
But just take a look at this formula, we can observe that, for example, the value of is same for and , which is not reasonable. Because by looking at the practical meaning of , we can observe that is the possibility of the reduct existing in any subtable, which means we want to obtain those 's with as large as possible. When the value of the changes, the value of should change accordingly, instead of being constant.
We propose a new method for computing the size of subtable family . First, a new method for evaluating the ability of similarity between the subtable and the original decision table is given, based on which the measurement of the stability of family is defined. Then the new algorithm of confirming the size of is elaborated. Several parameters are given to evaluate the property of family.
The stability of dynamic reduct set shows how effective the sample set is, and whether it has the same representativeness as the original decision table. From theory, we can withdraw that the relative positive region represents the classification ability of a given decision table, so we can compare the positive region of both the subtable and the original decision table to see if they are similar, and eventually, we are able to evaluate the ability of the whole sample set.
Let be a decision table, the division of (relative and respectively) be and , ,
Let be a decision table, , is the Full Sample set where the set is defined as , then the parameter which evaluates the similarity of the subtable compared with decision table is defined as followed
Considering that the cardinal numbers of and are different, the size of the universe has to be taken into account when computing the parameter of sets with different sizes. So the formula can be transformed into: When
It can be withdrawn that both the decision table and have the same representativeness. When . It can be transformed into
Let be a decision table, , the relative stability parameter of sample family is
It can be easily observed that the following properties are tenable:
1) ;
2) Apparently, when , the present sample set is not proper so we need to apply the sample set which meet the quality of .
The algorithm of confirming the size of is listed as followed, based on the concepts presented formerly:
Algorithm 1: Quick sampling algorithm in dynamic reduction
Input
1) The original decision table ;
2) The threshold of , noted as , e.g., .
Output
1) The value of ;
2) All the subtables in family.
Step 1 Extract a sample set from randomly, noted as ;
Step 2 Calculate ;
Step 3 Compare with , if , go to Step 7, otherwise, go to Step 4;
Step 4 Randomly choose an integer from 1 to , as the number of new subtable samples distracted from the original decision table. It has to meet the demand that the new subtables must be different from the objects in ;
Step 5 Add the new subtables extracted from Step 4 to ;
Step 6 Execute Step 2;
Step 7 Output and all the subtables.
Generally speaking, the more subtables the family contains, the more stable the outcome reduct set will be. But due to the Moivre-Laplace theorem, when the size of family reaches to a threshold, the influence of continuing extracting the subtable will be quite tiny.
However, there is another aspect that we must emphasize: The quality of the sample set. The distribution of the sample set has a great effect on the result of the reduction. We carry out a few rates to evaluate the quality of a calculated sample set.
Let be a decision table, , is the Full Sample set where the set is defined as . The Central Sample decision table is defined as , where , namely the central sample set.
1) Entire Sampling Rate .
The value of can intuitively reflect how many objects have been taken into account to the final decision. When is too low, the given sample set is relatively centralized and not appropriate to present the given decision table. Clearly, .
2) Entire equilibrium rate .
The value of represents the overlapping coefficient of the sample set. If it is too big, it means there are too many repeated objects that has been extracted during the sampling process, which will reduce the efficiency of the result. Clearly,
3) Local sampling rate .
The value of describes how much each subtable takes part in the whole sampling set. And when some subtables have relatively high value of their , the sample set will not be proper due to the high proportion of local repetition even though the is acceptable. Clearly,
4) Local equilibrium rate .
is the parameter that shows the containing degree of in . When there is a fixed number of subtables having over-high value of their , this sampling set is not efficient even though the Entire Equilibrium Rate meets the demand. Clearly, .
5 The Quick Generalized Dynamic Reduction Algorithm
Traditional reduction algorithms require to calculate all the reducts of the subtables in family , and then compare the stability coefficient of each reduct with the threshold to see if it meets the demand to be a real dynamic reduct. However, calculating in this way is both time and space consuming. Based on our research, it is not necessary to calculate all the reducts, which means that we can obtain the dynamic reducts by only calculating part of the reducts of the subtables in family . The definition of generalized dynamic reduct is as followed:
Let be a decision table, be the universe, be its decision attribute. , then by we denote the set
as the -generalized dynamic reduct of .
From the definition presented above, we can observe that a reduct of some subtable is a dynamic reduct if its existing rate of being a reduct of the subtables in family is no smaller than the threshold . On the other hand, it is much easier to judge whether a given reduct is the reduct of a decision table rather than to calculate the reduct of it. We only need to count the frequency and compare it with the threshold .
Let be a decision table, be the universe, be its decision attribute. Set , set and set meet the following requirements: . Set contains the subtables of which the reducts have to be calculated. is the set consisting of the union of and the subtables in that have already been judged. The initial status is . Under the circumstance that the size of family is already known, the key problem is to ascertain the cardinality of both and .
It's obvious that the reducts that don't belong to the family can't be dynamic reducts. When , the inequality establishes. So we only need to calculate a smaller part of the subtables to obtain the dynamic reducts.
It's easy to prove that if a reduct is a generalized dynamic reduct of decision table , then is also a reduct of at least one decision table in family .
Proof. means the number of times existing in the reducts of members. Note as the set of -generalized dynamic reducts. If any reduct in set isn't a member to family, then
The stability coefficient is smaller than the threshold , which is not possible.
To estimate if a given attribute set is a set of generalized dynamic reduct, we have to set some measurements to see if this reduct is effective. To level up the estimating efficiency, we can take advantage of existing information in set to estimate the most possible frequency rate of existing in , and compare the rate with the stability threshold. If the rate is smaller than the threshold, this reduct is definitely not a generalized dynamic reduct; if not, we need take further measurements to make the final decision.
Let be a decision table, be the universe, be its decision attribute. Set , set and set meet the following requirements: . There are three ratios: Present ratio , optimal ratio , and the real ratio defined respectively below:
Obviously, if the optimal ratio of a reduct is smaller than the given stability coefficient, so it can't be a generalized dynamic reduct. Based on the analysis listed above, we put up with a quick algorithm of calculating a set of generalized dynamic reducts.
Algorithm 2: Quick generalized dynamic reduction algorithm
Input , , .
Output -generalized dynamic reduct set of .
Step 1 Calculate the cardinality of and , namely and , and extract subtables from to form the original family;
Step 2 Take objects from family to form set, and the rest form set;
Step 3 Calculate the reducts of the tables in set, and put these reducts into candidate set. The cardinality of the candidate set is , ;
Step 4 Take the reduct from the candidate set one by one. , ;
Step 5 Take a table from , , ;
Step 6 If : , calculate and . If , go to Step 5; otherwise, go to Step 8. If , go to Step 7;
Step 7 If : go to Step 4; otherwise, go to Step 5;
Step 8 . If : add into the generalized dynamic reduct set. If , go to Step 3;
Step 9 Output the generalized dynamic reduct set.
Based on research, for a given decision table: , the time complexity of calculating all the reduct is , and the time complexity of judging whether a given attribute set is a reduct to this decision table is . So for traditional reduct algorithms, the time complexity of calculating dynamic reducts in family set will be , and the time complexity correspondingly for the quick algorithm will be is the cardinality of the reduct set of family).
6 Experimental Analyses
We selected several information tables in UCI data base and then compare the results between former algorithm on the ROSETTA software, denoted as , while our method is denoted as . The results are presented in Table 1.
Table 1 Comparison of experimental results between Algorithm and ROSETTA |
Database | | | | | | | y |
| | | R | A | R | A | R | A | R | A | R | A |
Auto original | 405 | 9 | 50 | 29 | 26 | 25 | 3.385 | 3.331 | 10436 | 9775 | 3385 | 3211 |
Breast cancer | 197 | 11 | 50 | 24 | 141 | 130 | 3.262 | 3.002 | 15003 | 12332 | 3249 | 3012 |
CPU performance machine | 208 | 10 | 50 | 27 | 136 | 112 | 4.404 | 4.102 | 27127 | 18004 | 4399 | 4013 |
Echo cardiogram | 63 | 13 | 50 | 12 | 74 | 52 | 2.338 | 2.043 | 3787 | 3214 | 2381 | 2011 |
Flag | 24 | 11 | 50 | 12 | 181 | 152 | 3.945 | 3.233 | 3153 | 2756 | 4083 | 3356 |
Glass | 213 | 11 | 50 | 26 | 35 | 20 | 2.286 | 2.280 | 7222 | 5801 | 2288 | 2286 |
Servo | 166 | 5 | 50 | 23 | 1 | 1 | 4.000 | 4.000 | 166 | 166 | 4000 | 4000 |
Wine | 177 | 14 | 50 | 20 | 178 | 89 | 2.646 | 2.444 | 31282 | 28046 | 2648 | 2454 |
Zoo | 100 | 18 | 50 | 18 | 1065 | 813 | 6.255 | 6.101 | 25625 | 24610 | 6424 | 6121 |
| *we applied the trace method for sampling in ; : The number of objects; : The number of attributes; : The average length of reducts; : The number of rules; : The average length of rules. |
It's clearly that the size of sample set of our method is apparently smaller than the ROSETTA method. Besides, by comparing all the other parameters in the table, our method is slightly better than the ROSETTA method. In conclusion, our method is able to obtain a better result with a smaller sample set, which can prove that our method is efficient.
7 Conclusion
Dynamic core is the most stable set of a given information system, and it can't be eliminated from any reduct set. Dynamic core reflects the essential characteristics of dynamic reduction. We have put up with several theorems to prove that the dynamic core is contained in the union of dynamic reducts, as well as the static core set, which means that the dynamic core plays a vital role in dynamic reduct computation. A new systematic method of dynamic reduction is carried out from sampling process to reduct computation period. By the experiment, it can be concluded that the new methods of sampling and quick algorithm have a better performance over the traditional rough set methods of reduction, which avoid the sample set being too subjective and decrease the computing time. The results turned out to be efficient.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}