Chào mừng quý vị đến với .
Quý vị chưa đăng nhập hoặc chưa đăng ký làm thành viên, vì vậy chưa thể tải được các tư liệu của Thư viện về máy tính của mình.
Nếu chưa đăng ký, hãy đăng ký thành viên tại đây hoặc xem phim hướng dẫn tại đây
Nếu đã đăng ký rồi, quý vị có thể đăng nhập ở ngay ô bên phải.
Khai thác dữ liệu

- 0 / 0
(Tài liệu chưa được thẩm định)
Nguồn:
Người gửi: Trần Thị Kim Dung (trang riêng)
Ngày gửi: 10h:55' 04-04-2012
Dung lượng: 1.5 MB
Số lượt tải: 4
Nguồn:
Người gửi: Trần Thị Kim Dung (trang riêng)
Ngày gửi: 10h:55' 04-04-2012
Dung lượng: 1.5 MB
Số lượt tải: 4
Số lượt thích:
0 người
Rough Sets in KDD
Tutorial Notes
Andrzej Skowron
Warsaw University
skowron@mimuw.eud.pl
Ning Zhong
Maebashi Institute of Technolgy
zhong@maebashi-it.ac.jp
Copyright 2000 by A. Skowron & N. Zhong
About the Speakers
Andrzej Skowron received his Ph.D. from Warsaw University. He is a professor in Faculty of Mathematics, Computer Science and Mechanics, Warsaw University, Poland. His research interests include soft computing methods and applications, in particular, reasoning with incomplete information, approximate reasoning, rough sets, rough mereology, granular computing, synthesis and analysis of complex objects, intelligent agents, knowledge discovery and data mining, etc, with over 200 journal and conference publications. He is an editor of several international journals and book series including Fundamenta Informaticae (editor in chief), Data Mining and Knowledge Discovery. He is president of International Rough Set Society. He was an invited speaker at many international conferences, and has served or is currently serving on the program committees of over 40 international conferences and workshops, including ISMIS’97-99 (program chair), RSCTC’98-00 (program chair), RSFDGrC’99 (program chair).
About the Speakers (2)
Ning Zhong received his Ph.D. from the University of Tokyo. He is head of Knowledge Information Systems Laboratory, and an associate professor in Department of Information Engineering, Maebashi Institute of Technology, Japan. His research interests include knowledge discovery and data mining, rough sets and granular-soft computing, intelligent agents and databases, Web-intelligence (WI), knowledge information systems, with over 80 journal and conference publications. He is an editor of Knowledge and Information Systems: an international journal (Springer). He is a member of the advisory board of International Rough Set Society, the Steering Committee of IEEE International Conferences on Data Mining, and PAKDD conferences, the advisory board of BISC/SIGGrC. He has served or is currently serving on the program committees of over 30 international conferences and workshops, including PAKDD’99 (program chair), IAT’99 and 2001 (program chair), RSFDGrC’99 (program chair), and WI’2001(program chair).
Contents
Introduction
Basic Concepts of Rough Sets
A Rough Set Based KDD process
Rough Sets in ILP and GrC
Concluding Remarks (Summary, Advanced Topics, References and Further Readings).
Introduction
Rough set theory was developed by Zdzislaw Pawlak in the early 1980’s.
Representative Publications:
Z. Pawlak, “Rough Sets”, International Journal of Computer and Information Sciences, Vol.11, 341-356 (1982).
Z. Pawlak, Rough Sets - Theoretical Aspect of Reasoning about Data, Kluwer Academic Pubilishers (1991).
Introduction (2)
The main goal of the rough set analysis is induction of approximations of concepts.
Rough sets constitutes a sound basis for KDD. It offers mathematical tools to discover patterns hidden in data.
It can be used for feature selection, feature extraction, data reduction, decision rule generation, and pattern extraction (templates, association rules) etc.
identifies partial or total dependencies in data, eliminates redundant data, gives approach to null values, missing data, dynamic data and others.
Introduction (3)
Recent extensions of rough set theory (rough mereology) have developed new methods for decomposition of large data sets, data mining in distributed and multi-agent systems, and granular computing.
This presentation shows how several aspects of the above problems are solved by the (classic) rough set approach, discusses some advanced topics, and gives further research directions.
Basic Concepts of Rough Sets
Information/Decision Systems (Tables)
Indiscernibility
Set Approximation
Reducts and Core
Rough Membership
Dependency of Attributes
Information Systems/Tables
IS is a pair (U, A)
U is a non-empty finite set of objects.
A is a non-empty finite set of attributes such that for every
is called the value set of a.
Age LEMS
x1 16-30 50
x2 16-30 0
x3 31-45 1-25
x4 31-45 1-25
x5 46-60 26-49
x6 16-30 26-49
x7 46-60 26-49
Decision Systems/Tables
DS:
is the decision attribute (instead of one we can consider more decision attributes).
The elements of A are called the condition attributes.
Age LEMS Walk
x1 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
Issues in the Decision Table
The same or indiscernible objects may be represented several times.
Some of the attributes may be superfluous.
Indiscernibility
The equivalence relation
A binary relation which is reflexive (xRx for any object x) ,
symmetric (if xRy then yRx), and
transitive (if xRy and yRz then xRz).
The equivalence class of an element
consists of all objects such that xRy.
Indiscernibility (2)
Let IS = (U, A) be an information system, then with any there is an associated equivalence relation:
where is called the B-indiscernibility relation.
If then objects x and x’ are indiscernible from each other by attributes from B.
The equivalence classes of the B-indiscernibility relation are denoted by
An Example of Indiscernibility
The non-empty subsets of the condition attributes are {Age}, {LEMS}, and {Age, LEMS}.
IND({Age}) = {{x1,x2,x6}, {x3,x4}, {x5,x7}}
IND({LEMS}) = {{x1}, {x2}, {x3,x4}, {x5,x6,x7}}
IND({Age,LEMS}) = {{x1}, {x2}, {x3,x4}, {x5,x7}, {x6}}.
Age LEMS Walk
x1 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
Observations
An equivalence relation induces a partitioning of the universe.
The partitions can be used to build new subsets of the universe.
Subsets that are most often of interest have the same value of the decision attribute.
It may happen, however, that a concept such as “Walk” cannot be defined in a crisp manner.
Set Approximation
Let T = (U, A) and let and We can approximate X using only the information contained in B by constructing the B-lower and B-upper approximations of X, denoted and respectively, where
Set Approximation (2)
B-boundary region of X,
consists of those objects that we cannot decisively classify into X in B.
B-outside region of X,
consists of those objects that can be with certainty classified as not belonging to X.
A set is said to be rough if its boundary region is non-empty, otherwise the set is crisp.
An Example of Set Approximation
Let W = {x | Walk(x) = yes}.
The decision class, Walk, is rough since the boundary region is not empty.
Age LEMS Walk
x1 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
An Example of
Set Approximation (2)
yes
yes/no
no
{{x1},{x6}}
{{x3,x4}}
{{x2}, {x5,x7}}
AW
U
setX
U/R
R : subset of
attributes
Lower & Upper Approximations
Lower & Upper Approximations (2)
Lower Approximation:
Upper Approximation:
Lower & Upper Approximations
(3)
X1 = {u | Flu(u) = yes}
= {u2, u3, u6, u7}
RX1 = {u2, u3}
= {u2, u3, u6, u7, u8, u5}
X2 = {u | Flu(u) = no}
= {u1, u4, u5, u8}
RX2 = {u1, u4}
= {u1, u4, u5, u8, u7, u6}
The indiscernibility classes defined by R = {Headache, Temp.} are {u1}, {u2}, {u3}, {u4}, {u5, u7}, {u6, u8}.
Lower & Upper Approximations (4)
R = {Headache, Temp.}
U/R = { {u1}, {u2}, {u3}, {u4}, {u5, u7}, {u6, u8}}
X1 = {u | Flu(u) = yes} = {u2,u3,u6,u7}
X2 = {u | Flu(u) = no} = {u1,u4,u5,u8}
RX1 = {u2, u3}
= {u2, u3, u6, u7, u8, u5}
RX2 = {u1, u4}
= {u1, u4, u5, u8, u7, u6}
u1
u4
u3
X1
X2
u5
u7
u2
u6
u8
Properties of Approximations
implies
and
Properties of Approximations (2)
where -X denotes U - X.
Four Basic Classes of Rough Sets
X is roughly B-definable, iff and
X is internally B-undefinable, iff and
X is externally B-undefinable, iff and
X is totally B-undefinable, iff and
Accuracy of Approximation
where |X| denotes the cardinality of
Obviously
If X is crisp with respect to B.
If X is rough with respect to B.
Issues in the Decision Table
The same or indiscernible objects may be represented several times.
Some of the attributes may be superfluous (redundant).
That is, their removal cannot worsen the classification.
Reducts
Keep only those attributes that preserve the indiscernibility relation and, consequently, set approximation.
There are usually several such subsets of attributes and those which are minimal are called reducts.
Dispensable & Indispensable Attributes
Let
Attribute c is dispensable in T
if , otherwise
attribute c is indispensable in T.
The C-positive region of D:
Independent
T = (U, C, D) is independent
if all are indispensable in T.
Reduct & Core
The set of attributes is called a reduct of C, if T’ = (U, R, D) is independent and
The set of all the condition attributes indispensable in T is denoted by CORE(C).
where RED(C) is the set of all reducts of C.
An Example of Reducts & Core
Reduct1 = {Muscle-pain,Temp.}
Reduct2 = {Headache, Temp.}
CORE = {Headache,Temp}
{MusclePain, Temp}
= {Temp}
Discernibility Matrix (relative to positive region)
Let T = (U, C, D) be a decision table, with
By a discernibility matrix of T, denoted M(T), we will mean matrix defined as:
for i, j = 1,2,…,n such that or belongs to the C-positive region of D.
is the set of all the condition attributes that classify objects ui and uj into different classes.
Discernibility Matrix (relative to positive region) (2)
The equation is similar but conjunction is taken over all non-empty entries of M(T) corresponding to the indices i, j such that
or belongs to the C-positive region of D.
denotes that this case does not need to be considered. Hence it is interpreted as logic truth.
All disjuncts of minimal disjunctive form of this function define the reducts of T (relative to the positive region).
Discernibility Function (relative to objects)
For any
where (1) is the disjunction of all variables a
such that if
(2) if
(3) if
Each logical product in the minimal disjunctive normal form (DNF) defines a reduct of instance
Examples of Discernibility Matrix
No a b c d
u1 a0 b1 c1 y
u2 a1 b1 c0 n
u3 a0 b2 c1 n
u4 a1 b1 c1 y
C = {a, b, c}
D = {d}
In order to discern equivalence classes of the decision attribute d, to preserve conditions described by the discernibility matrix for this table
u1 u2 u3
u2
u3
u4
a,c
b
c a,b
Reduct = {b, c}
Examples of Discernibility Matrix (2)
u1 u2 u3 u4 u5 u6
u2
u3
u4
u5
u6
u7
b,c,d b,c
b b,d c,d
a,b,c,d a,b,c a,b,c,d
a,b,c,d a,b,c a,b,c,d
a,b,c,d a,b c,d c,d
Core = {b}
Reduct1 = {b,c}
Reduct2 = {b,d}
Rough Membership
The rough membership function quantifies the degree of relative overlap between the set X and the equivalence class to which x belongs.
The rough membership function can be interpreted as a frequency-based estimate of
where u is the equivalence class of IND(B).
Rough Membership (2)
The formulae for the lower and upper approximations can be generalized to some arbitrary level of precision by means of the rough membership function
Note: the lower and upper approximations as originally formulated are obtained as a special case with
Dependency of Attributes
Discovering dependencies between attributes is an important issue in KDD.
A set of attribute D depends totally on a set of attributes C, denoted if all values of attributes from D are uniquely determined by values of attributes from C.
Dependency of Attributes (2)
Let D and C be subsets of A. We will say that D depends on C in a degree k
denoted by if
where called C-positive region of D.
Dependency of Attributes (3)
Obviously
If k = 1 we say that D depends totally on C.
If k < 1 we say that D depends partially (in a degree k) on C.
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
What Are Issues of Real World ?
Very large data sets
Mixed types of data (continuous valued, symbolic data)
Uncertainty (noisy data)
Incompleteness (missing, incomplete data)
Data change
Use of background knowledge
very large data set
mixed types of data
noisy data
incomplete instances
data change
use of background knowledge
Real world
issues
Methods
ID3 Prism Version BP Dblearn
(C4.5) Space
Probability
Logic
Set
Soft Techniques for KDD
Stoch. Proc.
Belief Nets
Conn. Nets
GDT
Deduction
Induction
Abduction
RoughSets
Fuzzy Sets
Soft Techniques for KDD (2)
Deduction
Induction
Abduction
GDT
GrC
RS&ILP
RS
TM
A Hybrid Model
GDT : Generalization Distribution Table
RS : Rough Sets
TM: Transition Matrix
ILP : Inductive Logic Programming
GrC : Granular Computing
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
Observations
A real world data set always contains mixed types of data such as continuous valued, symbolic data, etc.
When it comes to analyze attributes with real values, they must undergo a process called discretization, which divides the attribute’s value into intervals.
There is a lack of the unified approach to discretization problems so far, and the choice of method depends heavily on data considered.
Discretization based on RSBR
In the discretization of a decision table T = where is an interval of real values, we search for a partition of for any
Any partition of is defined by a sequence of the so-called cuts from
Any family of partitions can be identified with a set of cuts.
Discretization Based on RSBR (2)
In the discretization process, we search for a set of cuts satisfying some natural conditions.
U a b d
x1 0.8 2 1
x2 1 0.5 0
x3 1.3 3 0
x4 1.4 1 1
x5 1.4 2 0
x6 1.6 3 1
x7 1.3 1 1
U a b d
x1 0 2 1
x2 1 0 0
x3 1 2 0
x4 1 1 1
x5 1 2 0
x6 2 2 1
x7 1 1 1
P
P
P = {(a, 0.9),
(a, 1.5),
(b, 0.75),
(b, 1.5)}
A Geometrical Representation of Data
0
0.8
1
1.3 1.4 1.6
a
b
3
2
1
0.5
x1
x2
x3
x4
x7
x5
x6
A Geometrical Representation of Data and Cuts
0
0.8
1
1.3 1.4 1.6
a
b
3
2
1
0.5
x1
x2
x3
x4
x5
x6
x7
Discretization Based on RSBR (3)
The sets of possible values of a and b are defined by
The sets of values of a and b on objects from U are given by
a(U) = {0.8, 1, 1.3, 1.4, 1.6};
b(U) = {0.5, 1, 2, 3}.
Discretization Based on RSBR (4)
The discretization process returns a partition of the value sets of condition attributes into intervals.
A Discretization Process
Step 1: define a set of Boolean variables,
where
corresponds to the interval [0.8, 1) of a
corresponds to the interval [1, 1.3) of a
corresponds to the interval [1.3, 1.4) of a
corresponds to the interval [1.4, 1.6) of a
corresponds to the interval [0.5, 1) of b
corresponds to the interval [1, 2) of b
corresponds to the interval [2, 3) of b
The Set of Cuts on Attribute a
A Discretization Process (2)
Step 2: create a new decision table by using the set of Boolean variables defined in Step 1.
Let be a decision table, be a propositional variable corresponding to the interval for any
and
A Sample Defined in Step 2
U*
(x1,x2)
(x1,x3)
(x1,x5)
(x4,x2)
(x4,x3)
(x4,x5)
(x6,x2)
(x6,x3)
(x6,x5)
(x7,x2)
(x7,x3)
(x7,x5)
1 0 0 0 1 1 0
1 1 0 0 0 0 1
1 1 1 0 0 0 0
0 1 1 0 1 0 0
0 0 1 0 0 1 1
0 0 0 0 0 1 0
0 1 1 1 1 1 1
0 0 1 1 0 0 0
0 0 0 1 0 0 1
0 1 0 0 1 0 0
0 0 0 0 0 1 0
0 0 1 0 0 1 0
The Discernibility Formula
The discernibility formula
means that in order to discern object x1 and x2, at least one of the following cuts must be set,
a cut between a(0.8) and a(1)
a cut between b(0.5) and b(1)
a cut between b(1) and b(2).
The Discernibility Formulae for All Different Pairs
The Discernibility Formulae for All Different Pairs (2)
A Discretization Process (3)
Step 3: find the minimal subset of p that discerns all objects in different decision classes.
The discernibility boolean propositional formula is defined as follows,
The Discernibility Formula
in CNF Form
The Discernibility Formula
in DNF Form
We obtain four prime implicants,
is the optimal result, because
it is the minimal subset of P.
The Minimal Set Cuts
for the Sample DB
0
0.8
1
1.3 1.4 1.6
a
b
3
2
1
0.5
x1
x2
x3
x4
x5
x6
x7
A Result
U a b d
x1 0.8 2 1
x2 1 0.5 0
x3 1.3 3 0
x4 1.4 1 1
x5 1.4 2 0
x6 1.6 3 1
x7 1.3 1 1
U a b d
x1 0 1 1
x2 0 0 0
x3 1 1 0
x4 1 0 1
x5 1 1 0
x6 2 1 1
x7 1 0 1
P
P
P = {(a, 1.2),
(a, 1.5),
(b, 1.5)}
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
Observations
A database always contains a lot of attributes that are redundant and not necessary for rule discovery.
If these redundant attributes are not removed, not only the time complexity of rule discovery increases, but also the quality of the discovered rules may be significantly depleted.
The Goal of Attribute Selection
Finding an optimal subset of attributes in a database according to some criterion, so that a classifier with the highest possible accuracy can be induced by learning algorithm using information about data available only from the subset of attributes.
Attribute Selection
The Filter Approach
Preprocessing
The main strategies of attribute selection:
The minimal subset of attributes
Selection of the attributes with a higher rank
Advantage
Fast
Disadvantage
Ignoring the performance effects of the induction algorithm
The Wrapper Approach
Using the induction algorithm as a part of the search evaluation function
Possible attribute subsets (N-number of attributes)
The main search methods:
Exhaustive/Complete search
Heuristic search
Non-deterministic search
Advantage
Taking into account the performance of the induction algorithm
Disadvantage
The time complexity is high
Basic Ideas: Attribute Selection using RSH
Take the attributes in CORE as the initial subset.
Select one attribute each time using the rule evaluation criterion in our rule discovery system, GDT-RS.
Stop when the subset of selected attributes is a reduct.
Why Heuristics ?
The number of possible reducts can be
where N is the number of attributes.
Selecting the optimal reduct from all of possible reducts is time-complex and heuristics must be used.
The Rule Selection Criteria
in GDT-RS
Selecting the rules that cover as many instances as possible.
Selecting the rules that contain as little attributes as possible, if they cover the same number of instances.
Selecting the rules with larger strengths, if they have same number of condition attributes and cover the same number of instances.
Attribute Evaluation Criteria
Selecting the attributes that cause the number of consistent instances to increase faster
To obtain the subset of attributes as small as possible
Selecting an attribute that has smaller number of different values
To guarantee that the number of instances covered by rules is as large as possible.
Main Features of RSH
It can select a better subset of attributes quickly and effectively from a large DB.
The selected attributes do not damage the performance of induction so much.
An Example of Attribute Selection
Condition Attributes:
a: Va = {1, 2}
b: Vb = {0, 1, 2}
c: Vc = {0, 1, 2}
d: Vd = {0, 1}
Decision Attribute:
e: Ve = {0, 1, 2}
Searching for CORE
Removing attribute a
Removing attribute a does not cause inconsistency.
Hence, a is not used as CORE.
Searching for CORE (2)
Removing attribute b
Removing attribute b
cause inconsistency.
Hence, b is used as CORE.
Searching for CORE (3)
Removing attribute c
Removing attribute c
does not cause inconsistency.
Hence, c is not used
as CORE.
Searching for CORE (4)
Removing attribute d
Removing attribute d
does not cause inconsistency.
Hence, d is not used
as CORE.
Searching for CORE (5)
CORE(C)={b}
Initial subset R = {b}
Attribute b is the unique indispensable attribute.
R={b}
The instances containing b0 will not be considered.
T
T’
Attribute Evaluation Criteria
Selecting the attributes that cause the number of consistent instances to increase faster
To obtain the subset of attributes as small as possible
Selecting the attribute that has smaller number of different values
To guarantee that the number of instances covered by a rule is as large as possible.
Selecting Attribute from {a,c,d}
1. Selecting {a}
R = {a,b}
u3,u5,u6
u4
u7
U/{e}
u3
u4
u7
U/{a,b}
u5
u6
Selecting Attribute from {a,c,d} (2)
2. Selecting {c}
R = {b,c}
u3,u5,u6
u4
u7
U/{e}
Selecting Attribute from {a,c,d} (3)
3. Selecting {d}
R = {b,d}
u3,u5,u6
u4
u7
U/{e}
Selecting Attribute from {a,c,d} (4)
3. Selecting {d}
R = {b,d}
Result: Subset of attributes= {b, d}
A Heuristic Algorithm
for Attribute Selection
Let R be a set of the selected attributes, P be the set of unselected condition attributes, U be the set of all instances, X be the set of contradictory instances, and EXPECT be the threshold of accuracy.
In the initial state, R = CORE(C),
k = 0.
A Heuristic Algorithm
for Attribute Selection (2)
Step 1. If k >= EXPECT, finish, otherwise calculate the dependency degree, k,
Step 2. For each p in P, calculate
where max_size denotes the cardinality of the maximal subset.
A Heuristic Algorithm
for Attribute Selection (3)
Step 3. Choose the best attribute p with the largest and let
Step 4. Remove all consistent instances u in
from X.
Step 5. Go back to Step 1.
Experimental Results
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
Main Features of GDT-RS
Unseen instances are considered in the discovery process, and the uncertainty of a rule, including its ability to predict possible instances, can be explicitly represented in the strength of the rule.
Biases can be flexibly selected for search control, and background knowledge can be used as a bias to control the creation of a GDT and the discovery process.
A Sample DB
Condition attributes: a, b, c
Va = {a0, a1} Vb = {b0, b1, b2} Vc = {c0, c1}
Decision attribute: d, Vd = {y,n}
U a b c d
A Sample GDT
a0b0c0 a0b0c1 … … a1b0c0 …... a1b2c1
*b0c0
*b0c1
*b1c0
*b1c1
*b2c0
*b2c1
a0*c0
…...
a1b1*
a1b2*
**c0
…...
a0**
a1**
1/2 …… 1/2 ……
1/2 ……
……
……
……
…… 1/2
1/3 ……
…… ……
……
…… 1/2
1/6 1/6 ……
…… ……
1/6 1/6 ……
1/6 …… 1/6
G(x)
F(x)
Explanation for GDT
F(x): the possible instances (PI)
G(x): the possible generalizations (PG)
the probability relationships
between PI & PG.
Probabilistic Relationship Between PIs and PGs
a0*c0
a0b0c0
a0b1c0
a0b2c0
p = 1/3
1/3
1/3
is the number of PI
satisfying the ith PG.
Unseen Instances
Possible Instances:
yes,no,normal
yes, no, high
yes, no, very-high
no, yes, high
no, no, normal
no, no, very-high
Closed world Open world
Rule Representation
X Y with S
X denotes the conjunction of the conditions that a concept must satisfy
Y denotes a concept that the rule describes
S is a “measure of strength” of which the rule holds
Rule Strength (1)
The strength of the generalization X
(BK is no used),
is the number of the observed
instances satisfying the ith generalization.
Rule Strength (2)
The strength of the generalization X
(BK is used),
Rule Strength (3)
The rate of noises
is the number of instances belonging to the class Y within the instances satisfying the generalization X.
Rule Discovery by GDT-RS
Condition Attrs.: a, b, c
a: Va = {a0, a1}
b: Vb = {b0, b1, b2}
c: Vc = {c0, c1}
Class: d:
d: Vd = {y,n}
Regarding the Instances
(Noise Rate = 0)
Generating Discernibility Vector for u2
Obtaining Reducts for u2
Generating Rules from u2
{a0,b1}
{b1,c1}
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 0.5
{b1c1}
a0b1c1(u2)
a1b1c1(u7)
s({b1c1}) = 1
y
y
y
Generating Rules from u2 (2)
Generating Discernibility Vector for u4
Obtaining Reducts for u4
Generating Rules from u4
{c0}
{c0}
a0b0c0
a1b1c0(u4)
n
a1b2c0
Generating Rules from u4 (2)
Generating Rules from All Instances
u2: {a0b1} y, S = 0.5
{b1c1} y, S =1
u4: {c0} n, S = 0.167
u6: {b2} n, S=0.25
u7: {a1c1} y, S=0.5
{b1c1} y, S=1
The Rule Selection Criteria
in GDT-RS
Selecting the rules that cover as many instances as possible.
Selecting the rules that contain as little attributes as possible, if they cover the same number of instances.
Selecting the rules with larger strengths, if they have same number of condition attributes and cover the same number of instances.
Generalization Belonging to Class y
{b1c1} y with S = 1 u2,u7
{a1c1} y with S = 1/2 u7
{a0b1} y with S = 1/2 u2
u2 u7
Generalization Belonging to
Class n
c0 n with S = 1/6 u4
b2 n with S = 1/4 u6
u4 u6
Results from the Sample DB
(Noise Rate = 0)
Certain Rules: Instances Covered
{c0} n with S = 1/6 u4
{b2} n with S = 1/4 u6
{b1c1} y with S = 1 u2,u7
Possible Rules:
b0 y with S = (1/4)(1/2)
a0 & b0 y with S = (1/2)(2/3)
a0 & c1 y with S = (1/3)(2/3)
b0 & c1 y with S = (1/2)(2/3)
Instances Covered: u1, u3, u5
Results from the Sample DB (2)
(Noise Rate > 0)
Regarding Instances
(Noise Rate > 0)
Rules Obtained from All Instacnes
u2: {a0b1} y, S=0.5
{b1c1} y, S=1
u4: {c0} n, S=0.167
u6: {b2} n, S=0.25
u7: {a1c1} y, S=0.5
{b1c1} y, S=1
u1’:{b0} y, S=1/4*2/3=0.167
Example of Using BK
BK: a0 => c1, 100%
Changing Strength of Generalization by BK
{a0,b1}
{b1,c1}
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 0.5
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 1
1/2
1/2
100%
0%
a0 => c1, 100%
Algorithm 1
Optimal Set of Rules
Step 1. Consider the instances with the same condition attribute values as one instance, called a compound instance.
Step 2. Calculate the rate of noises r for each compound instance.
Step 3. Select one instance u from U and create a discernibility vector for u.
Step 4. Calculate all reducts for the instance u by using the discernibility function.
Algorithm 1
Optimal Set of Rules (2)
Step 5. Acquire the rules from the reducts for the instance u, and revise the strength of generalization of each rule.
Step 6. Select better rules from the rules (for u) acquired in Step 5, by using the heuristics for rule selection.
Step 7. If then go back to Step 3. Otherwise go to Step 8.
Algorithm 1
Optimal Set of Rules (3)
Step 8. Finish if the number of rules selected in Step 6 for each instance is 1. Otherwise find a minimal set of rules, which contains all of the instances in the decision table.
The Issue of Algorithm 1
It is not suitable for the database with a large number of attributes.
Methods to Solve the Issue:
Finding a reduct (subset) of condition attributes in a pre-processing.
Finding a sub-optimal solution using some efficient heuristics.
Algorithm 2
Sub-Optimal Solution
Step1: Set R = {}, COVERED = {}, and SS = {all instances IDs}. For each class , divide the decision table T into two parts: current class and other classes
Step2: From the attribute values of the instances (where means the jth value of attribute i,
Algorithm 2
Sub-Optimal Solution (2)
choose a value v with the maximal number of occurrence within the instances contained in T+,and the minimal number of occurrence within the instances contained in T-.
Step3: Insert v into R.
Step4: Delete the instance ID from SS if the instance does not contain v.
Algorithm 2
Sub-Optimal Solution (3)
Step5: Go back to Step2 until the noise rate is less than the threshold value.
Step6: Find out a minimal sub-set R’ of R according to their strengths. Insert
into RS. Set R = {}, copy the instance IDs
in SS to COVERED,and
set SS = {all instance IDs}- COVERED.
Algorithm 2
Sub-Optimal Solution (4)
Step8: Go back to Step2 until all instances of T+ are in COVERED.
Step9: Go back to Step1 until all classes are handled.
Time Complexity of Alg.1&2
Time Complexity of Algorithm 1:
Time Complexity of Algorithm 2:
Let n be the number of instances in a DB,
m the number of attributes,
the number of generalizations
and is less than
Experiments
DBs that have been tested:
meningitis, bacterial examination, cancer, mushroom,
slope-in-collapse, earth-quack, contents-sell, …...
Experimental methods:
Comparing GDT-RS with C4.5
Using background knowledge or not
Selecting different allowed noise rates as the threshold values
Auto-discretization or BK-based discretization.
Experiment 1
(meningitis data)
C4.5:
(from a meningitis DB with 140 records, and 38 attributes)
Experiment 1
(meningitis data) (2)
GDT-RS (auto-discretization):
Experiment 1
(meningitis data) (3)
GDT-RS (auto-discretization):
Using Background Knowledge
(meningitis data)
Never occurring together:
EEGwave(normal) EEGfocus(+)
CSFcell(low) Cell_Poly(high)
CSFcell(low) Cell_Mono(high)
Occurring with lower possibility:
WBC(low) CRP(high)
WBC(low) ESR(high)
WBC(low) CSFcell(high)
Using Background Knowledge
(meningitis data) (2)
Occurring with higher possibility:
WBC(high) CRP(high)
WBC(high) ESR(high)
WBC(high) CSF_CELL(high)
EEGfocus(+) FOCAL(+)
EEGwave(+) EEGfocus(+)
CRP(high) CSF_GLU(low)
CRP(high) CSF_PRO(low)
Explanation of BK
If the brain wave (EEGwave) is normal, the focus of brain wave (EEGfocus) is never abnormal.
If the number of white blood cells (WBC) is high, the inflammation protein (CRP) is also high.
Using Background Knowledge
(meningitis data) (3)
rule1 is generated by BK
rule1:
Using Background Knowledge
(meningitis data) (4)
rule2 is replaced by rule2’
rule2:
rule2’:
Experiment 2
(bacterial examination data)
Number of instances: 20,000
Number of condition attributes: 60
Goals:
analyzing the relationship between the bacterium-detected attribute and other attributes
analyzing what attribute-values are related to the sensitivity of antibiotics when the value of bacterium-detected is (+).
Attribute Selection
(bacterial examination data)
Class-1: bacterium-detected (+、-)
condition attributes: 11
Class-2: antibiotic-sensibility
(resistant (R), sensibility(S))
condition attributes: 21
Some Results
(bacterial examination data)
Some of rules discovered by GDT-RS are the same as C4.5, e.g.,
Some of rules can only be discovered by GDT-RS, e.g.,
bacterium-detected(-)
bacterium-detected(-).
Experiment 3
(gastric cancer data)
Instances number:7520
Condition Attributes: 38
Classes:
cause of death (specially, the direct death)
post-operative complication
Goals:
analyzing the relationship between the direct death and other attributes
analyzing the relationship between the post-operative complication and other attributes.
Result of Attribute Selection
(gastric cancer data)
Class: the direct death
sex, location_lon1, location_lon2, location_cir1,
location_cir2, serosal_inva, peritoneal_meta,
lymphnode_diss, reconstruction, pre_oper_comp1,
post_oper_comp1, histological, structural_atyp,
growth_pattern, depth, lymphatic_inva,
vascular_inva, ln_metastasis, chemotherapypos
(19 attributes are selected)
Result of Attribute Selection (2)
(gastric cancer data)
Class: post-operative complication
multi-lesions, sex, location_lon1, location_cir1,
location_cir2, lymphnode_diss, maximal_diam,
reconstruction, pre_oper_comp1, histological,
stromal_type, cellular_atyp, structural_atyp,
growth_pattern, depth, lymphatic_inva,
chemotherapypos
(17 attributes are selected)
Experiment 4
(slope-collapse data)
Instances number:3436
(430 places were collapsed, and 3006 were not)
Condition attributes: 32
Continuous attributes in condition attributes: 6
extension of collapsed steep slope, gradient, altitude, thickness of surface of soil, No. of active fault, distance between slope and active fault.
Goal: find out what is the reason that causes the slope to be collapsed.
Result of Attribute Selection
(slope-collapse data)
9 attributes are selected from 32 condition attributes:
altitude, slope azimuthal, slope shape, direction of high rank topography, shape of transverse section, position of transition line, thickness of surface of soil, kind of plant, distance between slope and active fault.
(3 continuous attributes in red color)
The Discovered Rules
(slope-collapse data)
s_azimuthal(2) ∧ s_shape(5) ∧ direction_high(8) ∧ plant_kind(3) S = (4860/E)
altitude[21,25) ∧ s_azimuthal(3) ∧ soil_thick(>=45) S = (486/E)
s_azimuthal(4) ∧ direction_high(4) ∧ t_shape(1) ∧ tl_position(2) ∧ s_f_distance(>=9) S = (6750/E)
altitude[16,17) ∧ s_azimuthal(3) ∧ soil_thick(>=45) ∧ s_f_distance(>=9) S = (1458/E)
altitude[20,21) ∧ t_shape(3) ∧ tl_position(2) ∧ plant_kind(6) ∧ s_f_distance(>=9) S = (12150/E)
altitude[11,12) ∧ s_azimuthal(2) ∧ tl_position(1) S = (1215/E)
altitude[12,13) ∧ direction_high(9) ∧ tl_position(4) ∧ s_f_distance[8,9) S = (4050/E)
altitude[12,13) ∧ s_azimuthal(5) ∧ t_shape(5) ∧ s_f_distance[8,9) S = (3645/E)
…...
Other Methods for Attribute Selection
(download from http://www.iscs/nus.edu.sg/liuh/)
LVW: A stochastic wrapper feature selection algorithm
LVI: An incremental multivariate feature selection
algorithm
WSBG/C4.5: Wrapper of sequential backward
generation
WSFG/C4.5: Wrapper of sequential forward
generation
Rule induction system: C4.5
Executing times: 10
Class: direct death
Number of selected attributes for each time:
20, 19, 21, 26, 22, 31, 21, 19, 31, 28
Result-2 (19 attributes are selected):
multilesions, sex, location_lon3, location_cir4,
liver_meta, lymphnode_diss, proximal_surg, resection_meth,
combined_rese2, reconstruction, pre_oper_comp1,
post_oper_com2, post_oper_com3, spec_histologi, cellular_atyp,
depth, eval_of_treat, ln_metastasis, othertherapypre
Results of LVW
Result-2 (19 attributes are selected):
age, typeofcancer, location_cir3, location_cir4,
liver_meta, lymphnode_diss, maximal_diam,
distal_surg, combined_rese1, combined_rese2,
pre_oper_comp2, post_oper_com1, histological,
spec_histologi, structural_atyp, depth, lymphatic_inva,
vascular_inva, ln_metastasis
(only the attributes in red color are selected by our method)
Result of LVW (2)
Result of WSFG
Rule induction system:
C4.5
Results
the best relevant attribute first
Result of WSFG (2)
(class: direct death)
eval_of_treat, liver_meta, peritoneal_meta, typeofcancer,
chemotherapypos, combined_rese1, ln_metastasis, location_lon2,
depth, pre_oper_comp1, histological, growth_pattern,vascular_inva,
location_cir1,location_lon3, cellular_atyp, maximal_diam,
pre_oper_comp2, location_lon1, location_cir3, sex, post_oper_com3,
age, serosal_inva, spec_histologi, proximal_surg, location_lon4,
chemotherapypre, lymphatic_inva, lymphnode_diss, structural_atyp,
distal_surg,resection_meth, combined_rese3, chemotherapyin,
location_cir4, post_oper_comp1, stromal_type, combined_rese2,
othertherapypre, othertherapyin, othertherapypos, reconstruction,
multilesions, location_cir2, pre_oper_comp3
( the best relevant attribute first)
Result of WSBG
Rule induction system:
C4.5
Result
the least relevant attribute first
Result of WSBG (2)
(class: direct death)
peritoneal_meta, liver_meta, eval_of_treat, lymphnode_diss,
reconstruction, chemotherapypos, structural_atyp, typeofcancer,
pre_oper_comp1, maximal_diam, location_lon2, combined_rese3,
othertherapypos, post_oper_com3, stromal_type, cellular_atyp,
resection_meth, location_cir3, multilesions, location_cir4,
proximal_surg, location_cir1, sex, lymphatic_inva, location_lon4,
location_lon1, location_cir2, distal_surg, post_oper_com2,
location_lon3, vascular_inva, combined_rese2, age, pre_oper_comp2,
ln_metastasis, serosal_inva, depth, growth_pattern, combined_rese1,
chemotherapyin, spec_histologi, post_oper_com1, chemotherapypre,
pre_oper_comp3, histological, othertherapypre
Result of LVI
(gastric cancer data)
Number of allowed inconsistent instances
Executing
times
Number of
inconsistent
instances
Number of selected attributes
80
20
1
2
3
4
5
1
2
3
4
5
79
68
49
61
66
7
19
19
20
18
19
16
20
18
20
49
26
28
23
26
Some Rules
Related to Direct Death
peritoneal_meta(2) ∧ pre_oper_comp1(.) ∧ post_oper_com1(L) ∧ chemotherapypos(.) S= 3*(7200/E)
location_lon1(M) ∧ post_oper_com1(L) ∧ ln_metastasis(3) ∧ chemotherapypos(.) S= 3*(2880/E)
sex(F) ∧ location_cir2(.) ∧ post_oper_com1(L) ∧ growth_pattern(2) ∧ chemotherapypos(.) S= 3*(7200/E)
location_cir1(L) ∧ location_cir2(.) ∧ post_oper_com1(L) ∧ ln_metastasis(2) ∧ chemotherapypos(.) S= 3*(25920/E)
pre_oper_comp1(.) ∧ post_oper_com1(L) ∧ histological(MUC) ∧ growth_pattern(3) ∧ chemotherapypos(.) S= 3*(64800/E)
sex(M) ∧ location_lon1(M) ∧ reconstruction(B2) ∧ pre_oper_comp1(.) ∧ structural_atyp(3) ∧ lymphatic_inva(3) ∧ vascular_inva(0) ∧ ln_metastasis(2) S=3*(345600/E)
sex(F) ∧ location_lon2(M) ∧ location_cir2(.) ∧ pre_oper_comp1(A) ∧ depth(S2) ∧ chemotherapypos(.) S= 3*(46080/E)
GDT-RS vs. Discriminant Analysis
if -then rules
multi-class, high-dimension, large-scale data can be processed
BK can be used easily
the stability and uncertainty of a rule can be expressed explicitly
continuous data must be discretized.
algebraic expressions
difficult to deal with the data with multi-class.
difficult to use BK
the stability and uncertainty of a rule cannot be explained clearly
symbolic data must be quantized.
GDT-RS vs. ID3 (C4.5)
BK can be used easily
the stability and uncertainty of a rule can be expressed explicitly
unseen instances are considered
the minimal set of rules containing all instances can be discovered
difficult to use BK
the stability and uncertainty of a rule cannot be explained clearly
unseen instances are not considered
not consider whether the discovered rules are the minimal set covered all instances
Rough Sets in ILP and GrC
-- An Advanced Topic --
Background and goal
The normal problem setting for ILP
Issues, observations, and solutions
Rough problem settings
Future work on RS (GrC) in ILP
ILP: Inductive Logic Programming
GrC: Granule Computing
Advantages of ILP
(Compared with Attribute-Value Learning)
It can learn knowledge which is more expressive because it is in predicate logic
It can utilize background knowledge more naturally and effectively because in ILP the examples, the background knowledge, as well as the learned knowledge are all expressed within the same logic framework.
Weak Points of ILP
(Compared with Attribute-Value Learning)
It is more difficult to handle numbers (especially continuous values) prevailing in real-world databases.
The theory, techniques are much less mature for ILP to deal with imperfect data (uncertainty, incompleteness, vagueness, impreciseness, etc. in examples, background knowledge as well as the learned rules).
Goal
Applying Granular Computing (GrC) and a special form of GrC: Rough Sets to ILP to deal with some kinds of imperfect data which occur in large real-world applications.
Normal Problem Setting for ILP
Given:
The target predicate p
The positive examples and the negative examples (two sets of ground atoms of p)
Background knowledge B (a finite set of definite clauses)
Normal Problem Setting for ILP (2)
To find:
Hypothesis H (the defining clauses of p) which is correct with respect to and , i.e.
1. is complete with respect to
(i.e. )
We also say that covers all positive examples.
2. is consistent with respect to
(i.e. )
We also say that rejects any negative examples.
Normal Problem Setting for ILP (3)
Prior conditions:
1’. B is not complete with respect to
(Otherwise there will be no learning task at all)
2’. is consistent with respect to
(Otherwise there will be no solution)
Everything is assumed correct and perfect.
Issues
In large, real-world empirical learning, uncertainty, incompleteness, vagueness, impreciseness, etc. are frequently observed in training examples, in background knowledge, as well as in the induced hypothesis.
Too strong bias may miss some useful solutions or have no solution at all.
Imperfect Data in ILP
Imperfect output
Even the input (Examples and BK) are “perfect”, there are usually several Hs that can be induced.
If the input is imperfect, we have imperfect hypotheses.
Noisy data
Erroneous argument values in examples.
Erroneous classification of examples as belonging to or
Imperfect Data in ILP (2)
Too sparse data
The training examples are too sparse to induce reliable H.
Missing data
Missing values: some arguments of some examples have unknown values.
Missing predicates: BK lacks essential predicates (or essential clauses of some predicates) so that no non-trivial H can be induced.
Imperfect Data in ILP (3)
Indiscernible data
Some examples belong to both and
This presentation wi
Tutorial Notes
Andrzej Skowron
Warsaw University
skowron@mimuw.eud.pl
Ning Zhong
Maebashi Institute of Technolgy
zhong@maebashi-it.ac.jp
Copyright 2000 by A. Skowron & N. Zhong
About the Speakers
Andrzej Skowron received his Ph.D. from Warsaw University. He is a professor in Faculty of Mathematics, Computer Science and Mechanics, Warsaw University, Poland. His research interests include soft computing methods and applications, in particular, reasoning with incomplete information, approximate reasoning, rough sets, rough mereology, granular computing, synthesis and analysis of complex objects, intelligent agents, knowledge discovery and data mining, etc, with over 200 journal and conference publications. He is an editor of several international journals and book series including Fundamenta Informaticae (editor in chief), Data Mining and Knowledge Discovery. He is president of International Rough Set Society. He was an invited speaker at many international conferences, and has served or is currently serving on the program committees of over 40 international conferences and workshops, including ISMIS’97-99 (program chair), RSCTC’98-00 (program chair), RSFDGrC’99 (program chair).
About the Speakers (2)
Ning Zhong received his Ph.D. from the University of Tokyo. He is head of Knowledge Information Systems Laboratory, and an associate professor in Department of Information Engineering, Maebashi Institute of Technology, Japan. His research interests include knowledge discovery and data mining, rough sets and granular-soft computing, intelligent agents and databases, Web-intelligence (WI), knowledge information systems, with over 80 journal and conference publications. He is an editor of Knowledge and Information Systems: an international journal (Springer). He is a member of the advisory board of International Rough Set Society, the Steering Committee of IEEE International Conferences on Data Mining, and PAKDD conferences, the advisory board of BISC/SIGGrC. He has served or is currently serving on the program committees of over 30 international conferences and workshops, including PAKDD’99 (program chair), IAT’99 and 2001 (program chair), RSFDGrC’99 (program chair), and WI’2001(program chair).
Contents
Introduction
Basic Concepts of Rough Sets
A Rough Set Based KDD process
Rough Sets in ILP and GrC
Concluding Remarks (Summary, Advanced Topics, References and Further Readings).
Introduction
Rough set theory was developed by Zdzislaw Pawlak in the early 1980’s.
Representative Publications:
Z. Pawlak, “Rough Sets”, International Journal of Computer and Information Sciences, Vol.11, 341-356 (1982).
Z. Pawlak, Rough Sets - Theoretical Aspect of Reasoning about Data, Kluwer Academic Pubilishers (1991).
Introduction (2)
The main goal of the rough set analysis is induction of approximations of concepts.
Rough sets constitutes a sound basis for KDD. It offers mathematical tools to discover patterns hidden in data.
It can be used for feature selection, feature extraction, data reduction, decision rule generation, and pattern extraction (templates, association rules) etc.
identifies partial or total dependencies in data, eliminates redundant data, gives approach to null values, missing data, dynamic data and others.
Introduction (3)
Recent extensions of rough set theory (rough mereology) have developed new methods for decomposition of large data sets, data mining in distributed and multi-agent systems, and granular computing.
This presentation shows how several aspects of the above problems are solved by the (classic) rough set approach, discusses some advanced topics, and gives further research directions.
Basic Concepts of Rough Sets
Information/Decision Systems (Tables)
Indiscernibility
Set Approximation
Reducts and Core
Rough Membership
Dependency of Attributes
Information Systems/Tables
IS is a pair (U, A)
U is a non-empty finite set of objects.
A is a non-empty finite set of attributes such that for every
is called the value set of a.
Age LEMS
x1 16-30 50
x2 16-30 0
x3 31-45 1-25
x4 31-45 1-25
x5 46-60 26-49
x6 16-30 26-49
x7 46-60 26-49
Decision Systems/Tables
DS:
is the decision attribute (instead of one we can consider more decision attributes).
The elements of A are called the condition attributes.
Age LEMS Walk
x1 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
Issues in the Decision Table
The same or indiscernible objects may be represented several times.
Some of the attributes may be superfluous.
Indiscernibility
The equivalence relation
A binary relation which is reflexive (xRx for any object x) ,
symmetric (if xRy then yRx), and
transitive (if xRy and yRz then xRz).
The equivalence class of an element
consists of all objects such that xRy.
Indiscernibility (2)
Let IS = (U, A) be an information system, then with any there is an associated equivalence relation:
where is called the B-indiscernibility relation.
If then objects x and x’ are indiscernible from each other by attributes from B.
The equivalence classes of the B-indiscernibility relation are denoted by
An Example of Indiscernibility
The non-empty subsets of the condition attributes are {Age}, {LEMS}, and {Age, LEMS}.
IND({Age}) = {{x1,x2,x6}, {x3,x4}, {x5,x7}}
IND({LEMS}) = {{x1}, {x2}, {x3,x4}, {x5,x6,x7}}
IND({Age,LEMS}) = {{x1}, {x2}, {x3,x4}, {x5,x7}, {x6}}.
Age LEMS Walk
x1 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
Observations
An equivalence relation induces a partitioning of the universe.
The partitions can be used to build new subsets of the universe.
Subsets that are most often of interest have the same value of the decision attribute.
It may happen, however, that a concept such as “Walk” cannot be defined in a crisp manner.
Set Approximation
Let T = (U, A) and let and We can approximate X using only the information contained in B by constructing the B-lower and B-upper approximations of X, denoted and respectively, where
Set Approximation (2)
B-boundary region of X,
consists of those objects that we cannot decisively classify into X in B.
B-outside region of X,
consists of those objects that can be with certainty classified as not belonging to X.
A set is said to be rough if its boundary region is non-empty, otherwise the set is crisp.
An Example of Set Approximation
Let W = {x | Walk(x) = yes}.
The decision class, Walk, is rough since the boundary region is not empty.
Age LEMS Walk
x1 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
An Example of
Set Approximation (2)
yes
yes/no
no
{{x1},{x6}}
{{x3,x4}}
{{x2}, {x5,x7}}
AW
U
setX
U/R
R : subset of
attributes
Lower & Upper Approximations
Lower & Upper Approximations (2)
Lower Approximation:
Upper Approximation:
Lower & Upper Approximations
(3)
X1 = {u | Flu(u) = yes}
= {u2, u3, u6, u7}
RX1 = {u2, u3}
= {u2, u3, u6, u7, u8, u5}
X2 = {u | Flu(u) = no}
= {u1, u4, u5, u8}
RX2 = {u1, u4}
= {u1, u4, u5, u8, u7, u6}
The indiscernibility classes defined by R = {Headache, Temp.} are {u1}, {u2}, {u3}, {u4}, {u5, u7}, {u6, u8}.
Lower & Upper Approximations (4)
R = {Headache, Temp.}
U/R = { {u1}, {u2}, {u3}, {u4}, {u5, u7}, {u6, u8}}
X1 = {u | Flu(u) = yes} = {u2,u3,u6,u7}
X2 = {u | Flu(u) = no} = {u1,u4,u5,u8}
RX1 = {u2, u3}
= {u2, u3, u6, u7, u8, u5}
RX2 = {u1, u4}
= {u1, u4, u5, u8, u7, u6}
u1
u4
u3
X1
X2
u5
u7
u2
u6
u8
Properties of Approximations
implies
and
Properties of Approximations (2)
where -X denotes U - X.
Four Basic Classes of Rough Sets
X is roughly B-definable, iff and
X is internally B-undefinable, iff and
X is externally B-undefinable, iff and
X is totally B-undefinable, iff and
Accuracy of Approximation
where |X| denotes the cardinality of
Obviously
If X is crisp with respect to B.
If X is rough with respect to B.
Issues in the Decision Table
The same or indiscernible objects may be represented several times.
Some of the attributes may be superfluous (redundant).
That is, their removal cannot worsen the classification.
Reducts
Keep only those attributes that preserve the indiscernibility relation and, consequently, set approximation.
There are usually several such subsets of attributes and those which are minimal are called reducts.
Dispensable & Indispensable Attributes
Let
Attribute c is dispensable in T
if , otherwise
attribute c is indispensable in T.
The C-positive region of D:
Independent
T = (U, C, D) is independent
if all are indispensable in T.
Reduct & Core
The set of attributes is called a reduct of C, if T’ = (U, R, D) is independent and
The set of all the condition attributes indispensable in T is denoted by CORE(C).
where RED(C) is the set of all reducts of C.
An Example of Reducts & Core
Reduct1 = {Muscle-pain,Temp.}
Reduct2 = {Headache, Temp.}
CORE = {Headache,Temp}
{MusclePain, Temp}
= {Temp}
Discernibility Matrix (relative to positive region)
Let T = (U, C, D) be a decision table, with
By a discernibility matrix of T, denoted M(T), we will mean matrix defined as:
for i, j = 1,2,…,n such that or belongs to the C-positive region of D.
is the set of all the condition attributes that classify objects ui and uj into different classes.
Discernibility Matrix (relative to positive region) (2)
The equation is similar but conjunction is taken over all non-empty entries of M(T) corresponding to the indices i, j such that
or belongs to the C-positive region of D.
denotes that this case does not need to be considered. Hence it is interpreted as logic truth.
All disjuncts of minimal disjunctive form of this function define the reducts of T (relative to the positive region).
Discernibility Function (relative to objects)
For any
where (1) is the disjunction of all variables a
such that if
(2) if
(3) if
Each logical product in the minimal disjunctive normal form (DNF) defines a reduct of instance
Examples of Discernibility Matrix
No a b c d
u1 a0 b1 c1 y
u2 a1 b1 c0 n
u3 a0 b2 c1 n
u4 a1 b1 c1 y
C = {a, b, c}
D = {d}
In order to discern equivalence classes of the decision attribute d, to preserve conditions described by the discernibility matrix for this table
u1 u2 u3
u2
u3
u4
a,c
b
c a,b
Reduct = {b, c}
Examples of Discernibility Matrix (2)
u1 u2 u3 u4 u5 u6
u2
u3
u4
u5
u6
u7
b,c,d b,c
b b,d c,d
a,b,c,d a,b,c a,b,c,d
a,b,c,d a,b,c a,b,c,d
a,b,c,d a,b c,d c,d
Core = {b}
Reduct1 = {b,c}
Reduct2 = {b,d}
Rough Membership
The rough membership function quantifies the degree of relative overlap between the set X and the equivalence class to which x belongs.
The rough membership function can be interpreted as a frequency-based estimate of
where u is the equivalence class of IND(B).
Rough Membership (2)
The formulae for the lower and upper approximations can be generalized to some arbitrary level of precision by means of the rough membership function
Note: the lower and upper approximations as originally formulated are obtained as a special case with
Dependency of Attributes
Discovering dependencies between attributes is an important issue in KDD.
A set of attribute D depends totally on a set of attributes C, denoted if all values of attributes from D are uniquely determined by values of attributes from C.
Dependency of Attributes (2)
Let D and C be subsets of A. We will say that D depends on C in a degree k
denoted by if
where called C-positive region of D.
Dependency of Attributes (3)
Obviously
If k = 1 we say that D depends totally on C.
If k < 1 we say that D depends partially (in a degree k) on C.
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
What Are Issues of Real World ?
Very large data sets
Mixed types of data (continuous valued, symbolic data)
Uncertainty (noisy data)
Incompleteness (missing, incomplete data)
Data change
Use of background knowledge
very large data set
mixed types of data
noisy data
incomplete instances
data change
use of background knowledge
Real world
issues
Methods
ID3 Prism Version BP Dblearn
(C4.5) Space
Probability
Logic
Set
Soft Techniques for KDD
Stoch. Proc.
Belief Nets
Conn. Nets
GDT
Deduction
Induction
Abduction
RoughSets
Fuzzy Sets
Soft Techniques for KDD (2)
Deduction
Induction
Abduction
GDT
GrC
RS&ILP
RS
TM
A Hybrid Model
GDT : Generalization Distribution Table
RS : Rough Sets
TM: Transition Matrix
ILP : Inductive Logic Programming
GrC : Granular Computing
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
Observations
A real world data set always contains mixed types of data such as continuous valued, symbolic data, etc.
When it comes to analyze attributes with real values, they must undergo a process called discretization, which divides the attribute’s value into intervals.
There is a lack of the unified approach to discretization problems so far, and the choice of method depends heavily on data considered.
Discretization based on RSBR
In the discretization of a decision table T = where is an interval of real values, we search for a partition of for any
Any partition of is defined by a sequence of the so-called cuts from
Any family of partitions can be identified with a set of cuts.
Discretization Based on RSBR (2)
In the discretization process, we search for a set of cuts satisfying some natural conditions.
U a b d
x1 0.8 2 1
x2 1 0.5 0
x3 1.3 3 0
x4 1.4 1 1
x5 1.4 2 0
x6 1.6 3 1
x7 1.3 1 1
U a b d
x1 0 2 1
x2 1 0 0
x3 1 2 0
x4 1 1 1
x5 1 2 0
x6 2 2 1
x7 1 1 1
P
P
P = {(a, 0.9),
(a, 1.5),
(b, 0.75),
(b, 1.5)}
A Geometrical Representation of Data
0
0.8
1
1.3 1.4 1.6
a
b
3
2
1
0.5
x1
x2
x3
x4
x7
x5
x6
A Geometrical Representation of Data and Cuts
0
0.8
1
1.3 1.4 1.6
a
b
3
2
1
0.5
x1
x2
x3
x4
x5
x6
x7
Discretization Based on RSBR (3)
The sets of possible values of a and b are defined by
The sets of values of a and b on objects from U are given by
a(U) = {0.8, 1, 1.3, 1.4, 1.6};
b(U) = {0.5, 1, 2, 3}.
Discretization Based on RSBR (4)
The discretization process returns a partition of the value sets of condition attributes into intervals.
A Discretization Process
Step 1: define a set of Boolean variables,
where
corresponds to the interval [0.8, 1) of a
corresponds to the interval [1, 1.3) of a
corresponds to the interval [1.3, 1.4) of a
corresponds to the interval [1.4, 1.6) of a
corresponds to the interval [0.5, 1) of b
corresponds to the interval [1, 2) of b
corresponds to the interval [2, 3) of b
The Set of Cuts on Attribute a
A Discretization Process (2)
Step 2: create a new decision table by using the set of Boolean variables defined in Step 1.
Let be a decision table, be a propositional variable corresponding to the interval for any
and
A Sample Defined in Step 2
U*
(x1,x2)
(x1,x3)
(x1,x5)
(x4,x2)
(x4,x3)
(x4,x5)
(x6,x2)
(x6,x3)
(x6,x5)
(x7,x2)
(x7,x3)
(x7,x5)
1 0 0 0 1 1 0
1 1 0 0 0 0 1
1 1 1 0 0 0 0
0 1 1 0 1 0 0
0 0 1 0 0 1 1
0 0 0 0 0 1 0
0 1 1 1 1 1 1
0 0 1 1 0 0 0
0 0 0 1 0 0 1
0 1 0 0 1 0 0
0 0 0 0 0 1 0
0 0 1 0 0 1 0
The Discernibility Formula
The discernibility formula
means that in order to discern object x1 and x2, at least one of the following cuts must be set,
a cut between a(0.8) and a(1)
a cut between b(0.5) and b(1)
a cut between b(1) and b(2).
The Discernibility Formulae for All Different Pairs
The Discernibility Formulae for All Different Pairs (2)
A Discretization Process (3)
Step 3: find the minimal subset of p that discerns all objects in different decision classes.
The discernibility boolean propositional formula is defined as follows,
The Discernibility Formula
in CNF Form
The Discernibility Formula
in DNF Form
We obtain four prime implicants,
is the optimal result, because
it is the minimal subset of P.
The Minimal Set Cuts
for the Sample DB
0
0.8
1
1.3 1.4 1.6
a
b
3
2
1
0.5
x1
x2
x3
x4
x5
x6
x7
A Result
U a b d
x1 0.8 2 1
x2 1 0.5 0
x3 1.3 3 0
x4 1.4 1 1
x5 1.4 2 0
x6 1.6 3 1
x7 1.3 1 1
U a b d
x1 0 1 1
x2 0 0 0
x3 1 1 0
x4 1 0 1
x5 1 1 0
x6 2 1 1
x7 1 0 1
P
P
P = {(a, 1.2),
(a, 1.5),
(b, 1.5)}
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
Observations
A database always contains a lot of attributes that are redundant and not necessary for rule discovery.
If these redundant attributes are not removed, not only the time complexity of rule discovery increases, but also the quality of the discovered rules may be significantly depleted.
The Goal of Attribute Selection
Finding an optimal subset of attributes in a database according to some criterion, so that a classifier with the highest possible accuracy can be induced by learning algorithm using information about data available only from the subset of attributes.
Attribute Selection
The Filter Approach
Preprocessing
The main strategies of attribute selection:
The minimal subset of attributes
Selection of the attributes with a higher rank
Advantage
Fast
Disadvantage
Ignoring the performance effects of the induction algorithm
The Wrapper Approach
Using the induction algorithm as a part of the search evaluation function
Possible attribute subsets (N-number of attributes)
The main search methods:
Exhaustive/Complete search
Heuristic search
Non-deterministic search
Advantage
Taking into account the performance of the induction algorithm
Disadvantage
The time complexity is high
Basic Ideas: Attribute Selection using RSH
Take the attributes in CORE as the initial subset.
Select one attribute each time using the rule evaluation criterion in our rule discovery system, GDT-RS.
Stop when the subset of selected attributes is a reduct.
Why Heuristics ?
The number of possible reducts can be
where N is the number of attributes.
Selecting the optimal reduct from all of possible reducts is time-complex and heuristics must be used.
The Rule Selection Criteria
in GDT-RS
Selecting the rules that cover as many instances as possible.
Selecting the rules that contain as little attributes as possible, if they cover the same number of instances.
Selecting the rules with larger strengths, if they have same number of condition attributes and cover the same number of instances.
Attribute Evaluation Criteria
Selecting the attributes that cause the number of consistent instances to increase faster
To obtain the subset of attributes as small as possible
Selecting an attribute that has smaller number of different values
To guarantee that the number of instances covered by rules is as large as possible.
Main Features of RSH
It can select a better subset of attributes quickly and effectively from a large DB.
The selected attributes do not damage the performance of induction so much.
An Example of Attribute Selection
Condition Attributes:
a: Va = {1, 2}
b: Vb = {0, 1, 2}
c: Vc = {0, 1, 2}
d: Vd = {0, 1}
Decision Attribute:
e: Ve = {0, 1, 2}
Searching for CORE
Removing attribute a
Removing attribute a does not cause inconsistency.
Hence, a is not used as CORE.
Searching for CORE (2)
Removing attribute b
Removing attribute b
cause inconsistency.
Hence, b is used as CORE.
Searching for CORE (3)
Removing attribute c
Removing attribute c
does not cause inconsistency.
Hence, c is not used
as CORE.
Searching for CORE (4)
Removing attribute d
Removing attribute d
does not cause inconsistency.
Hence, d is not used
as CORE.
Searching for CORE (5)
CORE(C)={b}
Initial subset R = {b}
Attribute b is the unique indispensable attribute.
R={b}
The instances containing b0 will not be considered.
T
T’
Attribute Evaluation Criteria
Selecting the attributes that cause the number of consistent instances to increase faster
To obtain the subset of attributes as small as possible
Selecting the attribute that has smaller number of different values
To guarantee that the number of instances covered by a rule is as large as possible.
Selecting Attribute from {a,c,d}
1. Selecting {a}
R = {a,b}
u3,u5,u6
u4
u7
U/{e}
u3
u4
u7
U/{a,b}
u5
u6
Selecting Attribute from {a,c,d} (2)
2. Selecting {c}
R = {b,c}
u3,u5,u6
u4
u7
U/{e}
Selecting Attribute from {a,c,d} (3)
3. Selecting {d}
R = {b,d}
u3,u5,u6
u4
u7
U/{e}
Selecting Attribute from {a,c,d} (4)
3. Selecting {d}
R = {b,d}
Result: Subset of attributes= {b, d}
A Heuristic Algorithm
for Attribute Selection
Let R be a set of the selected attributes, P be the set of unselected condition attributes, U be the set of all instances, X be the set of contradictory instances, and EXPECT be the threshold of accuracy.
In the initial state, R = CORE(C),
k = 0.
A Heuristic Algorithm
for Attribute Selection (2)
Step 1. If k >= EXPECT, finish, otherwise calculate the dependency degree, k,
Step 2. For each p in P, calculate
where max_size denotes the cardinality of the maximal subset.
A Heuristic Algorithm
for Attribute Selection (3)
Step 3. Choose the best attribute p with the largest and let
Step 4. Remove all consistent instances u in
from X.
Step 5. Go back to Step 1.
Experimental Results
A Rough Set Based KDD Process
Discretization based on RS and Boolean Reasoning (RSBR).
Attribute selection based RS with Heuristics (RSH).
Rule discovery by GDT-RS.
Main Features of GDT-RS
Unseen instances are considered in the discovery process, and the uncertainty of a rule, including its ability to predict possible instances, can be explicitly represented in the strength of the rule.
Biases can be flexibly selected for search control, and background knowledge can be used as a bias to control the creation of a GDT and the discovery process.
A Sample DB
Condition attributes: a, b, c
Va = {a0, a1} Vb = {b0, b1, b2} Vc = {c0, c1}
Decision attribute: d, Vd = {y,n}
U a b c d
A Sample GDT
a0b0c0 a0b0c1 … … a1b0c0 …... a1b2c1
*b0c0
*b0c1
*b1c0
*b1c1
*b2c0
*b2c1
a0*c0
…...
a1b1*
a1b2*
**c0
…...
a0**
a1**
1/2 …… 1/2 ……
1/2 ……
……
……
……
…… 1/2
1/3 ……
…… ……
……
…… 1/2
1/6 1/6 ……
…… ……
1/6 1/6 ……
1/6 …… 1/6
G(x)
F(x)
Explanation for GDT
F(x): the possible instances (PI)
G(x): the possible generalizations (PG)
the probability relationships
between PI & PG.
Probabilistic Relationship Between PIs and PGs
a0*c0
a0b0c0
a0b1c0
a0b2c0
p = 1/3
1/3
1/3
is the number of PI
satisfying the ith PG.
Unseen Instances
Possible Instances:
yes,no,normal
yes, no, high
yes, no, very-high
no, yes, high
no, no, normal
no, no, very-high
Closed world Open world
Rule Representation
X Y with S
X denotes the conjunction of the conditions that a concept must satisfy
Y denotes a concept that the rule describes
S is a “measure of strength” of which the rule holds
Rule Strength (1)
The strength of the generalization X
(BK is no used),
is the number of the observed
instances satisfying the ith generalization.
Rule Strength (2)
The strength of the generalization X
(BK is used),
Rule Strength (3)
The rate of noises
is the number of instances belonging to the class Y within the instances satisfying the generalization X.
Rule Discovery by GDT-RS
Condition Attrs.: a, b, c
a: Va = {a0, a1}
b: Vb = {b0, b1, b2}
c: Vc = {c0, c1}
Class: d:
d: Vd = {y,n}
Regarding the Instances
(Noise Rate = 0)
Generating Discernibility Vector for u2
Obtaining Reducts for u2
Generating Rules from u2
{a0,b1}
{b1,c1}
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 0.5
{b1c1}
a0b1c1(u2)
a1b1c1(u7)
s({b1c1}) = 1
y
y
y
Generating Rules from u2 (2)
Generating Discernibility Vector for u4
Obtaining Reducts for u4
Generating Rules from u4
{c0}
{c0}
a0b0c0
a1b1c0(u4)
n
a1b2c0
Generating Rules from u4 (2)
Generating Rules from All Instances
u2: {a0b1} y, S = 0.5
{b1c1} y, S =1
u4: {c0} n, S = 0.167
u6: {b2} n, S=0.25
u7: {a1c1} y, S=0.5
{b1c1} y, S=1
The Rule Selection Criteria
in GDT-RS
Selecting the rules that cover as many instances as possible.
Selecting the rules that contain as little attributes as possible, if they cover the same number of instances.
Selecting the rules with larger strengths, if they have same number of condition attributes and cover the same number of instances.
Generalization Belonging to Class y
{b1c1} y with S = 1 u2,u7
{a1c1} y with S = 1/2 u7
{a0b1} y with S = 1/2 u2
u2 u7
Generalization Belonging to
Class n
c0 n with S = 1/6 u4
b2 n with S = 1/4 u6
u4 u6
Results from the Sample DB
(Noise Rate = 0)
Certain Rules: Instances Covered
{c0} n with S = 1/6 u4
{b2} n with S = 1/4 u6
{b1c1} y with S = 1 u2,u7
Possible Rules:
b0 y with S = (1/4)(1/2)
a0 & b0 y with S = (1/2)(2/3)
a0 & c1 y with S = (1/3)(2/3)
b0 & c1 y with S = (1/2)(2/3)
Instances Covered: u1, u3, u5
Results from the Sample DB (2)
(Noise Rate > 0)
Regarding Instances
(Noise Rate > 0)
Rules Obtained from All Instacnes
u2: {a0b1} y, S=0.5
{b1c1} y, S=1
u4: {c0} n, S=0.167
u6: {b2} n, S=0.25
u7: {a1c1} y, S=0.5
{b1c1} y, S=1
u1’:{b0} y, S=1/4*2/3=0.167
Example of Using BK
BK: a0 => c1, 100%
Changing Strength of Generalization by BK
{a0,b1}
{b1,c1}
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 0.5
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 1
1/2
1/2
100%
0%
a0 => c1, 100%
Algorithm 1
Optimal Set of Rules
Step 1. Consider the instances with the same condition attribute values as one instance, called a compound instance.
Step 2. Calculate the rate of noises r for each compound instance.
Step 3. Select one instance u from U and create a discernibility vector for u.
Step 4. Calculate all reducts for the instance u by using the discernibility function.
Algorithm 1
Optimal Set of Rules (2)
Step 5. Acquire the rules from the reducts for the instance u, and revise the strength of generalization of each rule.
Step 6. Select better rules from the rules (for u) acquired in Step 5, by using the heuristics for rule selection.
Step 7. If then go back to Step 3. Otherwise go to Step 8.
Algorithm 1
Optimal Set of Rules (3)
Step 8. Finish if the number of rules selected in Step 6 for each instance is 1. Otherwise find a minimal set of rules, which contains all of the instances in the decision table.
The Issue of Algorithm 1
It is not suitable for the database with a large number of attributes.
Methods to Solve the Issue:
Finding a reduct (subset) of condition attributes in a pre-processing.
Finding a sub-optimal solution using some efficient heuristics.
Algorithm 2
Sub-Optimal Solution
Step1: Set R = {}, COVERED = {}, and SS = {all instances IDs}. For each class , divide the decision table T into two parts: current class and other classes
Step2: From the attribute values of the instances (where means the jth value of attribute i,
Algorithm 2
Sub-Optimal Solution (2)
choose a value v with the maximal number of occurrence within the instances contained in T+,and the minimal number of occurrence within the instances contained in T-.
Step3: Insert v into R.
Step4: Delete the instance ID from SS if the instance does not contain v.
Algorithm 2
Sub-Optimal Solution (3)
Step5: Go back to Step2 until the noise rate is less than the threshold value.
Step6: Find out a minimal sub-set R’ of R according to their strengths. Insert
into RS. Set R = {}, copy the instance IDs
in SS to COVERED,and
set SS = {all instance IDs}- COVERED.
Algorithm 2
Sub-Optimal Solution (4)
Step8: Go back to Step2 until all instances of T+ are in COVERED.
Step9: Go back to Step1 until all classes are handled.
Time Complexity of Alg.1&2
Time Complexity of Algorithm 1:
Time Complexity of Algorithm 2:
Let n be the number of instances in a DB,
m the number of attributes,
the number of generalizations
and is less than
Experiments
DBs that have been tested:
meningitis, bacterial examination, cancer, mushroom,
slope-in-collapse, earth-quack, contents-sell, …...
Experimental methods:
Comparing GDT-RS with C4.5
Using background knowledge or not
Selecting different allowed noise rates as the threshold values
Auto-discretization or BK-based discretization.
Experiment 1
(meningitis data)
C4.5:
(from a meningitis DB with 140 records, and 38 attributes)
Experiment 1
(meningitis data) (2)
GDT-RS (auto-discretization):
Experiment 1
(meningitis data) (3)
GDT-RS (auto-discretization):
Using Background Knowledge
(meningitis data)
Never occurring together:
EEGwave(normal) EEGfocus(+)
CSFcell(low) Cell_Poly(high)
CSFcell(low) Cell_Mono(high)
Occurring with lower possibility:
WBC(low) CRP(high)
WBC(low) ESR(high)
WBC(low) CSFcell(high)
Using Background Knowledge
(meningitis data) (2)
Occurring with higher possibility:
WBC(high) CRP(high)
WBC(high) ESR(high)
WBC(high) CSF_CELL(high)
EEGfocus(+) FOCAL(+)
EEGwave(+) EEGfocus(+)
CRP(high) CSF_GLU(low)
CRP(high) CSF_PRO(low)
Explanation of BK
If the brain wave (EEGwave) is normal, the focus of brain wave (EEGfocus) is never abnormal.
If the number of white blood cells (WBC) is high, the inflammation protein (CRP) is also high.
Using Background Knowledge
(meningitis data) (3)
rule1 is generated by BK
rule1:
Using Background Knowledge
(meningitis data) (4)
rule2 is replaced by rule2’
rule2:
rule2’:
Experiment 2
(bacterial examination data)
Number of instances: 20,000
Number of condition attributes: 60
Goals:
analyzing the relationship between the bacterium-detected attribute and other attributes
analyzing what attribute-values are related to the sensitivity of antibiotics when the value of bacterium-detected is (+).
Attribute Selection
(bacterial examination data)
Class-1: bacterium-detected (+、-)
condition attributes: 11
Class-2: antibiotic-sensibility
(resistant (R), sensibility(S))
condition attributes: 21
Some Results
(bacterial examination data)
Some of rules discovered by GDT-RS are the same as C4.5, e.g.,
Some of rules can only be discovered by GDT-RS, e.g.,
bacterium-detected(-)
bacterium-detected(-).
Experiment 3
(gastric cancer data)
Instances number:7520
Condition Attributes: 38
Classes:
cause of death (specially, the direct death)
post-operative complication
Goals:
analyzing the relationship between the direct death and other attributes
analyzing the relationship between the post-operative complication and other attributes.
Result of Attribute Selection
(gastric cancer data)
Class: the direct death
sex, location_lon1, location_lon2, location_cir1,
location_cir2, serosal_inva, peritoneal_meta,
lymphnode_diss, reconstruction, pre_oper_comp1,
post_oper_comp1, histological, structural_atyp,
growth_pattern, depth, lymphatic_inva,
vascular_inva, ln_metastasis, chemotherapypos
(19 attributes are selected)
Result of Attribute Selection (2)
(gastric cancer data)
Class: post-operative complication
multi-lesions, sex, location_lon1, location_cir1,
location_cir2, lymphnode_diss, maximal_diam,
reconstruction, pre_oper_comp1, histological,
stromal_type, cellular_atyp, structural_atyp,
growth_pattern, depth, lymphatic_inva,
chemotherapypos
(17 attributes are selected)
Experiment 4
(slope-collapse data)
Instances number:3436
(430 places were collapsed, and 3006 were not)
Condition attributes: 32
Continuous attributes in condition attributes: 6
extension of collapsed steep slope, gradient, altitude, thickness of surface of soil, No. of active fault, distance between slope and active fault.
Goal: find out what is the reason that causes the slope to be collapsed.
Result of Attribute Selection
(slope-collapse data)
9 attributes are selected from 32 condition attributes:
altitude, slope azimuthal, slope shape, direction of high rank topography, shape of transverse section, position of transition line, thickness of surface of soil, kind of plant, distance between slope and active fault.
(3 continuous attributes in red color)
The Discovered Rules
(slope-collapse data)
s_azimuthal(2) ∧ s_shape(5) ∧ direction_high(8) ∧ plant_kind(3) S = (4860/E)
altitude[21,25) ∧ s_azimuthal(3) ∧ soil_thick(>=45) S = (486/E)
s_azimuthal(4) ∧ direction_high(4) ∧ t_shape(1) ∧ tl_position(2) ∧ s_f_distance(>=9) S = (6750/E)
altitude[16,17) ∧ s_azimuthal(3) ∧ soil_thick(>=45) ∧ s_f_distance(>=9) S = (1458/E)
altitude[20,21) ∧ t_shape(3) ∧ tl_position(2) ∧ plant_kind(6) ∧ s_f_distance(>=9) S = (12150/E)
altitude[11,12) ∧ s_azimuthal(2) ∧ tl_position(1) S = (1215/E)
altitude[12,13) ∧ direction_high(9) ∧ tl_position(4) ∧ s_f_distance[8,9) S = (4050/E)
altitude[12,13) ∧ s_azimuthal(5) ∧ t_shape(5) ∧ s_f_distance[8,9) S = (3645/E)
…...
Other Methods for Attribute Selection
(download from http://www.iscs/nus.edu.sg/liuh/)
LVW: A stochastic wrapper feature selection algorithm
LVI: An incremental multivariate feature selection
algorithm
WSBG/C4.5: Wrapper of sequential backward
generation
WSFG/C4.5: Wrapper of sequential forward
generation
Rule induction system: C4.5
Executing times: 10
Class: direct death
Number of selected attributes for each time:
20, 19, 21, 26, 22, 31, 21, 19, 31, 28
Result-2 (19 attributes are selected):
multilesions, sex, location_lon3, location_cir4,
liver_meta, lymphnode_diss, proximal_surg, resection_meth,
combined_rese2, reconstruction, pre_oper_comp1,
post_oper_com2, post_oper_com3, spec_histologi, cellular_atyp,
depth, eval_of_treat, ln_metastasis, othertherapypre
Results of LVW
Result-2 (19 attributes are selected):
age, typeofcancer, location_cir3, location_cir4,
liver_meta, lymphnode_diss, maximal_diam,
distal_surg, combined_rese1, combined_rese2,
pre_oper_comp2, post_oper_com1, histological,
spec_histologi, structural_atyp, depth, lymphatic_inva,
vascular_inva, ln_metastasis
(only the attributes in red color are selected by our method)
Result of LVW (2)
Result of WSFG
Rule induction system:
C4.5
Results
the best relevant attribute first
Result of WSFG (2)
(class: direct death)
eval_of_treat, liver_meta, peritoneal_meta, typeofcancer,
chemotherapypos, combined_rese1, ln_metastasis, location_lon2,
depth, pre_oper_comp1, histological, growth_pattern,vascular_inva,
location_cir1,location_lon3, cellular_atyp, maximal_diam,
pre_oper_comp2, location_lon1, location_cir3, sex, post_oper_com3,
age, serosal_inva, spec_histologi, proximal_surg, location_lon4,
chemotherapypre, lymphatic_inva, lymphnode_diss, structural_atyp,
distal_surg,resection_meth, combined_rese3, chemotherapyin,
location_cir4, post_oper_comp1, stromal_type, combined_rese2,
othertherapypre, othertherapyin, othertherapypos, reconstruction,
multilesions, location_cir2, pre_oper_comp3
( the best relevant attribute first)
Result of WSBG
Rule induction system:
C4.5
Result
the least relevant attribute first
Result of WSBG (2)
(class: direct death)
peritoneal_meta, liver_meta, eval_of_treat, lymphnode_diss,
reconstruction, chemotherapypos, structural_atyp, typeofcancer,
pre_oper_comp1, maximal_diam, location_lon2, combined_rese3,
othertherapypos, post_oper_com3, stromal_type, cellular_atyp,
resection_meth, location_cir3, multilesions, location_cir4,
proximal_surg, location_cir1, sex, lymphatic_inva, location_lon4,
location_lon1, location_cir2, distal_surg, post_oper_com2,
location_lon3, vascular_inva, combined_rese2, age, pre_oper_comp2,
ln_metastasis, serosal_inva, depth, growth_pattern, combined_rese1,
chemotherapyin, spec_histologi, post_oper_com1, chemotherapypre,
pre_oper_comp3, histological, othertherapypre
Result of LVI
(gastric cancer data)
Number of allowed inconsistent instances
Executing
times
Number of
inconsistent
instances
Number of selected attributes
80
20
1
2
3
4
5
1
2
3
4
5
79
68
49
61
66
7
19
19
20
18
19
16
20
18
20
49
26
28
23
26
Some Rules
Related to Direct Death
peritoneal_meta(2) ∧ pre_oper_comp1(.) ∧ post_oper_com1(L) ∧ chemotherapypos(.) S= 3*(7200/E)
location_lon1(M) ∧ post_oper_com1(L) ∧ ln_metastasis(3) ∧ chemotherapypos(.) S= 3*(2880/E)
sex(F) ∧ location_cir2(.) ∧ post_oper_com1(L) ∧ growth_pattern(2) ∧ chemotherapypos(.) S= 3*(7200/E)
location_cir1(L) ∧ location_cir2(.) ∧ post_oper_com1(L) ∧ ln_metastasis(2) ∧ chemotherapypos(.) S= 3*(25920/E)
pre_oper_comp1(.) ∧ post_oper_com1(L) ∧ histological(MUC) ∧ growth_pattern(3) ∧ chemotherapypos(.) S= 3*(64800/E)
sex(M) ∧ location_lon1(M) ∧ reconstruction(B2) ∧ pre_oper_comp1(.) ∧ structural_atyp(3) ∧ lymphatic_inva(3) ∧ vascular_inva(0) ∧ ln_metastasis(2) S=3*(345600/E)
sex(F) ∧ location_lon2(M) ∧ location_cir2(.) ∧ pre_oper_comp1(A) ∧ depth(S2) ∧ chemotherapypos(.) S= 3*(46080/E)
GDT-RS vs. Discriminant Analysis
if -then rules
multi-class, high-dimension, large-scale data can be processed
BK can be used easily
the stability and uncertainty of a rule can be expressed explicitly
continuous data must be discretized.
algebraic expressions
difficult to deal with the data with multi-class.
difficult to use BK
the stability and uncertainty of a rule cannot be explained clearly
symbolic data must be quantized.
GDT-RS vs. ID3 (C4.5)
BK can be used easily
the stability and uncertainty of a rule can be expressed explicitly
unseen instances are considered
the minimal set of rules containing all instances can be discovered
difficult to use BK
the stability and uncertainty of a rule cannot be explained clearly
unseen instances are not considered
not consider whether the discovered rules are the minimal set covered all instances
Rough Sets in ILP and GrC
-- An Advanced Topic --
Background and goal
The normal problem setting for ILP
Issues, observations, and solutions
Rough problem settings
Future work on RS (GrC) in ILP
ILP: Inductive Logic Programming
GrC: Granule Computing
Advantages of ILP
(Compared with Attribute-Value Learning)
It can learn knowledge which is more expressive because it is in predicate logic
It can utilize background knowledge more naturally and effectively because in ILP the examples, the background knowledge, as well as the learned knowledge are all expressed within the same logic framework.
Weak Points of ILP
(Compared with Attribute-Value Learning)
It is more difficult to handle numbers (especially continuous values) prevailing in real-world databases.
The theory, techniques are much less mature for ILP to deal with imperfect data (uncertainty, incompleteness, vagueness, impreciseness, etc. in examples, background knowledge as well as the learned rules).
Goal
Applying Granular Computing (GrC) and a special form of GrC: Rough Sets to ILP to deal with some kinds of imperfect data which occur in large real-world applications.
Normal Problem Setting for ILP
Given:
The target predicate p
The positive examples and the negative examples (two sets of ground atoms of p)
Background knowledge B (a finite set of definite clauses)
Normal Problem Setting for ILP (2)
To find:
Hypothesis H (the defining clauses of p) which is correct with respect to and , i.e.
1. is complete with respect to
(i.e. )
We also say that covers all positive examples.
2. is consistent with respect to
(i.e. )
We also say that rejects any negative examples.
Normal Problem Setting for ILP (3)
Prior conditions:
1’. B is not complete with respect to
(Otherwise there will be no learning task at all)
2’. is consistent with respect to
(Otherwise there will be no solution)
Everything is assumed correct and perfect.
Issues
In large, real-world empirical learning, uncertainty, incompleteness, vagueness, impreciseness, etc. are frequently observed in training examples, in background knowledge, as well as in the induced hypothesis.
Too strong bias may miss some useful solutions or have no solution at all.
Imperfect Data in ILP
Imperfect output
Even the input (Examples and BK) are “perfect”, there are usually several Hs that can be induced.
If the input is imperfect, we have imperfect hypotheses.
Noisy data
Erroneous argument values in examples.
Erroneous classification of examples as belonging to or
Imperfect Data in ILP (2)
Too sparse data
The training examples are too sparse to induce reliable H.
Missing data
Missing values: some arguments of some examples have unknown values.
Missing predicates: BK lacks essential predicates (or essential clauses of some predicates) so that no non-trivial H can be induced.
Imperfect Data in ILP (3)
Indiscernible data
Some examples belong to both and
This presentation wi
 






Các ý kiến mới nhất