R

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Description

International Journal of Database Management Systems ( IJDMS )

Tags

Transcript

International Journal of Database Management Systems ( IJDMS ) Vol.4, No.5, October 2012DOI: 10.5121/ijdms.2012.4505 61
R
EDUCTION
OF
N
UMBER
OF
A
SSOCIATION
R
ULES
W
ITH
I
NTER
I
TEMSET
D
ISTANCE IN
T
RANSACTION
D
ATABASES
Pankaj Kumar Deva Sarma
1
and Anjana Kakati Mahanta
2
1
Department of Computer Science, Assam University, Silchar, Assam, IndiaPIN-788011
pankajkdsarma@yahoo.com, pankajgr@rediffmail.com
2
Department of Computer Science, Gauhati University, Guwahati, Assam, IndiaPIN - 781014
anjanagu@yahoo.com
ABSTRACT
Association Rule discovery has been an important problem of investigation in knowledge discovery and data mining. An association rule describes associations among the sets of items which occur together intransactions of databases.The Association Rule mining task consists of finding the frequent itemsets and therules in the form of conditional implications with respect to some prespecified threshold values of support and confidence.The interestingness of Association Rules are determined by these two measures. However,other measures of interestingness like lift and conviction are also used. But, there occurs an explosivegrowth of discovered association rules and many of such rules are insignificant. In this paper we introducea new measure of interestingness called Inter Itemset Distance or Spread and implemented this notionbased on the approaches of the apriori algorithm with a view to reduce the number of discovered Association Rules in a meaningful manner. An analysis of the working of the new algorithm is done and theresults are presented and compared with the results of conventional apriori algorithm.
KEYWORDS
Association rules, frequent itemsets, support, confidence, Inter itemset distance, spread, data mining.
1. I
NTRODUCTION
Data Mining or Knowledge Discovery in Databases came to being as a field of research sincethe very early 1990s and since it has grown tremendously with wide ranging applications fromBusiness, Science and Engineering Research, Medical Diagnosis, DNA and Genome dataanalysis. Data Mining concerns with design and development of computational algorithms andtechniques for discovering hidden patterns and rules which are nontrivial, interesting, previouslyunknown and potentially useful from data in databases [1]. These discovered rules are immenselyuseful in decision making.
The problem of Association Rule Mining is to discover a set of items shared among a largenumber of transactions in a database. In association rule discovery, it is found how the presenceof a set of items in a transaction influences the presence of another set of items in the sametransaction. For example, let us consider the database of daily issue and return of books of a
International Journal of Database Management Systems ( IJDMS ) Vol.4, No.5, October 201262
particular library, where the transactions represent the subscribers of the Library and the attributesor items represent the books. Here while the subscribers get books issued in their names andreturns, these are recorded as transactions in the database of issue and return. The volume of transactions thus generated can become large over a period of time and stores the subscribers’usage pattern of the library. However, these patterns cannot be discovered by conventional meansand needs relevant data mining technique to discover the same. One of the discovered patternscould be the set of books which are most frequently issued or returned together by the subscribers.For example, 50% of the subscribers who issue Ulman’s
Theory of Automata
also issue
Compilers
by the same author. The library can utilize this knowledge and many such for betterservice to the subscribers. The problem is to discover all such association rules in large databases.Efficient algorithms are required to be designed with appropriate measure of interestingness.Further, there are issues related to measure of interestingness of association rules. Supportand confidence are the two most widely used mearures. Other parameters include correlation orlift or interest and conviction. The concept of correlation rule is introduced in [5]. A correlationrule is defined as a set of itemsets that are correlated. The motivation for developing such a rule isthat negative correlations may be useful [6] [56]
.
Correlation satisfies upward closure in theitemset lattice. Thus if a set is correlated then every superset of it is also correlated. For
anassociation rule of the form A
=> Bin a transaction database, where A and B are the itemsets,
support
of an itemset is defined as the frequency of its occurrence in the database. The
support
(s)(expressed as percentage) of the rule A
=> Bis the probability of occurrence of the itemset (A UB) and is given bys (A
=> B) = P(A U B) (1)The
confidence
(c) of the rule A
=> B is defined asc (A
=> B) = support(A U B) / support (A) = P(A U B) / P(A) (2)The value of support and confidence measures the strength of a rule. Chi squared test forindependence is another measure for significance of rules [6] [56]
.
The Chi squared significancetest takes into account both presence or absence of items in sets. On the other hand support andconfidence are the measures which are based only on the presence of items in the sets. The task of discovering interesting Association Rules from large databases computationally intensive. Thesearch space is exponential in terms of the number of distinct database items present in thetransaction database. Further, there are billions of database transactions for which there occursshortage of main memory to accommodate the transactions of the large databases. This increasesthe data transfer activity. Most of the approaches of Association Rule Mining require multipledatabase scans, which is expensive[2]. Alternatively, Parallel Algorithms [3] and algorithmsbased on sampling [4] are also designed and implemented. The method of sampling was used tocontain the explosive growth of the search space. However, the methods based on sampling donot always get the true representation of the data to work upon can be sensitive to the data - skewwhich may adversely affect the performance [2]. The parallel algorithms have the additionaloverhead of data transfer and message passing which may consume longer time than desired.Research issues related to the association rule mining algorithms include scalability, exponentialgrowth of the search space and the increase in discovered rules with the increase in the number of items in the database, multiple database scans and I/O reduction, reduction of the database scansand so on.
International Journal of Database Management Systems ( IJDMS ) Vol.4, No.5, October 201263
The rest of the paper is organized as follows. In section 2, we introduce the concept of Average Inter Itemset Distance or Spread as new measure of interesingness for reducing the rules.The association rule discovery problem is described in section 3. Related works are analysed insection 4. Mining association rules with average inter itemset distance or spread is discussed insection 5. In section 6, the algorithmic steps and its analysis for the mining association rulesbased on average inter itemset distance along with support and confidence is given. In section 7,the working of the algorithm is demonstrated with an example. In section 8, implementation andresults of the proposed algorithm is presented. In section 9, conclusion and future scope of worksis discussed.
2. I
NTER
I
TEMSET
D
ISTANCE
(SPREAD)
AS A
M
EASURE
OF
I
NTERESTINGNESS
Support and confidence are by far the most widely used and popular measures of interestingness. Confidence is similar to the conditional probability of occurrence of a rule. Tofind the support of an itemset, the database is scanned and only the frequency of occurrence of theitemset is calculated irrespective of the position of occurrence of the itemset in the transactions.The position of occurrence of an itemset is according to the Transaction IDs (TID) of thetransactions in which the itemset is found to be present and there are number of transactions inwhich the itemset may not have been occurred. These transactions of nonoccurrence of theitemset are of importance for the relative patterns of occurrences of the itemsets while miningassociation rules. Such transactions in which an itemset has not occurred are calculated and thediscovered association rules are reinforced with additional meaning with the existing measuressupport and confidence. It is observed that the measures support, confidence, correlation etc. forassociation rules as mentioned above do not take into account the number of intervenningtransactions between two consecutive occurrences of an itemset from which the rules aregenerated in the database of transactions. When the support of an itemset is counted to determinewhether the itemset is frequent or not, only the overall occurrences of each itemset is counted inthe whole transaction database. From the values of support count of the itemsets, it is not possibleto know how these itemsets are distributed with respct to one another in the whole transactiondatabase. Therefore, it is required to find out after one occurrence of an itemset in the transactiondatabase, how many transactions are there in the middle in which the same itemset has notoccurred before its next occurrence and so on. In this manner, if continued then there are severalintervenning transactions for the occurrences of an itemset till its last occurrence. All theseintervenning number of transactions separating the consecutive occurrences of an itemset, whenadded together and divided by the number of the gaps of nonoccurrences, then we get the averageseparation in terms of the number of transactions in which the itemset has not occurred betweenthe first and the last occurrences of the itemset. This, we call
Average Inter Itemset Distance(IID) or Spread
for an itemset.The value of
Average Inter Itemset Distance or Spread
gives an indication how closely orsparsely the itemsets in the transaction database are separated from each other within its lifespan.The
lifespan
of an itemset is the number of transactions including the first and the lasttransactions of occurrence of the itemset which need not necessarily always be the first and thelast transactions of the database. Together with the support, average inter itemset distance orspread is used as another measure for an association rule generated from a frequent itemset. Byusing threshold values for average inter itemset distance in addition to the threshold values forsupport and confidence, association rules can be discovered. The smaller the value of thethreshold for average inter itemset distance the closure will be the spacing between the successiveoccurrences of an itemset. Thus with the help of a threshold for average inter itemset distance the
International Journal of Database Management Systems ( IJDMS ) Vol.4, No.5, October 201264
number of frequent itemsets and hence the number of association rules genetrated can be reduced.The value of the avarage inter itemset distance implies that on an average, after how manytransactions the same itemset repeats itself in the transactions of the database. Therefore theoccurrence patterns of an itemset which can be obtained by using the avarage inter itemsetdistance or spread cannot be obtained with support. The following is an example in this context.Let us suppose in a database of 50 transactions containing items {A, B, C, D, E} the itemset{A, B, C} occurs in the transactions 1, 5, 6, 25, 33 and 42. Thus this itemset has support count 6.Another itemset {B, C, E} occurs in the transactions 5, 7, 12, 15, 17 and 19. It also has supportcount 6. In this case the occurrences of the itemset {B, C, E} is closure to each other as comparedto the itemset {A, B, C}. Though the support of both the itemsets is the same, there is a differencein their pattern of occurrences in the whole database. This cannot be identified based on thesupport of an itemset. But with the introduction of average inter itemset distance or spread asanother measure for a frequent itemset and the corresponding association rules, this kind of pattern of occurrences can be discovered. The parameter average inter itemset distance can beused along with support and confidence for a rule.The itemsets which satisfy the input thresold value for average inter itemset distance orspread is called
closely spaced itemsets
. An algorithm for discovering association rules withaverage inter itemset distance or spread, support and confidence based on prespecified thresholdvalues is proposed based on the apriori algorithm. The apriori algorithm has been discussed in [3][7] [8] [9]. The apriori algorithm and the proposed algorithm are implemented and resultsobtained from standard datasets are presented. It is observed that there is a reduction in thenumber of association rules discovered based on this approach as compared to the apriorialgorithm.
3. P
ROBLEM
S
TATEMENT
: M
INING
OF
A
SSOCIATION
R
ULES
The problem of mining Association Rules was first introduced in [7]. It is described asfollows: Let I = {i
1
, i
2
, i
3
, … …. …. i
m
} be a set of literals called items and D be a database of transactions, where each transaction T is a set of items such that T I. Given an itemset X I, atransaction T contains X if and only if XT. In other words, I = {i
1
, i
2
, i
3
, … …. …. i
m
} is a setof attributes over the binary domain {0,1}. Each transaction T is a tuple of the database D and isrepresented by identifying the attributes with value 1. A unique identifier called TID is associatedwith each transaction. A set of items X I is called an itemset.An association rule is an implication of the form X=> Y such thatX I, YI and X
∩
Y =
Φ
. The rule X=>Y holds in the transaction database D with confidence c if c% of the transactionsin D that contain X also contain Y. The rule X=>Y has support s in the transaction database D if s% of transactions in D contains X U Y. Confidence denotes the strength of implication andsupport indicates the frequencies of the occurring patterns in the rule. Rules with high confidenceand strong support are referred to as strong rules in large databases [7].Given a set of transactions D, the problem of mining association rules is to generate allassociation rules that have certain user specified minimum support (called
minsup
) andconfidence (called
minconf
). Such an association rule is also called strong rule to distinguish itfrom weak ones i. e. those rules which do not meet the thresholds [11]. A set of items is called an
itemset
. An itemset with k-items is called a
k-itemset
. The support count of an itemset X denotedby
σ
(X) is the number of transactions in which it occurs as a subset. Support of an itemset is alsoexpressed as percentage. That is for an itemset X, its percentage support is defined as thepercentage of transactions in the database D which contains X. Support count denotes thefrequency of occurrence of an itemset in the database D. Given a minimum support threshold

Related Search

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks