Data Mining – Definition, Techniques and Challenges

Standard

The concept of data analysis has always existed. The evolution of this analysis has progressed as follows:

  • 1960’s: What was my total income of last year?
  • 1980’s: What was my total income of electronics department last year?
  • 1990’s: What was my monthly net revenue of electronics department within the last year?
  • Today: What is likely to be my monthly net income this coming year and why?

This means we are able to make predictions based on data analysis. An estimation of the digital universe size was made in December, 2012 to b 2,837 exabytes (EB) and the forecast was to grow to 40,000EB by 2020.

1 EB (exabyte) = 1000 PB (petabyte) = 1,000,000 TB (terabyte) 

With millions of transactions occurring every day and with the above forecast, we can only imagine how much data we are talking about. However, this leaves a big opportunity for us to swift through that data and turn it into something much more useful. In another word, it is a great opportunity to create knowledge from knowledge and that is what is called: “Data Mining”.

So, what exactly happens during data mining process? The following illustration shows the overall process:

Flow

 

As shown above, we first select the data set we want to work on. Let’s say, we want to do data mining on a grocery store’s financial transactions. Then, we would first select all the transactional records. Secondly, we select the specific data sets we are interested in by discarding unnecessary data sets. After that, we apply preprocessing techniques and transformation techniques to get the data ready for mining.Then, we mine the data, perform analysis, detect patterns (if desired and available) and conclude by creating the knowledge.

You might wonder what “preprocessing” and “transformation” entails. Data preprocessing is necessary because today’s real-world data is of huge size resulting in possible noise, absence, and inconsistency. Therefore, if we work on low quality data our result will only be same or worse. “Transformation”, a.k.a normalization, is one of several data preprocessing techniques. It can be applied where data is scaled to fall within a smaller range like 0.0 to 1.0. The other techniques include: data cleaning (removes noise and corrects inconsistencies), data integration (merges data from multiple sources into a coherent data store such as a data warehouse), data reduction (reduces data size by aggregating, eliminating redundant features, or clustering)

As for data-mining functionalities, we categorize them into 2 types: Predictive and Descriptive where the former is characterizes properties of the data in a target data set and the latter performs induction on the current data in order to make predictions. The following table shows the two and functionalities that belong to them.

 

Predictive Descriptive
Classification Clustering
Regression Association Rule Discovery
Deviation Detection Sequential Pattern Discovery

CLASSIFICATION

Let’s say we want to know if a person is likely to understand this blog post or not based on the person’s previous experience with data mining, reading, studying, or other technical backgrounds that are similar to that of data mining. Assuming that we have all this data in a database, we would select a: “class” which is the: “person likely to understand the post?” with: “YES\NO” attributes. In other words, we are classifying the blog reader; hence, we are using the Classification technique and creating a “class” for it. Some applications based on this technique include: Direct Marketing, Fraud Protection/Detection, Customer Attrition/Churn, Sky Survey Cataloging…etc

CLUSTERING

We use this technique when we need to group data into different clusters based on their similarities. Data points in one cluster are more similar to each other; whereas, data points in separate clusters are less similar to one another. The actual implementation of the technique is as follows:

  • Define a stopping criteria (i.e, divide students into groups of 3)
  • Define a parameter of affinity (i.e, ask the students to pick nearest students)
  • Define the Centroid/Medoid (i.e, ask each group to select a leader)

Some applications based on this technique include: Market Segmentation, Document Clustering…etc. 

ASSOCIATION RULE DISCOVERY

We produce dependency rules which will predict occurrence of an item based on occurrences of other items. For example, male customers who buy diapers on Thursdays buy beers most of the time. We are establishing an association rule between: “diaper” and “beer” items. Using this data, we can make a prediction that a male customer who buys diaper(s) will most likely buy beers. Once we have this knowledge, we can do wonders to boost our sales!

Some applications: Marketing and Sales Promotion, Supermarket shelf Management, Inventory Management…etc.

SEQUENTIAL PATTERN DISCOVERY

We find rules that predict strong sequential dependencies among different events. “Sequence” is the key here. Example: People who go to BestBuy on Black Friday go to Walmart right away. This means, if I am a RadioShack manager then I would see if I can offer any good deals that are better than Walmart’s to drive customers from BestBuy directly.

REGRESSION

We predict a value of a given continuous valued variable based on the values of other variables, assuming a liner\non-linear model of dependency. For example, we use this to predict sales amounts of new product based on advertising expenditure, predict wind velocities as a function of temperature, humidity, air pressure…etc.

Challenges of Data Mining

Data mining is a great process with great results that can make wonders happen. By doing data mining, a struggling business can boost their revenue and avoid bankruptcy, a government security agency can prevent a crime from happening, and a credit card company can protect its customers from fraud. These are just some of many examples. However, there are also the challenges of this process that are faced by data miners including: scalability, dimensionality, complex and heterogeneous data, data quality, data ownership and distribution, privacy preservation and streaming data.

This is where I will end this post. As I learn more in class, I will be posting and sharing the knowledge – so stay tuned. : )

Also, please feel free to add\contribute to this post and let me know if there is something I should modify for a better understanding.

 

Advertisements

One thought on “Data Mining – Definition, Techniques and Challenges

  1. The proliferation of numerous large, and sometimes connected, government and private databases has led to regulations to ensure that individual records are accurate and secure from unauthorized viewing or tampering. Most types of data mining are targeted toward ascertaining general knowledge about a group rather than knowledge about specific individuals—a supermarket is less concerned about selling one more item to one person than about selling many items to many people—though pattern analysis also may be used to discern anomalous individual behaviour such as fraud or other criminal activity.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s