Data Mining using SQL Server Analysis Server
In today’s world, software applications are moving from traditional information systems to Business Intelligent systems. The growing of data and information have within data need to raise to develop new kind of applications for the organizations. To address this, data mining solutions have become an integral part of many software solutions.
Data mining is the process of discovering actionable information from large sets of data. It helps organizations analyze incredible amounts of data in order to detect common patterns or learn new things. It uses mathematical analysis to derive patterns and trends from existing data. However existing data need to be organized via ETL (Extract, Transform, Loading) process before applying the mining technique on them. This is because typically these patterns cannot be discovered by traditional data exploration methods since the relationships are too complex or because there is too much data.
There are many data mining techniques available to analyze data and drive the useful knowledge and patterns from those. These techniques are ranged from extremely complex to basic. Each technique serves a slightly different purpose or goal. Here are few example approaches to data mining.
Cluster detection is a type of pattern recognition that is used to detect patterns within large data sets. It’s a bit like arranging a large amount of information into categories using patterns which emerge during data analysis.
Classification Analysis is a systematic process for obtaining important and relevant information about data, and metadata – data about data. The classification analysis helps identifying to which set of categories different types of data belong. Classification analysis is closely linked to cluster analysis as the classification can be used to cluster data.
Regression is a technique that aims to predict future outcomes using large sets of existing variables. This is used to predict future user engagement, customer retention and even property prices.
Anomaly or Outlier Detection
Anomaly detection refers to the search for data items in a dataset that do not match a projected pattern or expected behaviour. Anomalies are also called outliers, exceptions, surprises or contaminants and they often provide critical and actionable information.
Association Rule Learning
Association rule learning enables the discovery of interesting relations between different variables in large databases. Association rule learning uncovers hidden patterns in the data that can be used to identify variables within the data and the co-occurrences of different variables that appear with the greatest frequencies.
Microsoft SQL Sever Data tools
There are many data mining tools available to apply the data and predict. Here we are going to talk about Microsoft SQL Server Data tools and how its features support to build & deploy a mining model.
The Microsoft SQL Server Data tools includes SQL server relational databases, Azure SQL databases, Integration Services packages, Analysis Services data models and reporting Services. The analysis server contains data mining algorithms and query tools that make it easy to build a comprehensive solution for a variety of projects. SQL Server Management Studio, contains tools for browsing models and managing data mining objects.
Developing a mining application generally takes following steps
- Defining the problem
- Preparing Data
- Exploring Data
- Building Models
- Exploring and Validating Models
- Deploying and updating Models
Following diagram explains the cyclical flow of creating a data mining model. Each step in the process needs to be repeated many times in order to create a decent model.
Image source: https://i-msdn.sec.s-msft.com/dynimg/IC125015.jpeg
Defining the Problem
First step is to determine the scope of the problem and analyzing the business requirements to defining specific goals for the data mining project. Here we might need to conduct a data availability study.
In this step, we are working with a very huge data set and cannot examine every transaction for data quality; therefore, we might need data profiling and automated data cleansing and filtering tools, such as Microsoft SQL Server Master Data Services or SQL Server Data Quality Services to explore the data and find the inconsistencies.
In this step, we can use tools such as Master Data Services to canvass available sources of data and define their availability for data mining. We can use tools such as SQL Server Data Quality Services, or the Data Profiler in Integration Services, to analyze the distribution of data and repair issues such as incorrect or missing data.
In this step, we state the columns of data which we want to use by creating a mining structure. When we process the mining structure, Analysis Services produces aggregates and other statistical information that can be used for analysis. This information can be used by any mining model that is based on the structure.
Processing a model is called as training. Applying a specific mathematical algorithm to the data in the structure is the process of training. By using training, we can extract patterns. The patterns that we find in the training process depend on the following three points
- Selection of training data,
- The algorithm we chose,
- How we have configured the algorithm.
We can also use parameters to adjust each algorithm and apply filters to the training data to use just a subset of the data. After data is passed through the model, the mining model object holds summaries and patterns that can be queried or used for prediction.
Exploring and Validating Models
By using Analysis Services we can distinct data into training and testing a dataset so that we can precisely assess the performance of all models on the same data. We use the training dataset to build the model and the testing dataset to test the accuracy of the model by creating prediction queries.
We can explore the trends and patterns that the algorithms discover by using the viewers in Data Mining Designer in SQL Server Data Tools. We can also test how well the models create predictions by using tools in the designer such as the lift chart and classification matrix. To verify whether the model is specific to our data or might be used to make inferences on the general population, we can use the statistical technique called cross-validation to automatically create subsets of the data and test the model against each subset.
Deploying and Updating Models
Here we can use the use the models to create predictions, which we can then use to make business decisions. SQL Server provides the DMX language that we can use to create prediction queries, and Prediction Query Builder to help you build the queries.
- Shamaran Satkunanathan
- Amila Basnayaka