Information Technologies Dersi 4. Ünite Özet
Data Management
- Özet
Introduction
Data can be referred as raw, unorganized facts and concepts which need to be processed. Information is organized, categorized, contextualized data. Data becomes meaningful with information. Information needs to be analyzed especially for making strategic decisions. Information does not need to be useful. Knowledge is simply defined as useful information. Wisdom is related with judging which knowledge is true and applying it to real life.
Data can be grouped as qualitative data and quantitative data. Quantitative data can be referred as data in numerical form whereas qualitative data can be referred as data different than numerical form. A quantitative or numeric data can be continuous or discrete. A continuous data can be any value from the specified range. A discrete data can be any one of the specified discrete numerical values. Qualitative data or categorical data can be grouped as nominal data and ordinal data. Unlike nominal data, categories are in certain orders in ordinal data. Data can be collected explicitly or implicitly..
Data management includes topics such as data defining, data architecture management, data quality management, master data management, metadata management, database management, data warehousing, reporting and so on.
Extracting knowledge from data is extremely valuable since raw data is converted into useful, interesting form. Data mining is the major step of knowledge discovery process. Collected data needs to be preprocessed. Data repositories, such as relational databases, data warehouses, flat files etc. are used for storing data. Data mining functions are applied to preprocessed data to find out meanings of data. Then, some interestingness measures are applied to extracted information to find out knowledge.
What is Data Management?
In order to extract useful information from data, data management is required. Managing data is important and provides some advantages described as follows:
- It provides mechanisms to become data into accessible, reliable, authentic and complete form.
- It provides mechanisms to securely share data between information technology systems, applications, departments or employees for an enterprise.
- It helps decision makers to make strategic decisions by allowing mining and analyzing data.
- It allows organizing, tracking, deleting, creating, and reporting data.
- It disallows redundancy and loss in data.
- It minimizes damage caused by errors related with data.
Managing data can be regarded as a set of some tasks on data such as data quality, data security, database management, data warehousing, data architecture, reporting&document management, data definition, data integration, metadata&masterdata management.
Database Management System
A database (DB) is defined as a collection of related data. Stored data in DB can be allowed to be accessed, managed, and manipulated with a special set of software called as database management system (DBMS). A data model describes high level structure of data to be stored in DBs and hides its low level representation. A data model defines how data is stored and processed in DB by specifying some constraints that DB should obey. Data models can be grouped as conceptual or high-level data model, low-level or physical data models and implementation or representational data models. Conceptual data models represent data in conceptual level, whereas implementational data models represent logical structure of data. Physical data models describe how data can be stored into DB such as constraints that DB should obey. The main steps for designing a DB are requirement collection and analysis, conceptual design, logical design and physical design orderly
Requirement Collection and Analysis: Requirements on data represent requests and expectations of DB users, required rules or constraints for the DB. DB designers interview with DB users to understand data requirements, find out required constraints on data and obtain functional requirements. Requirements should be detailed, relevant, descriptor, unambiguous, complete and consistent.
Conceptual Design: Conceptual design provides conceptual schema of DB. It allows representing entities and relationships between them in high-level view models. Entity-Relationship model is one of the popular data model used for conceptual design. It represents data as entities, attributes and relationships. An entity is a realworld object, whereas an attribute represents a property describing the related entity. An entity can be weak or strong. A strong entity independently exists in real world, but a weak entity is dependent to another entity. Entities that share similar attributes are grouped into the same entity type. An entity set is the collection of entities belong to same entity type. An attribute can be any type of composite, derived, simple and multivalued. A simple attribute is the attribute which is not divisible. A composite attribute symbolizes the collection of simple attributes. Thus, it can be divided into sub-attributes. A multi-valued attribute contains more than one entry for a specific entity. An attribute is derived attribute if it can be derived from other described attributes. Key attribute have distinct values for each entity. An entity can have more than one key attributes. Each key attribute is called as candidate key and one of them is chosen as primary key. A relationship represents the relation between entities. A relationship relates more than one entities. Relationships that share common attributes are grouped into same relationship type. The number of attendant entities of a relationship represents the degree of the corresponding relationship. Role name in a relationship represents the role of participated entity type in the relationship. A relationship is recursive if participants belong to the same entity type participate in the relationship with different roles. There are some constraints defined on relationships. Mapping cardinality constraints the number of participated entities for an entity type in a relationship. For a binary relationship, cardinality ratio can be one of the following options: one-to-one (1:1), one-to-many (1:N), many-toone (N:1) and many-to-many (N:M). Two types of participation constraints exist. If every entity in an entity set is participated in the relationship type then the participation for corresponding entity type is total. If there are some entities of an entity type which do not participate in the relationship type, then that entity type is partially participated. A database schema is represented with ER diagrams in ER model. Basic ER diagrams for some concepts of ER model is given in Figure 4.4 (Page 66). Enhanced entity-relationship model is derived to define more precise constraints and relationships between entities. In enhanced ER model, some semantic data concepts such as superclass vs subclass, specialization vs generalization, category vs union are introduced in addition to basic data concepts.
Logical Design: Logical design maps ER or enhanced ER concepts into relational schemas. Logical design is the collection of processing identifying primary keys and foreign keys, mapping conceptual schema into relational schema and normalizing relational schema. Mapping of conceptual design into relational schema results in tables of single-valued attributes. A table represents an entity type or relationship type where its columns represent attributes of the corresponding entity or relationship type and its rows are entities. A table, a column and a row is also referred as a relation, an attribute and a tuple, respectively. An attribute can take NULL value if it is undefined, unknown or unavailable. Relational schema have some restrictions such as domain constraints, key constraints, entity integrity constrains and referential integrity constraints. According to domain constraints, an attribute can be a value from its own domain and that value must be indivisible. According to key constraints, a relation is a set of distinct tuples; thus, two tuples cannot have the same collection of values for all attributes at any relation state. Superkey is the set of one or more attributes which uniquely identifies each tuple in a relation. In a relation, there may be more than one superkey sets, then each one is called as a candidate key. One of the candidate keys (generally the one contains minimum number of attributes) is chosen as primary key set. Primary keys are represented with underlines in logical design. For each tuple in a relation, superkey attributes have unique values. According to entity integrity constraints, a primary key cannot be NULL value. Referential integrity constraints are defined on relationships in order to provide consistency in participants. A foreign key in a relation R1 is the primary key set of the referenced relation R2. The foreign key of relation R1 refers relation R2. R1 is referred as referencing relation and R2 is called as referenced relation. According to referential integrity constraints, if there are some values for foreign keys in R1 same values must appear on relation R2 for primary keys. Insert, update, delete are three basic operations to manipulate data. When the attributes of different entities or relationships are collected into one entity some anomalies occur when performing these three basic operations. Bad designs cause some errors in join operations so a database design should satisfy the lossless join condition. Good database is designed using functional dependencies (FDs). FDs represent meaning and inter-relationships between attributes. A set of attributes is functionally dependent to attribute A, if A uniquely identifies that set of attributes. Utilizing FDs, some normalization rules as First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce Code Normal Form (BCNF), Fourth Normal Form (4NF) and Fifth Normal Form (5NF) are defined for designing good databases. 1NF, 2NF, 3NF and BCNF are based on keys and FDs. 4NF is based on keys and multivalued-dependencies whereas 5NF is based on keys and join dependencies. 1NF forces a relation to be non-nested and disallows composite, multi,-values and non-atomic attributes. If a relation is in 1NF and its all non-prime attributes are fully functionally dependent on primary keys, then this relation is in 2NF. If a relation is in 2NF and its all non-prime attributes are fully functionally dependent on primary keys but not transitively dependent, then it is in 3NF. If a relation is in a normal form then it is also in previous normal forms of the corresponding normal form. Steps of mapping ER models into relational design are given as follows:
- Mapping of strong entity types: A relation R is created for each strong entity type E which contains E’s simple attributes. Instead of a composite attribute A, A’s simple attributes are added to the created relation R. One of the candidate keys is chosen as primary key. For multi-valued attributes of E, a different relation is generated and primary key of E is added as foreign key to the newly generated relation. Adding new column to R is not needed for derived attributes.
- Mapping of weak entity types: For each weak entity type W a relation R is created as strong entity types. Additionally, since weak entity type is dependent to another strong entity type, the owner entity type’s primary keys are inserted as columns of R. Finally, combination of the partial keys of W and primary keys of the owner entity type generates the primary key of R.
- Mapping 1:1 binary relationship types: There are three possible ways to be followed while mapping 1:1 binary relationship type R relating two entity types concept to relational design. One of the choices is foreign key approach where primary key of one of the entities is inserted to the other entity type’s relation as foreign key. Merging relation approach is valid when both entity types are total participated in R. This time, two different entity types are merged into one relation. As a third choice, a relationship is created where primary keys of two entity types are used as foreign keys. One of those foreign keys are chosen as primary key of R.
- Mapping of 1:N or N:1 binary relationship types: For 1:Nor N:1 binary relationship type, the participated entity type E on the N-side of the relation is chosen as identifier. Primary key of the participant entity type on 1-side of the relationship type is added to E as foreign key.
- Mapping of N:M binary relationship type: For each N:M binary relationship type, a new relation R is generated. R’s primary key is the set of primary keys of both participant entity types.
- Mapping n-ary relationship types: For each n-ary relationship type a new relation R is generated. Primary keys of each participants are added to R as foreign keys. Primary key of R is the combination of primary keys of participants on N-sides of the relationship type.
Physical Design: Physical design specifies database in terms of physical configurations of chosen DBMS, such as indexes, physical file storages. Logical design of the database are mapped into physical database utilizing specifications of chosen DBMS.
Knowledge Discovery
Knowledge discovery is referred as a process of extracting interesting and useful information from huge amount of data which can be from structured and unstructured sources. KDD is a collection of seven steps which are data collection, data preprocessing, data warehousing, tasking relevant data, data mining, pattern evaluation and visualization orderly.
Data Collection: In first step, data needs to be collected to extract knowledge from it. Data can be collected as explicitly or implicitly.
Data Preprocessing: Some techniques such as data cleaning, data integration, data transformation, data reduction, discretization and generating concept hierarchies are used for preprocessing data to improve quality of data.
Data Warehousing: KDD extracts knowledge from high amount of data which is stored in databases, data warehouses or other data repositories flat files, data streams and so on. A data warehouse is a data store for data from multiple data sources. It allows some applications to analyze historical data for helping decision making and analytical reporting. A data warehouse is subject-oriented, time-variant, nonvolatile and integrated. Operational database systems are online transaction processing (OLTP) systems. They perform daily transactions for an organization. Data warehouses are online analytical processing (OLAP) systems. In OLTP systems, daily operational transaction processings are performed. Data used in OLTP systems is highly detailed and up-to-date. OLAP systems are utilized by knowledge workers such as managers or analytics. Historical data is processed for analytical purposes and decision making process. Data used in OLAP systems is summarized and historical. Operational databases are designed for processing daily transactions such as indexing, searching, inserting, modifying, deleting a record. OLAP systems require complex queries to process summarized data. Since performing analytical operations on operational databases will cause in bad performance, data warehouses are needed. Data warehouses utilize multidimensional data model to represent data. Star schema, snowflake schema and fact constellation schema are types of multidimensional data model.
Tasking Relevant Data: The step of tasking relevant data is data selection for analytical and processing purposes. In other words, data selection is retrieving relevant data which will be analyzed or processed from databases, data warehoused or other data repositories.
Data Mining: Data mining can be referred as extracting data patterns from huge amount of data. Data mining is the major step in KDD process. Data mining is used with purpose of classification, clustering, mining frequent patterns and associations, predictions, outlier analysis and so on.
Data mining functionalities specify the type of data for data mining tasks. Data mining tasks can be grouped as descriptive and predictive. Predictive tasks utilizes existing data to make predictions whereas descriptive tasks are based on characterizing properties of mined data. Predictive data mining tasks are given as follows:
- Classification: Classification is the process of understanding data classes or concepts. Class labels of data are given for purpose of classification. After finding out a model which described data classes, the class labels of data which is not assigned to a class are predicted. Categorical labels are predicted in classification. Data classification is a two-step process as training and prediction. In training step, a classifier model is created by analyzing data objects and their class labels. In a prediction step, unknown class labels of data objects are predicted using the generated classifier model. Classification is an example of supervised learning. Supervised learning is to understand structure of data whose class labels are predefined.
- Prediction: Unavailable numerical values are predicted in prediction process. Prediction is similar to classification but it is worked on continuous or ordered valued data. Prediction maps its input into a continuous or ordered value
Descriptive data mining tasks are given as follows:
- Class/Concept description: Data can be described with classes or concepts. These descriptions are inferred with data characterization and data discrimination.
- Mining of frequent patterns: It is the process of mining patterns that frequently occurs in data. There are three types of frequent patterns as frequent item set, frequent subsequence and frequent substructure. Mining of frequent item set is the process of mining items that frequently appear together. Frequent subsequence refers sequence of patterns that frequently occur sequentially. Frequent substructure refers structure forms associated with item sets or subsequences.
- Mining Associations: Association is referred as patterns frequently purchased together. Mining association is the process of discovering relations among data. Mining association rules has an important role in data mining especially for understanding customers’ purchase behaviors.
- Mining Correlations: Correlations represent relations between items or their attributes. Two items can be positively or negatively correlated. Moreover, there may be no correlation between two items as a result of mining correlation.
- Mining Clusters: Clustering brings together similar objects which do not have a class label to extracting significant features which represent either strong relations intra-clusters or differences inter-clusters. Clustering is an example of unsupervised learning. Unsupervised learning is to understand structure of data whose class labels are not predefined. Clustering methods can be classified in several groups such as partitioningbased, hierarchical-based, density-based and model-based approaches. Partitioning-based methods divide n objects into k partitions where k is smaller than or equal to n. In partitioningbased techniques, each partition contains at least one object and each object belongs to only one partition. Objects are divided hierarchically in hierarchical-based clustering. Hierarchical-based clustering methods can be classified as agglomerative and divisive approaches. Densitybased clustering methods group objects into clusters based on densities of clusters to represent clusters with arbitrary shape. An object in a cluster must have at least minimum number of neighbors in the neighborhood of a given radius. In model-based methods, objects are assigned to clusters based on defined models of clusters. The fits between an object and clusters’ models are optimized in this type of algorithms.
Pattern Evaluation: In pattern evaluation step, interesting patterns are identified with interestingness measures. The extracted patterns by data mining may be either interesting or uninteresting. Interestingness measures identify interesting patterns and categorized as subjective and objective. Objective interestingness measures are based on structure of the extracted patterns and utilized statistical techniques. Support and confidence are objective measures used for association rule mining. Subjective interestingness measures consider users’ needs, interests, understandability on domain. Unexpectedness, actionability and novelty are main subjective interestingness measures. A pattern is unexpected for a user if it is out of the understandability of the user on domain and it surprises the user. A pattern is actionable for a user if it represents a strategic information for that user to act. A pattern is novel if it contributes to new knowledge.
Visualization: Extracted knowledge is visualized in the last step of the KDD process. Discovered knowledge can be visualized with tables, graphs, charts, cubes, decision trees, diagrams, maps and etc.
Metadata Management
Metadata is generally referred as data about other data or catalog for a data. It describes, identifies, explains, manages, and uses a data resource. Metadata can be grouped as descriptive, structural and administrative metadata. Descriptive metadata identifies a data resource. Administrative metadata defines the ways of managing a data resource. Structural metadata describes relationships between parts of a composite object. Metadata provides information about a data resource, so it is valuable for everyone in every domain. Especially in business life, metadata helps understanding and analyzing data for future plans. Metadata helps decision makers to make meaningful, strategic decisions. Metadata allows discovering resources, preserving and archiving digital information, organizing, managing resources and etc. Metadata must be available, trusted in terms of quality and permanent to utilize its functionalities. Thus, metadata needs to be managed in order to understand and manage complex data resources, to help strategic decisions for future plans.