DEPEND 2009 1
Marco Vieira, University of Coimbra, Portugal
Using the AMBER Data Repository to Analyze, Share and Cross-exploit Dependability Data
Marco Vieira [emailprotected]
University of Coimbra, Portugal The Second International Conference on Dependability (DEPEND 2009)
Athens/Glyfada, Greece, June 18,2009
Tutorial The AMBER Project
• Assessing, Measuring and Benchmarking Resilience in computer systems and components (AMBER)
• Coordination Action supported by the European Commission in the 7th FP
• Coordinating and advancing research in resilience measurement and benchmarking in computer systems and infrastructures
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 2
Current challenges
• Quality of measurements
• Integration of the human and technical components of the analysis
• Dynamic and adaptive systems and networks
• Integration with the development processes
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 3
AMBER objectives
• State-of-the art survey
• Research agenda
• Data repository
• Others: – Dissemination events (workshops, panels, etc) – Benchmarking tools – Training material
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 4
5 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
This Tutorial…
Learn how to use the AMBER Data Repository to analyze and share data
from dependability evaluation experiments
6 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Problems
• How to analyze the usually large amount of raw data produced in dependability evaluation experiments?
• How to compare results from different experiments or results of similar experiments across different systems? – Different and incompatible tools, data formats, and
setup details…
• How to share raw experimental results among research teams?
DEPEND 2009 2
Marco Vieira, University of Coimbra, Portugal
7 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Current situation
• The situation today is not good!!! • Spreadsheets and other specific tools to
analyze results – Not standard and difficult to build
• Difficult to compare data and generalize conclusions
• Researchers share final results and conclusions – Papers, mainly – Raw data is not shared
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
ADR Vision and objectives
• Vision – Become a worldwide repository for dependability related data
• Key objectives: – Provide state-of-the-art data analysis – Allow data comparison and cross-exploitation – Facilitate worldwide data sharing and
dissemination • Potential tool to increase the impact of
research 8
Data analysis approach
• Repository to analyze, compare, and share results
• Use a business intelligence approach: – Data warehouse to store data – On-Line Analytical Processing (OLAP) to analyze
data – Data mining algorithms to identify (unknown)
phenomena in the data – Information retrieval for data in textual formats
• Adopt the same life cycle of BI data 9 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 10 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Outline
1. Business Intelligence
2. Data Warehousing & OLAP
3. Using DW to analyze dependability related data
4. The AMBER Data Repository
1. Business Intelligence
What is Business Intelligence?
• Business Intelligence (BI): – Getting the right information, to the right
decision makers, at the right time • BI is an enterprise-wide platform that
supports, data gathering, reporting, analysis and decision making
• BI is meant to: – Fact-based decision making – “Single version of the truth”
• BI includes reporting and analytics 12 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
DEPEND 2009 3
Marco Vieira, University of Coimbra, Portugal
Five classic BI questions
13 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
• What happened? • What is happening? • Why did it happen? • What will happen? • What do I want to happen?
Past
Present
Future
Typical BI technologies
• ETL Tools (Extract, Transform, and Load)
• Repositories – Data Warehouse
• Analytical tools – Reporting and querying – OLAP – Data mining
• Information retrieval
14 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Many proprietary products
ACE*COMM Ab Initio Actuate ComArch CyberQuery Dimensional Insight IBM
Applix Cognos
InetSoft Informatica Information Builders LogiXML LucidEra MicroStrategy
15 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Microsoft Microsoft Analysis Services PerformancePoint Server 2007 Proclarity
Oracle Corporation Hyperion Solutions Corporation
Panorama Software Pentaho Pervasive Pilot Software, Inc. PRELYTIS Prospero Business Suite Qliktech SAP Business Inf. Warehouse Business Objects
OutlookSoft
SAS Institute
Siebel Systems
Spotfire (now Tibco)
StatSoft
SPSS
Telerik Reporting
Teradata
Thomson Data Analyzer
Some open source/free producs
• Eclipse BIRT Project:. • Freereporting.com: • JasperSoft: • OpenI: • Palo (OLAP database): • Pentaho: • RapidMiner • SpagoBI: • Weka
• Some products from big companies can be used freely
16 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
2. Data Warehousing & OLAP
18 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
What is a Data Warehouse?
• Big database that stores data for decision support
• Built from the operational data collected from transactional DB and other operational systems
Operational DB & other systems Data Warehouse
Users
Users
DEPEND 2009 4
Marco Vieira, University of Coimbra, Portugal
19 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Basic DW components
Operational DB
Legacy systems
Spreadsheets, files, ...
External sources
Data Staging
Area
Data warehouse (presentation servers)
Users
Ad hoc queries
Reports
Specific apps
Models and other tools
20 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Data volume
• Less than 20 GBytes – Small dimension; runs in a PC
• From 20 to 100 GBytes – Medium dimension; needs a powerful workstation
• From 100 Gbytes to 1 TBytes – Large dimension; needs a powerful server,
normally with parallel processing • More than 1 TBytes
– Very large dimension; massive parallel processing
21 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Some characteristics
• Temporal dependency
• Non volatile
• Target oriented
• Data integration and consistency
• Designed for queries
22 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Temporal dependency
• The data is collected over time – Do not represent a specific moment – Represents the history
• A temporal reference must be associated to all data in the database
23 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Non volatile
• The data in the DW is never updated
• The DW stores historic data (historic memory) collected from the operational databases
• After being load (from the operational databases) there is only one operation: – Queries
24 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Target oriented
• The data warehouse must only store data relevant for decision support
• Many operational data (needed for everyday management) is not relevant for the DW
DEPEND 2009 5
Marco Vieira, University of Coimbra, Portugal
25 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Data integration and consistency
• In a operational environment the information may be stored in different locations using different representations
• That data must be integrated and made consistent before being load in the DW
26 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Designed for queries
• After being load the data never changes: – Only queries are
allowed
• DW stores a large amount of data
Multidimensional view Partial denormalization
The data must be stored in such a way that improves performance
27 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Dimensional model
• Typical model in operational databases: E/R
• The dimensional model follows a different approach – Stores the same data – Data organization is user oriented
• Easy to understand • Very good performance for queries
• Data Warehouses built over complex E/R models never succeed
28 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
The multidimensional model
• Facts stored in a multidimensional array
• The dimensions are used to index the array
• Usually built using data from operational databases
Jan Feb Mar Apr
Milk
Oil
Sugar Coffee
Coimbra Lisbon
Sales
Product
Date
3 5
2 Store
29 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Star model
ID_dim 1 ID_dim 2 ID_dim 3 ID_dim 4
Fact 1 Fact 2
. . . Fact n
Dimension 1
Facts Table ID_dim 1
Attributes . . .
Dimension 2 ID_dim 2
Attributes . . .
Dimension 3 ID_dim 3
Attributes . . .
Dimension 4 ID_dim 4
Attributes . . .
• The typical dimensional model is a star structure with: – A central table with facts – Several dimensions tables describing the facts
30 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Facts
• Represent the business measures
• The most useful facts are: – Numbers – Additives
DEPEND 2009 6
Marco Vieira, University of Coimbra, Portugal
31 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Facts table
• Comprises several numeric attributes (facts) and foreign keys to the dimensions
• Normalized table
• Relationships M:1 with the business dimensions
• Contains normally a large number of records • Represents typically 95% of the space used
by the DW
32 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Dimensions
• Each dimension represents a business parameter – Time, clients, products, etc
• Represent a entry point for the analysis of the facts
• Represent different point-of-views for the analysis of the facts
33 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Dimension tables
• Strongly denormalized – For performance
• Dimensions have hierarchies – Day Month Year … Contain a large set of
attributes
• Typically comprise a small number of records (when compared to the facts table)
34 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Star schema example
Time
Sale
ID_time Day Day_of_week Week_of_year Month Trimester Year
Product ID_product Name Type Brand Category Pack Description
Store ID_store Name Local District Area Num_tellers
Stores
ID_time ID_product ID_store Units_sold Purchase_cost Sale_value Num_Clients
35 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Low level queries
Time
Sales
ID_time Day Day_of_week Week_of_year Month Trimester Year
Product ID_product Name Type Brand Category Pack Description
Store ID_store Name Local District Area Num_tellers
ID_time ID_product ID_store Units_sold Purchase_cost Sale_value Num_Clients
select avg (sale_value x units_sold) from sale, time, product where JOIN_TABLES group by brand, month
36 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
User interfaces
• Explore data in Data Warehouses – Typical OLAP tools
• Access the relational engine using SQL • Data presentation using tables, graphics, reports, etc • Targeted for ad-hoc queries
– Other tools • Data mining • Modeling
DEPEND 2009 7
Marco Vieira, University of Coimbra, Portugal
37 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Queries - Slice and Dice
Sales by time and product Sales by store and
brand
38 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Drill-Down & Roll-Up
Most generic category
Intermediate category
Most detailed category
Full Detail
Drill-Down Roll-up
39 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Time: Drill-Down & Roll-Up
Year
Trimester
Month
Week
Drill-Down Roll-up
Day
ALL
40 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Steps for the design of the star model
1. Identify the business process/activity
2. Identify the facts
3. Identify the dimensions
4. Define the data granularity • Day, Week, Month, … • Product, Category, … • Store, City, …
Do not
forget
that t
he mod
el
depend
s on t
he dat
a avai
lable
(opera
tional
datab
ases, f
iles, e
tc)
41 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Example – Retail sales
• Set of stores belonging to the same enterprise
• Goal: Analysis of sales
• Each store has several departments (food, hygiene and cleaning, etc)
• Sells thousands of products
• Products are identified using a unique number
42 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales – Business data
• Where to collect the data? – POS - point of sales – Operational database
• What to measure? – Sales
• Goals? – Maximize the profit – Maximum sales price possible – Lower costs – More clients
DEPEND 2009 8
Marco Vieira, University of Coimbra, Portugal
43 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales – Facts
• Examples of relevant decision support facts: – Number of units sold – Acquisition costs – Sale value – Number of clients that bought the product
• Question: is it possible to obtain base data (from the operational system) for these facts?
44 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales – Dimensions
• Main dimensions: – Product x Store x Time
• Are there other relevant dimensions? – Supplier? – Promotions? – Client? – Employee responsible for the store on that day?
• It is normally possible to add extra dimensions
• All the dimensions have a 1:M relationship with the facts
45 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales
ID_store name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone store_FAX floor_plan_type photo_processing_type finance_services_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft meat_sqft ……...
ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients
ID_product description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm ……...
ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….
ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...
46 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Granularity
• Example: record the daily sales for all products – Analyze in detail (price, quantity, etc) the products
sold every day, in each store, …
• Retail sales granularity: – Products x Store x Promotion x Day
• The granularity defines the detail of the DW and has a strong impact in the size
• The granularity must be adjusted to the analysis requirements
47 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales – Details
ID_store name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone store_FAX floor_plan_type photo_processing_type finance_services_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft meat_sqft ……...
ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients
ID_product description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm ……...
ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….
ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...
• Mandatory dimension that represents the DW temporal dependency
• Must describe time as seen by the business management
• Is typically generated in a synthetic manner
• It is not generated from the operational databases
• Includes all the records representing the time period considered in the DW
48 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales – Details
ID_store name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone store_FAX floor_plan_type photo_processing_type finance_services_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft meat_sqft ……...
ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients
ID_product description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm ……...
ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….
ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...
• Must characterize the products as seen by the business management
• Must contain the attributes that are relevant for posterior queries
• It is a strongly denormalized table (which is also typical in other dimensions)
DEPEND 2009 9
Marco Vieira, University of Coimbra, Portugal
49 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales – Details
ID_store name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone store_FAX floor_plan_type photo_processing_type finance_services_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft meat_sqft ……...
ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients
ID_product description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm ……...
ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….
ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...
• Must characterize the stores as seen by the business management
• Must contain the attributes that are relevant for posterior queries
• Includes geographical attributes (localization)
• Includes time attributes (opening date,…).
50 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Retail sales – Details
ID_store name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone store_FAX floor_plan_type photo_processing_type finance_services_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft meat_sqft ……...
ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients
ID_product description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm ……...
ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….
ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...
• Characterizes the existing promotions • In this example there is only one dimension related to promotions • Represents a very important dimension
• Managers want to know the impact of promotions in the sales in order to target new promotions to specific products, stores and time
51 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Warehouse
More than one star
• Two or more starts can be connected using one or more dimensions
• Shared dimensions must be conform – Contain consistent data when considering each star
• Drill across: query that crosses more than one start
Stock ID_time ID_product ID_warehouse quant_available quant_out purchase_cost last_sell_price
Time
Product
Sales ID_time ID_product ID_store units_sold purchase_cost sale_value num_clients
Store
52 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Several stars
Orders Dimension: Time Dimension: Component Dimension: Supplier Dimension: Contract
Sales Dimension: Time Dimension: Component Dimension: Client Dimension: Contract
Stocks Dimension: Time Dimension: Component Dimension: Warehouse
53 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Questions
? 3. Using DW to analyze dependability data
DEPEND 2009 10
Marco Vieira, University of Coimbra, Portugal
55 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Multidimensional server
OLAP application (result analysis)
Data Warehouse
Ad hoc queries
StatisticalReporting
Net
Analysis Operational DB
Legacy Systems
Spread sheets, files ...
External sources
Operations
Basic elements of a DW
56 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Exp. System A
Multidimensional server
OLAP application (result analysis)
Data Warehouse
Ad hoc queries
StatisticalReporting Exp. System B
Exp. System N
LAN/ Internet
Fault injection tools
Robustness testing tools
Dependability benchmarking
experiments
Any other experimental environment
Field data
Result analysis Experiments
A DW for experimental data
57 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
• Exp. Setup A • Multidimensional
database
Data Warehouse
• Ad hoc queries
• Statistical • Reporting
• Net • Exp. Setup B
Exp. Setup N
• Experiments • OLAP tool
• General approach to store results from dependability evaluation experiments
• Data from different experiments can be compared/cross-exploit (only if it makes sense to compare)
• Raw data is available (not only the final results) • Results can be analyzed and shared world wide by using
web-enabled versions of OLAP tools
?
What’s inside?
Key points of the proposed approach
58 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Two types of data in experimental dependability evaluation
• Measures collected from the target system (FACTS) – For example, raw data representing error detection efficiency, recovery
time, failure modes, etc • Features of the target system and experimental setup
that have impact on the measures (DIMENSIONS) – For example, attributes describing the target systems, the different
configurations, the workload, the faultload, etc
Network
Experiment Management System
Target System
Readouts (impact of faults)
Exp. control data Faults definition
Two types of data:
59 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Targe
t syst
em
System A System B
Faultload
Workload
The multidimensional model
• Facts are stored in a multidimensional array
• Dimensions are used to access the array according to any possible criteria
60 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
The star schema
DEPEND 2009 11
Marco Vieira, University of Coimbra, Portugal
61 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
• Exp. Setup A • Multidimensional
database
Data Warehouse
• Ad hoc queries
• Statistical • Reporting
• Net • Exp. Setup B
Exp. Setup N
• Experiments • Analysis
The experimental setups are used as they are. You can use your favorite dependability evaluation tool and do the experiments in the usual way. It’s necessary…
• To know the format of the raw results • To have access to the results
Basic elements of the proposed approach
62 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
• Exp. Setup A • Multidimensional
database
Data Warehouse
• Ad hoc queries
• Statistical • Reporting
• Net • Exp. Setup B
Exp. Setup N
• Experiments • Analysis
Loading applications • General purpose loading applications • Some transformations in the data are normally necessary for
consistency
• Loading applications
Basic elements of the proposed approach
63 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
• Exp. Setup A • Multidimensional
database
Data Warehouse
• Ad hoc queries
• Statistical • Reporting
• Net • Exp. Setup B
Exp. Setup N
• Experiments • Analysis
Data warehouse • Raw data is available in a standard star schema (facts + dimensions) • Results from different experiments are compatible and can be compared/
analyzed together, then they are stored in the same star schema (or in scheme that share at least one dimension)
• If results are from different unrelated experiments then they are stored in a separated schema
• Loading applications
Basic elements of the proposed approach
64 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
• Exp. Setup A • Multidimensional
database
Data Warehouse
• Ad hoc queries
• Statistical • Reporting
• Net • Exp. Setup B
Exp. Setup N
• Experiments • Analysis
Analysis • Commercial OLAP tools are used to analyze the raw data and
compute the measures. These tools are designed to be used by managers: very easy to use :-)
• Just need an internet browser to analyze the data
• Loading applications
Basic elements of the proposed approach
65 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Steps needed to put our approach into practice
1. Definition of the adequate star schema to store the data. Create the tables in the data warehouse
2. Use general-purpose loading application to define the loading plans for each table in the star schema
3. Run the loading plans to load the star tables with the raw data collected from the experiments
4. Every time a new experiment is done corresponding loading plans are run again to add the new data to the data warehouse
5. Analyze the data: calculate measures, find unexpected results, analyze trends, etc
66 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Example: Recovery and Performance Evaluation in DBMS
• Tuning of a large DBMS is very complex
• Administrators tend to focus on performance tuning and disregard the recovery features
• Administrators seldom have feedback on how good a given configuration is
• A technique to characterize the performance and the recoverability in DBMS is needed
DEPEND 2009 12
Marco Vieira, University of Coimbra, Portugal
67 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
The Approach
• Extending existing performance benchmarks to evaluate recoverability features in DBMS
• Include a faultload and new measures
• Faultload based on operator faults • Measures related to recovery:
– Recovery time – Data integrity violations – Lost transactions
68 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Operator faults injection and recovery
69 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Experimental setup
Test
70 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
The data storage model
71 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Steps towards data analyzes
1. Definition of the adequate star schema a. Identify the process/activity b. Identify the facts c. Identify the dimensions d. Define the data granularity
2. Load the data
3. Analyze the data
72 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Definition of the adequate star schema: Identify the process/activity
• Experiments to characterize the performance and the recoverability in DBMS
• Includes a faultload and new measures
• Faultload based on operator faults
• Measures related to recovery
DEPEND 2009 13
Marco Vieira, University of Coimbra, Portugal
73 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Definition of the adequate star schema: Identify the facts
74 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Definition of the adequate star schema: Identify the dimensions
75 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Definition of the adequate star schema: Define the data granularity
• Performance and recovery results – Per experiment – Per SUT – Per workload – Per fault type
76 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
The star schema
77 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Load the data
ETL
78 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Analyze the data: Example of query construction
DEPEND 2009 14
Marco Vieira, University of Coimbra, Portugal
79 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Analyze the data: Example of query answer
80 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Questions
?
4. The AMBER Data Repository
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 82
AMBER Repository vision and objectives
• Vision − Become a worldwide repository for
dependability related data
• Key objectives: − Provide state-of-the-art data analysis − Allow data comparison and cross-exploitation − Facilitate worldwide data sharing and dissemination
• Potential tool to increase the impact of research
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 83
Potential use
• Research team level − Perform the analysis of data in an efficient way − Efficient dissemination of the results of the team
• Project level − Sharing and cross-exploitation of results from different
project teams
• World wide − Common repository to store and share data − Many teams are performing dependability evaluation but
there are no results available at the web
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 84
Data analysis approach
• Repository to analyze, compare, and share results
• Use a business intelligence approach: − Data warehouse to store data − On-Line Analytical Processing (OLAP) to analyze data − Data mining algorithms to identify (unknown) phenomena in
the data − Information retrieval to access data in textual formats
• Adopt the same life cycle of BI data
• Use technology already available for DW, DM & IR
DEPEND 2009 15
Marco Vieira, University of Coimbra, Portugal
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 85
Steps
1. User registration
2. Multidimensional analysis
3. Definition of the loading plans
7. Load the data
8. Definition of data ownership policies
9. Analysis of the data
• Analyze DBench-OLTP results using OLAP
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 86
User registration
• ADR users must undergo a registration procedure
• Provide identification information that is verified by the ADR support team − To filter malicious users
• Contact information is used to get in touch with the potential repository user
• To access the repository users must authenticate
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 87
Multidimensional analysis
• Design an adequate multidimensional data model
• User has the required expertise to design the data model − Send to the ADR support team the SQL scripts needed to
create the database tables
• The ADR team helps the user defining the model − The user only needs to explain us the experimental setup
and the format of the data collected
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 88
The DBench-OLTP benchmark
Format of the raw data
• Raw data collected by DBench-OLTP is composed of tens of CSV files (one from each run)
• Each row contains data from an injection slot − Identification, duration, number of transactions executed,
data integrity errors discovered, type of fault injected, moment of fault injection, workload used, etc)
• A text file describes the experiment and the characteristics of the SUB
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 89
Data model (1)
• Key steps: − Identification of the facts that characterize the problem
under analysis − Identification of the dimensions that may influence the facts − Definition of the granularity of the data stored in the star
schema
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 90
DEPEND 2009 16
Marco Vieira, University of Coimbra, Portugal
Data model (2)
91 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Definition of the loading plans
• Data extraction − SQL scripts to extract data from the CSV files to a temporary
database schema (data staging area)
• Data transformation − SQL scripts transform the data into an adequate format
• Data load − SQL scripts to load the transformed data into the data
warehouse
• Loading plans documented and stored in the ADR DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 92
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 93
Load the data
• Executing the loading plans created before
• If new data becomes available we just need to rerun the plans − e.g., if the benchmark is executed in other systems
• The documentation of the DBench-OLTP includes papers and technical reports − This is considered as part of the DBench-OLTP data − It is loaded to the repository and made available to the
potential readers of the data
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 94
Data ownership policy
• Data ownership policies of ADR are divided in two main groups − Private data − Proprietary data − Collaborative data
• For the DBench-OLTP data we have decided to use a collaborative approach − Allows other potential users of the benchmark to compare
their results with the ones available in the ADR
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 95
Analysis of the data
• On-line Analytical Processing (OLAP) tools − Support the analysis in a very flexible way − Provide high query performance and easy, intuitive data
navigation
• Oracle Business Intelligence Discoverer Plus (ODP) − Commercial tool included in Oracle Business Intelligence
package − Widely used by industry Used freely for research purposes
under an Oracle Academy Agreement
OLAP Wizard
• Selection of query type (crosstab or table) and characteristics (title, graph, text area, etc)
• Selection of measures and dimensional attributes
• Setting the query layout
• Selection of the fields to be used to sort the results
• Creation of parameters used to filter data
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 96
DEPEND 2009 17
Marco Vieira, University of Coimbra, Portugal
Some results
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 97 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 98
Quick demo…
• Murphy's law…
http://www.amber-project.eu
Do you have data?
Share Them!
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 99
Questions
?
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 100
101 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
Generic bibliography
• Ralph Kimbal, Margy Ross, “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling” (Second Edition), Ed. J. Wiley & Sons, Inc, 2002.
• Ralph Kimbal, “The Data Warehouse Lifecycle Toolkit”, Ed. J. Wiley & Sons, Inc, 2001.
102 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009
ADR bibliography
• Madeira, H., Costa, J., Vieira, M. , "The OLAP and Data Warehousing Approaches for Analysis and Sharing of Results from Dependability Evaluation Experiments", International Conference on Dependable Systems and Networks, DSN-DCC 2003, San Francisco, CA, USA, June 2003
• Pintér, G., Madeira, H., Vieira, M., Pataricza, A., Majzik, I. , "A Data Mining Approach to Identify Key Factors in Dependability Experiments", Fifth European Dependable Computing Conference (EDCC-5), Budapest, Hungary, April 2005
DEPEND 2009 18
Marco Vieira, University of Coimbra, Portugal
103
ADR bibliography
• Pintér, G., Madeira, H., Vieira, M., Majzik, I., Pataricza, A. , "Integration of OLAP and Data Mining for Analysis of Results from Dependability Evaluation Experiments", International Journal of Knowledge Management Studies (IJKMS), Volume 2 – Issue 4 – 2008, Inderscience Publishers, July 2008
• Vieira, M., Mendes, N., Durães, J., Madeira, H. , "The AMBER Data Repository", DSN 2008 Workshop on Resilience Assessment and Dependability Benchmarking (DSN-RADB08), Anchorage, Alaska, June 2008
• Vieira, M., Mendes, N., Durães, J. , "A Case Study on Using the AMBER Data Repository for Experimental Data Analysis", SRDS 2008 Workshop on Sharing Field Data and Experiment Measurements on Resilience of Distributed Computing Systems, Naples, Italy, October 2008
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009