Marco Vieira, University of Coimbra, Portugaliaria.org/conferences2009/filesDEPEND09/ADR_Tutorial... · 2016-09-11 · Marco Vieira, University of Coimbra, Portugal Using the AMBER

DEPEND 2009 1

Marco Vieira, University of Coimbra, Portugal

Using the AMBER Data Repository to Analyze, Share and Cross-exploit Dependability Data

Marco Vieira [emailprotected]

University of Coimbra, Portugal The Second International Conference on Dependability (DEPEND 2009)

Athens/Glyfada, Greece, June 18,2009

Tutorial The AMBER Project

•  Assessing, Measuring and Benchmarking Resilience in computer systems and components (AMBER)

•  Coordination Action supported by the European Commission in the 7th FP

•  Coordinating and advancing research in resilience measurement and benchmarking in computer systems and infrastructures

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 2

Current challenges

•  Quality of measurements

•  Integration of the human and technical components of the analysis

•  Dynamic and adaptive systems and networks

•  Integration with the development processes

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 3

AMBER objectives

•  State-of-the art survey

•  Research agenda

•  Data repository

•  Others: – Dissemination events (workshops, panels, etc) – Benchmarking tools – Training material

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 4

5 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

This Tutorial…

Learn how to use the AMBER Data Repository to analyze and share data

from dependability evaluation experiments

6 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Problems

•  How to analyze the usually large amount of raw data produced in dependability evaluation experiments?

•  How to compare results from different experiments or results of similar experiments across different systems? – Different and incompatible tools, data formats, and

setup details…

•  How to share raw experimental results among research teams?

DEPEND 2009 2

Marco Vieira, University of Coimbra, Portugal

7 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Current situation

•  The situation today is not good!!! •  Spreadsheets and other specific tools to

analyze results – Not standard and difficult to build

•  Difficult to compare data and generalize conclusions

•  Researchers share final results and conclusions – Papers, mainly – Raw data is not shared

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

ADR Vision and objectives

•  Vision – Become a worldwide repository for dependability related data

•  Key objectives: – Provide state-of-the-art data analysis – Allow data comparison and cross-exploitation – Facilitate worldwide data sharing and

dissemination •  Potential tool to increase the impact of

research 8

Data analysis approach

•  Repository to analyze, compare, and share results

•  Use a business intelligence approach: – Data warehouse to store data – On-Line Analytical Processing (OLAP) to analyze

data – Data mining algorithms to identify (unknown)

phenomena in the data –  Information retrieval for data in textual formats

•  Adopt the same life cycle of BI data 9 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 10 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Outline

1.  Business Intelligence

2.  Data Warehousing & OLAP

3.  Using DW to analyze dependability related data

4.  The AMBER Data Repository

1. Business Intelligence

What is Business Intelligence?

•  Business Intelligence (BI): – Getting the right information, to the right

decision makers, at the right time •  BI is an enterprise-wide platform that

supports, data gathering, reporting, analysis and decision making

•  BI is meant to: –  Fact-based decision making –  “Single version of the truth”

•  BI includes reporting and analytics 12 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

DEPEND 2009 3

Marco Vieira, University of Coimbra, Portugal

Five classic BI questions

13 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

•  What happened? •  What is happening? •  Why did it happen? •  What will happen? •  What do I want to happen?

Past

Present

Future

Typical BI technologies

•  ETL Tools (Extract, Transform, and Load)

•  Repositories – Data Warehouse

•  Analytical tools – Reporting and querying – OLAP – Data mining

•  Information retrieval

14 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Many proprietary products

ACE*COMM Ab Initio Actuate ComArch CyberQuery Dimensional Insight IBM

Applix Cognos

InetSoft Informatica Information Builders LogiXML LucidEra MicroStrategy

15 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Microsoft Microsoft Analysis Services PerformancePoint Server 2007 Proclarity

Oracle Corporation Hyperion Solutions Corporation

Panorama Software Pentaho Pervasive Pilot Software, Inc. PRELYTIS Prospero Business Suite Qliktech SAP Business Inf. Warehouse Business Objects

OutlookSoft

SAS Institute

Siebel Systems

Spotfire (now Tibco)

StatSoft

SPSS

Telerik Reporting

Teradata

Thomson Data Analyzer

Some open source/free producs

•  Eclipse BIRT Project:. •  Freereporting.com: •  JasperSoft: •  OpenI: •  Palo (OLAP database): •  Pentaho: •  RapidMiner •  SpagoBI: •  Weka

•  Some products from big companies can be used freely

16 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

2. Data Warehousing & OLAP

18 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

What is a Data Warehouse?

•  Big database that stores data for decision support

•  Built from the operational data collected from transactional DB and other operational systems

Operational DB & other systems Data Warehouse

Users

DEPEND 2009 4

Marco Vieira, University of Coimbra, Portugal

19 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Basic DW components

Operational DB

Legacy systems

Spreadsheets, files, ...

External sources

Data Staging

Area

Data warehouse (presentation servers)

Users

Ad hoc queries

Reports

Specific apps

Models and other tools

20 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Data volume

•  Less than 20 GBytes – Small dimension; runs in a PC

•  From 20 to 100 GBytes – Medium dimension; needs a powerful workstation

•  From 100 Gbytes to 1 TBytes – Large dimension; needs a powerful server,

normally with parallel processing •  More than 1 TBytes

– Very large dimension; massive parallel processing

21 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Some characteristics

•  Temporal dependency

•  Non volatile

•  Target oriented

•  Data integration and consistency

•  Designed for queries

22 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Temporal dependency

•  The data is collected over time – Do not represent a specific moment – Represents the history

•  A temporal reference must be associated to all data in the database

23 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Non volatile

•  The data in the DW is never updated

•  The DW stores historic data (historic memory) collected from the operational databases

•  After being load (from the operational databases) there is only one operation: – Queries

24 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Target oriented

•  The data warehouse must only store data relevant for decision support

•  Many operational data (needed for everyday management) is not relevant for the DW

DEPEND 2009 5

Marco Vieira, University of Coimbra, Portugal

25 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Data integration and consistency

•  In a operational environment the information may be stored in different locations using different representations

•  That data must be integrated and made consistent before being load in the DW

26 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Designed for queries

•  After being load the data never changes: – Only queries are

allowed

•  DW stores a large amount of data

Multidimensional view Partial denormalization

The data must be stored in such a way that improves performance

27 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Dimensional model

•  Typical model in operational databases: E/R

•  The dimensional model follows a different approach – Stores the same data – Data organization is user oriented

•  Easy to understand •  Very good performance for queries

•  Data Warehouses built over complex E/R models never succeed

28 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

The multidimensional model

•  Facts stored in a multidimensional array

•  The dimensions are used to index the array

•  Usually built using data from operational databases

Jan Feb Mar Apr

Milk

Oil

Sugar Coffee

Coimbra Lisbon

Sales

Product

Date

3 5

2 Store

29 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Star model

ID_dim 1 ID_dim 2 ID_dim 3 ID_dim 4

Fact 1 Fact 2

. . . Fact n

Dimension 1

Facts Table ID_dim 1

Attributes . . .

Dimension 2 ID_dim 2

Attributes . . .

Dimension 3 ID_dim 3

Attributes . . .

Dimension 4 ID_dim 4

Attributes . . .

•  The typical dimensional model is a star structure with: – A central table with facts – Several dimensions tables describing the facts

30 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Facts

•  Represent the business measures

•  The most useful facts are: – Numbers – Additives

DEPEND 2009 6

Marco Vieira, University of Coimbra, Portugal

31 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Facts table

•  Comprises several numeric attributes (facts) and foreign keys to the dimensions

•  Normalized table

•  Relationships M:1 with the business dimensions

•  Contains normally a large number of records •  Represents typically 95% of the space used

by the DW

32 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Dimensions

•  Each dimension represents a business parameter – Time, clients, products, etc

•  Represent a entry point for the analysis of the facts

•  Represent different point-of-views for the analysis of the facts

33 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Dimension tables

•  Strongly denormalized – For performance

•  Dimensions have hierarchies – Day Month Year … Contain a large set of

attributes

•  Typically comprise a small number of records (when compared to the facts table)

34 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Star schema example

Time

Sale

ID_time Day Day_of_week Week_of_year Month Trimester Year

Product ID_product Name Type Brand Category Pack Description

Store ID_store Name Local District Area Num_tellers

Stores

ID_time ID_product ID_store Units_sold Purchase_cost Sale_value Num_Clients

35 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Low level queries

Time

Sales

ID_time Day Day_of_week Week_of_year Month Trimester Year

Product ID_product Name Type Brand Category Pack Description

Store ID_store Name Local District Area Num_tellers

ID_time ID_product ID_store Units_sold Purchase_cost Sale_value Num_Clients

select avg (sale_value x units_sold) from sale, time, product where JOIN_TABLES group by brand, month

36 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

User interfaces

•  Explore data in Data Warehouses – Typical OLAP tools

•  Access the relational engine using SQL •  Data presentation using tables, graphics, reports, etc •  Targeted for ad-hoc queries

– Other tools •  Data mining •  Modeling

DEPEND 2009 7

Marco Vieira, University of Coimbra, Portugal

37 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Queries - Slice and Dice

Sales by time and product Sales by store and

brand

38 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Drill-Down & Roll-Up

Most generic category

Intermediate category

Most detailed category

Full Detail

Drill-Down Roll-up

39 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Time: Drill-Down & Roll-Up

Year

Trimester

Month

Week

Drill-Down Roll-up

Day

ALL

40 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Steps for the design of the star model

1.  Identify the business process/activity

2.  Identify the facts

3.  Identify the dimensions

4.  Define the data granularity •  Day, Week, Month, … •  Product, Category, … •  Store, City, …

Do not

forget

that t

he mod

depend

s on t

he dat

a avai

lable

(opera

tional

datab

ases, f

iles, e

tc)

41 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Example – Retail sales

•  Set of stores belonging to the same enterprise

•  Goal: Analysis of sales

•  Each store has several departments (food, hygiene and cleaning, etc)

•  Sells thousands of products

•  Products are identified using a unique number

42 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales – Business data

•  Where to collect the data? – POS - point of sales – Operational database

•  What to measure? – Sales

•  Goals? – Maximize the profit – Maximum sales price possible – Lower costs – More clients

DEPEND 2009 8

Marco Vieira, University of Coimbra, Portugal

43 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales – Facts

•  Examples of relevant decision support facts: – Number of units sold – Acquisition costs – Sale value – Number of clients that bought the product

•  Question: is it possible to obtain base data (from the operational system) for these facts?

44 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales – Dimensions

•  Main dimensions: – Product x Store x Time

•  Are there other relevant dimensions? – Supplier? – Promotions? – Client? – Employee responsible for the store on that day?

•  It is normally possible to add extra dimensions

•  All the dimensions have a 1:M relationship with the facts

45 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales

ID_store name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone store_FAX floor_plan_type photo_processing_type finance_services_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft meat_sqft ……...

ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients

ID_product description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm ……...

ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….

ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...

46 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Granularity

•  Example: record the daily sales for all products – Analyze in detail (price, quantity, etc) the products

sold every day, in each store, …

•  Retail sales granularity: – Products x Store x Promotion x Day

•  The granularity defines the detail of the DW and has a strong impact in the size

•  The granularity must be adjusted to the analysis requirements

47 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales – Details

ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients

ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….

ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...

• Mandatory dimension that represents the DW temporal dependency

• Must describe time as seen by the business management

•  Is typically generated in a synthetic manner

•  It is not generated from the operational databases

•  Includes all the records representing the time period considered in the DW

48 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales – Details

ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients

ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….

ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...

• Must characterize the products as seen by the business management

• Must contain the attributes that are relevant for posterior queries

•  It is a strongly denormalized table (which is also typical in other dimensions)

DEPEND 2009 9

Marco Vieira, University of Coimbra, Portugal

49 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales – Details

ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients

ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….

ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...

• Must characterize the stores as seen by the business management

• Must contain the attributes that are relevant for posterior queries

• Includes geographical attributes (localization)

• Includes time attributes (opening date,…).

50 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Retail sales – Details

ID_product ID_time ID_store ID_promotion units_sold purchase_cost sale_value num_Clients

ID_time date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag ……….

ID_promotion number name type_price_red type_advertisem*nt type_poster Type_coupons promotion_cost start_date end_date ……...

• Characterizes the existing promotions •  In this example there is only one dimension related to promotions • Represents a very important dimension

• Managers want to know the impact of promotions in the sales in order to target new promotions to specific products, stores and time

51 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Warehouse

More than one star

•  Two or more starts can be connected using one or more dimensions

•  Shared dimensions must be conform – Contain consistent data when considering each star

•  Drill across: query that crosses more than one start

Stock ID_time ID_product ID_warehouse quant_available quant_out purchase_cost last_sell_price

Time

Product

Sales ID_time ID_product ID_store units_sold purchase_cost sale_value num_clients

Store

52 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Several stars

Orders Dimension: Time Dimension: Component Dimension: Supplier Dimension: Contract

Sales Dimension: Time Dimension: Component Dimension: Client Dimension: Contract

Stocks Dimension: Time Dimension: Component Dimension: Warehouse

53 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Questions

? 3. Using DW to analyze dependability data

DEPEND 2009 10

Marco Vieira, University of Coimbra, Portugal

55 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Multidimensional server

OLAP application (result analysis)

Data Warehouse

Ad hoc queries

StatisticalReporting

Net

Analysis Operational DB

Legacy Systems

Spread sheets, files ...

External sources

Operations

Basic elements of a DW

56 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Exp. System A

Multidimensional server

OLAP application (result analysis)

Data Warehouse

Ad hoc queries

StatisticalReporting Exp. System B

Exp. System N

LAN/ Internet

Fault injection tools

Robustness testing tools

Dependability benchmarking

experiments

Any other experimental environment

Field data

Result analysis Experiments

A DW for experimental data

57 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

• Exp. Setup A • Multidimensional

database

Data Warehouse

• Ad hoc queries

• Statistical • Reporting

• Net • Exp. Setup B

Exp. Setup N

• Experiments • OLAP tool

•  General approach to store results from dependability evaluation experiments

•  Data from different experiments can be compared/cross-exploit (only if it makes sense to compare)

•  Raw data is available (not only the final results) •  Results can be analyzed and shared world wide by using

web-enabled versions of OLAP tools

What’s inside?

Key points of the proposed approach

58 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Two types of data in experimental dependability evaluation

•  Measures collected from the target system (FACTS) –  For example, raw data representing error detection efficiency, recovery

time, failure modes, etc •  Features of the target system and experimental setup

that have impact on the measures (DIMENSIONS) –  For example, attributes describing the target systems, the different

configurations, the workload, the faultload, etc

Network

Experiment Management System

Target System

Readouts (impact of faults)

Exp. control data Faults definition

Two types of data:

59 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Targe

t syst

System A System B

Faultload

Workload

The multidimensional model

•  Facts are stored in a multidimensional array

•  Dimensions are used to access the array according to any possible criteria

60 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

The star schema

DEPEND 2009 11

Marco Vieira, University of Coimbra, Portugal

61 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

• Exp. Setup A • Multidimensional

database

Data Warehouse

• Ad hoc queries

• Statistical • Reporting

• Net • Exp. Setup B

Exp. Setup N

• Experiments • Analysis

The experimental setups are used as they are. You can use your favorite dependability evaluation tool and do the experiments in the usual way. It’s necessary…

•  To know the format of the raw results •  To have access to the results

Basic elements of the proposed approach

62 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

• Exp. Setup A • Multidimensional

database

Data Warehouse

• Ad hoc queries

• Statistical • Reporting

• Net • Exp. Setup B

Exp. Setup N

• Experiments • Analysis

Loading applications • General purpose loading applications • Some transformations in the data are normally necessary for

consistency

• Loading applications

Basic elements of the proposed approach

63 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

• Exp. Setup A • Multidimensional

database

Data Warehouse

• Ad hoc queries

• Statistical • Reporting

• Net • Exp. Setup B

Exp. Setup N

• Experiments • Analysis

Data warehouse •  Raw data is available in a standard star schema (facts + dimensions) •  Results from different experiments are compatible and can be compared/

analyzed together, then they are stored in the same star schema (or in scheme that share at least one dimension)

•  If results are from different unrelated experiments then they are stored in a separated schema

• Loading applications

Basic elements of the proposed approach

64 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

• Exp. Setup A • Multidimensional

database

Data Warehouse

• Ad hoc queries

• Statistical • Reporting

• Net • Exp. Setup B

Exp. Setup N

• Experiments • Analysis

Analysis • Commercial OLAP tools are used to analyze the raw data and

compute the measures. These tools are designed to be used by managers: very easy to use :-)

• Just need an internet browser to analyze the data

• Loading applications

Basic elements of the proposed approach

65 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Steps needed to put our approach into practice

1.  Definition of the adequate star schema to store the data. Create the tables in the data warehouse

2.  Use general-purpose loading application to define the loading plans for each table in the star schema

3.  Run the loading plans to load the star tables with the raw data collected from the experiments

4.  Every time a new experiment is done corresponding loading plans are run again to add the new data to the data warehouse

5.  Analyze the data: calculate measures, find unexpected results, analyze trends, etc

66 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Example: Recovery and Performance Evaluation in DBMS

•  Tuning of a large DBMS is very complex

•  Administrators tend to focus on performance tuning and disregard the recovery features

•  Administrators seldom have feedback on how good a given configuration is

•  A technique to characterize the performance and the recoverability in DBMS is needed

DEPEND 2009 12

Marco Vieira, University of Coimbra, Portugal

67 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

The Approach

•  Extending existing performance benchmarks to evaluate recoverability features in DBMS

•  Include a faultload and new measures

•  Faultload based on operator faults •  Measures related to recovery:

– Recovery time – Data integrity violations – Lost transactions

68 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Operator faults injection and recovery

69 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Experimental setup

Test

70 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

The data storage model

71 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Steps towards data analyzes

1.  Definition of the adequate star schema a.  Identify the process/activity b.  Identify the facts c.  Identify the dimensions d.  Define the data granularity

2.  Load the data

3.  Analyze the data

72 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Definition of the adequate star schema: Identify the process/activity

•  Experiments to characterize the performance and the recoverability in DBMS

•  Includes a faultload and new measures

•  Faultload based on operator faults

•  Measures related to recovery

DEPEND 2009 13

Marco Vieira, University of Coimbra, Portugal

73 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Definition of the adequate star schema: Identify the facts

74 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Definition of the adequate star schema: Identify the dimensions

75 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Definition of the adequate star schema: Define the data granularity

•  Performance and recovery results – Per experiment – Per SUT – Per workload – Per fault type

76 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

The star schema

77 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Load the data

ETL

78 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Analyze the data: Example of query construction

DEPEND 2009 14

Marco Vieira, University of Coimbra, Portugal

79 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Analyze the data: Example of query answer

80 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Questions

4. The AMBER Data Repository

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 82

AMBER Repository vision and objectives

•  Vision −  Become a worldwide repository for

dependability related data

•  Key objectives: −  Provide state-of-the-art data analysis −  Allow data comparison and cross-exploitation −  Facilitate worldwide data sharing and dissemination

•  Potential tool to increase the impact of research

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 83

Potential use

•  Research team level −  Perform the analysis of data in an efficient way −  Efficient dissemination of the results of the team

•  Project level −  Sharing and cross-exploitation of results from different

project teams

•  World wide −  Common repository to store and share data −  Many teams are performing dependability evaluation but

there are no results available at the web

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 84

Data analysis approach

•  Repository to analyze, compare, and share results

•  Use a business intelligence approach: −  Data warehouse to store data −  On-Line Analytical Processing (OLAP) to analyze data −  Data mining algorithms to identify (unknown) phenomena in

the data −  Information retrieval to access data in textual formats

•  Adopt the same life cycle of BI data

•  Use technology already available for DW, DM & IR

DEPEND 2009 15

Marco Vieira, University of Coimbra, Portugal

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 85

Steps

1.  User registration

2.  Multidimensional analysis

3.  Definition of the loading plans

7.  Load the data

8.  Definition of data ownership policies

9.  Analysis of the data

•  Analyze DBench-OLTP results using OLAP

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 86

User registration

•  ADR users must undergo a registration procedure

•  Provide identification information that is verified by the ADR support team −  To filter malicious users

•  Contact information is used to get in touch with the potential repository user

•  To access the repository users must authenticate

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 87

Multidimensional analysis

•  Design an adequate multidimensional data model

•  User has the required expertise to design the data model −  Send to the ADR support team the SQL scripts needed to

create the database tables

•  The ADR team helps the user defining the model −  The user only needs to explain us the experimental setup

and the format of the data collected

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 88

The DBench-OLTP benchmark

Format of the raw data

•  Raw data collected by DBench-OLTP is composed of tens of CSV files (one from each run)

•  Each row contains data from an injection slot −  Identification, duration, number of transactions executed,

data integrity errors discovered, type of fault injected, moment of fault injection, workload used, etc)

•  A text file describes the experiment and the characteristics of the SUB

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 89

Data model (1)

•  Key steps: −  Identification of the facts that characterize the problem

under analysis −  Identification of the dimensions that may influence the facts −  Definition of the granularity of the data stored in the star

schema

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 90

DEPEND 2009 16

Marco Vieira, University of Coimbra, Portugal

Data model (2)

91 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Definition of the loading plans

•  Data extraction −  SQL scripts to extract data from the CSV files to a temporary

database schema (data staging area)

•  Data transformation −  SQL scripts transform the data into an adequate format

•  Data load −  SQL scripts to load the transformed data into the data

warehouse

•  Loading plans documented and stored in the ADR DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 92

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 93

Load the data

•  Executing the loading plans created before

•  If new data becomes available we just need to rerun the plans −  e.g., if the benchmark is executed in other systems

•  The documentation of the DBench-OLTP includes papers and technical reports −  This is considered as part of the DBench-OLTP data −  It is loaded to the repository and made available to the

potential readers of the data

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 94

Data ownership policy

•  Data ownership policies of ADR are divided in two main groups −  Private data −  Proprietary data −  Collaborative data

•  For the DBench-OLTP data we have decided to use a collaborative approach −  Allows other potential users of the benchmark to compare

their results with the ones available in the ADR

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 95

Analysis of the data

•  On-line Analytical Processing (OLAP) tools −  Support the analysis in a very flexible way −  Provide high query performance and easy, intuitive data

navigation

•  Oracle Business Intelligence Discoverer Plus (ODP) −  Commercial tool included in Oracle Business Intelligence

package −  Widely used by industry Used freely for research purposes

under an Oracle Academy Agreement

OLAP Wizard

•  Selection of query type (crosstab or table) and characteristics (title, graph, text area, etc)

•  Selection of measures and dimensional attributes

•  Setting the query layout

•  Selection of the fields to be used to sort the results

•  Creation of parameters used to filter data

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 96

DEPEND 2009 17

Marco Vieira, University of Coimbra, Portugal

Some results

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 97 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 98

Quick demo…

•  Murphy's law…

http://www.amber-project.eu

Do you have data?

Share Them!

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 99

Questions

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 100

101 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Generic bibliography

•  Ralph Kimbal, Margy Ross, “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling” (Second Edition), Ed. J. Wiley & Sons, Inc, 2002.

•  Ralph Kimbal, “The Data Warehouse Lifecycle Toolkit”, Ed. J. Wiley & Sons, Inc, 2001.

102 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

ADR bibliography

•  Madeira, H., Costa, J., Vieira, M. , "The OLAP and Data Warehousing Approaches for Analysis and Sharing of Results from Dependability Evaluation Experiments", International Conference on Dependable Systems and Networks, DSN-DCC 2003, San Francisco, CA, USA, June 2003

•  Pintér, G., Madeira, H., Vieira, M., Pataricza, A., Majzik, I. , "A Data Mining Approach to Identify Key Factors in Dependability Experiments", Fifth European Dependable Computing Conference (EDCC-5), Budapest, Hungary, April 2005

DEPEND 2009 18

Marco Vieira, University of Coimbra, Portugal

103

ADR bibliography

•  Pintér, G., Madeira, H., Vieira, M., Majzik, I., Pataricza, A. , "Integration of OLAP and Data Mining for Analysis of Results from Dependability Evaluation Experiments", International Journal of Knowledge Management Studies (IJKMS), Volume 2 – Issue 4 – 2008, Inderscience Publishers, July 2008

•  Vieira, M., Mendes, N., Durães, J., Madeira, H. , "The AMBER Data Repository", DSN 2008 Workshop on Resilience Assessment and Dependability Benchmarking (DSN-RADB08), Anchorage, Alaska, June 2008

•  Vieira, M., Mendes, N., Durães, J. , "A Case Study on Using the AMBER Data Repository for Experimental Data Analysis", SRDS 2008 Workshop on Sharing Field Data and Experiment Measurements on Resilience of Distributed Computing Systems, Naples, Italy, October 2008

DEPEND 2009, Athens/Glyfada, Greece, June 18,2009

Marco Vieira, University of Coimbra, Portugaliaria.org/conferences2009/filesDEPEND09/ADR_Tutorial... · 2016-09-11 · Marco Vieira, University of Coimbra, Portugal Using the AMBER - [PDF Document] (2024)

References