Model-based Data Mining And Visualization – A Study Of .

Transcription

Syed, Adil BariModel-based Data mining and visualization – A study of datamining techniques to analyze construction project progressdatasetsModellbasiertes Data Mining und Datenvisualisierung – Eine Untersuchung von Data MiningTechniken zur Analyse von Projektfortschrittsdaten

Model-based Data mining and visualization – A study of datamining techniques to analyze construction project progressdatasetsModellbasiertes Data Mining und Datenvisualisierung – EineUntersuchung von Data Mining Techniken zur Analyse vonProjektfortschrittsdatenMaster thesisApproved by the Faculty of Civil engineeringof the Technische Universität DresdenWritten bySyed Adil BariSupervisors:Prof. Dr. –Ing. Raimar J. SchererCoordinators:M. Sc. Martin Just (TU Dresden)B.E. Christian Boog (Max Bögl)Date of Submission: 15.01.2015Date of Presentation:i

Confidentiality clauseThis Master's Thesis entitledModel-based Data mining and visualization – A study of datamining techniques to analyze construction project progressdatasetsModellbasiertes Data Mining und Datenvisualisierung – EineUntersuchung von Data Mining Techniken zur Analyse vonProjektfortschrittsdatencontains confidential and internal information and data of the companyMax Bögl Bauservice GmbH & Co. KG, Neumarkt – Germany.Insight into the thesis in any way is forbiddenfor not authorized third parties.This work may only be made available to the first and secondreviewers and authorized members of the committee.Distribution, duplication, publication of the content of the thesis, which is notlimited to but also includes in particular enclosed drawings andany part of data – as well as in digital form – is not allowed.An exception to these regulations requires the explicitpermission of both, the author as well as the companyMax Bögl Bauservice GmbH & Co. KG, Neumarkt – Germany.

I declare, that the Master thesis with the title “Model-based Data mining andvisualization – A study of data mining techniques to analyze constructionproject progress datasets” is entirely my own work. I have not sought or usedinadmissible help of third parties to produce this work and that I have clearlyreferenced all sources used in the work. No resources and means other thanthose specified were used. In each case, the source of direct quotations andreferences from other works were given and indicated.This work has not yet been submitted to another examination institution –neither in Ger-many nor outside Germany – neither in the same nor in asimilar way and has not yet been published.Date: 11.01.2015Syed Adil Bariii

TABLE OF CONTENTSPageTITLE PAGEiCERTIFICATEiiTABLE OF CONTENTSiiiLIST OF FIGURESviiLIST OF DEDICATIONxvCHAPTER 1: INTRODUCTION1.1 Introduction11.2 Motivation31.3 Construction Industry needs BIM and Data Mining31.4 A brief about Statistics and Data Mining41.5 Description of Research Problem61.5.1 Objectives61.5.2 Research Approach61.5.3 Research Scope71.6 Thesis Structure7CHAPTER 2: Literature ReviewBackground92.1 What is Data Mining?92.2 Data Mining in Construction Industry?122.3 Why Data Mining?142.4 What kind of Data can be mined?162.5 Stages of Data Mining172.5.1 Classification182.5.2 Regression202.5.3 Clustering212.5.4 Dependency Modeling or Association Rule Learning222.5.5 Deviation Detection or Outlier Detection242.5.6 Summarization25iii

2.6 Which technology are used?262.6.1 Statistics272.6.2 Database System272.7 The Data Mining Team282.8 Documentation along with Data Mining292.9 Tools for Data Mining: Why R?292.10 How R and Rattle works32CHAPTER 3: Implementation of Knowledge Discovery Techniques inConstruction IndustryBackground353.1 Defining System353.1.1 Data Quality363.1.2 Data Type383.1.3 Data Collection for Quality Test393.2 Data Extraction433.3 Data Cleaning453.3.1 Binning463.3.2 Correlation Analysis463.3.3 Results from Linear Regression493.3.4 Outlier Detection493.4 Data Integeration543.5 Data Selection553.6 Data Transformation583.7 Data Mining613.7.1 Description of Attributes623.7.2 Data Mining Tool643.7.3 Classification643.7.4 Clustering673.7.5 Association Rule Learning683.7.6 Data Modelling693.8 Evaluation693.8.1 Shopping Complex Project713.8.2 Parking Plaza Project723.8.3 Infrastructure Project733.9 Knowledge Representation74iv

CHAPTER 4: Implementation of Visualization TechniqueBackground764.1 Data Visualization764.1.1 Graph Visualization Technique774.1.2 Pixel-Oriented Visualization Technique784.1.3 Geometric Projection Visualization Technique794.1.4 Icon- Method804.1.5 Periodic Slice Visualization Technique814.2 Implementation of Visualization Technique824.2.1 Basic Graph854.2.2 Bi-directional Interactive Graph874.2.3 Advanced Graph894.3 Concluding Dashboard924.3.1 Project-wise Data Representation93CHAPTER 5: Conclusion and RecommendationsBackground955.1 Research Summary955.2 Research Contributions965.2.1 Identification of Measures for Enhanced Quality Data965.2.2 Designing and Development of a Web-form965.2.3 Implementation of Data Mining Technique965.3 Limitations965.3.1 Limited Type of Attributes975.3.2 Limited Case Study975.3.3 Manual handling for Predictive Analysis with R975.4 Future Research985.4.1 Dynamic Selection Mode for an Interactive Graph985.4.2 Development of an integrated GUI985.4.3 Development of Success Evaluation Schema985.4.4 Anticipated Schedule Notifier (Model Driven Activity98Anticipating Notifier)5.5 Concluding Remarks99REFERENCES100v

APPENDICESAPPENDIX A: Developed and Utilized Data-sets103APPENDIX B: Survey Questions and Results107APPENDIX C: Evaluation of Data mining models of a shopping complex project117APPENDIX D: 3D Model of the investigated projects121vi

LIST OF FIGURESPagesFigure. 1.1.Golden circle: Uniqueness and importance of this research work2Figure. 1.2.How BIM and Data Mining helping Construction Industry4Figure. 2.1.The world is data rich but information poor [20].10Figure. 2.2.Data mining as a step in the process of knowledge discovery [20].11Figure. 2.3.Growth of the global data acquisition and the growth of the BIM in 15North America, sources: [13, 26].Figure. 2.4.Snapshot of the CSV file exported from the model containing 17program (desite MD).Figure. 2.5.Hunt’s algorithm for inducing decision treesFigure. 2.6.Scatter plot along with regression line for planned days verses 2019number days delayed for each activity.Figure. 2.7.Clustering of data for delay statuses on the basis of planned days22Figure. 2.8.Possible Association rule that algorithm can learn during data mining, 23Sources: 20workers.jpg tion-equipment2.jpg .Figure. 2.9.Number of days delayed for several activities in a construction 24project.Figure. 2.10. Various Techniques that are applied to Data Mining26Figure. 2.11. Snapshot of the R-console, version 3.1.1, released on 10th July 2014.32Figure. 2.12. Snapshot of the R-Studio, version 0.98.1062, released on 18th Sept 332014Figure. 2.13. Representing the Rattle, version 3.3.0, released on 10th Sept 2014.Figure. 3.1.34Complex construction site equipped with technology to create, 36manage and utilize project dataFigure. 3.2.Conceptualization of data flow in desite MD to develop a graph: 37representing missing data of a construction project.Figure. 3.3.Conceptualization of data flow in desite MD and importance of data 38quality: representing recovery of data of a construction project.Figure. 3.4.Showing a snippet of code used to generate a basic csv file from 40desite MD.Figure. 3.5.Showing aR-codeandsummary oftestShoppingMall basic-00.csv.viitheobservationfor 41

Figure. 3.6.Showing a use of API embedded in the JavaScript snippet to create a 42new attribute name DM actualStartDate.Figure. 3.7.Showing aR-codeandsummary oftheobservationfor 43testShoppingMall basic-01.csv.Figure. 3.8.Binning methods for data smoothing.46Figure. 3.9.Scatter plot between Days.finishDelay and Days.startDelay from 47M basic.csv.Figure. 3.10. Linear regression line between Days.finishDelay and Days.startDelay 47from M basic.csv.Figure. 3.11. R-codetogeneratethelinearregressionlinebetween 48Days.finishDelay and Days.startDelay from M basic.csv.Figure. 3.12. Showing a box plot for the Days.startDelay and Days.finishDelay.Figure. 3.13. Showingaboxplotforlinelinethe50sub Days.startDelayand 51betweensub Days.finishDelayand 52betweensub Days.finishDelay1and 52sub Days.finishDelay.Figure. 3.14. Linearregressionsub Days.startDelay.Figure. 3.15. Linearregressionsub Days.startDelay1.Figure. 3.16. Linear regression line between bins of finish Delay and start Delay.53Figure. 3.17. Summary of an imported dataset of a yearly weather from the 55wunderground.com [54] in R.Figure. 3.18. Representing a frequency of observations in different attributes.58Figure. 3.19. Loading dataset-1 in Rattle representing attribute’s unique and 65missing values.Figure. 3.20. Loading dataset-1 along with the changes applied in result of learning 66from above practices within Rattle.Figure. 3.21. Classification of attributes using Boosting technique, provided by 66Ada package.Figure. 3.22. 2D scatter plot between various attribute elements using hierachical 67cluster.Figure. 3.23. 2D scatter between selected attribute using KMean cluster.68Figure. 3.24. Evaluation of sensitivity of four (4) datasets using Rattle for various 70predictive models.Figure. 3.25. Evaluation of risk of four (4) datasets using Rattle for various 71predictive models.viii

Figure. 3.26. Evaluate a predicted Overall Accuracy for four (4) prepared dataset 72of a Shopping Complex (Building) project.Figure. 3.27. Evaluate a predicted Overall Accuracy for two (2) prepared dataset of 73a Parking Plaza (Building) project.Figure. 3.28. Evaluate a predicted Overall Accuracy for two (2) prepared dataset of 74a Concrete Bridge (Infrastructure) project.Figure. 3.29. Explanation of anticipated finished dates for an exemplary project.75Figure. 4.1.Data Flow and Logical Visualization Plan.77Figure. 4.2.Radial tree diagram, source: R. M. Tarawneh et al., A general 78Introduction to Graph Visualization Techniques, pg 155 [68].Figure. 4.3.Pixel-oriented visualization of four attributes by sorting all tasks in 79Duration ascending order.Figure. 4.4.Visualization of a 2-D data set using a scatter plot. Source: 80 matica06.pdf .Figure. 4.5.Visualization of a 3-D data set using a scatter plot. Source: 80 /Scatterplot.jpg .Figure. 4.6.Chernoff faces. Each face represents an n-dimensional data point 81(n 18). Source: J. Han et al., Data mining concepts and techniques,3rd edition, pg. 62 [20, 69].Figure. 4.7.Time series for the planned activities for a construction of a shopping 82complex.Figure. 4.8.Conceptualizing functionality of a BIM managing tool (desite MD).83Figure. 4.9.Tools and libraries used for visualization technique in a BIM 83managing tool (desite MD).Figure. 4.10. Existing scheme of desite MD platform representing connection 84between various components.Figure. 4.11. Proposed scheme for integrating desite MD platform with R console.85Figure. 4.12. Basic graph schema for a tree diagram to show a flow of data stream 86from model to a graphical representation.Figure. 4.13. Tree graph representing the quality of a shopping complex model.86Figure. 4.14. Bidirectional interactive graph schema for a timeline diagram to show 87a flow of the data stream from a model to a graphical representation.Figure. 4.15. Bidirectional interactive graph schema for a stack diagram to 88represent the flow of the data stream from a model to a graphicalix

representation.Figure. 4.16. Bidirectional interactive graph for showing the timeline for a 88shopping complex.Figure. 4.17. ed 89completion statuses for a shopping complex.Figure. 4.18. Advance graph schema for a line graph.90Figure. 4.19. Advance graph showing a snapshot of line graph.91Figure. 4.20. Advance graph showing a snapshot of stack graph.91Figure. 4.21. Main Dashboard view for a shopping complex project.92Figure. 4.22. Exhaustive view of Project Dashboard with closed collapsible frames. 93Figure. 4.23. First instance view of Dashboard, showing frame of Bidirectional 94Interactive Project Timeline.Figure. 5.1.Proposed research phases for development of model driven activity 99anticipating notifier (MDAAN).x

LIST OF TABLESPagesTable. 2.1.Training data set for predicting delay statuses.18Table. 2.2.Data summarization for predicting delay statuses.25Table. 3.1.General Description of several types of datasets attached to a 3D 39model of a building project.Table. 3.2.PICOTS data extraction for considering valuable variable for the 44analysis.Table. 3.3.Example of a summary table for the progress datasets at basic level.Table. 3.4.Type of incompetency of a dataset along with this research related 45examples.Table. 3.5.Analysis of linear regression model with and without outliers.51Table. 3.6.Comparison b/w linear regression model of original and bin median 53dataset.Table. 3.7.Total number of attributes in the database and the selected attributes 56for data analysis and data mining.Table. 3.8.Outcomes of pre and post scaling process on 15% of validating data 60for the datasets M basic.csv.Table. 3.9.Outcomes of pre and post scaling process on 15% of validating data 61for the datasets M basic.csv.Table. 3.10.Enlist the attribute in testShoppingMall basic-all.csv, from here 62forward termed as a dataset-1.Table. 3.11.Enlist the attribute in testShoppingMall basic-all-ref.csv, from here 63forward termed as a dataset-2.Table. 3.12.Subset of dataset-1 and it will be termed as dataset-3.63Table. 3.13.Subset of dataset-2 and it will be termed as dataset-4.64Table. 3.14.Showing results obtained for the dataset-1 using Rattle’s model tab.69Table. 3.15.Evaluation of the

2.9 Tools for Data Mining: Why R? 2.10 How R and Rattle works CHAPTER 3: Implementation of Knowledge Discovery Techniques in Construction Industry Background 3.1 Defining System 3.1.1 Data Quality 3.1.2 Data Type 3.1.3 Data Collection for Quality Test 3.2 Data Extraction 3.3 Data Cleaning 3.3.1 Binning 3.3.2 Correlation Analysis 3.3.3 Results from Linear Regression 3.3.4 Outlier Detection 3.4 .