Attribute Dependency Data Analysis For Massive Datasets By . - Springer

Transcription

Soft Computing (2021) TA ANALYTICS AND MACHINE LEARNINGAttribute dependency data analysis for massive datasets by fuzzytransformsFerdinando Di Martino1 Salvatore Sessa2Accepted: 19 March 2021 / Published online: 15 April 2021 The Author(s) 2021AbstractWe present a numerical attribute dependency method for massive datasets based on the concepts of direct and inverse fuzzytransform. In a previous work, we used these concepts for numerical attribute dependency in data analysis: Therein, themulti-dimensional inverse fuzzy transform was useful for approximating a regression function. Here we give an extensionof this method in massive datasets because the previous method could not be applied due to the high memory size. Ourmethod is proved on a large dataset formed from 402,678 census sections of the Italian regions provided by the ItalianNational Statistical Institute (ISTAT) in 2011. The results of comparative tests with the well-known methods of regression,called support vector regression and multilayer perceptron, show that the proposed algorithm has comparable performancewith those obtained using these two methods. Moreover, the number of parameters requested in our method is minor withrespect to those of the cited in the above two algorithms.Keywords Attribute dependency Data mining Fuzzy transform1 IntroductionData analysis and data mining knowledge discovery processes represent powerful functionalities that can be combined in knowledge-based expert and intelligent systems inorder to extract and build knowledge starting by data. Inparticular, attribute dependency data analysis is an activitynecessary to reduce the dimensionality of the data and todetect hidden relations between features. Nowadays, inmany application fields, data sources are massive (forexample, web social data, sensor data, etc.), and it is necessary to implement knowledge extraction methods thatcan operate on massive data. Massive (Very Large (VL)and Large (L)) datasets (Chen and Zhang 2014) are& Ferdinando Di Martinofdimarti@unina.itSalvatore Sessasessa@unina.it1Dipartimento di Architettura, Università Degli Studi diNapoli Federico II, Via Toledo 402, 80134 Napoli, Italy2Centro Interdipartimentale di Ricerca ‘‘A. Calza Bini’’,Università Degli Studi di Napoli Federico II, via Toledo 402,Napoli, Italyproduced and updated and they cannot be managed bytraditional databases. Today, access via the Web to thesedatasets has led to develop technologies for managing them(cfr., e.g., (Dean 2014; Leskovec et al. 2014; Singh et al.2015)).We recall the regression analysis (cfr., e.g., (Draper andSmith 1988; Han et al. 2012; Johnson and Wichern 1992;Jun et al. 2015; Piatecky–Shapiro and Frawley 1991)) forestimating relationships among variables in the datasets(cfr., e.g., (Lee and Yen 2004; Mitra et al. 2002; Tanaka1987; Wood et al. 2015)) and fuzzy tools for attributedependency (Vucetic et al. 2013; Yen and Lee 2011).Machine learning soft computing models were proposedin the literature to perform nonlinear regressions on highdimensional data; two well-known machine learning nonlinear regression algorithms are support vector regression(SVR) (Drucker et al. 1996) and multilayer perceptron(MLP) (cfr., e.g., (Collobert and Bengio 2004; Cybenko1989; Hastie et al. 2009; Haykin 1999,2009; Murtagh1991; Schmidhube 2014)) algorithms. The main problemsof these algorithms are the complexity of the model due tothe presence of many parameters to be set by the user, andthe presence of overfitting, phenomenon in which theregression function fits optimally the training set data, but123

8732fails in predictions on new data. K-fold cross-validationtechniques are proposed in the literature to avoid overfitting (Anguita et al. 2005). In Thomas and Suhner (2015), apruning method based on variance sensitivity analysis isproposed to find the optimal structure of a multilayer perceptron in order to mitigate overfitting problems. In Hanand Jian (2019), a novel sparse-coding kernel algorithm isproposed to overcome overfitting in disease diagnosis.Some authors proposed variations of nonlinear machinelearning regression models to manage massive data. InCheng et al. (2010), Segata and Blanzieri (2009) a fastlocal support vector machine (SVM) method to managelarge datasets are presented in which a set of multiple localSVMs for low-dimensional data are constructed. In Zhenget al. (2013), the authors proposed an incremental versionof the vector machine regression model to manage largescale data. In Peng et al. (2013) the authors proposed aparallel architecture of a logistic regression model formassive data management. Recently, variations of theextreme learning machine (ELM) regression methods formassive data based on the MapReduce model are presented(Chen et al. 2017; Yao and Ge 2019).The presence of a high number of parameters makesSVR and MLP methods too complex to be integrated ascomponents into an intelligent or expert system. In thisresearch, we propose a model of attribute dependency inmassive datasets based on the use of the multi-dimensionalfuzzy transform. We extend the attribute dependencymethod presented in Martino et al. (2010a) to massivedatasets in which the inverse multi-dimensional fuzzytransform is used as a regression function. Our goal is toguarantee a high performance of the proposed method inthe analysis of massive data, maintaining, at the same time,the usability of the previous multi-dimensional fuzzytransform attribute dependency. As in Jun et al. (2015), weuse a random sampling algorithm for subdividing thedataset in subsets of equal cardinality.The fuzzy transform (F-transform) method (Perfilieva2006) is a technique which approximates a given functionby means of another function unless an arbitrary constant.This approach is particularly flexible in the applicationssuch as image processing (cfr., e.g., (Martino et al.2008,2010b,2011b; Martino and Sessa 2007,2012)), dataanalysis (cfr., e.g., (Martino et al. 2010a,2011a; Perfilievaet al. 2008)). In this last work, an algorithm, called FAD(F-transform Attribute Dependence), evaluates an attributeXz depending from k attributes X1 Xk. (predictors) withz 62 {1,2, k}, i.e. Xz H(X1 Xk), and the (unknown)function H is approximated with the inverse multi-dimensional F-transform via a procedure presented in Perfilievaet al. (2008). The error of this approximation in Martinoet al. (2010a) is measured from a statistical index ofdeterminacy (Draper and Smith 1988; Johnson and123F. Di Martino, S. SessaWichern 1992). If it overcomes a prefixed threshold, thenthe functional dependency is found. Each attribute has aninterval Xi [ai,bi], i 1, , k, as domain of knowledge.Then an uniform fuzzy partition (whose definition is givenin Sect. 2) of fuzzy sets fAi1 ; Ai2 ; :::; Aini g defined on [ai,bi]is created assuming ni C 3.The main problem in the use of the inverse F-transformfor approximating the function H consists in the fact thatthe data are not sufficiently dense with respect to the fuzzypartitions. The FAD algorithm solves this problem with aniterative process which is shown in Sect. 3. If the data arenot sufficiently dense with respect to the fuzzy partitions,the process stops otherwise an index of determinacy iscalculated. If this index is greater than a threshold a, thefunctional dependency is found and the inverse F-transformis considered as approximation of the function H, otherwisea finer fuzzy partition is set with n: n ? 1. The FADalgorithm is schematized in Fig. 1.In this paper, we propose an extension of the FADalgorithm, called MFAD (massive F-transform attributedependency) for finding dependencies between numericalattributes in massive datasets. In other words, by using auniform sampling method, we can apply the algorithm ofMartino et al. (2010a) to several sample subsets of the dataand hence we extend the results obtained to the overalldataset with suitable mathematical artifices.Indeed, the dataset is partitioned randomly in s subsetshaving equal cardinality to which we apply the F-transformmethod.Let Dl ¼ ½a1l ; b1l ½akl ; bkl ; l ¼ 1; :::; s, be theCartesian product of the domains of the attributes X1,X2, , Xk, where ail and bil are the minimum and maximumvalues of Xi in the lth subset. Hence, the multi-dimensionalinverse F-transform HnF1l n2l :::nkl is calculated for approximating the function H in the domain Dl and an index ofdeterminacy rcl2 is calculated for evaluating the error in theapproximation of H with HnF1l n2l :::nkl in Dl. For simplicity, weput n1l n2l nkl nl and thus HnF1l n2l nkl HnFl . Inorder to obtain the final approximation of H, we introduceweights for considering the contribute of the inverse Ftransform HnFl in the approximation of H. We calculate theweighted mean of HnF1 , , HnFs replacing the weights with22, , rcs.the indices of determinacy rc1Calculate the approximated value of H F in ðx1 ; :::; xk Þ 2Ssl¼1 Dl given byPswl ðx1 ; x2 ; :::; xk Þ HnFl ðx1 ; x2 ; :::; xk ÞPsH F ðx1 ; x2 ; :::; xk Þ ¼ l¼1l¼1 wl ðx1 ; x2 ; :::; xk Þð1Þwhere

Attribute dependency data analysis for massive datasets by fuzzy transforms8733Fig. 1 Flux diagram of the FADalgorithm wl ðx1 ; x2 ; :::; xk Þ ¼rcl20if ðx1 ; x2 ; :::; xk Þ 2 Dlotherwiseð2ÞFor example, we consider two attributes, X1 and X2, asinputs and suppose, for simplicity, that the dataset is partitioned in two subsets. Figure 2 shows two rectangles D1(red) and D2 (green). The zone labeled as A of the inputspace is covered by the domain D2: in this zone the weightw1 is null and H F ¼ H2F . Conversely, in the zone C thecontribute of H2F is null and H F ¼ H1F . In the zone labeledas B, the inverse F-transforms, calculated for both subsets,contribute to the final evaluation of H, with a weight corresponding to the index of determinacy.Figure 3 shows the schema of MFAD. We apply ourmethod on a L dataset loadable in memory, so we can applyalso the method of Martino et al. (2010a) and hence wecompare the results obtained by using both methods. Astest dataset, we consider the last Italian census dataacquired during 2011 by ISTAT (Italian National Statistical Institute). Section 2 contains the F-transform in one andmore variables (Perfilieva et al. 2008). In Sect. 3, theF-transform attribute dependency method is presented,Sect. 4 contains the results of our tests. Conclusions aredescribed in Sect. 5.2 F-transforms in one and k variablesFollowing the definitions of Perfilieva (2006). We recallthe main notations for making this paper self-contained.Let n C 2, x1, x2, , xn be points (nodes) of [a,b], x1 a \ x2 \ \ xn b. The fuzzy sets A1, , An:[a,b] ? [0,1] (basic functions) constitute a fuzzy partitionof [a,b] if Ai(xi) 1 for i 1,2, , n; Ai(x) 0 if x 62(xi1,xi?1) for i 2, , n; Ai(x) is a continuous on [a,b]; Ai(x)strictly increases on [xi-1, xi] for i 2, , n and strictlyPdecreases on [xi,xi?1] for i 1, , n-1; ni¼1 Ai ðxÞ ¼ 1 forevery x 2[a,b]. The partition{A1(x), , An(x)} is said uniform if n C 3, xi a ? h (i-1), where h (b-a)/(n-1) andi 1, 2, , n (equidistance); Ai(xi-x) Ai(xi ? x) for x123

8734F. Di Martino, S. SessaFig. 2 Example of union ofdomains of the subsets in whichthe dataset is partitioned2[0,h] and i 2, , n-1; Ai?1(x) Ai(x-h) for x 2[xi, xi?1]and i 1,2, , n-1.We know that the function f assumes given values in thepoints p1, , pm of [a,b],. If the set P {p1, , pm} issufficiently dense with respect to {A1, A2, , An}, that is forevery i 2{1, , n} there exists an index j 2{1, , m} suchthat Ai(pj) [ 0, then the n-tuple ½F1 ; F2 ; :::; Fn is the discrete direct F-transform of f with respect to {A1, A2, , An},where each Fi is given byPmj¼1 f ðpj ÞAi ðpj Þð3ÞFi ¼ P mj¼1 Ai ðpj Þfor i 1, , n. Then we define the discrete inverse Ftransform of f with respect to the basic functions {A1,A2, , An} by settingfF;n ðpj Þ ¼nXFi Ai ðpj Þð4Þi¼1for every j 2{1, , m}. Now we recall concepts fromPerfilieva et al. (2008). The F-transforms can be extendedto k (C 2) variables considering the Cartesian product of123intervals[a1,b1] [a2,b2] [ak,bk].Letx11 ; x12 ; ::::; x1n1 2[a1,b1], , xk1 ; xk2 ; ::::; xknk 2[ak,bk] ben1 ? ? nk assigned points (nodes) such that xi1 ai\ xi2 \ \ xini bi and {Ai1 ; Ai2 ; ::::; Aini } be a fuzzypartition of [ai,bi] for i 1, , k. Let the function f (x1,x2, ,xk) be assuming values in m points pj (pj1, pj2, , pjk) 2[a1,b1] [a2,b2] [ak,bk] for j 1, , m. The setP {(p11, p12, , p1k), (p21, p22, , p2k), , (pm1, pm2, ,pmk)} is said sufficiently dense with respect tofA11 ; A12 ; :::; A1n1 g, , fAk1 ; Ak2 ; :::; Aknk g if for {h1, ,hk}2{1, , n1} {1, , nk} there exists pj (pj1,pj2, , pjk)2 P with A1h1 ðpj1 Þ A2h2 ðpj2 Þ . . . AkhK ðpjk Þ [ 0,j 2 {1, , m}. Then we define the (h1,h2, , hk)th component Fh1 h2 :::hK of the discrete direct F-transform of f withrespect to fA11 ; A12 ; :::; A1n1 g, ,fAk1 ; Ak2 ; :::; Aknk g asPmFh1 h2 :::hK ¼j¼1f ðpj1 ; pj2 ; ::: pjk Þ A1h1 ðpj1 Þ A2h2 ðpj2 Þ ::: AkhK ðpjk ÞPmj¼1 A1h1 ðpj1 Þ A2h2 ðpj2 Þ ::: AkhK ðpjk Þð5Þ

Attribute dependency data analysis for massive datasets by fuzzy transformsThus we define the discrete inverse F-transform of f withrespect to fA11 ; A12 ; :::; A1n1 g, , fAk1 ; Ak2 ; :::; AknK g bysetting for pj (pj1, pj2, , pjk)2 [a1,b1] [ak,bk]:fnF1 n2 :::nK ðpj1 ; pj2 ; . . .; pjk Þ ¼n1 Xn2Xh1 ¼1 h2 ¼1:::nkXFh1 h2 :::hK A1h1 ðpj1 Þ . . . AkhK ðpjk ÞhK ¼1ð6Þfor j 1, , m. The following Theorem holds (Perfilieva2006):Theorem 1 Let f(x1,x2, , xk) be a function assigned on theset of points P {(p11,p12, ,p1k),(p21, p22, , p2k), ,(pm1, pm2, ,pmk)} [a1,b1] [a2,b2] [ak,bk].Then for every e [ 0, there exist k integers n1(e), , nk(e)and related fuzzy partitions. A11 ; A12 ; :::; A1n1 ðeÞ ; . . .; Ak1 ; Ak2 ; :::; Aknk ðeÞð7Þsuch that the set P is sufficiently dense with respect to fuzzypartitions (5) and for every pj (pj1, pj2, , pjk) 2 P,j 1, , m, the following inequality holds: f ðpj1 ; pj2 ; :::; pjk Þ fnF1 ðeÞn2 ðeÞ :::nk ðeÞ ðpj1 ; pj2 ; :::; pjk Þ \e ð8Þ8735Here X1, , Xi, , Xr are the involved attributes andO1, , Oj, , Om (m [ r) are the instances and pji is thevalue of the attribute Xi for the instance Oj. Each attributeXi can be considered as a numerical variable assumingvalues in the domain [ai,bi], where ai min{p1i, , pmi}and bi max{p1i, , pmi}. We analyze the functionaldependency between attributes in the form:Xz ¼ H ðX1 ; . . .; Xk Þð9Þwhere z 2{1, , r}, k B r \ m, Xz X1, X2, ,Xk,, H:[a1,b1] [a2,b2] [ak,bk] ! [az,bz] is continuous.In [ai,bi], i 1,2, , k, an uniform partition of Ai1 ; :::; Aij ; :::; Ain is defined for i 1, , k and j 2, ,k-1:8 0:5 ð1 þ cos p ðx x ÞÞ if x 2 ½x ; x i1i1 i2hiAi1 ðxÞ ¼:0otherwise8p 0:5 ð1 þ cos ðx x ÞÞ if x 2 ½xijiðj 1Þ ; xiðjþ1Þ hiAij ðxÞ ¼:0otherwise8p 0:5 ð1 þ cos ðx x ÞÞ if x 2 ½xiniðn 1Þ ; xin hiAin ðxÞ ¼:0otherwiseð10Þ3 Multi-dimensional algorithm for massivedatasets3.1 FAD algorithmWe schematize a dataset in tabular form asO1.Oj.OmX1p11.pj1.pm1.where hi (bi-ai)/(n-1), xij ai ? hi (j-1).By setting H (pj1,pj2, , pjk) pjz for j 1,2, , m, thecomponents of H are given byPmj¼1 pjz A1h1 ðpj1 Þ . . . AkhK ðpjk Þð11ÞFh1 h2 .hk ¼ Pmj¼1 A1h1 ðpj1 Þ . . . AkhK ðpjk ÞXip1i.pji.pmi.Xrp1r.pjr.pmr123

8736F. Di Martino, S. SessaThe inverse F-transform HnF1 n2 :::nk is defined asHnF ðpj1 ; pj2 ; :::pjk Þ ¼n XnXh1 ¼1 h2 ¼1:::nXFh1 h2 :::hK A1h1 ðpj1 Þ ::: AkhK ðpjk Þhk ¼1perfectly) to the data. However we use a variation of (11)for taking into account both the number of independentvariables and the scale of the sample used (Martino et al.2010a) given byð12Þ2The error of the approximation is evaluated in (pj1,pj2, , pjm) by using the following statistical index ofdeterminacy (Draper and Smith 1988; Johnson and Wichern 1992):Pm rc2 ¼j¼1HnF1 n2 :::nk ðpj1 ; pj2 ; :::pjk Þ p zPm 2 zj¼1 pjz p 2ð13Þwhere p z is the mean of the values of the attribute Xz. Ifrc2 0 (resp., rc2 1) means that (11) does not fit (resp., fits123r0 c ¼ 1 1 rc2 m 1m k 1ð14ÞThe pseudocode of the algorithm FAD is schematizedbelow.The function DirectFuzzyTransform() is used to calculate each direct F-transform component. The functionBasicFunction() calculates the value Aihi ðxÞ for an assignedx of the hith basic function of the ith fuzzy partition.IndexofDeterminacy calculates the index of determinacy.

Attribute dependency data analysis for massive datasets by fuzzy transforms8737123

8738123F. Di Martino, S. Sessa

Attribute dependency data analysis for massive datasets by fuzzy transforms3.2 MFAD algorithmWe consider a massive dataset DT composed by r attributes where X1, , Xi, , Xr and m instances O1, , Oj, ,Om (m [ r). We make a partition of DT in s subsetsDTl, , DTs with the same cardinality, by using an uniformrandom sample in such a way each subset is loadable inmemory. We apply the FAD algorithm to each subset,calculating the direct F-transform components, the inverse87390F-transforms HnF1 HnFs , the indices of determinacy rc12 , ,r 0 2cs . r 0 2cs and the domains Dl, , Ds, where Dl ¼ ½a1l ; b1l ½akl ; bkl ; l 1, , s. All these quantities are saved inmemory. If a dependency f is not found for the lth subset,the corresponding value of r 0 2cl is set to 0. The pseudocodeof MFAD is given below.123

8740Now we consider a point (x1,x2, , xk) 2F. Di Martino, S. SessasSDl . In orderl¼1to approximate the function H(x1,x2, , xk), we calculatethe weights as: 02rcl if ðx1 ; x2 ; . . .; xk Þ 2 Dll ¼ 1; :::; sw0l ðx1 ; x2 ; :::; xk Þ ¼0 otherwiseð15ÞIf for any subset the functional dependency is not found,0then wl 0 for each l 1, , s. Otherwise, the approximated value of H(x1,x2, , xk) is given byPs0Fi¼1 wi ðx1 ; x2 ; :::; xk Þ Hnl ðx1 ; x2 ; :::; xk ÞFPsH ðx1 ; x2 ; :::; xk Þ ¼0l¼1 wi ðx1 ; x2 ; :::; xk Þð16Þwhich is also the value of Xz. To analyze the performanceof the MFAD algorithm, we execute a set of experimentson a large dataset formed from 402,678 census tracts of theItalian regions provided by the Italian National StatisticalInstitute (ISTAT) in 2011. Therein, 140 numerical attributes belong to each of the following categories: Inhabitants,Foreigner and stateless inhabitants,Families,Buildings,Dwellings.The FAD method is applied on the overall dataset, theMFAD method is applied by partitioning the dataset in ssubsets, and we perform the tests varying the value of theparameter s and by setting the threshold a 0.7.123In addition, we compare the MFAD algorithm with thesupport vector regression (SVR) and multilayer perceptron(MLP) algorithms.4 ExperimentsTable 1 shows the 402,678 census tracts of Italy divided foreach region.Table 2 shows the approximate number of census tractsin each subset for each partition of the dataset in s subsets.In any experiment, we apply the MFAD algorithm toanalyze the attribute dependency explored of an outputattribute Xz from a set of input attributes X1, X2, , Xr. Inall the experiments, we set a 0.7 and partition randomlythe dataset in s subsets. We now show the results obtainedin three experiments.4.1 Experiment AIn this experiment, we explore the relation between thedensity of resident population with laurea degree and thedensity of resident population employed. Generallyspeaking, a higher density of population with laurea degreeshould correspond to a greater density of populationemployed. The attribute dependency explored is Hz H(X1), where Input attribute: X1 Resident population with laureadegree Output attribute: Xz Resident population over 15employed

Attribute dependency data analysis for massive datasets by fuzzy transforms8741Fig. 3 Schema of the MFAD methodWe apply the FAD algorithm on different random subsets of the dataset, and then we calculate the index ofdeterminacy (12). In Table 3, we show the value of theindex of determinacy r 0 2cl obtained for different values of s.For s 1, we have the overall dataset.The results in Table 3 show that the dependency hasbeen found. We obtain r 0 2cl 0.760 by using FAD algorithm on the entire dataset, while the best value ofr 0 2cl (reached by using MFAD) is 0.758 for s 16. Hencethe related smallest difference between the two algorithmsis 0.02. Figure 4 shows in abscissas the input X1 and inordinates the output H F (x1 Þ for s 1, 10, 16, 40.4.2 Experiment BIn this experiment, we explore the relation between thedensity of residents with job or capital income and thedensity of families in owned residences. We expect that thegreater the density of residents with job or capital incomeis, the resident families density in owned homes the greateris. The attribute dependency explored is Hz H(X1),where: Input attributes: X1 Resident population with job orcapital income Output attribute Xz Families in owned residences After some tests, we put a 0.8.Table 4 shows r 0 2cl obtained for different values of s: 0.881 in FAD algorithm on the entire dataset, 0.878 in MFAD obtained for s 13, 16. Thesmallest index of dependency difference is 0.003.Figure 5 shows in abscissas the input X1 and in ordinatesthe output H F (x1 Þ for s 1, 10, 16, 40.r 0 2clr 0 2cl4.3 Experiment CIn this experiment, the attribute dependency explored isHz H(X1,X2), whereInput attributes: X1 Density of residential buildings built with reinforced concrete X2 Density of residential buildings built after 2005Output attribute:123

8742F. Di Martino, S. SessaTable 1 Number of census tracts for each Italian regionTable 3 Index of determinacyfor values of s in experiment Avia FADsIndex of determinacy10.76080.74590.74853,173100.7500.752ID regionDescriptionNumber of census tracts001Piemonte35,672002Valle d’Aosta1902003Lombardia004Trentino Alto Adige11,71211005Veneto33,883130.754006Friuli Venezia Giulia8278160.758007Liguria11,054200.752008Emilia a13,981Table 2 Number of censustracts for each subset by varyings5107sNumber of census tracts85.0 10494.5 104104.0 10113.7 1041343.1 10162.5 104202.0 104261.5 104401.0 104 Xz Density of residential buildings with state of goodconservationAfter some tests, we decided a 0.75 in this experiment. In Table 5, we show r 0 2cl obtained for different valuesof s: r 0 2cl 0.785 in FAD algorithm on the entire dataset.r 0 2cl 0.781 in MFAD algorithm obtained for s 13, 16.The smallest index of dependency difference is 0.004.Now we present the results obtained by considering allthe experiments performed on the entire dataset in whichthe dependency was found (r 0 2cl [ 0.7). We consider theindex of determinacy in the FAD algorithm (s 1) and the123Fig. 4 Tendency of Hz for dataset partitions in the experiment A4minimum and maximum values of the index of determinacy obtained by using the MFAD algorithm for s 9, 10,11, 13, 16, 20, 26, 40.A functional dependency was found in 43 experiments.Figure 6 (resp., 7) shows the trend of the differencebetween the maximum (resp., minimum) value calculatedfor r 0 2cl in MFAD and in FAD for the same experiment. Inabscissae, we have r 0 2cl in the FAD method, in ordinates thedifference between the two indices. For all the experimentsthis difference is always below 0.005 (resp., 0.0015).These results show that the MFAD algorithm is comparable with the FAD algorithm, independently of thechoice of the number of subsets partitioning the entiredataset (Fig. 7).Figure 8 shows the mean CPU time gain obtained byMFAD algorithm with different partitions, with respect tothe CPU time obtained by using FAD algorithm (s 1).The CPU time gain is given by the difference between theCPU time measured by using s 1, and the CPU timemeasured by using a partition in s subsets, divided by theCPU time measured for s 1. The CPU time gain is

Attribute dependency data analysis for massive datasets by fuzzy transformsTable 4 Index of determinacyfor values of s in experiment Bvia FADsIndex of determinacy10.88180.8729Table 5 Index of determinacyfor values of s in the experimentC via FAD8743sIndex of .780260.875260.779400.872400.777Fig. 6 Trend of the difference between the max value r 0 2cl in MFADand FADFig. 5 Trend of Hz for dataset partitions in the experiment Balways positive and the greatest value are obtained fors 16. These considerations allow to apply the MFADalgorithm to a VL dataset not loadable entirely in memoryto which the FAD algorithm is not applicable.Now we compare the results obtained by using theMFAD method with the ones obtained by applying theSVR and MLP algorithms. For the comparison tests wehave used the machine learning tool Weka 3.8.In order to perform the tests by using the SVR algorithm, we repeat each experiment using the following different kernel functions: linear, polynomial, Pearson VIIuniversal kernel, and Radial Basis Function kernel, andvarying the complexity C parameter in a range between 0and 10. To compare the performances of the SVR andMFAD algorithms we measure the index of determinacyand store it in every experiment.In Fig. 9 we show the trend of the difference betweenthe max values of r 0 2cl in SVR and MFAD.Figure 9 shows that the difference between the optimalvalue r 0 2cl in SVR and MFAD is always under 0.02. In thecomparison tests performed by using the MLP algorithm,we vary the learning rate and the momentum parameter in[0.1,1]. We use a single hidden layer varying the number ofnodes between 2 and 8. Furthermore, we set the number ofepochs to 500 and the percentage size of validation set to 0.In Fig. 10 we show the trend of the difference betweenthe max value of r 0 2cl in MLP and MFAD.Figure 10 shows that the difference between the maxvalue of the index of determinacy in MLP and MFAD isunder the value 0.016.These results show that the MFAD algorithm of attributedependency in massive datasets has comparable performances with the SVR and MLP nonlinear regressionalgorithms. Moreover, it has the advantage of having asmaller number of parameters compared to the other twoalgorithms, therefore it has greater usability and can beeasily integrated into expert systems and intelligent systems for the analysis of dependencies between attributes inmassive datasets. Indeed, the only two parameters for theexecution of the MFAD algorithm are the number of subsets and the threshold value of the index of determinacy.123

8744F. Di Martino, S. SessaFig. 7 Trend of the difference between the min value of r 0 2cl in MFADand FADFig. 10 Trend of the difference between the max value of r 0 2cl in MLPand MFADFig. 8 Trend of CPU time gain with respect to FAD method (s 1)Fig. 9 Trend of the difference between the max value of r 0 2cl obtainedin SVR and MFADmethod for massive datasets called MFAD: The dataset ispartitioned in s subsets equally sized, to each subset theFAD method is applied by calculating the inverseF-transform. Approximated by a weighted mean where theweights are given from the index of determinacy assignedto each subset. For testing the performance of the MFADmethod, we compare tests with respect to the FAD methodon an L dataset of the ISTAT 2011 census data. The resultsshow that the performances obtained in MFAD are wellcomparable in FAD. The comparison tests show that theMFAD algorithm has performances comparable with SVRand MLP algorithms, moreover it has greater usability dueto the lower number of parameters to be selected.These results allow us to conclude that MFAD providesacceptable performance in the detection of attributedependencies in the presence of massive datasets. Therefore, unlike FAD, MFAD can be applied to massive dataand can represent a trade-off between usability and highperformance in detecting attribute dependencies in massivedatasets.The critical point of the algorithm is the choice of thenumber of subsets and the threshold value of the index ofdeterminacy. Further studies on massive datasets are necessary to analyze if the choice of the optimal values ofthese two parameters depend on the type of dataset analyzed. Furthermore, we intend to experiment the MFADalgorithm in future robust frameworks such as expert systems and decision support systems.5 ConclusionsThe FAD method presented in (Martino et al. 2010a) canbe used as a regression model for finding attribute dependencies in datasets: the inverse multiple F-transform canapproximate the regression function. But this method canbe expensive for massive datasets and for VL datasets notloaded in memory. Then we propose a variation of the FAD123Author contributions All authors contributed to the study conception and design. All authors contributed to material preparation, datacollection and analysis. All authors wrote the first draft of themanuscript commented on previous versions of the manuscript. Allauthors read and approved the final manuscript.

Attribute dependency data analysis for massive datasets by fuzzy transformsFunding Open access funding provided by Università degli Studi diNapoli Federico II within the CRUI-CARE Agreement. This researchreceived no external funding.DeclarationsConflict of interest The authors declare no conflict of interest.Ethical approval This research does not contain any studies involvinghuman participants performed by any of the authors.Informed consent Informed consent was obtained from all individualparticipants included in the study.Open Access This article is licensed under a Creative CommonsAttribution 4.0 International License, which permits use, sharing,ada

many application fields, data sources are massive (for example, web social data, sensor data, etc.), and it is nec-essary to implement knowledge extraction methods that can operate on massive data. Massive (Very Large (VL) and Large (L)) datasets (Chen and Zhang 2014) are produced and updated and they cannot be managed by traditional databases.