Efficient Shift-Invariant Dictionary Learning

Transcription

Efficient Shift-Invariant Dictionary LearningGuoqing ZhengSchool of Computer ScienceCarnegie Mellon University5000 Forbes AvenuePittsburgh, PA 15213, USAgzheng@cs.cmu.eduYiming YangSchool of Computer ScienceCarnegie Mellon University5000 Forbes AvenuePittsburgh, PA 15213, USAyiming@cs.cmu.eduABSTRACTShift-invariant dictionary learning (SIDL) refers to the problem of discovering a set of latent basis vectors (the dictionary) that captures informative local patterns at differentlocations of the input sequences, and a sparse coding foreach sequence as a linear combination of the latent basis elements. It differs from conventional dictionary learning andsparse coding where the latent basis has the same dimensionas the input vectors, where the focus is on global patternsinstead of shift-invariant local patterns. Unsupervised discovery of shift-invariant dictionary and the correspondingsparse coding has been an open challenge as the number ofcandidate local patterns is extremely large, and the number of possible linear combinations of such local patterns iseven more so. In this paper we propose a new framework forunsupervised discovery of both the shift-invariant basis andthe sparse coding of input data, with efficient algorithmsfor tractable optimization. Empirical evaluations on multiple time series data sets demonstrate the effectiveness andefficiency of the proposed method.CCS Concepts Computing methodologies Factor analysis; Learninglatent representations;Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University5000 Forbes AvenuePittsburgh, PA 15213, USAjgc@cs.cmu.edurestoration [13, 2, 18], signal compression [15, 17] and compressed sensing [4, 9].Conventional dictionary learning models assume that theinduced basis is of the same dimensionality of the inputdata, thus the basis elements (vectors) and the input sequences (also vectors) are strictly aligned from coordinateto coordinate. This simplistic assumption enables efficientalgorithms for finding the optimal basis from input data;however, it also significantly limits the scope of potentialapplications of those models. Consider the three-time-seriesexample from [24] reproduced in Figure 1 for instance. Timeseries No. 1 and series No. 3 share the bumps (as plottedin red) which are similar in shape but appear at differentlocations. We would consider this pair (1 and 3) more similar than the other pairs but such similarity cannot be captured, obviously, using a global non-shift-invariant pattern(corresponding to a basis element) with the full span overthe time series. On the other hand, such similarity couldbe captured if we allow the basis elements to have a shorterspan and to occur at different locations in the time series.Such a desirable kind of basis is called shift-invariant basis,and finding such a basis from data is referred as solving theshift-invariant dictionary learning (SIDL) problem.1Keywords2dictionary learning; sparse coding; time series31.INTRODUCTIONSparse representation models, such as those in dictionarylearning, have been proposed for obtaining both a succinctset of vectors as the new basis (also called the dictionary)from unlabeled input sequences, and for representing eachsequence as a sparse linear combination of the basis elements, i.e., the sparse coding of the data. Successful applications include those in dimensionality reduction [6], imagePermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from permissions@acm.org.KDD ’16, August 13 - 17, 2016, San Francisco, CA, USAc 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 978-1-4503-4232-2/16/08. . . 15.00DOI: http://dx.doi.org/10.1145/2939672.2939824Figure 1: Importance of local patterns (plotted inred) versus that of the entire time series.As a challenging problem with a potentially broad rangeof applications, SIDL has received increasing attention in recent machine learning research [8, 23, 14, 24]. Existing workand potential strengths/limitations can be grouped and outlined as follows: First, methods such as [24], which focus onexhaustive or heuristic search over all possible local patterncandidates, are either highly inefficient or sub-optimal, orboth. Second, methods such as [7], which rely on the availability of sufficiently labeled data in order to learn discriminant local patterns, are not applicable when labeled dataare lacking or are extremely sparse. Third, methods such as[8], which use convolution to model invariant shift of localpatterns, has the drawback of not offering any method forsparse coding of input data. In other words, those methodsare capable of learning shift-invariant local patterns as a new

basis, but lack of the ability to combine the shift-invariantpatterns in a sparse coding of the original data. Recall thatsparse coding is a highly desirable for scalable data analysistasks (including classification and clustering of time series),as well as for the interpretability of the analysis and therobustness of system predictions.In this paper, we propose a new approach to SIDL whichcombines the strengths and addresses the aforementionedlimitations of the previous methods. Specifically, our modelallows the basis elements to be much shorter than the lengthof input series, to occur at different locations in the inputseries, and to be optimally combined into a sparse coding of observed data, thus representing each input seriesas as a sparse linear combination of the short basis elements. More importantly, both the shift-invariant basis andthe sparse coding are jointly optimized in an unsupervisedlearning framework with our new algorithms for highly efficient computation. Empirical results on multiple times series demonstrate the effectiveness and efficiency of the proposed approach.2.PRELIMINARIESSuppose we have a set of n input data points with dimension p, {xi Rp }ni 1 , we want to learn a set of K basiselements, i.e., a dictionary, D [d1 , ., dK ] in Rq K and asparse coding αi RK for each input data point xi . In classic sparse encoding and dictionary learning, the dimensionof the basis is the same as that of the data point, i.e. q p.2.1variables is a common approach to address dictionary learning.When D is fixed, the above problem is essentially a sparsecoding as described in Section 2.1. When all αs are fixed,the above problem is quadratic programming with quadraticconstraints (QCQP) and there exist many algorithms tosolve it efficiently.3.SHIFT-INVARIANT DICTIONARY LEARNINGIn this section, we present our shift-invariant dictionarylearning (SIDL) to capture both the locality of representative patterns as well as preserve a sparse representations forthe data points.Unlike classic dictionary learning which enforces the samedimensionality for the basis and the data point (q p),shift-invariant dictionary relaxes this constraint by q p,allowing the basis to slide along the support of the datapoint. For real problems, we may often set q to be muchsmaller than p; however we emphasize that classic DL is aspecial case of SIDL with the setting q p.For a data point xi Rp and shift-invariant basis dk Rq ,we introduce a variable tik to denote the location1 where dkis matched to xi , with tik 0 indicating that dk is alignedto the beginning of xi and that tik p q indicating thelargest shift dk can be aligned to xi without running beyondthe boundary of xi , as shown in Figure 3. Hence the possiblevalues for tik are all integers2 in [0, p q].Sparse CodingGiven a fixed dictionary D, for an input data point sparseencoding aims to find a sparse linear combination of thebasis vectors as its representation. Formally, sparse codingwith 1 regularization amounts to solving1arg min kx Dαk22 λkαk12Kα R(1)(a) Basis d is aligned to the beginning of x, i.e.the shift variable t 0where λ is controls the balance between restoration errorand coding sparsity. It is well known that 1 penalty yieldsa sparse solution, and the above LASSO problem can beefficiently solved with coordinate descent [19, 20]. Othersparsity penalties such as 0 regularization can be used well,however solving sparse coding with 0 penalty is often intractable. In this paper, we focus on the 1 regularized setting.2.2Dictionary Learning(b) Basis d is aligned to the end of x, i.e. the shiftvariable t p qWhen we lack a pre-existing dictionary a priori or we wishthe dictionary to reflect directly the properties of the data,dictionary learning can induce both the sparse coding for theinput data points and the dictionary by solving the followingoptimization problemarg minD Rp Kαi RK12nXObviously, shifting a basis element d by an amount of t is1kxi Dαi k22 λkαi k1i 1s.t. kdk k2 c, for k 1, ., KFigure 2: The same basis with different shifts(2)where c is a constant and the norm constraint on dj is necessary to avoid degenerate solutions. The above problem isnot convex in (D, α) jointly, but is convex if we fix eithervariable. Hence alternate optimization between both sets ofThe idea of shift invariant basis can also be applied to datawith more than one “directions”, such as image. For example, for images, the basis now tunrs to be a rectangular areaand the shift is represented by two variables on each direction, the idea proposed in this paper can be easily extendedto this case. For brevity, we focus on data with only one“direction” in this paper and leave the extensions as futurework.2As we only consider “discretized” data points, the shift canonly take integer values.

equivalent to defining a new vector aspTp (d, t) , v R(3)where(vi di t0if 1 i t qotherwise(4)Algorithm 1 Shift-invariant sparse codingInput: data point x Rp , dictionary D [d1 , ., dK ]Output: sparse coding α , matching offsets {t k }Kk 1Initialize α randomlyrepeatfor k h1 toPK doib x j6 k αj T (dj , tj )xtk Eq. (9)αk Eq. (10)end foruntil convergenceGiven a set of input data points {xi }ni 1 , shift-invariantdictionary learning aims to learn the dictionary D [d1 , ., dK ],the sparse coefficients {αik } and the corresponding shifts{tik } entirely from the data in an unsupervised way:SIDL :nKX1Xarg minxi αik T (dk , tik )2 i 1D Rq Kαi RKtik [0,p q]k 12 λ2nXi 1s.t. kdk k2 c, for k 1, ., K4.kαi k1Proposition 4.1. The optimal solution for Problem (6)(5)isb t k arg max T (dk , tk ) xMODEL LEARNING FOR SIDL(9)tkIn this section, we present an efficient algorithm to solveSIDL. Similar to classic DL, the SIDL problem is non-convex;hence, we employ an anternative optimization scheme.4.1(6). However, by plugging Eq. (7) back to Problem (6) andwith simple math manipulation, we arrive at the followingtheorem which is much more efficient to compute.Shift-invariant sparse codingGiven the dictionary D fixed, Problem (5) turns to solvingfor the basis matching location tik and the the coefficientsα. Also note that with D fixed, Problem (5) can be decomposed to learning the coefficients and the basis shift forevery xi independently. Hence in this subsection we dropthe subscript and simply use x to represent any single inputdata point from the data set for clarity.To learn α and {tk } for input x, we adopt coordinatedescent to estimate the coefficients α and a greedy approachto estimate {tk }. Minimizing over αk and tk with {αj }j6 kand {tj }j6 k fixed:X1αj T (dj , tj ) xk2 λ αk arg min kαk T (dk , tk ) αk ,tk 2j6 kand(a k with all basis except for dk and S(·) is the shrinkage operatordefined as x a if x aSa (x) 0(8)if x ( a, a) x a if x aThe above solution for αk depends on the shift tk of thekth basis element, and since it can only take integer valuesfrom [0, p q], a naive approach is to enumerate all possible tk , compute the corresponding αk and pick the pairof (t k , αk ) which yields the minimum objective defined inb λif T (dk , t k ) xotherwise(10)Proof. See Appendix A. Proposition 4.1 suggests that we only need to compute thedot product between the basis element and the segment fromthe input data point; all updates are exact which do not involve parameter tuning. Algorithm 1 outlines the suggestedprocedure for solving shift-invariant sparse coding.4.2Shift-invariant dictionary updateWhen the coefficients α and basis shifts {tik } are fixed,updating the dictionary requires solving the following optimization problem:nKX1Xxi αik T (dk , tik )d1 ,.,dk Rq 2i 12arg mink(6)For this one-dimensional optimization problem, its solutionfor αk (as a function of the optimal basis shift t k ) is bT (dk , t k ) xαk S λT (dk , t k ) T (dk , t k )kdk k2 bT (dk , t k ) x S λ(7)kdk k2kdk k2hiPb , x j6 k αj T (dj , tj ) is the residue of fitting xwhere xbT (dk , t k ) x0s.t. kdk k2 c, k [1, K](11)Coordinate descent is used to solve for d as well, when minimizing over dk with {dj }j6 k fixed:KnX1Xxi αik T (dk , tik ) αij T (dk , tij )arg mindk Rq 2i 12j6 k2s.t. kdk k c(12)This is a least square problem with quadratic constraints,which we can solve via its dual problem. The Lagrangian is:L(dk , u) n1Xkbxi αik T (dk , tik )k2 u(kdk k2 c)2 i 1(13)PKbi , xi j6 k αij T (dk , tij ) is the residue for xi andwhere xu 0 is the Lagrangian multiplier. Minimizing L(dk , u)over dk , we get an analytic form for d k asPnbi[1 tik ,q tik ]x i 1 αikPdk (14)22u ni 1 αik

Algorithm 2 Shift-invariant dictionary updateInput: data point {xi }ni 1 , sparse coding α, matchingoffsets {tk }Kk 1Output: dictionary D [d1 , ., dK ]Initialize D randomlyrepeatfor k P1 to K doik ,q tik ]b[1 te nxii 1 αik xdk Eq. (15)end foruntil convergenceAlgorithm 3 SIDLInput: data points {xi }ni 1 , desired basis dimension q,desired number of basis KOutput: dictionary D [d1 , ., dK ], sparse coding{αi }ni 1 , basis shifting location {tik }Initialize D, {αi }ni 1 and {tik } randomlyrepeatSparse coding: call Alg. 1Dictionary updating: call Alg. 2until convergence[1 t,q t]ikbi from index 1 tikbi ikdenotes the slice of xwhere xthrough q tik . Eq. (14) suggests that the optimum dk isa weighted average of the corresponding matching segmentsof all residues {bxi }. Plugging it back to the Lagrangian andmaximizing the dual problem we finally get kexkc exif Pn c 2 kexkαi 1 ikdk (15)1 exotherwise Pn2i 1 αikPik ,q tik ]e, nb[1 twhere x. Refer to Appendix B for aii 1 αik xdetailed proof of Eq. (15). Algorithm 2 outlines the learningprocedure for shift-invariant dictionary updating.4.3In fact, we do not necessarily require Algorithm 1 and 2 toconverge in every call from 3, since as long as the objectivefunction decreases in both shift-invariant sparse coding andshift-invariant dictionary update, the entire algorithm is stillguaranteed to converge. Hence M1 and M2 can be set tobe quite small, e.g., 10 to 20 iterations in both Algorithm 1and 2 would suffice. Therefore, the total time complexity forSIDL is O(M Kq(n p q)), where M is the total numberof iterations allowed or needed for Algorithm 3 to converge.4.3.2M Kq(n p q) M Knp5.Algorithm analysisEXPERIMENTSIn this section, we evaluate the proposed Shift-InvariantDictionary Learning (SIDL) from two aspects. One is to investigate its performance as a data reconstruction algorithmin reconstructing unseen data with the dictionary learnedfrom training data, and the other is to evaluate the qualityof learned sparse representations as features for downstreamtasks, specifically for classification.5.1Data setsTable 1: Dataset informationComplete algorithmThe time complexity of SIDL consists of two parts, one forshift-invariant sparse coding and the other for shift-invariantdictionary update. For shift-invariant sparse coding (Algorithm 1), the cost for one outer iteration, i.e., finding the optimum shifting locations and basis coefficients for K basis,is O(Kq(p q 1)), where O(q) comes from the cost of computing inner product of two vectors of length q and p q 1is the possible integer scope for solving tk by Proposition4.1, hence the total complexity for one call to Algorithm 1is O(M1 Kq(p q)) where M1 is the maximum number ofiterations allowed in Algorithm 1.The core part of shift-invariant dictionary update (Algorithm 2) is the computing for dk , which takes O(nq) opere and scale it to get dk according toations to compute the xEq. (15). Thus the total complexity for one call to Algorithm 2 is O(M2 Knq) where M2 is the maximum number ofiterations allowed in Algorithm 2.(16)holds for any q min(n, p), hence in terms of asympoticcomplexity, the proposed SIDL is more efficient than standard classical DL in terms of time complexity. If q is close tomin(n, p), then the proposed SIDL will be of about the samecompelexity in computation. Empirical timing of SIDL canalso be found in Section 5.3.Given unlabeled input data points, SIDL alternates between optimization of the sparse coding and updating thedictionary. Algorithm 3 outlines the learning procedure forshift-invariant ditionary updating.4.3.1Comparison to classical dictionary learningAs we mentioned, classical dictionary learning is a specialcase of the proposed framework of SIDL by setting q p,hence the time complexity for classical dictionary learning isO(M Kpn). By comparing the complexity of the proposedSIDL framework to classical DL, )Tracesynthetic controlECGFiveDaysGun 61150760384027560136150991664622103Before presenting our experimental results, we first listthe data sets on which SIDL is performed. We use severalreal-world time-series data sets in all evaluations from theUCR time series archives3 [5], which are listed below withbrief descriptions. Trace: Synthetic dataset designed to simulate failuresin nuclear power plants; Synthetic Control: Synthetically generated chartsby the process in [1]; ECGFiveDays: Measurements of the abnormality ofheartbeat recorded by an electrode; GunPoint: Video tracking data of a human test subject’s right hand, recording whether or not he/she ispulling out a gun;3http://www.cs.ucr.edu/ eamonn/time series data/

MedicalImages: Histograms of pixel intensity of medical images of different human body regions; ChlorineConcentration: Chlorine concentration indrinking water versus time collected from eight households.as we can see from 4, the running time for SIDL is gettingcloser to that of DL, especially for larger K.We also plot how the training objective of SIDL and DLdecreases in terms of iterations. We set the convergencethreshold, which is defined as the relative function objectivechange, to be 10 5 , i.e. convergence is achieved when F (i 1) F (i) 10 5F (i)Table 1 lists related statistics about the data sets mentionedabove.5.2SIDL on data reconstructionWe first investigate how SIDL can be used as a data reconstruction method by learning the shift-invariant basis onthe training set and use the learned dictionary to encode thesignals from the test set. We report the reconstruction erroron all data sets, as well how its performance changes withdifferent values of the sparsity regularization parameter λ,basis vector length q and dictionary size K.We train SIDL with different parameters on the trainingportion of each data set and use the learned dictionary to reconstruct the testing data. Specifically, on the training data,we use SIDL the learn the shift-invariant dictionary D, andthen given the testing data points, we solve for their sparsecodings with D fixed. The Mean Squared Error (MAE) forreconstruction is measured asM AE 1ntestnXtesti 1xi KX2α ik T (dk , t ik )(17)k 1where α ik and t ik are obtained by solving the shift invariantsparse coding problem as described in Section 4.1 with dk strained from the training data points.Figure 3 presents the complete reconstruction performanceof SIDL on all time series data sets with different parameter configurations. It’s worth noting that when pq equals 1,our proposed model reduces to classical dictionary learning.The dictionary sizes in Figure 3 are all set to be K 50 aswe have examined different values of K and observed similarpatterns, so we omit the detailed graphs.On five out of the six datasets, we can see that SIDL actually achieves (i.e. when pq 1) significantly lower reconstruction error than classical dictionary learning, given thesame number of basis elements and same sparsity regularization strength. This suggests that not only SIDL producesa better dictionary to generalize to unseen data but also itresults in smaller dictionary (since our basis elements areshorter than those of classical dictionary learning). Resultsfrom Figure 3 demonstrate that SIDL can be an effective alternative to yield unsupervised sparse representations compared to classical dictionary learning.The above results show the reconstruction performance ofSIDL doing a grid search of the parameters. When appliedpratically, the optimal choice of pq and the degree of sparsity regularization λ can be obtained via cross-validation onsplits from the training set, as we will show in the evaluationfor using codings from SIDL as features for classification inSection 5.4.5.3Computational efficiency of SIDLWe plot the running time of SIDL with different parametersetting in Figure 4. As disscussed in Section 4.3.2, when thelength of the basis is quite smaller than that of the inputsignal, SIDL will be much faster than classical DL, such asq 0.1p and p 0.25p in Figure 4. When q gets closer to p,(18)where F (i) is the objective function for training after theith iteration. It’s clear to see when λ 1, all the SIDLruns converge in fewer iterations than DL, except for q 0.75p which takes about 200 more iterations to converge.When λ 10, which encourages sparser solutions, all theSIDL runs converge in fewer iterations than DL. These implythat the proposed SIDL method can be at least as efficientas classical DL, with the flexibility to model shift-invariantbasis present in the data.It’s also worth emphasizing that, from Figure 4, althoughthe training loss minimum DL achieved is slightly betterthan those of SIDLs, this does not necessarily mean thatthe basis found by DL is any better than those found bySIDL. In fact, as we have shown in the data restruction andwe will show in the classification tasks, the basis found bySIDL actually are better than those found by DLs.5.4Sparse coding from SIDL as features forclassificationIn this section, we evaluate the encodings output by SIDLas feature representations for time series classification tasks.For all data sets, we train the dictionary on the trainingset and apply it onto the test data points to get their encodings. The dictionary size K, the length of basis elementq and the sparsity regularization parameter λ are chosenvia 3-fold cross validation on the training set. We comparethe classification results using sparse coding output by SIDLversus using raw input as features. Below is a list of classification algorithms we used for the evaluation. 1-Nearst Neighbor with Euclidean distance (1NN-Euclidean).It is widely accepted in the time series mining community that 1NN with eucledian distance [22] is a strongthough naive baseline. We run this algorithm on theraw representation of the time series. 1-Nearst Neighbor with Dynamic Time Warping (1NNDTW). It is by far a strong benchmark method for timeseries classfication. The distance of two time series iscomputed by performing DTW on the pair. Support Vector Machine with raw representation (SVMRaw). This method trains an SVM classifier basedon the raw representation of time series, without anytransformation or processing of the input signals. Weuse the libsvm package to implement this method [3,21]. Support Vector Machine with classical dictionary learning features (SVM-DL). This method trains an SVMclassifier based on the encodings from classical dictionary learning and then makes predictions on the encodings of testing data. Support Vector Machine with SIDL (SVM-SIDL). Wetrain SVM on the shift invariant sparse codings given

Trace (K 50)2.0MSEMSE2.5synthetic control (K 50)10λ 0.1λ 1.0λ 10.03.01.582.061.54λ 0.1λ 1.0λ 10.01.020.50.30.40.6qp0.70.80.900.21.0ECGFiveDays (K ChlorineConcentration (K 50)λ 0.1λ 1.0λ 10.032λ 0.1λ 1.0λ es (K 50)1.01.00.00.20.33.0λ 0.1λ 1.0λ 10.03.0MSE0.5λ 0.1λ 1.0λ 10.00.5MSE0.00.2Gun Point (K 70.80.91.0Figure 3: Reconstruction performance of SIDL w.r.t sparsity regularization parameter λ and the relativelength of the basis element pq and a fixed dictionary size of K 50. Note that pq 1 denotes the classicaldictionary learning setting.10 2600040002000002040608010010 3SIDL (q 0.10p)SIDL (q 0.25p)SIDL (q 0.50p)SIDL (q 0.75p)DL10 110 00200400600800iterationK(a) Time taken for SIDL with different qand DL to converge (convergence threshold 10 5 ) when trained on the Tracedata set with sparsity regularization parameter λ 1.avg. training lossSIDL (q 0.10p)SIDL (q 0.25p)SIDL (q 0.50p)SIDL (q 0.75p)DLavg. training lossTime to converge (s)8000(b) Training loss versus iteration forSIDL and DL when trained on theTrace data set with sparsity regularization parameter λ 1. (convergencethreshold 10 5 )SIDL (q 0.10p)SIDL (q 0.25p)SIDL (q 0.50p)SIDL (q 0.75p)DL10 210 1050100150iteration(c) Training loss versus iteration forSIDL and DL when trained on theTrace data set with sparsity regularization parameter λ 10. (convergencethreshold 10 5Figure 4: Running time comparisonsby SIDL, and report the performance on the encodings of the testing data points with the learned shiftinvariant dictionary.Table 2 presents the classification accuracies of variousclassification methods of SIDL on all data sets, with comparison to the baseline method. One interesting result to pointout is that, without sparse encodings (such as DL and SIDL),classification with raw time series representations works always worse than naive 1NN method with Euclidean distanceexcept on the ECGFiveDays data set. This suggests fordata with temporal dependencies, such as time series, usingraw representation as features for SVM is not a good idea.Nevertheless, sparse codings of the sequential data does helpimprove classification, compared to SVM-Raw.SVM-SIDL achieves either the best or second best resultson five data sets out of six, demonstrating the benefits ofmodeling shift-invariant local patterns in the data and thoughclassical dictionary learning alleviate the problem sufferedby SVM-Raw, it still lacks the flexibility to model local patterns. This further validates our motivation that a methodthat models local patterns in sequential data will better represent and explain the data.As also validated by previous literature, 1NN-DTW is indeed a strong baseline method for time series classification,achieving three times of best and once of second best ac-

Table 2: Classification accuracies (Best results are printed in both bold and italic; second best results areprinted in bold only. All parameters of SIDL are selected via 3-fold cross validation on splits of the trainingdata with the following range K {10, 20, 50, 100}, λ {0.1, 1, 10, 100}, q {0.1p, 0.25p, 0.5p, 0.75p}. The sameparameter ranges are also used for parameter tunning for DL. The parameter C for all linear SVMs arechosen through cross validation from {0.001, 0.01, 0.1,1,10,100,1000} )Dataset1NN-Euclidean 1NN-DTW SVM-Raw SVM-DL 9300.9600.967synthetic racies over all data sets, while the proposed SVM-SIDLtwice of best and three times of second best. It’s worthmentioning that the proposed method is formulated to reducereconstruction error, instead of classification error. Besides,it’s worth pointing out that nearest neighbor based methodssuffer from expensive computational burden from pairwiseDTW between the test instance and each of all training examples when doing prediction.Also it is worth mentioning that SVM-SIDL performs beston the “toughest” dataset in the collection, i.e., ECGFiveDays,with a large margin over the baseline methods, including1NN-DTW. This again reinforces the benefits of modelingshift-invariant local patterns for sequential data mining.5.5Learned DictionaryIn this section, we showcase the basis elements learnedby SIDL without any human supervision. Figure 5 presentssample time series from each of the four classes and plots thelearned basis elements together with the time series for theTrace data set. The two largest basis of each time series (interms of the absolute value of the coefficient of that basis)are plotted with their shift locations in Figure 5. Also thedegree of sparsity of the resulting codings (propotions of zerocoefficients) for each time series are shown in the figure titles.It can be observed from the plots that the learned basis docapture local and discriminative patterns of the time seriesfrom different classes. This further validates the idea ofrepresenting time series with local basis elements and theproposed method can be used to efficiently learn them, evenin the absence of true class labels. All the basis elements arelearned with r

Efficient Shift-Invariant Dictionary Learning Guoqing Zheng School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213, USA gzheng@cs.cmu.edu Yiming Yang School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213, USA yiming@cs.cmu.edu Ja