Detection Of Phishing Websites Using Data Mining Techniques

Transcription

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 2 Issue 12, December - 2013Detection of Phishing Websites Using Data Mining TechniquesAnindita KhadeDr. Subhash K ShindeDept. of CE ,Lokmanya Tilak College of EngineeringKoparkhairane,Navi Mumbai 421302Dept of CE,Lokmanya Tilak College of EngineeringKoparkhairane,Navi Mumbai 421302manner. A proactive approach to minimizing phishing hasbeen conducted where the system removes a phishing pagefrom the host server rather than just filtering email andflagging suspected messages as spam. DM approaches suchas neural networks, rule induction, and decision trees canbe a useful addition to the fuzzy logic model.2. RELATED WORKIntrusion detection is software, hardware or combination ofExisting anti-phishing and anti-spam techniques sufferfrom one or more limitations and they are not 100%effective at stopping all spam and phishing attacks.Phishing website is a recent problem, nevertheless due toits huge impact on the financial and on-line retailingsectors and since preventing such attacks is an importantstep towards defending against e-banking phishing websiteattacks, there are several promising approaches to thisproblem and a comprehensive collection of related works.In this section, we briefly survey existing anti-phishingsolutions and list of the related works. One approach is tostop phishing at the email level , since most currentphishing attacks use broadcast email (spam) to lure victimsto a phishing website . Another approach is to use securitytoolbars. The phishing filter in IE7 is a toolbar approachwith more features such as blocking the user‗s activity witha detected phishing site. A third approach is to visuallydifferentiate the phishing sites from the spoofed legitimatesites. Dynamic Security Skins proposes to use a randomlygenerated visual hash to customize the browser window orweb form elements to indicate the successfullyauthenticated sites. A fourth approach is two factorauthentication, which ensures that the user not only knowsa secret but also presents a security token .However, thisapproach is a server-side solution. Phishing can still happenat sites that do not support two-factor authentication.Sensitive information that is not related to a specific site,e.g., credit card information and SSN, cannot be protectedby this approach either. Many industrial anti phishingproducts use toolbars in Web browsers, but someresearchers have shown that security tool bars don‗teffectively prevent phishing attacks. proposed a schemethat utilizes a cryptographic identity verification methodthat lets remote Web servers prove their identities.However, this proposal requires changes to the entire WebIJERTAbstract—Detecting any Phishing website is really acomplex and dynamic problem involving many factorsand criteria. Because of the ambiguities involved inphishing detection, fuzzy data mining techniques can bean effective tool in detecting phishy websites.In thispaper we propose a method which combines fuzzy logicalong with data mining algorithms for detecting phishywebsites. Here, we define 3 different phishing types and6 different criteria for detecting phishy websites with alayer structure. We have used RIPPER data miningalgorithm for classification. Furthermore, after theemail has been assessed and classified as a Phishingemail, the system proactively gets rid of the Phishingsite or Phishing page by sending a notification to theSystem Administrator of the host server that it ishosting a Phishing site which may result in the removalof the site. Furthermore, after classifying the Phishingemail, the system retrieves the location, IP address andcontact information of the host server.Keywords: Phishing, Ripper algorithm, fuzzy logic1. INTRODUCTION TO PHISHINGPhishing websites are forged websites that are created bymalicious people to appear as a real websites. Phishing isas an act of sending an e-mail to a user falsely claiming tobe a legitimate business establishment in an attempt toscam or trick the user into surrendering private informationthat will be used for identity theft. The impact is the breachof information security through the compromise ofconfidential data and the victims may finally suffer lossesof money or other kinds. There were at least 67, 677phishing attacks reported by the Anti-Phishing WorkingGroup (APWG) in the last six months of 2010. The latestreports showed that most phishing attacks are ―spearphishing‖ that aim the financial, business and paymentsectors. E-banking Phishing website is a very complexissue to understand and to analyze, since it is joiningtechnical and social problem with each other for whichthere is no known single silver bullet to entirely solve it.The motivation behind this study is to create a resilient andeffective method that uses Fuzzy Data Mining algorithmsand tools to detect phishing websites in an automatedIJERTV2IS121245www.ijert.org3725

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 2 Issue 12, December - 2013essential advantage offered by fuzzy logic techniques is theuse of linguistic variables to represent key phishingcharacteristic or indicators in relating phishing emailprobability.4. DETECTING AND CLASSIFYINGPHISHING EMAILSIJERTinfrastructure (both servers and clients), so it can succeedonly if the entire industry supports it. Proposed a tool tomodel and describe phishing by visualizing and quantifyinga given site‗s threat, but this method still wouldn‗t providean antiphishing solution. Another approach is to privacy/spam).A recent and particularly promising solution was proposedto combine the technique of standard certificates with avisual indication of correct certification; a site-dependentlogo indicating that the certificate was valid would bedisplayed in a trusted credentials area of the browser. Avariant of web credential is to use a database or listpublished by a trusted party, where known phishing websites are blacklisted. For example Netcraft antiphishingtoolbar http://toolbar.netcraft.com/prevents phishingattacks by utilizing a centralized blacklist of currentphishing URLs. Other Examples include Websense,McAfee‗s anti–phishing filter, Netcraft anti-phishingsystem, Cloudmark SafetyBar, and Microsoft PhishingFilter . The weaknesses of this approach are its poorscalability and its timeliness. Note that phishing sites arecheap and easy to build and their average lifetime is only afew days. APWG provides a solution directory at (AntiPhishing Working Group) which contains most of themajor antiphishing companies in the world. However, anautomatic antiphishing method is seldom reported. Thetypical technologies of antiphishing from the User Interfaceaspect are done by and . They proposed methods that needWeb page creators to follow certain rules to create Webpages, either by adding dynamic skin to Web pages oradding sensitive information location attributes to HTMLcode. However, it is difficult to convince all Web pagecreators to follow the rules . The DOM based visualsimilarity of Web pages is oriented, and the concept ofvisual approach to phishing detection was first introduced.Through this approach, a phishing Web page can bedetected and reported in an automatic way rather thaninvolving too many human efforts. Their method firstdecomposes the Web pages (in HTML) into salient(visually distinguishable) block regions.3. FUZZY LOGIC AND DATA MININGDM is the process of searching through large amounts ofdata and picking out relevant information. It has beendescribed as "the nontrivial extraction of implicit,previously unknown, and potentially useful informationfrom large data sets.It is a powerful new technology with great potential to helpresearchers focus on the most important information intheir data archive. Data mining tools predict future trendsand behaviors, allowing businesses to make proactive,knowledge-driven decisions. there are many characteristicsand factors that can distinguish the original legitimatewebsite from the forged e-banking phishing website likeSpelling errors. The approach is to apply fuzzy logic andRIPPER data mining algorithm to assess phishing emailbased on the identified characteristics or components. TheIJERTV2IS121245Fig 1: Overall system approachThe proposed methodology will apply fuzzy logic and datamining algorithms to classify phishing emails based on twoclassification approaches such as content-based approachand non-content based approach. Specific categories orcriteria are selected for each approach. The components orselected features are then identified for each category. Thelist of the classification approaches with the identifiedcriteria and specific features is listed in the table below.The list will be used as basis for in the simulation anddetermination of phishing emails.Table 1:Characteristics of phishing emailsClassificationapproachNon URLNonMatching URLCrawler URLLongURL addressURL prefix/suffixSpelling errorsKeywordsEmbedded Links1EmailMessage23726

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 2 Issue 12, December - 2013ENDDO5.MINING USING RIPPER ALGORITHMThe approach is to apply fuzzy logic and RIPPER datamining algorithm to assess phishing email based on the 9identified characteristics or components. The essentialadvantage offered by fuzzy logic techniques is the use oflinguistic variables to represent key phishing characteristicor indicators in relating phishing email probability.Classification is done using WEKA.5.1 Algorithm:Initialize RS {}, and for each class from the lessprevalent one to the more frequent one, DO:RuleIPURLRedirectURLNonMatching ModerateModerate451. Building stage:Repeat 1.1 and 1.2 until the descrition length (DL) of theruleset and examples is 64 bits greater than the smallest DLmet so far, or there are no positive examples, or the errorrate 50%.6Fraud5.2 Rule base for layer 11.1. Grow Rule1.2. Prune phase:IJERTGrow one rule by greedily adding antecedents (orconditions) to the rule until the rule is perfect (i.e. 100%accurate). The procedure tries every possible value of eachattribute and selects the condition with highest informationgain: p(log(p/t)-log(P/T)).Incrementally prune each rule and allow the pruning of anyfinal sequences of the antecedents;The pruning metric is (pn)/(p n) – but it's actually 2p/(p n) -1, so in thisimplementation we simply use p/(p n) (actually(p 1)/(p n 2), thus if p n is 0, it's 0.5).2. Optimization stage:After generating the initial ruleset {Ri}, generate and prunetwo variants of each rule Ri from randomized data usingprocedure 1.1 and 1.2. But one variant is generated from anempty rule while the other is generated by greedily addingantecedents to the original rule. Moreover, the pruningmetric used here is (TP TN)/(P N).Then the smallestpossible DL for each variant and the original rule iscomputed. The variant with the minimal DL is selected asthe final representative of Ri in the ruleset.After all therules in {Ri} have been examined and if there are stillresidual positives, more rules are generated based on theresidual positives using Building Stage again.3. Delete the rules from the ruleset that would increase theDL of the whole ruleset if it were in it. and add resultantruleset to usFraud5.3 Rule Base for layer 25.4 Locating the Host Server of the Phishing PageWHOIS is a protocol used to find information aboutnetworks, domains and hosts. The WHOIS query is used tolocate the host server of a phishing page. WHOIS is aquery/response protocol that is widely used for querying anofficial database. The WHOIS database contains IPaddresses, autonomous system numbers, organizations orcustomers that are associated with these resources, andrelated Points of Contact on the Internet . A WHOIS searchwill provide information regarding a domain name, such asexample.com. It may include information, such as domainownership, where and when registered, expiration date, andthe name servers assigned to the domain. The system runsthe WHOIS query on the URL that is contained in thePhishing email.Upon receiving the notification of thephishing page‘s existence on the host server,the hostingadministrator will then test the legitimacy of the phishinglink and its validity. Once the Administrator confirms thephishing page, the infected or hacked website will be shutdown immediately to protect Internet users from furtherphishing. The host Administrator then notifies the websitewww.ijert.org3727

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 2 Issue 12, December - 2013owner about the existence of the phishing page within theirwebsite. As soon as the phishing page is removed, if nonotification has been sent, the proposed system willperiodically check for evidence that it has been removed.This technique assumes that website owner and hostAdministrator are absolutely unaware of the presence ofthe phishing page within their website or server until ourtechnique notifies them. This means Phishers are takingcontrol of the legitimate website to upload their phishingpage.URL and Entity Domain as well as Email Content Domainare two important and significant Phishing criteria. If oneof the criteria is ―Valid or Genuine‖, it will likely followthat the email is a legitimate email. The same is true if bothof the criteria are ―Valid or Genuine‖. Likewise, if thecriteria are ―Fraud‖, the email is considered as a Phishingemail. It should be noted, however, that even if some of thePhishing email characteristics or stage is present, it doesnot automatically mean that the email is a Phishing email.The initial objective is to assess the risk of the email in thearchive data using fuzzy logic and the RIPPERclassification algorithm. Several characteristics wereidentified and major rules that were determined along thestudy were used in the fuzzy rule engine. The resultsshowed that the RIPPER algorithm achieved 85.4% forcorrectly classified Phishing emails and 14.6% for wronglyclassified Phishing emails. The phishing page removalsuccess rate is .comIJERT5.5 Removal of Phishing page:Upon receiving the notification of the phishing page‘sexistence on the host server,,the hosting administrator willthen test the legitimacy of the phishing link and its validity.Once the Administrator confirms the phishing page, theinfected or hacked website will be shut down immediatelyto protect Internet users from further phishing. The hostAdministrator then notifies the website owner about theexistence of the phishing page within their website. Assoon as the phishing page is removed, if no notification hasbeen sent, the proposed system will periodically check forevidence that it has been removed. This technique assumesthat website owner and host Administrator are absolutelyunaware of the presence of the phishing page within theirwebsite or server until our technique notifies them. Thismeans Phishers are taking control of the legitimate websiteto upload their phishing page.7. CONCLUSIONS AND FUTURE WORK6. RESULTS100 websites from Phishtank.com were considered fortesting purpose. For rule base 1, there are 6 identifiedPhishing email characteristics based on the non-contentbased approach. The assigned weight is 0.5. For rule base2, there are 3 identified characteristics of Phishing emailsbased on the content-based approach. The assigned weightis 0.5. The email rating is computed as 0.5 * URL andDomain Entity crisp (rule base 1) 0.5 * Email ContentDomain crisp (rule base 2).Table 4:Results from WEKAValidation modeAttributesNumber of RulesCorrectly classifiedIncorrectly ClassifiedNo. of samplesCaller-ID,[2] Anti-Phishing Working Group. Phishing ts/apwg report sep2007 final.pdf. September 2007.[3] B. Adida, S. Hohenberger and R. Rivest, ―LightweightEncryption for Email,‖ USENIX Steps to Reducing UnwantedTraffic on the Internet Workshop (SRUTI), 2005.[4] S. M. Bridges and R. B. Vaughn, ―fuzzy data miningand genetic algorithms applied to intrusion detection,‖Department of Computer Science Mississippi StateUniversity, White Paper, 2001.[5] R. Dhamija and J.D. Tygar, ―The Battle againstPhishing: Dynamic Security Skins,‖ Proc. Symp. UsablePrivacy and Security, 2005.10 folds cross validationURL DomainEmail Content1285.4%14.6%100[6] FDIC., ―Putting an End to Account-Hijacking sumer/idtheftstudy/identity theft.pdf 2004.The initial results showed that URL and Entity Domain andthe Email Content Domain are important criteria foridentify and detecting Phishing emails. If one of them is―Valid or Genuine‖, it will likely follow that the email is alegitimate email. The same is true if both of the criteria are―Valid or Genuine‖. Likewise, if the criteria are ―Fraud‖,the email is considered as a Phishing email.IJERTV2IS121245Web[7] A. Y. Fu, L. Wenyin and X. Deng, ― DetectingPhishing Web Pages with Visual Similarity AssessmentBased on Earth Mover‗s Distance (EMD) ‖ IEEEtransactions on dependable and secure computing, vol. 3,no. 4, 2006.www.ijert.org3728

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 2 Issue 12, December - 2013[8] A. Herzberg and A. Gbara, ―Protecting Naive WebUsers,‖ Draft of July 18, 2004.[9] C. Y. Ho, B. W. Ling and J. D. Reiss, "Fuzzy ImpulsiveControl of High-Order Interpolative Low-Pass Sigma–Delta Modulators," IEEE Transactions on Circuits andSystems—I: Regular Papers, Vol. 53, No 10, October2006.[10] L. James, ―Phishing Exposed,‖ Tech Target Articlesponsored by: Sunbelt software, searchexchange.com,2006.[11] M. Liu, D. Chen and C. Wu. "The continuity ofMamdani method," International Conference on MachineLearning and Cybernetics, Page(s): 1680 - 1682 vol.3,2002.[12] W. Liu, G. Huang, X. Liu, M. Zhang, and X. Deng,―Phishing Web PageDetection,‖ Proc. Eighth Int‗l Conf. Documents Analysisand Recognition, pp. 560-564, 2005.X. Deng, G. Huang and A. Y. Fu, ―AnStrategy Based on Visual SimilarityPublished by the IEEE Computer SocietyIEEE , INTERNET COMPUTING IEEE,IJERT[13] W. ] Microsoft Corp, ―Microsoft Phishing Filter: A NewApproach toBuilding Trust in E-Commerce Content,‖ White Paper,2005.[15] S. Olsen, ―AOL tests caller ID for e-mail,‖ CNETNews.com, January22, 2004.IJERTV2IS121245www.ijert.org3729

System Administrator of the host server that it is hosting a Phishing site which may result in the removal . Intrusion detection is software, hardware or combination of Existing anti-phishing and anti-spam techniques suffer . McAfee‗s anti-phishing filter, Netcraft anti-phishing system, Cloudmark SafetyBar, and Microsoft Phishing .