Top Data Analytics Interview Questions And Answers PDF Free Download

2y ago

96 Views

1 Downloads

686.54 KB

17 Pages

Report/dmca

Download PDF

Transcription

Top Data Analytics Interview Questions and Answers“Data is a precious thing and will last longer than the systems themselves.” – Tim Berners-Lee,inventor of the World Wide Web.We live in an information-driven age where data plays an integral role in the functioning of anyorganization. Thus, organizations are always on the lookout for skilled data analysts who canturn their data into valuable information. This can help organizations achieve better businessgrowth as they can better understand the market, consumers, their products or services, andmuch more.Technology professionals with data analytics skills are finding themselves in high demand asbusinesses look to harness data power. If you are planning to be a part of this high potentialindustry and prepare for your next data analytics interview, you are in the right place!Here are the top 55 data analytics questions & answers that will help you clear your next dataanalytics interview. These questions cover all the essential topics, ranging from data cleaningand data validation to SAS.Let’s begin!Top 55 Data Analytics Interview Questions & AnswersHere are some of the top data analytics interview questions & answers:Q1. What are the best practices for data cleaning?Ans. There are 5 basic best practices for data cleaning: Make a data cleaning plan by understanding where the common errors take place andkeep communications open.Standardize the data at the point of entry. This way it is less chaotic and you will be ableto ensure that all information is standardized, leading to fewer errors on entry.Focus on the accuracy of the data. Maintain the value types of data, provide mandatoryconstraints, and set cross-field validation.

Identify and remove duplicates before working with the data. This will lead to aneffective data analysis process.Create a set of utility tools/functions/scripts to handle common data cleaning tasks.EXPLORE DATA ANALYTICS COURSES NOW Q2. What are the challenges that are faced as a data analyst?Ans. There are various ways you can answer the question. It might be very badly formatteddata when the data isn’t enough to work with, clients provide data they have supposedlycleaned it but it has been made worse, not getting updated data or there might be factual/dataentry errors.Q3. What are the data validation methods used in data analytics?Ans. The various types of data validation methods used are: Field Level Validation – validation is done in each field as the user enters the data toavoid errors caused by human interaction.Form Level Validation – In this method, validation is done once the user completes theform before a save of the information is needed.Data Saving Validation – This type of validation is performed during the saving processof the actual file or database record. This is usually done when there are multiple dataentry forms.Search Criteria Validation – This type of validation is relevant to the user to match whatthe user is looking for to a certain degree. It is to ensure that the results are actuallyreturned.Q4. What is an outlier?Ans. Any observation that lies at an abnormal distance from other observations is known as anoutlier. It indicates either variability in the measurement or an experimental error.Q5. What is the difference between data mining and data profiling?Ans. Data profiling is usually done to assess a dataset for its uniqueness, consistency, and logic.It cannot identify incorrect or inaccurate data values.

Data mining is the process of finding relevant information that has not been found before. It isthe way in which raw data is turned into valuable information.Q6. How often should a data model be retained?Ans. A good data analyst would be able to understand the market dynamics and act accordinglyto retain a working data model so as to adjust to the new environment.Q7. What is the KNN imputation method?Ans. KNN (K-nearest neighbor) is an algorithm that is used for matching a point with its closestk neighbors in a multi-dimensional space.Q8. Why is KNN used to determine missing numbers?Ans. KNN is used for missing values under the assumption that a point value can beapproximated by the values of the points that are closest to it, based on other variables.Q9. Explain what you do with suspicious or missing data?Ans. When there is a doubt in data or there is missing data, then: Make a validation report to provide information on the suspected data.Have experienced personnel look at it so that its acceptability can be determined.Invalid data should be updated with a validation code.Use the best analysis strategy to work on the missing data like simple imputation,deletion method or case wise imputation.Q10. What is kmeans algorithm?Ans. Kmeans algorithm partitions a data set into clusters such that a cluster formed ishomogeneous and the points in each cluster are close to each other. The algorithm tries tomaintain enough separation between these clusters. Due to the unsupervised nature, theclusters have no labels.

Also Read Top Data Analytics Courses from Coursera, Edx, WileyNXT, and JigsawQ11. What is the difference between the true positive rate and recall?Ans. There is no difference, they are the same, with the formula:(true positive)/(true positive false negative)Q12. What is the difference between linear regression and logistic regression?Ans. The differences between linear regression and logistic regression are:Linear RegressionLogistic RegressionIt requires independent variables to becontinuousIt can have dependent variables with morethan two categoriesBased on the least-square estimationBased on maximum likelihood estimationRequires 5 cases per independent variableRequires at least 10 events per independentvariableAimed at finding the best fitting straight linewhere the distance between the points andthe regression line are errorsAs it is used to predict a binary outcome, theresultant graph is an S-curved one.Q13. What is a good data model?Ans. The criteria that define a good data model are: It is intuitive.Its data can be easily consumed.The data changes in it are scalable.It can evolve and support new business cases.

Q14. Estimate the number of weddings that take place in a year in India?Ans. To answer this type of guesstimation questions, one should always follow four steps:Step 1:Start with the right proxy – here the right proxy will be the total population. You know thatIndia has more than 1 billion population and to be a bit more precise, it’s around 1.2 billion.Step 2:Segment and filter – the next step is to find the right segments and filter out the ones which arenot. You will have a tree-like structure, with branches for each segment and sub-brancheswhich filters out each segment further. In this question, we will filter out the population above35 years of age and below 15 for rural/below 20 for urban.Step 3:Always round of the proxy to one or zero decimal points so that your calculation is easy. Insteadof doing a calculation like 1488/5, you can go for 1500/5.Step 4:Validate each number using your common sense to understand if it’s the right one. Add all thenumbers that you have come up after filtering. You will get the required guesstimate. E.g. wewill validate the guesstimate to include one-time marriages only at the end.Let’s do it:Total population – 1.2 billionTwo main population segments – Rural (70%) and Urban (30%)Now, filtering age group and sex ratio:The average marriage age in rural – 15 to 35 years

The average marriage age in urban – 20 to 35 yearsAssuming 65% of the total population is within 0-35 years,Percentage of the population which has the probability of getting married in the rural area (35-15)/35*65 40%Percentage of the population which has the probability of getting married in the urban area (35-20)/35*65 30%Assuming the sex ratio to be 50% male and 50% female,Total number of marriages in rural area .70*.40*1.2 billion/2 170 millionConsidering only first-time marriages in the rural area 170 million/20 8.5 millionTotal number of marriages in urban area .30*.30*1.2 billion/2 50 millionConsidering only first-time marriages in the rural area 50 million/15 3millionThus, the total marriage in India in a year 11 – 12 millionQ15. What is the condition for using a t-test or a z-test?Ans. The T-test is usually used when we have a sample size of less than 30 and a z-test when wehave a sample test greater than 30.Q16. What are the two main methods two detect outliers?Ans. Box plot method: if the value is higher or lesser than 1.5*IQR (interquartile range) abovethe upper quartile (Q3) or below the lower quartile (Q1) respectively, then it is considered anoutlier.Standard deviation method: if the value higher or lower than mean (3*standard deviation),then it is considered an outlier.Q17. Why is ‘naïve Bayes’ naïve?

Ans. It is naïve because it assumes that all datasets are equally important and independent,which is not the case in a real-world scenario.Q18. What is the difference between standardized and unstandardized coefficients?Ans. The standardized coefficient is interpreted in terms of standard deviation while theunstandardized coefficient is measured in actual values.Q19. What is the difference between R-squared and adjusted R-squared?Ans. R-squared measures the proportion of variation in the dependent variables explained bythe independent variables.Adjusted R-squared gives the percentage of variation explained by those independent variablesthat in reality affect the dependent variable.Q20. What is the difference between factor analysis and principal component analysis?Ans. The aim of the principal component analysis is to explain the covariance between variableswhile the aim of factor analysis is to explain the variance between variables.Q21. What are the steps involved in a data analytics project?Ans. The fundamental steps involved in a data analysis project are – Understand the BusinessGet the dataExplore and clean the dataValidate the dataImplement and track the data setsMake predictionsIterateQ22. What do you do for data preparation?Ans. Since data preparation is a critical approach to data analytics, the interviewer might beinterested in knowing what path you will take up to clean and transform raw data beforeprocessing and analysis. As an answer to this data analytics interview question, you should

discuss the model you will be using, along with logical reasoning for it. In addition, you shouldalso discuss how your steps would help you to ensure superior scalability and accelerated datausage.Also Read How to become a Data AnalystQ23. What are some of the most popular tools used in data analytics?Ans. The most popular tools used in data analytics are: TableauGoogle Fusion TablesGoogle Search OperatorsKonstanz Information Miner L Server Reporting Services (SSRS)Microsoft data management stackQ24. What are the most popular statistical methods used when analyzing data?Ans. The most popular statistical methods used in data analytics are – Linear RegressionClassificationResampling MethodsSubset SelectionShrinkageDimension ReductionNonlinear ModelsTree-Based MethodsSupport Vector MachinesUnsupervised LearningQ25. What are the benefits of using version control?

Ans. The primary benefits of version control are – Enables comparing files, identifying differences, and merging the changesAllows keeping track of application builds by identifying which version is underdevelopment, QA, and productionHelps to improve the collaborative work cultureKeeps different versions and variants of code files secureAllows seeing the changes made in the file’s contentKeeps a complete history of the project files in case of central server breakdownQ26. What is Collaborative Filtering?Ans. Collaborative filtering is a technique used by recommender systems by making automaticpredictions or filtering about a user’s interests. This is achieved by collecting information frommany users.Q27. Do you have any idea about the job profile of a data analyst?Ans. Yes, I have a fair idea of the job responsibilities of a data analyst. Their primaryresponsibilities are – To work in collaboration with IT, management, and/or data scientist teams to determineorganizational goalsDig data from primary and secondary sourcesClean the data and discard irrelevant informationPerform data analysis and interpret results using standard statistical methodologiesHighlight changing trends, correlations, and patterns in complicated data setsStrategize process improvementEnsure clear data visualizations for managementQ28. What is a Pivot Table?Ans. A Pivot Table is a Microsoft Excel feature used to summarize huge datasets quickly. Itsorts, reorganizes, counts, or groups data stored in a database. This data summarizationincludes sums, averages, or other statistics.Q29. Name different sections of a Pivot Table.Ans. A Pivot table has four different sections, which include –

Values AreaRows AreaColumn AreaFilter AreaQ30. What is Standard Deviation?Ans. Standard deviation is a very popular method to measure any degree of variation in a dataset. It measures the average spread of data around the mean most accurately.Q31. What is a data collection plan?Ans. A data collection plan is used to collect all the critical data in a system. It covers – Type of data that needs to be collected or gatheredDifferent data sources for analyzing a data setQ32. What is an Affinity Diagram?Ans. An Affinity Diagram is an analytical tool used to cluster or organize data into subgroupsbased on their relationships. These data or ideas are mostly generating from discussions orbrainstorming sessions, and are used in analyzing complex issues.Q33. What is imputation?Ans. Missing data may lead to some critical issues; hence, imputation is the methodology thatcan help to avoid pitfalls. It is the process of replacing missing da