Ask Ghassem - Recent activity in Data Science Interview Questions

Do you usually collect you own data or there is always a resource available for you? Or it depends on the company?

Sun, 09 Jan 2022 22:13:34 +0000

Data manipulation problem study resources

Wed, 28 Aug 2019 12:53:53 +0000

A colleague of mine is studying for tech roles, and they're asked to solve a consistent type of problem during the phone screenings: practicing manipulating data (sets, hash tables/dictionaries, arrays/lists, strings). These questions aren’t necessarily difficult problems and tend to require very little logic, and tend to be more about having a good understanding of the data types (such as listed above). I've provided some examples in this link: https://imgur.com/a/ITVeVnr

So I'm wondering if there are resources to study these questions. They aren't really Leetcode questions or the kind of thing found on Reddit daily programmer, which is where I'm generally directed to most often in the time I've been asking all over. Even if it's a textbook, it would be incredibly handy. And to be clear, I'm not looking for like a hack or golden secret, just resources for studying. Thank you for any help!

Answered: Do you have a cheatsheet for Data Science?!

Wed, 27 Feb 2019 05:59:34 +0000

A great cheatsheet is available in this link, and can be downloaded directly from here. "The cheatsheet is loosely based off of The Data Science Design Manual by Steven S. Skiena and An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani ."

Some screenshots:

https://raw.githubusercontent.com/ml874/Data-Science-Cheatsheet/master/Screenshots/screenshot1.png

https://raw.githubusercontent.com/ml874/Data-Science-Cheatsheet/master/Screenshots/screenshot2.png

Answered: What is summary statistics?

Thu, 01 Nov 2018 19:45:32 +0000

The information that gives a quick and simple description of the data. These include mean, median, mode, minimum value, maximum value, range, standard deviation, etc

Answered: What is the purpose of randomization in statistics?

Thu, 01 Nov 2018 19:26:53 +0000

The main purpose for using randomization in an experiment is to control the lurking variable.

Using randomization is the most reliable method of creating homogeneous treatment groups, without involving any potential biases or judgments.

Answered: What is the difference between univariate and multivariate analysis?

Tue, 30 Oct 2018 11:54:20 +0000

Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words, your data has only one variable. It doesn't deal with causes or relationships (unlike regression) and its major purpose is to describe.

For example, the distribution of the educational background of students involves only one variable and can the analysis can be referred to as univariate analysis.

To know more: https://www.statisticshowto.datasciencecentral.com/univariate/

Multivariate analysis (MVA) involves observation and analysis of more than one statistical outcome variable at a time. The technique is used across multiple dimensions while taking into account the effects of all variables on the responses of interest, and the techniques are especially valuable when working with correlated variables. One example mentioned in class is Factor Analysis.

Specifically, if attempting to understand the difference between two variables at a time is called Bivariate analysis.

To know more: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/multivariate-analysis/

https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/multivariate-analysis

Answered: How will you create a classification to identify key customer trends in unstructured data?

Sun, 28 Oct 2018 11:50:59 +0000

A model does not hold any value if it cannot produce actionable results, an experienced data analyst will have a varying strategy based on the type of data being analysed. For example, if a customer complain was retweeted then should that data be included or not. Also, any sensitive data of the customer needs to be protected, so it is also advisable to consult with the stakeholder to ensure that you are following all the compliance regulations of the organization and disclosure laws, if any.

You can answer this question by stating that you would first consult with the stakeholder of the business to understand the objective of classifying this data. Then, you would use an iterative process by pulling new data samples and modifying the model accordingly and evaluating it for accuracy. You can mention that you would follow a basic process of mapping the data, creating an algorithm, mining the data, visualizing it and so on. However, you would accomplish this in multiple segments by considering the feedback from stakeholders to ensure that you develop an enriching model that can produce actionable results.

Answered: Mention some common problems that data analysts encounter during analysis.

Sun, 28 Oct 2018 11:50:17 +0000

Having a poor formatted data file. For instance, having CSV data with un-escaped newlines and commas in columns.
Having inconsistent and incomplete data can be frustrating.
Common Misspelling and Duplicate entries are a common data quality problem that most of the data analysts face.
Having different value representations and misclassified data.

Answered: Explain the typical data analysis process.

Sun, 28 Oct 2018 11:49:51 +0000

Data analysis deals with collecting, inspecting, cleaning, transforming and modeling data to glean valuable insights and support better decision making in an organization. The various steps involved in the data analysis process include

Data Exploration

Having identified the business problem, a data analyst has to go through the data provided by the client to analyse the root cause of the problem.

Data Preparation

This is the most crucial step of the data analysis process wherein any data anomalies (like missing values or detecting outliers) with the data have to be modelled in the right direction.

Data Modelling

The modelling step begins once the data has been prepared. Modelling is an iterative process wherein the model is run repeatedly for improvements. Data modelling ensures that the best possible result is found for a given business problem.

Validation

In this step, the model provided by the client and the model developed by the data analyst are validated against each other to find out if the developed model will meet the business requirements.

Implementation of the Model and Tracking

This is the final step of the data analysis process wherein the model is implemented in production and is tested for accuracy and efficiency.

Answered: What is the difference between Data Mining and Data Analysis?

Sun, 28 Oct 2018 11:49:13 +0000

Data Mining vs Data Analysis
Data Mining	Data Analysis
Data mining usually does not require any hypothesis.	Data analysis begins with a question or an assumption.
Data Mining depends on clean and well-documented data.	Data analysis involves data cleaning.
Results of data mining are not always easy to interpret.	Data analysts interpret the results and convey the to the stakeholders.
Data mining algorithms automatically develop equations.	Data analysts have to develop their own equations based on the hypothesis.

Answered: What is TF-IDF algorithm?

Sun, 28 Oct 2018 11:26:52 +0000

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.

This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

How to Compute:

Typically, the tf-idf weight is composed by two terms:

the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document;
the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See below for a simple example.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Answered: What are Natural Language Processing (NLP) and its applications?

Sun, 28 Oct 2018 11:04:40 +0000

The majority of activities performed by humans are done through language, whether communicated directly or reported using natural language. As technology is increasingly making the methods and platforms on which we communicate ever more accessible, there is an even greater need to understand the languages we use to communicate. By combining the power of artificial intelligence, computational linguistics, and computer science, Natural Language Processing (NLP) helps machines “read” text by simulating the human ability to understand language.

NLP is everywhere even if we don’t realize it. Does your email application automatically correct you when you try to send an email without the attachment that you referenced in the text of the email? This is Natural Language Processing Applications at work.

some examples of the most widely used NLP applications:

Natural Language Processing Applications: Machine Translation

As the amount of information available online is growing, the need to access it becomes increasingly important and the value of natural language processing applications becomes clear. Machine translation helps us conquer language barriers that we often encounter by translating technical manuals, support content or catalogs at a significantly reduced cost. The challenge with machine translation technologies is not in translating words, but in understanding the meaning of sentences to provide a true translation.

Automatic summarization

Information overload is a real problem when we need to access a specific, important piece of information from a huge knowledge base. Automatic summarization is relevant not only for summarizing the meaning of documents and information but also for understanding the emotional meanings inside the information, such as in collecting data from social media. Automatic summarization is especially relevant when used to provide an overview of a news item or blog posts while avoiding redundancy from multiple sources and maximizing the diversity of content obtained.

Sentiment analysis

The goal of sentiment analysis is to identify sentiment among several posts or even in the same post where emotion is not always explicitly expressed. Companies use natural language processing applications, such as sentiment analysis, to identify opinions and sentiment online to help them understand what customers think about their products and services (i.e., “I love the new iPhone” and, a few lines later “But sometimes it doesn’t work well” where the person is still talking about the iPhone) and overall indicators of their reputation. Beyond determining simple polarity, sentiment analysis understands the sentiment in context to help you better understand what’s behind an expressed opinion, which can be extremely relevant in understanding and driving purchasing decisions.

Text classification

Text classification makes it possible to assign predefined categories to a document and organize it to help you find the information you need or simplify some activities. For example, application of text categorization is spam filtering in an email.

Question Answering

As speech-understanding technology and voice-input applications improve, the need for NLP will only increase. Question-Answering (QA) is becoming more and more popular thanks to applications such as Siri, OK Google, chat boxes and virtual assistants. A QA application is a system capable of coherently answering a human request. It may be used as a text-only interface or as a spoken dialog system. While they offer great promise, they still have a long way to go. This remains a relevant challenge especially for search engines and is one of the main applications of natural language processing research.

Using natural language processing for creating a seamless and interactive interface between humans with machines will continue to be a top priority for today’s and tomorrow’s increasingly cognitive applications.

Commented: Which scenarios among the following are a valid reason to use regularization?

Sat, 27 Oct 2018 17:45:22 +0000

Please provide the links to the sources as well.

Answered: How to transform categorical variable into a matrix binary feature?

Sat, 27 Oct 2018 17:24:33 +0000

Answer: Letter A: One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Source: Hackernoon

Answered: When should we use Mean, Median or Mode as Measures of Central Tendency?

Wed, 24 Oct 2018 20:57:42 +0000

All three measures are used to give us a good representative for "average" in our data samples. However, based on the type and properties of each we have to use them in different situations. Based on the types of variables, we can use the following table to see what measure we should use:

Type of Variable	Best measure of central
Categorical (Nominal)	Mode
Ordinal	Median
Interval/Ratio (not skewed)	Mean
Interval/Ratio (skewed)	Median

Consider the effect of Outliers

In addition, when we have ratio variables (such as numeric values) and it contains outliers, we have to use Median instead of the mean. An example is a salary data columns that may contain very large or very small values which affect the mean, but if we use Median instead, we can see a better representative for the "average". That is why on many websites you see Median Salary for a job position instead of mean. For more information, you can take a look at this page.

What are the main steps in making a decision tree?

Fri, 12 Oct 2018 02:17:19 +0000

Please explain Linear Regression with an example?

Fri, 12 Oct 2018 02:14:00 +0000

How to return the outliers by having a list of numbers ?

Mon, 08 Oct 2018 12:19:22 +0000

Can you explain the percentiles and quartiles and their applications?

Mon, 08 Oct 2018 11:52:30 +0000