Ask Ghassem - Recent activity in Statistics

How do I compare the count of a value in each year while having a different sanple size each year.

Wed, 08 Jun 2022 10:32:33 +0000

How do I accurately compare between the number of something a survey measure from my employees each year with a varying umber of survey engagement and employee size?

If I was measuring the satisfaction of my employees over the years by collecting a survey from my them each year by asking them wether they are satisfied or not, and then comparing yes’s over the years but the number of employees who answer is not the same each year and the number of employees increases every year. How do I correctly compare this throughout each year?

In other words, how do I remove the effect of the survey engagement rate when calculating the results?

is it possible to derive a new 95% CI from two separate 95% CIs?

Mon, 23 Nov 2020 14:45:19 +0000

Individual and group relative strength in a fixed pool of players: How to approach the problem?

Tue, 29 Oct 2019 20:00:28 +0000

I apologize in advance if my question sounds too basic to be worthy of anyone's time, but statistics are not part of my curriculum.

I am developing a proof of concept of a web application modeling the contribution of individual soccer player with respect to the different teams they've played with throughout their career. In particular, I am looking into a way of ranking both individuals and groups of players as follows::

teammates relative strength: the best/worst combinations of players when playing in the same team in the same matches;
opponents relative strength: the best/worst combinations of players when playing in opposite teams in the same matches, i.e. which tuples of teammates are the best/worst against which;

I must admit I don't quite know how to approach the problem (as I said I have no formal education in statistics or data science). I would be very grateful if anyone could give me some directions. How should I frame this particular problem and what resources in statistics or machine learning (if indeed this is a task fit for machine learning, perhaps I am mistaken on this) would be appropriate to tackle it?

I am eager to learn, so both practical examples or theoretical references (book chapters, online articles, etc) would be very welcome.

Thanks in advance!

Using aggregate data to generate observation-level data statistically sound?

Tue, 11 Jun 2019 22:04:01 +0000

Context: In the realm of Paid Search Marketing. Current reporting does not provide event level data only aggregate totals with different segments. Want to compare distributions/test statistical significance of A/B test results. Did not want to assume that data followed normal distribution or know STDEV for data so came with this approach.

My Question: I am going to use the average "CPA" or "CTR" for a date range, and generate an observation for each conversion based off the average for a time range. Is this statistically sound way if I want to generate raw data? Would I have wonky distributions because of the multiple averages? Just want a gutcheck if I'm completely off base.

My Aggregate data looks like below:

Day	Cost	Acquisition	CPA or CTR
1	40	2	$20
2	75	3	$25

Observation data I generate looks like below:

Day	Acquisition
1	$20
1	$20
2	$25
2	$25
2	$25

I really appreciate your help with this question! An important project to me at work.

Answered: What is degree of Freedom while calculating confidence interval?

Fri, 28 Dec 2018 16:06:55 +0000

Degree of Freedom is the number of values that are free to vary in the computation of a statistic. For more information please take a look at this article.

Answered: How to find the strength of a P-value against a null hypothesis?

Fri, 28 Dec 2018 16:01:36 +0000

P > 0.10 --------------- Weak or None

0.05 < P <= 0.10 --------------- Moderate

0.01 < P <= 0.05 ---------------- Strong

P <= 0.01 ---------------------- Very Strong

Answered: What are the Gaussian equation parameters?

Tue, 18 Dec 2018 04:43:54 +0000

the parameters ( σ and μ ) of the Gaussian equation

g(x)=1σ2π√e−12(x−μσ)2

σ is the standard deviation
μ is the mean

Answered: What is the use of following equation?

Tue, 18 Dec 2018 04:42:20 +0000

used to find the hypothesis of a triangle. it is also called the Pythagorean Theorem

If c denotes the length of the hypotenuse and a and b denote the lengths of the other two sides, the Pythagorean theorem can be expressed as the Pythagorean equation:

a 2 + b 2 = c 2 . {\displaystyle a^{2}+b^{2}=c^{2}.}

If the length of both a and b are known, then c can be calculated as

c = a 2 + b 2 . {\displaystyle c={\sqrt {a^{2}+b^{2}}}.}

If the length of the hypotenuse c and of one side (a or b) are known, then the length of the other side can be calculated as

a = c 2 − b 2 {\displaystyle a={\sqrt {c^{2}-b^{2}}}}

b = c 2 − a 2 . {\displaystyle b={\sqrt {c^{2}-a^{2}}}.}

Answered: nonpooled independent samples t-interval method

Fri, 07 Dec 2018 23:10:28 +0000

The test statistic is : $t=\frac{\bar{x}_1-\bar{x}_2}{s_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$

The degrees of freedom are found using a complicated approximation formula. You won’t have to do that calculation "by hand", but is done by: $DF=\frac{(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2}{\frac{1}{n_1-1} (\frac{s_1^2}{n_1})^2 + \frac{1}{n_2-1} (\frac{s_2^2}{n_2})^2}$

Answered: Computing the Confidence Interval for a Difference Between Two Means

Fri, 07 Dec 2018 23:04:51 +0000

If the sample sizes are larger, that is both n₁ and n₂ are greater than 30, then one uses the z-table.

If either sample size is less than 30, then the t-table is used.

If n₁ > 30 and n₂ > 30, we can use the z-table:

Use Z table for standard normal distribution

If n₁ < 30 or n₂ < 30, use the t-table:\

Use the t-table with degrees of freedom = n₁+n₂-2

For both large and small samples Sp is the pooled estimate of the common standard deviation (assuming that the variances in the populations are similar) computed as the weighted average of the standard deviations in the samples.

Answered: How do you know when to use T-Distribution instead of regular Confidence Interval?

Thu, 06 Dec 2018 00:52:34 +0000

What do you mean by "regular confidence interval"? If your question is when to use z-table and when to use t-table, it depends on the metrics we have. If we have population metrics, that we call them parameters, we should use z-table. If we are estimating population metrics such as mean and standard deviation using the sample, then we should use t-table.

Answered: What is the T Distribution?

Fri, 30 Nov 2018 21:22:29 +0000

The T distribution is a family of distributions that look almost identical to the normal distribution curve, only a bit shorter and fatter. The t distribution is used instead of the normal distribution when you have small samples (for more on this, see: t-score vs. z-score). The larger the sample size, the more the t distribution looks like the normal distribution. In fact, for sample sizes larger than 20 (e.g. more degrees of freedom), the distribution is almost exactly like the normal distribution.

Answered: How to calculate normal distribution?

Fri, 30 Nov 2018 21:20:23 +0000

The standardized value of a normally distributed random variable is called a Z score and is calculated using the following formula.

x = the value that is being standardized
m = the mean of the distribution
s = standard deviation of the distribution

Answered: Probability for the data items within mean +- 1, +-2, +-3 standard deviation?

Fri, 30 Nov 2018 21:18:08 +0000

Probobality mean for +-1 standard deviation is 0.68.
Probobality mean for +-2 standard deviation is 0.95.
Probobality mean for +-3 standard deviation is 0.997.

Answered: What will be probability for the data items within mean +- 1, +-2, +-3 standard deviation?

Fri, 30 Nov 2018 21:17:20 +0000

Probobality mean for +-1 standard deviation is 0.68.
Probobality mean for +-2 standard deviation is 0.95.
Probobality mean for +-3 standard deviation is 0.997.

Answered: How do I calculate a and z a/2?

Thu, 29 Nov 2018 18:26:08 +0000

Given a confidence interval of 95% :

$\alpha = 1 - 0.95 = 0.05$
$z_{\alpha/2} = z_{0.05 / 2} = z_{0.025} = 1.96$

Given a confidence interval of 99%:

a = 1 - 0.99 = 0.01
z a/2 = 0.01 / 2 = z 0.005 = 2.576

Answered: What is the rule that determines percentages with the mean +- 1,2 or 3 standard deviations?

Sat, 17 Nov 2018 19:04:26 +0000

The Empirical Rule, which is used for normal distribution. The notes about the rule can be found in the slides.

Answered: How much percentage of data items are consumed within mean +-1 , +-2, +-3 standard deviation?

Wed, 14 Nov 2018 15:05:54 +0000

For mean +-1 standard deviation, 68% of data items.

For mean +-2 standard deviation, 95% of data items.

For mean +-3 standard deviation, 99.7% of data items.

What is a test statistic in hypothesis testing and how does it relate to the p-value?

Tue, 13 Nov 2018 00:51:11 +0000

Answered: What is sampling error?

Tue, 13 Nov 2018 00:48:25 +0000

Sampling error occurs when using a sample mean to estimate a population mean. Usually the sample mean is quite close to the population mean, but it is important to understand that there will be some level of sampling error.

What is P-value in Statistical Testing?

Mon, 12 Nov 2018 13:41:39 +0000

What is P-value in Statistical testing?

Answered: What is a sampling distribution of sample means?

Mon, 12 Nov 2018 13:38:44 +0000

A sampling distribution of sample means is a frequency distribution of all the possible means of samples of a given size n.

Answered: In the first week of November 2003.... What is the expected number of homicides per week in the GTA?

Fri, 02 Nov 2018 19:36:25 +0000

If there are 78 per year then the expected value or mean for a week is 78 homicides/52.1429 weeks which is 1.496 per week.

Answered: In statistics, whats the difference between contingency and frequency tables?

Tue, 30 Oct 2018 11:38:10 +0000

Contingency table associates two categorical variables whereas frequency table associates one categorical variable

Answered: How to discover outliers of a data frame?

Sun, 28 Oct 2018 11:21:10 +0000

There are two types of analysis we will follow to find the outliers- Uni-variate(one variable outlier analysis) and Multi-variate(two or more variable outlier analysis).

Discover outliers with visualization tools

Box plot

if there is an outlier it will plotted as point in boxplot but other population will be grouped together and display as boxes.

Above plot shows three points between 10 to 12, these are outliers as there are not included in the box of other observation i.e nowhere near the quartiles.

Here we analyzed Uni-variate outlier i.e. we used DIS column only to check the outlier.

multivariate outlier analysis

Scatter plot

The scatter plot is the collection of points that show values for two variables.

Looking at the plot above, we can most of data points are lying bottom left side but there are points which are far from the population like the top right corner.

Discover outliers with a mathematical function

Z-Score

The intuition behind Z-score is to describe any data point by finding their relationship with the Standard Deviation and Mean of the group of data points. Z-score is finding the distribution of data where mean is 0 and the standard deviation is 1 i.e. normal distribution.

While calculating the Z-score we re-scale and center the data and look for data points which are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

We will use Z-score function defined in scipy library to detect the outliers.

from scipy import stats
import numpy as np

z = np.abs(stats.zscore(boston_df))
print(z)

threshold = 3
print(np.where(z > 3))

IQR score

Box plot uses the IQR method to display data and outliers(shape of the data) but in order to get a list of the identified outlier, we will need to use the mathematical formula and retrieve the outlier data.

IQR is somewhat similar to Z-score in terms of finding the distribution of data and then keeping some threshold to identify the outlier.

Q1 = boston_df_o1.quantile(0.25)
Q3 = boston_df_o1.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

As we now have the IQR scores, it’s time to get hold on outliers. The below code will give an output with some true and false values. The data point where we have False that means these values are valid whereas True indicates presence of an outlier.

print(boston_df_o1 < (Q1 - 1.5 * IQR)) |(boston_df_o1 > (Q3 + 1.5 * IQR))

Answered: Define measures of center (Median and Mode) ?

Sat, 27 Oct 2018 14:06:42 +0000

Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:

We first need to rearrange that data into order of magnitude (smallest first):

Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores, but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below:

We again rearrange that data into order of magnitude (smallest first):

Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5.

Mode:

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below:

We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:

Answer reshown: How many unique ways are there to arrange the letters in the word PRIOR?

Thu, 18 Oct 2018 11:47:51 +0000

The word PRIOR has 5 letters and the letter R repeats twice. So the formula to calculate the answer is:

5!/2! = 5*4*3*2*1 / 2*1 = 120/2 = 60.

The answer is 60.

Retagged: Calculate IQR(Inter Quartile Range) of {18, 24, 19, 16, 21}?

Tue, 16 Oct 2018 11:59:59 +0000

Answered: How do you tell a permutation problem from a combination problem?

Mon, 15 Oct 2018 17:15:22 +0000

The key is finding out if the order of items are important or not. If the order does matter, it will be permutation, otherwise it is combination.

In these two examples, we should think about context of the words "Groups" and "Lists". For Groups, the order is NOT important, but because Lists comes with the index, and the order does matter. So, the first question is combination and the second one is permutation.

Combination:

1. How many different groups of 4 students can be made from a class of 40?

Answer: $C(40,4) = \binom{40}{4} = \frac{40!}{4!(40-4)!} $

Permutation:

2. How many different lists of 4 students can be made from a class of 40?

Answer: $P(40,4) = \frac{40!}{(40-4)!} $

Reshown: What is the probability of getting a King of Hearts?

Mon, 15 Oct 2018 02:53:32 +0000

What is the probability of picking a King of Hearts in a standard 52 card deck?

Answer selected: What is the use of the Poisson Distribution?

Sun, 14 Oct 2018 13:03:36 +0000

Generally speaking, the Poisson distribution can be used for probability problems where only the expected number of occurrences is known. Specifically, it gives us the probability of a given number of events happening in a fixed interval of time.

For example, one would use the Poisson distribution for problems like: counts per unit time, defects per unit area, events per unit length, etc .

Answered: If x is the number that comes up when you roll a 20 sided die. What is the expected value of x?

Sun, 14 Oct 2018 04:22:54 +0000

Assign a number to each of those events (having 1 as outcome, having 2 as outcome, ..., having 20 as outcome) which is the job of Random Variable. For this specific example, we can assign the same numbers. The Expected Value equation is $\mu = \sum x.p(x)$. Now you have $x = {1,2,...,20}$, and if you have $p(x)$, you can easily calculate expected value.

In this case that we have a fair die with 20 sides, probability of each side is $\frac{1}{20}$. Now, we can calculate it in this way:

$\mu = \sum x.p(x) = \sum_{x=1}^{20} x.\frac{1}{20} = \frac{1}{20} \sum_{x=1}^{20} x = \frac{1}{20} \frac{20\times21}{2} = 10.5 $

(note: $\sum_{x=1}^{n} x = \frac {n\times(n+1)}{2}$)

Answered: How do we determine if two events are independent

Sat, 13 Oct 2018 15:32:02 +0000

We can call two events independent if the outcome of one of the events doesn't impact the outcome of another event.

For instance, we could throw 2 dice and consider the probability that both are 6's. So, we have thrown our dice. The outcome of the first die throw does not impact the probability of the second die throw. Regardless of what the first outcome was, the second die still has a 1/6 chance of rolling a 6.

Answered: In the context of probability, what is the "event space"

Sat, 13 Oct 2018 15:10:46 +0000

The event space is a subset of outcomes from the sample space. The sample space refers to all possible outcomes of an event.

Answered: What is the "Three-Standard-Deviations Rule"

Sat, 13 Oct 2018 15:04:07 +0000

This rule is used to remember the percentage of values that lie around the mean in a normal distribution.

It is a helpful rule to quickly analyze a normal distribution.

To reiterate, 68% of the data is within 1 standard deviation, 95% is within 2 standard deviations, 99.7% is within 3 standard deviations

Answered: How do histograms help us understand a data set?

Sat, 13 Oct 2018 14:46:18 +0000

Histograms are very effective for modeling continuous data. When we plot histograms, we have more insight into the underlying distribution. This allows us to identify key traits like outliers and skewness.

Answered: What is a discrete random variable?

Sat, 13 Oct 2018 01:23:29 +0000

https://revisionmaths.com/advanced-level-maths-revision/statistics/discrete-random-variables

A discrete variable is a variable that can only take a finite number of values. The variable is said to be random if the sum of the probabilities is 1.

Answered: What is the difference between permutation and combination?

Sat, 13 Oct 2018 01:20:38 +0000

https://towardsdatascience.com/difference-between-permutation-and-combination-9e12b6763ee1

Permutation is when the order matters on the selection of objects, whereas Combination is the number of combinations that can be performed on a set of n objects

Answered: What is the difference between Qualitative and Quantitative data?

Sat, 13 Oct 2018 01:16:53 +0000

Qualitative data is not a number, and is typically a text value, whereas Quantitative data is a number (which is then divided into sub-groups: Discrete and Continuous).

Answered: What is binomial experiment and when do we use it?

Fri, 12 Oct 2018 22:30:50 +0000

A binomial experiment is a probability experiment where a same process is repeated for certain number of times.

So we use binomial experiment when we know that the events are independent of each others and has only 2 possible outcomes. For example, tossing a coin has 2 outcomes head or tail. Success is denoted as p while failure is denoted as 1- p .

What is "Random Sampling"?

Fri, 12 Oct 2018 22:26:06 +0000

Define random sampling

Answered: When should we use permutation and combination?

Fri, 12 Oct 2018 22:22:57 +0000

Permutations are used when we need to arrange things in a specific order. For example, lets assume there are 10 people and we have to assign them medals. So order is important here as 1st person gets gold while 2nd gets silver. In this case, we use permutations.

Combinations are used when we need to make groups where order doesn't matter. For example, if we need to give 3 tin cans to 8 people, order doesn't matter here for the way we pick people. In this case, we use combinations.

Answered: What do population, parameter, census and sample mean in statistics?

Fri, 12 Oct 2018 21:57:50 +0000

Population is a set of similar items or events. It can be a group of any existing objects.

Parameter is an very important component of population. It is a numerical quantity which features population or some aspect in it.

Census is a survey which is conducted after a complete observation belonging to any population.

Sample generally refers to a set of observations. Here, it refers to a set of observations drawn from a population.

Answered: Question for Discrete random variable

Fri, 12 Oct 2018 21:07:12 +0000

μ, of a random variable x is $\mu = \sum x.P(x)$ for all values of x

Here we will subtract the second value as it is in loss.

$\mu = (0.60)(20000) - (0.40)(25000)$

$\mu = 2000$

Answered: Sample question of permutation and combination.

Fri, 12 Oct 2018 20:52:42 +0000

Here the order of the r elements does not matter, so we will use combination formula:

= $\frac{n!}{r!\left(n-r\right)!}$

= $\frac{5!}{2!\left(5-2\right)!}$

= 10

Answered: {1.75, 1.63, 1.55, 1.92, 1.81, 1.79, 1.81}. Determine the mean, the median, and the 20th percentile?

Fri, 12 Oct 2018 20:37:02 +0000

Mean = (1.75 + 1.63 + 1.55 + 1.92 + 1.81 + 1.79 + 1.81) / 7 = 1.751

Median = Arrange the values in ascending order, if the total number of values in data set is odd the middle value is median

1.55, 1.63, 1.75, 1.79, 1.81, 1.81, 1.92

Percentile = 1.55, 1.63, 1.75, 1.79, 1.81, 1.81, 1.92

A = (nk)/100

= (7 x 20)/100 = 1.4

=> 2nd value in data set, which is 1.63

Answered: What is difference between discrete numerical variable and continuous numerical variable?

Fri, 12 Oct 2018 19:28:25 +0000

A discrete numerical variable can be determined by counting a quantity or variables that can only take on a finite number of values are called "discrete variables."

Variables that can take on an infinite number of possible values are called continuous numerical variable.

Answered: What is the difference between Descriptive Statistics and Inferential Statistics?

Fri, 12 Oct 2018 02:09:38 +0000

Answer:

Descriptive Statistics: Collecting, summarizing, and presenting sample data using numerical and graphical methods.

Inferential Statistics: Making estimates, decisions, predictions, or other generalizations about a larger set of data based on sampling.

Answered: Find the expected value of X from the probability table.

Fri, 12 Oct 2018 02:08:21 +0000

The expected value of X, E(X) = ∑X*P(X).

E(X) = 1 * 0.3 + 2 * 0.8 + 3 × 0.4

= 0.3 + 1.6 + 1.2 = 3.1

Answer: 3.1

Answered: If a die is rolled, find the probability of rolling an even number

Fri, 12 Oct 2018 02:07:45 +0000

First write the S of the scenario. A die has 6 possible numbers.
S = {1,2,3,4,5,6}
E in this scenario is all possible even numbers.
E = {2,4,6}
Then write the classical probability.
P(E) = n(E) / n(S) = 3 / 6 = 1 / 2