<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Ask Ghassem - Recent questions tagged data-analysis</title>
<link>https://ask.ghassem.com/tag/data-analysis</link>
<description>Powered by Question2Answer</description>
<item>
<title>How to analyse imbalanced categorical colum in dataset</title>
<link>https://ask.ghassem.com/1042/how-to-analyse-imbalanced-categorical-colum-in-dataset</link>
<description>Hello,&lt;br /&gt;
&lt;br /&gt;
I have a dataset with a categorical column that contains three categories. One of the categories represents 98% of the data, while the remaining 2% are distributed between the other two categories, with a few (maybe around 50) in each. It is worth mentioning that the output for these 50 rows is the same, which suggests that these data points may be important.&lt;br /&gt;
&lt;br /&gt;
However, the data is obviously imbalanced, and I am unable to perform any analysis. Should I drop the entire column, or perform a chi-square test on the data as-is?</description>
<category>Data Science</category>
<guid isPermaLink="true">https://ask.ghassem.com/1042/how-to-analyse-imbalanced-categorical-colum-in-dataset</guid>
<pubDate>Sat, 24 Jun 2023 17:55:23 +0000</pubDate>
</item>
<item>
<title>When dealing with categorical values, should the &#039;year&#039; column be encoded using OHE or OrdinalEncoder?</title>
<link>https://ask.ghassem.com/1012/dealing-categorical-values-should-encoded-ordinalencoder</link>
<description>It&amp;#039;s a car prices dataset, and so I&amp;#039;m assuming that the more recent the more value a car should have. The values in the &amp;#039;year&amp;#039; column simply consist of years from 1995 to 2020.&lt;br /&gt;
I am trying to predict the selling price of the car.&lt;br /&gt;
&lt;br /&gt;
I&amp;#039;m a bit new to ML, currently still doing my undergraduate so any help / tips are appreciated. Thank you.</description>
<category>Machine Learning</category>
<guid isPermaLink="true">https://ask.ghassem.com/1012/dealing-categorical-values-should-encoded-ordinalencoder</guid>
<pubDate>Sat, 18 Dec 2021 18:46:07 +0000</pubDate>
</item>
<item>
<title>How do I know which encoder to use to convert from categorical variables to numerical?</title>
<link>https://ask.ghassem.com/1006/know-which-encoder-convert-categorical-variables-numerical</link>
<description>So say I have a column with categorical data like different styles of temperature: &amp;#039;Lukewarm&amp;#039;, &amp;#039;Hot&amp;#039;, &amp;#039;Scalding&amp;#039;, &amp;#039;Cold&amp;#039;, &amp;#039;Frostbite&amp;#039;,... etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I know that we can use pd.get_dummies to convert the column to numerical data within the dataframe, but I also know that there are other &amp;#039;converters&amp;#039; (not sure if that&amp;#039;s the correct terminology) that we can use, i.e. OneHotEncoder from Sk-learn (like I could use the pipeline module to make a nice pipeline and feed my dataframe through the pipeline to also get my categorical data encoded to numerical).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
How do I know which to use? Does it matter? If it does matter, when does it matter the most (i.e. what types of problems? When there are lots of categorical variables, or few?) If anyone can give me any pointers on this type of stuff I&amp;#039;d greatly appreciate it.</description>
<category>Exploratory Data Analysis</category>
<guid isPermaLink="true">https://ask.ghassem.com/1006/know-which-encoder-convert-categorical-variables-numerical</guid>
<pubDate>Mon, 29 Nov 2021 04:09:06 +0000</pubDate>
</item>
<item>
<title>How to calculate average with deviating sensors?</title>
<link>https://ask.ghassem.com/983/how-to-calculate-average-with-deviating-sensors</link>
<description>In case of 3 sensors reporting loads of values individually.. one sensor might be off. The average of the 2 trustworthy sensors is to be reported.. the third in need for recalibration is to be neglected. I&amp;#039;m in need of an (excel) formula looking at three columns which row-by-row detects a significant deviation compared to the others and calculate the average of the most trustworthy.&lt;br /&gt;
Example:&lt;br /&gt;
48.1 ; 45.2 ; 45.4 =&amp;gt; 45.3, as sensor 1 is way off....&lt;br /&gt;
36.0 ; 37;0 ; 45.0 =&amp;gt; 36.5, as sensor 3 is way off....&lt;br /&gt;
36.0 ; 36;5 ; 37.0 =&amp;gt; 36.5 as the deviation is too small to be considered an anomaly, so all values are valid to create the average.&lt;br /&gt;
&lt;br /&gt;
Working with long periods of time.. the readings might be trustworthy for a few weeks, but in defect from moment X up until now... so simply ruling out one sensor is not really an option either.. What is the best way forward?&lt;br /&gt;
Please help. Highly appreciated.</description>
<category>Data Science</category>
<guid isPermaLink="true">https://ask.ghassem.com/983/how-to-calculate-average-with-deviating-sensors</guid>
<pubDate>Tue, 04 May 2021 14:39:14 +0000</pubDate>
</item>
<item>
<title>Mention some common problems that data analysts encounter during analysis.</title>
<link>https://ask.ghassem.com/460/mention-common-problems-analysts-encounter-during-analysis</link>
<description></description>
<category>Data Science Interview Questions</category>
<guid isPermaLink="true">https://ask.ghassem.com/460/mention-common-problems-analysts-encounter-during-analysis</guid>
<pubDate>Sun, 28 Oct 2018 11:44:59 +0000</pubDate>
</item>
<item>
<title>Explain the typical data analysis process.</title>
<link>https://ask.ghassem.com/459/explain-the-typical-data-analysis-process</link>
<description></description>
<category>Data Science Interview Questions</category>
<guid isPermaLink="true">https://ask.ghassem.com/459/explain-the-typical-data-analysis-process</guid>
<pubDate>Sun, 28 Oct 2018 11:43:46 +0000</pubDate>
</item>
<item>
<title>What is the difference between Data Mining and Data Analysis?</title>
<link>https://ask.ghassem.com/458/what-is-the-difference-between-data-mining-and-data-analysis</link>
<description></description>
<category>Data Science Interview Questions</category>
<guid isPermaLink="true">https://ask.ghassem.com/458/what-is-the-difference-between-data-mining-and-data-analysis</guid>
<pubDate>Sun, 28 Oct 2018 11:42:45 +0000</pubDate>
</item>
</channel>
</rss>