<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Ask Ghassem - Recent questions tagged data-cleaning</title>
<link>https://ask.ghassem.com/tag/data-cleaning</link>
<description>Powered by Question2Answer</description>
<item>
<title>How to analyse imbalanced categorical colum in dataset</title>
<link>https://ask.ghassem.com/1042/how-to-analyse-imbalanced-categorical-colum-in-dataset</link>
<description>Hello,&lt;br /&gt;
&lt;br /&gt;
I have a dataset with a categorical column that contains three categories. One of the categories represents 98% of the data, while the remaining 2% are distributed between the other two categories, with a few (maybe around 50) in each. It is worth mentioning that the output for these 50 rows is the same, which suggests that these data points may be important.&lt;br /&gt;
&lt;br /&gt;
However, the data is obviously imbalanced, and I am unable to perform any analysis. Should I drop the entire column, or perform a chi-square test on the data as-is?</description>
<category>Data Science</category>
<guid isPermaLink="true">https://ask.ghassem.com/1042/how-to-analyse-imbalanced-categorical-colum-in-dataset</guid>
<pubDate>Sat, 24 Jun 2023 17:55:23 +0000</pubDate>
</item>
<item>
<title>How do I know which encoder to use to convert from categorical variables to numerical?</title>
<link>https://ask.ghassem.com/1006/know-which-encoder-convert-categorical-variables-numerical</link>
<description>So say I have a column with categorical data like different styles of temperature: &amp;#039;Lukewarm&amp;#039;, &amp;#039;Hot&amp;#039;, &amp;#039;Scalding&amp;#039;, &amp;#039;Cold&amp;#039;, &amp;#039;Frostbite&amp;#039;,... etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I know that we can use pd.get_dummies to convert the column to numerical data within the dataframe, but I also know that there are other &amp;#039;converters&amp;#039; (not sure if that&amp;#039;s the correct terminology) that we can use, i.e. OneHotEncoder from Sk-learn (like I could use the pipeline module to make a nice pipeline and feed my dataframe through the pipeline to also get my categorical data encoded to numerical).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
How do I know which to use? Does it matter? If it does matter, when does it matter the most (i.e. what types of problems? When there are lots of categorical variables, or few?) If anyone can give me any pointers on this type of stuff I&amp;#039;d greatly appreciate it.</description>
<category>Exploratory Data Analysis</category>
<guid isPermaLink="true">https://ask.ghassem.com/1006/know-which-encoder-convert-categorical-variables-numerical</guid>
<pubDate>Mon, 29 Nov 2021 04:09:06 +0000</pubDate>
</item>
<item>
<title>How to filter a dataframe?</title>
<link>https://ask.ghassem.com/775/how-to-filter-a-dataframe</link>
<description>&lt;p&gt;Consider the Pandas DataDrame&amp;nbsp;&lt;code&gt;df&lt;/code&gt;&amp;nbsp;below. Filter it appropriately so that it outputs the shown results.&lt;/p&gt;

&lt;pre class=&quot;prettyprint lang-python&quot; data-pbcklang=&quot;python&quot; data-pbcktabsize=&quot;4&quot;&gt;
     gh owner language      repo  stars
0  pandas-dev   python    pandas  17800
1   tidyverse        R     dplyr   2800
2   tidyverse        R   ggplot2   3500
3      has2k1   python  plotnine   1450&lt;/pre&gt;

&lt;h2&gt;Expected Output&lt;/h2&gt;

&lt;pre class=&quot;prettyprint lang-&quot; data-pbcklang=&quot;&quot; data-pbcktabsize=&quot;&quot;&gt;
     gh owner language    repo  stars
0  pandas-dev   python  pandas  17800&lt;/pre&gt;</description>
<category>Python Interview Questions</category>
<guid isPermaLink="true">https://ask.ghassem.com/775/how-to-filter-a-dataframe</guid>
<pubDate>Wed, 25 Dec 2019 05:56:14 +0000</pubDate>
</item>
<item>
<title>What are basic steps for treating missing values?</title>
<link>https://ask.ghassem.com/430/what-are-basic-steps-for-treating-missing-values</link>
<description></description>
<category>Exploratory Data Analysis</category>
<guid isPermaLink="true">https://ask.ghassem.com/430/what-are-basic-steps-for-treating-missing-values</guid>
<pubDate>Fri, 19 Oct 2018 04:08:48 +0000</pubDate>
</item>
<item>
<title>What are the general steps in data cleaning?</title>
<link>https://ask.ghassem.com/427/what-are-the-general-steps-in-data-cleaning</link>
<description></description>
<category>General</category>
<guid isPermaLink="true">https://ask.ghassem.com/427/what-are-the-general-steps-in-data-cleaning</guid>
<pubDate>Fri, 19 Oct 2018 03:54:38 +0000</pubDate>
</item>
</channel>
</rss>