Ask Ghassem - Recent questions tagged dataframe

How do I know which encoder to use to convert from categorical variables to numerical?

Mon, 29 Nov 2021 04:09:06 +0000

So say I have a column with categorical data like different styles of temperature: 'Lukewarm', 'Hot', 'Scalding', 'Cold', 'Frostbite',... etc.

I know that we can use pd.get_dummies to convert the column to numerical data within the dataframe, but I also know that there are other 'converters' (not sure if that's the correct terminology) that we can use, i.e. OneHotEncoder from Sk-learn (like I could use the pipeline module to make a nice pipeline and feed my dataframe through the pipeline to also get my categorical data encoded to numerical).

How do I know which to use? Does it matter? If it does matter, when does it matter the most (i.e. what types of problems? When there are lots of categorical variables, or few?) If anyone can give me any pointers on this type of stuff I'd greatly appreciate it.

Terminology clarification in Spark

Sat, 06 Feb 2021 18:03:32 +0000

I have a hard time distinguishing terminologies of SparkSQL. While SparkSQL are quite flexible in terms of abstraction layers, its really difficult for beginner to navigate around those options.

1. When we say " using SparkSQL to perform .....", does it mean that we can use any API/abstraction layers such as Scala, Python, HiveQL to query? As long as the core dataframe is in spark, we should be fine?

2. Can we manipulate data in both PySpark and Scala sequentially?

For example, may I clean up the data in Scala, then perform follow up manipulation in PySpark, then go back to Scala?

3. As demonstrated in the tutorial, we can query with SQL command by using the api spark.sql("My SQL command"). does it count as SQL or SPARK?

How to filter a dataframe?

Wed, 25 Dec 2019 05:56:14 +0000

Consider the Pandas DataDrame df below. Filter it appropriately so that it outputs the shown results.

     gh owner language      repo  stars
0  pandas-dev   python    pandas  17800
1   tidyverse        R     dplyr   2800
2   tidyverse        R   ggplot2   3500
3      has2k1   python  plotnine   1450

Expected Output

     gh owner language    repo  stars
0  pandas-dev   python  pandas  17800

How to reshape in pandas dataframe?

Fri, 05 Apr 2019 13:41:30 +0000

Dataframe looks like below

I have dataframe like above. which I want to a~t reshape (a~t, 1)

I want to reshape dataframe like below ( b~t column is go to under the a column)

날짜 역번호 역명 구분 a

2018-01-01 150 서울역 승차 379

2018-01-01 150 서울역 승차 287

2018-01-01 150 서울역 승차 371

2018-01-01 150 서울역 승차 876

2018-01-01 150 서울역 승차 965

....

2008-01-01 152 종각 승차 2920

2008-01-01 152 종각 승차 2290

2008-01-01 152 종각 승차 802

2008-01-01 152 종각 승차 1559

like df = df.reshape(len(data2)*a~t, 1)

i tried pd.melt but It does not work well.

df2 = pd.melt(df, id_vars=["날짜", "역번호", "역명", "구분"], value_name="t")

is remove b ~ t but i want insert b~t behind a

dataset is https://drive.google.com/file/d/1Upb5PgymkPB5TXuta_sg6SijwzUuEkfl/view?usp=sharing