Show distinct column values in pyspark dataframe [message #1860697] |
Sat, 26 August 2023 01:14  |
Eclipse User |
|
|
|
With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().
I want to list out all the unique values in a pyspark dataframe column.
Not the SQL type way (registertemplate then SQL query for distinct values).
Also I don't need groupby then countDistinct, instead I want to check distinct VALUES in that column.
|
|
|
Re: Show distinct column values in pyspark dataframe [message #1860708 is a reply to message #1860697] |
Mon, 28 August 2023 07:29  |
Eclipse User |
|
|
|
To list out all the unique values in a PySpark DataFrame column, you can use the df['col'].distinct() method. This method will return a new DataFrame with only the unique values in the specified column.
For example, the following code will list out all the unique values in the col column of the df DataFrame:
df = spark.createDataFrame([('a', 1), ('b', 2), ('c', 3)], ['col', 'val'])
unique_values = df['col'].distinct()
print(unique_values)
This code will first create a DataFrame called df with two columns: col and val. The col column will contain the values a, b, and c. The val column will contain the values 1, 2, and 3.
Then, the code will call the df['col'].distinct() method to get a new DataFrame with only the unique values in the col column. Finally, the code will print the unique values to the console.
The output of the code will be the following:
This means that the col column of the df DataFrame has three unique values: a, b, and c.
|
|
|
Powered by
FUDForum. Page generated in 0.03829 seconds