pyspark.pandas.
get_dummies
Convert categorical variable into dummy/indicator variables, also known as one hot encoding.
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.
Add a column to indicate NaNs, if False NaNs are ignored.
Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.
Whether the dummy-encoded columns should be be backed by a SparseArray (True) or a regular NumPy array (False). In pandas-on-Spark, this value must be “False”.
SparseArray
Whether to get k-1 dummies out of k categorical levels by removing the first level.
Data type for new columns. Only a single dtype is allowed.
See also
Series.str.get_dummies
Examples
>>> s = ps.Series(list('abca'))
>>> ps.get_dummies(s) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0
>>> df = ps.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], ... 'C': [1, 2, 3]}, ... columns=['A', 'B', 'C'])
>>> ps.get_dummies(df, prefix=['col1', 'col2']) C col1_a col1_b col2_a col2_b col2_c 0 1 1 0 0 1 0 1 2 0 1 1 0 0 2 3 1 0 0 0 1
>>> ps.get_dummies(ps.Series(list('abcaa'))) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 4 1 0 0
>>> ps.get_dummies(ps.Series(list('abcaa')), drop_first=True) b c 0 0 0 1 1 0 2 0 1 3 0 0 4 0 0
>>> ps.get_dummies(ps.Series(list('abc')), dtype=float) a b c 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 0.0 1.0