pyspark.pandas.DataFrame.kde¶

DataFrame.kde(bw_method=None, ind=None, **kwds)[source]¶

Generate Kernel Density Estimate plot using Gaussian kernels.

Parameters

bw_methodscalar: The method used to calculate the estimator bandwidth. See KernelDensity in PySpark for more information.
indNumPy array or integer, optional: Evaluation points for the estimated PDF. If None (default), 1000 equally spaced points are used. If ind is a NumPy array, the KDE is evaluated at the points passed. If ind is an integer, ind number of equally spaced points are used.
**kwargsoptional: Keyword arguments to pass on to pandas-on-Spark.Series.plot().

Returns

plotly.graph_objs.Figure: Return an custom object when backend!=plotly. Return an ndarray when subplots=True (matplotlib-only).

Examples

A scalar bandwidth should be specified. Using a small bandwidth value can lead to over-fitting, while using a large bandwidth value may result in under-fitting:

>>> s = ps.Series([1, 2, 2.5, 3, 3.5, 4, 5])
>>> s.plot.kde(bw_method=0.3)  

>>> s = ps.Series([1, 2, 2.5, 3, 3.5, 4, 5])
>>> s.plot.kde(bw_method=3)  

The ind parameter determines the evaluation points for the plot of the estimated KDF:

>>> s = ps.Series([1, 2, 2.5, 3, 3.5, 4, 5])
>>> s.plot.kde(ind=[1, 2, 3, 4, 5], bw_method=0.3)  

For DataFrame, it works in the same way as Series:

>>> df = ps.DataFrame({
...     'x': [1, 2, 2.5, 3, 3.5, 4, 5],
...     'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> df.plot.kde(bw_method=0.3)  

>>> df = ps.DataFrame({
...     'x': [1, 2, 2.5, 3, 3.5, 4, 5],
...     'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> df.plot.kde(bw_method=3)  

>>> df = ps.DataFrame({
...     'x': [1, 2, 2.5, 3, 3.5, 4, 5],
...     'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> df.plot.kde(ind=[1, 2, 3, 4, 5, 6], bw_method=0.3)  

pyspark.pandas.DataFrame.hist pyspark.pandas.DataFrame.pandas_on_spark.apply_batch