DataFrame.
dot
Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other Series
It can also be called using self @ other in Python >= 3.5.
self @ other
Note
This method is based on an expensive operation due to the nature of big data. Internally it needs to generate each row for each value, and then group twice - it is a huge operation. To prevent misusage, this method has the ‘compute.max_rows’ default limit of input length, and raises a ValueError.
>>> from pyspark.pandas.config import option_context >>> with option_context( ... 'compute.max_rows', 1000, "compute.ops_on_diff_frames", True ... ): ... psdf = ps.DataFrame({'a': range(1001)}) ... psser = ps.Series([2], index=['a']) ... psdf.dot(psser) Traceback (most recent call last): ... ValueError: Current DataFrame has more then the given limit 1000 rows. Please set 'compute.max_rows' by using 'pyspark.pandas.config.set_option' to retrieve to retrieve more than 1000 rows. Note that, before changing the 'compute.max_rows', this operation is considerably expensive.
The other object to compute the matrix product with.
Return the matrix product between self and other as a Series.
See also
Series.dot
Similar method for Series.
Notes
The dimensions of DataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the matrix product here.
Examples
>>> from pyspark.pandas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> psdf = ps.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]]) >>> psser = ps.Series([1, 1, 2, 1]) >>> psdf.dot(psser) 0 -4 1 5 dtype: int64
Note how shuffling of the objects does not change the result.
>>> psser2 = psser.reindex([1, 0, 2, 3]) >>> psdf.dot(psser2) 0 -4 1 5 dtype: int64 >>> psdf @ psser2 0 -4 1 5 dtype: int64 >>> reset_option("compute.ops_on_diff_frames")