Python implementation of cost function in logistic regression: why dot multiplication in one expression but element-wise multiplication in another

Question

I have a very basic question which relates to Python, numpy and multiplication of matrices in the setting of logistic regression.
First, let me apologise for not using math notation.
I am confused about the use of matrix dot multiplication versus element wise pultiplication. The cost function is given by:
$J = - {1over m} sum_{i=1}^m y^{(i)}log(a^{(i)})+(1 - y^{(i)})log(1-a^{(i)})$
And in python I have written this as
    cost = -1/m * np.sum(Y * np.log(A) + (1-Y) * (np.log(1-A)))

But for example this expression (the first one - the derivative of J with respect to w)
${partial J over{partial w}} = {1 over{m}} X(A-Y)^T$
${partial Jover{partial b}} = {1over{m}} sum limits_{i = 1}^m (a^{(i)}-y^{(i)})$
is
   dw = 1/m * np.dot(X, dz.T)

I don't understand why it is correct to use dot multiplication in the above, but use element wise multiplication in the cost function i.e why not:
   cost = -1/m * np.sum(np.dot(Y,np.log(A)) + np.dot(1-Y, np.log(1-A)))

I fully get that this is not elaborately explained but I am guessing that the question is so simple that anyone with even basic logistic regression experience will understand my problem.

Neil Slater · Accepted Answer

In this case, the two math formulae show you the correct type of multiplication:

$y_i$ and $text{log}(a_i)$ in the cost function are scalar values. Composing the scalar values into a given sum over each example does not change this, and you never combine one example's values with another in this sum. So each element of $y$ only interacts with its matching element in $a$, which is basically the definition of element-wise.

The terms in the gradient calculation are matrices, and if you see two matrices $A$ and $B$ multiplied using notation like $C = AB$, then you can write this out as a more complex sum: $C_{ik} = sum_j A_{ij}B_{jk}$. It is this inner sum across multiple terms that np.dot is performing.

In part your confusion stems from the vectorisation that has been applied to equations in the course materials, which are looking forward to more complex scenarios. You could in fact use
cost = -1/m * np.sum( np.multiply(np.log(A), Y) + np.multiply(np.log(1-A), (1-Y)))

or
cost = -1/m * np.sum( np.dot(np.log(A), Y.T) + np.dot(np.log(1-A), (1-Y.T)))

whilst Y and A have shape (m,1) and it should give the same result. NB the np.sum is just flattening a single value in that, so you could drop it and instead have [0,0] on the end. However, this does not generalize to other output shapes (m,n_outputs) so the course does not use it.

Sean Owen · Answer

Are you asking, what's the difference between a dot product of two vectors, and summing their elementwise product? They are the same. np.sum(X * Y) is np.dot(X, Y). The dot version would be more efficient and easy to understand, generally.

But in the cost function, $Y$ is a matrix, not a vector. np.dot actually computes a matrix product, and the sum of those elements is not the same as the sum of the elements of the pairwise product. (The multiplication isn't even going to be defined for the same cases.)

So I guess the answer is that they're different operations doing different things, and these situations are different, and the main difference is dealing with vectors versus matrices.

Gordon Hutchison · Answer

With regards to "In the OP's case np.sum(a * y) is not going to be same as np.dot(a, y) because a and y are column vectors shape (m,1), so the dot function will raise an error. "...

(I don't have enough kudos to comment using the comment button but I thought I
would add..)

If the vectors are column vectors and have shape (1,m), 
a common pattern is that the second operator for the dot 
function is postfixed with a ".T" operator to transpose it to shape (m,1) 
and then the dot product works out as a (1,m).(m,1). e.g.

np.dot(np.log(1-A), (1-Y).T)

The common value for m enables the dot product (matrix multiplication)
to be applied.

Similarly for column vectors one would see the transpose applied
to the first number e.g  np.dot(w.T,X) to put the dimension that is >1
in the 'middle'.

The pattern to get a scalar from np.dot is to get the two vectors shapes to have the '1' dimension on the 'outside' and the common >1 dimension on the 'inside':

(1,X).(X,1)  or np.dot( V1, V2 ) Where V1 is shape (1,X) and V2 is shape (X,1)

SO the result is a (1,1) matrix, i.e. a scalar.

Python implementation of cost function in logistic regression: why dot multiplication in one expression but element-wise multiplication in another

3 Answers

Add your own answers!

Ask a Question