Комментарии:
Dot product is divided by square root to normalize the variance. Sum of products of a dot product accumulates variance.
ОтветитьGreat teaching videos. I am lost at 28.40 here because it is not explained why and what does the value matrix do. It is sort of popped in magically whereas we we were taught up to then that adding the vectors of the same line in the attention array already „moved“ the embedded vector to a better place . IMHO, there is a kind of a jump or void I cant bridge.
ОтветитьThis is for math-dimwits like myself! 😃
ОтветитьExcellent video. Kudos to you sir
Ответить@SerranoAcademy Amazing video, I just have one question that I cannot wrap my head around. Why do we need a Value matrix exactly? I mean why the transformed matrix created by Key and Query matrixes is not sufficient to be used to perform the attention on? What is the problem with it? I don't get it when you said it is optimized for finding the next word, how it is optimized and better from the product of K*Q?
ОтветитьI am still not sure how on the first read of sentence it will know similarity, it has not been trained yet ?
Ответитьwhy do you product orange to key matrix and phone to query matrix ? Why don't you do the opposite? I mean product phone to key matrix and orange to query matrix
ОтветитьI actually think the gravity picture is quite misleading since, in some cases, it might be that a word that is the furthest away from something has the most substantial influence.
ОтветитьAmazing! Thank you!
ОтветитьMany many thanks! This is the best explanation I found !
ОтветитьWhy divide it by sqrt(2)? Should it not be divided by the norm of the vector?
Ответитьrespectfully, and fully out of context , that is not a drawing of an orange that’s a drawing of a pomegranate.
Ответитьis there an english version of your podcast with Omar Florez?
Ответитьthat part where you say you have been lying is really painful I must say
ОтветитьThe way you explain is great.
Ответитьnice
Ответитьbest explanation I have seen
ОтветитьVery nice and clear Explanation Thanks a lot.
ОтветитьWonderful explanation . You should write a text book about the subject. It will be very successful.
ОтветитьThank You So Much Sir. it is really helpful.😊
ОтветитьThis video uncovers so many nuanced concepts its worth multiple viewings . What a superb segue to what the softMax function is and "why" we may need it. What a gift. Thank you ❤
ОтветитьIt would have been more easier to go behind the intuition of softmax attention equations and how the attention weights change the embeddings before diving into the example math but none the less great video on the math.
Ответитьpika pika
ОтветитьI am a tech student,it is really amazing
ОтветитьIf I don't understand something in deep learning I go to Josh Stramer or Louis Serrano. Josh also has a great intro on this topic however this round goes to Louis.
This is the best lead-up video to attention mechanisms. full stop. I'm still a little hazy in the QKV manipulation towards the end but I hope repeated viewings make things clearer.
Cheers
Really good way to explain the attention mechanism. Making all the pictures must have taken some time. Thanks for the effort.
ОтветитьMaybe I am wrong but it seems to me that you motivate the linear transformations (the keys and queries matrices) by saying that the word apple gets a new embedding that will have a higher cosine similarity with the word phone when trained on some corpus of text? I suppose you can train the weights of the model on a corpus where you have not only sentences with apple and phone but also sentences with apple and orange. In that case I suspect that the matrices are just projections from, say, 512 dimensions to, say, 64 dimensions and only motivated by a reduction of the computational cost when calculating the weighted sum of the elements of the value projection?
Ответитьthank you serrano academy for teaching this bullshit!
ОтветитьStill not clear on how Query, Key and Value matrices are populated. Where do the numbers come from? Why they are called Query, Key and Value?
Ответитьvery good explanation, thank you
Ответитьbest in class Wonderful and useful video THANK YOUUU💯💯👏👏🤩🤩, only a tip and observation,I noticed that when you say at 11.13 the length of the vector ..it should be the dimension of the vector because the length will be the Square of the sum of the quadratic component...Eulero form, right?
ОтветитьThank you so much , very well explained
ОтветитьThis is an amazing video, probably the best video explaining the intuition behind the attention mechanism. Thank you for being awesome :D
ОтветитьThis has lot of clarity
ОтветитьGreat video
ОтветитьTop notch explanations, very clear with visuals 👍 🎉
ОтветитьVery pedagogical 👍
ОтветитьThanks !!🙏
ОтветитьThank you so much!!! was struggling with the concept of attention and this video came in like a savior
Ответитьthis 3b1b guy gets much more attention than he deserves, he makes nice animations, but in reality explains very little, and not very useful things. This is the best video I have seen I can recommend to beginners, thank you!
ОтветитьAs usual thanks a lot for clyrifying all this stuff in simple words.
ОтветитьThis explanation series is more concrete than 3 brown 1 blue ones
ОтветитьHey, aren't key, query and value matrix supposed to be of d*1 dimension? Assuming each input embedding is of d*1 vector, then the weighted matrix is of d*d dimensions. Why is it so that your key query and value matrix of dimension for a single word embedding is of d*x. i dunno what x is here
ОтветитьThe BEST and MOST intuitive explanation on the attention mechanism. Period.
ОтветитьHello all! In the video I made a comment about how the Key and Query matrices capture low and high level properties of the text. After reading some of your comments, I've realized that this is not true (or at least there's no clear reason for it to be true), and probably something I misunderstood while reading in different places in the literature and threads.
Apologies for the error, and thank you to all who pointed it out! I've removed that part of the video.