The math behind Attention: Keys, Queries, and Values matrices

The math behind Attention: Keys, Queries, and Values matrices

Serrano.Academy

1 год назад

316,720 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

@agnelomascarenhas8990
@agnelomascarenhas8990 - 07.12.2024 21:23

Dot product is divided by square root to normalize the variance. Sum of products of a dot product accumulates variance.

Ответить
@yvesbernas1772
@yvesbernas1772 - 09.12.2024 17:57

Great teaching videos. I am lost at 28.40 here because it is not explained why and what does the value matrix do. It is sort of popped in magically whereas we we were taught up to then that adding the vectors of the same line in the attention array already „moved“ the embedded vector to a better place . IMHO, there is a kind of a jump or void I cant bridge.

Ответить
@me5920-i3f
@me5920-i3f - 15.12.2024 14:32

This is for math-dimwits like myself! 😃

Ответить
@Democracy_Manifest
@Democracy_Manifest - 17.12.2024 14:55

Excellent video. Kudos to you sir

Ответить
@k.i.a7240
@k.i.a7240 - 26.12.2024 20:27

@SerranoAcademy Amazing video, I just have one question that I cannot wrap my head around. Why do we need a Value matrix exactly? I mean why the transformed matrix created by Key and Query matrixes is not sufficient to be used to perform the attention on? What is the problem with it? I don't get it when you said it is optimized for finding the next word, how it is optimized and better from the product of K*Q?

Ответить
@chaseg8888
@chaseg8888 - 06.01.2025 05:14

I am still not sure how on the first read of sentence it will know similarity, it has not been trained yet ?

Ответить
@NillaNewyork
@NillaNewyork - 09.01.2025 03:31

why do you product orange to key matrix and phone to query matrix ? Why don't you do the opposite? I mean product phone to key matrix and orange to query matrix

Ответить
@henrik3141
@henrik3141 - 09.01.2025 06:47

I actually think the gravity picture is quite misleading since, in some cases, it might be that a word that is the furthest away from something has the most substantial influence.

Ответить
@FlippinFunFlips
@FlippinFunFlips - 09.01.2025 07:06

Amazing! Thank you!

Ответить
@amineazaiez2285
@amineazaiez2285 - 10.01.2025 09:11

Many many thanks! This is the best explanation I found !

Ответить
@jvdp9660
@jvdp9660 - 11.01.2025 12:28

Why divide it by sqrt(2)? Should it not be divided by the norm of the vector?

Ответить
@anonymous_a
@anonymous_a - 29.01.2025 20:55

respectfully, and fully out of context , that is not a drawing of an orange that’s a drawing of a pomegranate.

Ответить
@joelsabiti4828
@joelsabiti4828 - 05.02.2025 10:43

is there an english version of your podcast with Omar Florez?

Ответить
@joelsabiti4828
@joelsabiti4828 - 05.02.2025 12:35

that part where you say you have been lying is really painful I must say

Ответить
@Matteo-o1m4m
@Matteo-o1m4m - 15.02.2025 12:45

The way you explain is great.

Ответить
@sunksun
@sunksun - 16.02.2025 07:15

nice

Ответить
@jingqianli6392
@jingqianli6392 - 23.02.2025 03:22

best explanation I have seen

Ответить
@shaycorvo4290
@shaycorvo4290 - 25.02.2025 23:46

Very nice and clear Explanation Thanks a lot.

Ответить
@hanytadros3333
@hanytadros3333 - 26.02.2025 01:19

Wonderful explanation . You should write a text book about the subject. It will be very successful.

Ответить
@its-itish
@its-itish - 01.03.2025 16:18

Thank You So Much Sir. it is really helpful.😊

Ответить
@behrampatel3563
@behrampatel3563 - 01.03.2025 16:49

This video uncovers so many nuanced concepts its worth multiple viewings . What a superb segue to what the softMax function is and "why" we may need it. What a gift. Thank you ❤

Ответить
@fa7234
@fa7234 - 02.03.2025 23:27

It would have been more easier to go behind the intuition of softmax attention equations and how the attention weights change the embeddings before diving into the example math but none the less great video on the math.

Ответить
@43SunSon
@43SunSon - 04.03.2025 22:23

pika pika

Ответить
@abhijitsarkar3589
@abhijitsarkar3589 - 06.03.2025 06:38

I am a tech student,it is really amazing

Ответить
@behrampatel4872
@behrampatel4872 - 06.03.2025 18:06

If I don't understand something in deep learning I go to Josh Stramer or Louis Serrano. Josh also has a great intro on this topic however this round goes to Louis.
This is the best lead-up video to attention mechanisms. full stop. I'm still a little hazy in the QKV manipulation towards the end but I hope repeated viewings make things clearer.
Cheers

Ответить
@harshavr
@harshavr - 07.03.2025 23:23

Really good way to explain the attention mechanism. Making all the pictures must have taken some time. Thanks for the effort.

Ответить
@jakobmller7465
@jakobmller7465 - 08.03.2025 08:01

Maybe I am wrong but it seems to me that you motivate the linear transformations (the keys and queries matrices) by saying that the word apple gets a new embedding that will have a higher cosine similarity with the word phone when trained on some corpus of text? I suppose you can train the weights of the model on a corpus where you have not only sentences with apple and phone but also sentences with apple and orange. In that case I suspect that the matrices are just projections from, say, 512 dimensions to, say, 64 dimensions and only motivated by a reduction of the computational cost when calculating the weighted sum of the elements of the value projection?

Ответить
@rajaramanathan8384
@rajaramanathan8384 - 13.03.2025 03:43

thank you serrano academy for teaching this bullshit!

Ответить
@yutuver9327
@yutuver9327 - 14.03.2025 12:34

Still not clear on how Query, Key and Value matrices are populated. Where do the numbers come from? Why they are called Query, Key and Value?

Ответить
@ultimateblue4568
@ultimateblue4568 - 14.03.2025 22:17

very good explanation, thank you

Ответить
@MrVdennis
@MrVdennis - 18.03.2025 13:56

best in class Wonderful and useful video THANK YOUUU💯💯👏👏🤩🤩, only a tip and observation,I noticed that when you say at 11.13 the length of the vector ..it should be the dimension of the vector because the length will be the Square of the sum of the quadratic component...Eulero form, right?

Ответить
@chafikboularak2894
@chafikboularak2894 - 04.04.2025 17:25

Thank you so much , very well explained

Ответить
@gkcs
@gkcs - 08.04.2025 06:35

This is an amazing video, probably the best video explaining the intuition behind the attention mechanism. Thank you for being awesome :D

Ответить
@louies89
@louies89 - 09.04.2025 22:55

This has lot of clarity

Ответить
@adarshsunil
@adarshsunil - 16.04.2025 21:55

Great video

Ответить
@percevalzinzin5983
@percevalzinzin5983 - 17.04.2025 13:38

Top notch explanations, very clear with visuals 👍 🎉

Ответить
@percevalzinzin5983
@percevalzinzin5983 - 19.04.2025 13:58

Very pedagogical 👍

Ответить
@sahil_shrma
@sahil_shrma - 25.04.2025 13:59

Thanks !!🙏

Ответить
@patil_gaurav
@patil_gaurav - 29.04.2025 12:32

Thank you so much!!! was struggling with the concept of attention and this video came in like a savior

Ответить
@Merkw
@Merkw - 04.05.2025 21:54

this 3b1b guy gets much more attention than he deserves, he makes nice animations, but in reality explains very little, and not very useful things. This is the best video I have seen I can recommend to beginners, thank you!

Ответить
@ВечныйСтудент-у9е
@ВечныйСтудент-у9е - 11.05.2025 17:20

As usual thanks a lot for clyrifying all this stuff in simple words.

Ответить
@Drteslacoiler
@Drteslacoiler - 13.05.2025 21:28

This explanation series is more concrete than 3 brown 1 blue ones

Ответить
@RajatAggarwal-re6ef
@RajatAggarwal-re6ef - 28.05.2025 07:21

Hey, aren't key, query and value matrix supposed to be of d*1 dimension? Assuming each input embedding is of d*1 vector, then the weighted matrix is of d*d dimensions. Why is it so that your key query and value matrix of dimension for a single word embedding is of d*x. i dunno what x is here

Ответить
@supersnowva6717
@supersnowva6717 - 13.06.2025 05:52

The BEST and MOST intuitive explanation on the attention mechanism. Period.

Ответить
@SerranoAcademy
@SerranoAcademy - 07.09.2023 20:22

Hello all! In the video I made a comment about how the Key and Query matrices capture low and high level properties of the text. After reading some of your comments, I've realized that this is not true (or at least there's no clear reason for it to be true), and probably something I misunderstood while reading in different places in the literature and threads.
Apologies for the error, and thank you to all who pointed it out! I've removed that part of the video.

Ответить