๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
math4ai/Machine Learning and Deep Learning

M4AI) That's a Wrap! — MLDL

by ์žผ๋ฏผai 2024. 8. 2.

์ถœ์ฒ˜: https://jrc-park.tistory.com/259

  • ๐Ÿง Frequentist ์™€ Bayesian์˜ ์ฐจ์ด๋Š” ๋ฌด์—‡์ธ๊ฐ€? ๋นˆ๋„์ฃผ์˜์  ๊ด€์ ์€ ๋ฐ์ดํ„ฐ-์˜์กด์ ์ธ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ๊ฑด์˜ ๋ฐœ์ƒ ๋นˆ๋„๊ฐ€ ๊ณง ํ™•๋ฅ ์ด ๋œ๋‹ค๋Š” ์ž…์žฅ์ด๊ณ , Bayesian์€ belief๋ผ๋Š” ๊ฐœ๋…์„ ๋„์ž…ํ•ด ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์„ค์„ ์„ธ์šฐ๊ณ  ๋ฐ์ดํ„ฐ๋กœ์จ ์ž…์ฆํ•˜์—ฌ belief๋ฅผ ๊ฐฑ์‹ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋Š” ์ž…์žฅ์ž„. ์˜ˆ๋ฅผ ๋“ค์–ด, ํ˜ผ์ž ์žˆ๋Š” ๊ฑธ ์ข‹์•„ํ•˜๊ณ  ์ง„์ง€ํ•˜๊ณ  ๋…ผ๋ฆฌ์ ์ธ ์‚ฌ๋žŒ์ด ๊ฐœ๋ฐœ์ž์ผ ํ™•๋ฅ ์— ๋Œ€ํ•ด์„œ๋Š” frequentist๋ผ๋ฉด ์ง„์ง€ํ•˜๊ณ  ๋…ผ๋ฆฌ์ ์ธ ์‚ฌ๋žŒ ์ค‘ ๊ฐœ๋ฐœ์ž์˜ ๋น„์œจ์„ ์ง์ ‘ ๋ฐ์ดํ„ฐ๋กœ ์ธก์ •ํ•ด ๊ทธ๊ฒŒ ๊ณง ํ™•๋ฅ ์ด๋ผ๊ณ  ๋Œ€๋‹ตํ•  ๊ฒƒ์ž„! ๊ทผ๋ฐ ๋ฒ ์ด์ง€์•ˆ์€ prior(์ „์ฒด ์ธ๊ตฌ ์ค‘ ๊ฐœ๋ฐœ์ž์˜ ๋น„์œจ), likelihood(์ € ์„ฑ๊ฒฉ์˜ ์‚ฌ๋žŒ์ด ๊ฐœ๋ฐœ์ž์ผ ํ™•๋ฅ )๋ฅผ ์ด์šฉํ•ด, "์ €๋Ÿฐ ์„ฑ๊ฒฉ์˜ ์‚ฌ๋žŒ์ด๋ผ๋ฉด ๊ฐœ๋ฐœ์ž์ผ ๊ฒƒ์ด๋‹ค"๋ผ๋Š” ์šฐ๋ฆฌ์˜ ๊ฐ€์„ค์„ posterior(์ €๋Ÿฐ ์ฆ๊ฑฐ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์šฐ๋ฆฌ์˜ ๊ฐ€์„ค์ด ๋งž์„ ํ™•๋ฅ )๋ผ ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ์ดํ„ฐ๋กœ์จ ์—…๋ฐ์ดํŠธํ•œ๋‹ค๋Š” ์‹์œผ๋กœ ์ƒ๊ฐํ•  ๊ฒƒ์ž„.
  • ๐Ÿง Frequentist ์™€ Bayesian์˜ ์žฅ์ ์€ ๋ฌด์—‡์ธ๊ฐ€? ์ „์ž์˜ ๊ฒฝ์šฐ ๋‹จ์ˆœํ•œ ์ ‘๊ทผ์„ ์ทจํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ์— ์ง€๋‚˜์น˜๊ฒŒ ์˜์กด์ ์ด๋ผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ์ƒํ™ฉ์—์„œ๋Š” ๋ณ„๋กœ ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๊ณ , ํ›„์ž์˜ ๊ฒฝ์šฐ bayes rule์— ๊ทผ๊ฑฐํ•ด prior์˜ ํ™•๋ฅ  ๋ชจ๋ธ์ด ์ž˜ ์„ค์ •๋˜์–ด ์žˆ๋‹ค๋ฉด hypothesis๊ฐ€ ํƒ€๋‹นํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ทธ๊ฒŒ ๊ณง ๋‹จ์ ์ด ๋จ..
  • ๐Ÿง ์ฐจ์›์˜ ์ €์ฃผ๋ž€? ๋ชจ๋ธ์˜ complexity๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก training data์— ๋Œ€ํ•œ ์—๋Ÿฌ๋Š” ์ž‘์•„์งˆ์ง€์–ธ์ • ๋ชจ๋ธ์˜ variance๊ฐ€ ์ฆ๊ฐ€ํ•ด์„œ test data์— ๋Œ€ํ•œ error๋Š” ์ปค์งˆ ์ˆ˜๋„ ์žˆ๋‹ค๋Š” ๊ฒƒ.
  • ๐Ÿง Train, Valid, Test๋ฅผ ๋‚˜๋ˆ„๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€? real-world ๋ฐ์ดํ„ฐ์™€ ๊ทธ Label์€ $y=f(x)+\epsilon$์œผ๋กœ ํ‘œํ˜„๋˜์–ด, determinisiticํ•œ ๋ถ€๋ถ„ $f(x)$์™€ ๋ถˆ๊ฐ€ํ”ผํ•œ noise $\epsilon$์œผ๋กœ ํ‘œํ˜„์ด ๋˜๋Š”๋ฐ,(1) ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์™„๋ฒฝํ•œ hypothesis๋ฅผ ๋„์ถœํ•˜๋ ค๋ฉด ๊ทธ๋งŒํผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฌดํ•œํžˆ ๋งŽ์•„์•ผ ํ•˜์ง€๋งŒ ํ˜„์‹ค์ ์œผ๋กœ ๊ทธ๋ ‡์ง€ ์•Š๊ณ , (2) ๋ถˆ๊ฐ€ํ”ผํ•œ noise๊ฐ€ ํ•ญ์ƒ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด overfitting์ด๋‚˜ underfitting์ด ๋˜์ง€ ์•Š๋„๋ก ๊ฒ€์ฆํ•ด์ฃผ๋Š” ๊ณผ์ •๋„ ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ๊ทธ๋Ÿฐ๋“ฏ.. valid data๋Š” hyperparameter๋ฅผ ๊ณ ๋ฅด๋Š” ๋“ฑ์˜ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒƒ๋“ค์— ๋Œ€ํ•ด์„œ ์ ์ ˆํžˆ ์„ ํƒํ•ด์ค˜์•ผ ํ•  ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
  • ๐Ÿง Cross Validation์ด๋ž€? ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ subset์œผ๋กœ ์ชผ๊ฐ  ํ›„ ์ด์ค‘ ์ผ๋ถ€์˜ subset์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ณ  ๋‹ค๋ฅธ subset์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ๊ฒ€์ฆํ•˜๋Š” ๋ฐฉ์‹์„ ์˜๋ฏธํ•จ. ๊ทธ๋ž˜์„œ unseen data์— ๋Œ€ํ•ด์„œ๋„ ๋ชจ๋ธ์ด ์ž˜ ์ผ๋ฐ˜ํ™”๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋ž˜์š”. K-fold (k๊ฐœ๋กœ ๋‚˜๋ˆ ์„œ k-1๊ฐœ์— ๋Œ€ํ•ด ํ•™์Šต์‹œํ‚ค๊ณ  1๊ฐœ๋กœ ๊ฒ€์ฆ), LOOCV (๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ํ•œ ๊ฐœ๋กœ ๊ฒ€์ฆ, ๋ฐ์ดํ„ฐ๊ฐ€ ์ž‘์„ ๋•Œ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ• ๊ฐ™์Œ) ๋“ฑ์ด ์žˆ์–ด์š”.
  • ๐Ÿง (Super-, Unsuper-, Semi-Super) vised learning์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?
    1) Supervised Learning: ์ •๋‹ต์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•  ๊ฒฝ์šฐ. ๋ถ„๋ฅ˜๊ณผ์ œ(scoring ๊ฐ™์€..), regression ๋“ฑ์ด ์—ฌ๊ธฐ์— ํ•ด๋‹น๋ผ์š”.
    2) Unsupervised Learning: ์ •๋‹ต์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•  ๊ฒฝ์šฐ. clustering ๋“ฑ์ด ์—ฌ๊ธฐ์— ํ•ด๋‹น๋  ์ˆ˜ ์žˆ์–ด์š”.
    3)Semi-Supervised Learning: ๋‘ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ํ™œ์šฉํ•ด์„œ ํ•™์Šตํ•  ๊ฒฝ์šฐ. 
  • ๐Ÿง Decision Theory๋ž€? .. ํ™•๋ฅ ๋ก , ์ตœ์ ํ™”์ด๋ก  ๋“ฑ์„ ํ™œ์šฉํ•ด ์–ด๋–ค ๋ถˆํ™•์‹ค์„ฑ์ด ๋‚ด์žฌํ•˜๋Š” ์ƒํ™ฉ์—์„œ (์˜ˆ์ธก ๊ณผ์ œ๋ผ๋“ ์ง€) ๊ฐ€์„ค์„ ๋งŒ๋“ค๊ณ  ์˜ˆ์ธก/๊ฒฐ์ •์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค๋ฃจ๋Š” ์ด๋ก ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Abstractํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ๊ณ , k-NN์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ๋ฅผ storeํ•ด์„œ online-learning์„ ํ•˜๋Š” ๊ฒฝ์šฐ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์„ ๊ฑฐ ๊ฐ™์•„์š”.
  • ๐Ÿง Receiver Operating Characteristic Curve๋ž€ ๋ฌด์—‡์ธ๊ฐ€? ์•„ ROC.. false positive rate์™€ True positive rate๋ฅผ ๊ฐ๊ฐ x, y์ถ•์œผ๋กœ ํ•˜๋Š” ๊ทธ๋ž˜ํ”„์ธ๋ฐ, binary classifier์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜๋Š” ๋ฐ ์ฃผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜ํ”„๊ฐ€ $y=x$๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ขŒ์ธก ์ƒ๋‹จ์„ ํ–ฅํ•ด ๋ฉ€์–ด์งˆ์ˆ˜๋ก, ์ฆ‰ ์ง„์–‘์„ฑ๋ฅ ์ด 1์— ๊ฐ€๊น๊ณ  ์œ„์–‘์„ฑ๋ฅ ์ด 0์— ๊ฐ€๊น๋„๋ก ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ทธ๋ ค์งˆ ๊ฒฝ์šฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์–ด์š”. 
  • ๐Ÿง Precision Recall์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•ด๋ณด๋ผ Binary classification์˜ ๊ฒฝ์šฐ, 
  • ๐Ÿง Precision Recall Curve๋ž€ ๋ฌด์—‡์ธ๊ฐ€?
  • ๐Ÿง Type 1 Error ์™€ Type 2 Error๋Š”?
  • ๐Ÿง Entropy๋ž€ ๋ฌด์—‡์ธ๊ฐ€? ์ •๋ณด์ด๋ก ์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ impureํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ์ธ๋ฐ, ํ‰๊ท ์ •๋ณด๋Ÿ‰์ด๋ผ๊ณ ๋„ ๋งํ•œ๋‹ค. log probability์˜ ๊ธฐ๋Œ“๊ฐ’์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์„๋“ฏ
  • ๐Ÿง KL-Divergence๋ž€ ๋ฌด์—‡์ธ๊ฐ€? symmetricํ•˜์ง€ ์•Š์€ semi- distance metric (??)์ด๋ผ๊ณ  ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์€๋ฐ, $KL(A\Vert B)$๋Š” A์˜ entropy์™€ A์˜ B์— ๋Œ€ํ•œ cross entropy์˜ ์ฐจ์ด๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ˆœ์„œ๋Š” cross entropy - entropy..
  • ๐Ÿง Mutual Information์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?
  • ๐Ÿง Cross-Entropy๋ž€ ๋ฌด์—‡์ธ๊ฐ€? "the dissimilarity between two probability distributions", ์ฆ‰ ๋‘ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ. $H(A, B)$๋Š” A์˜ ๊ธฐ์ค€์—์„œ B์˜ ๋ถ„ํฌ์™€ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ผ์น˜ํ•˜๋Š”์ง€๋ฅผ ์˜๋ฏธํ•˜๊ณ , ํ•ญ์ƒ ์—”ํŠธ๋กœํ”ผ๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜ ๊ฐ™๋‹ค๋Š” ํŠน์ง•์ด ์žˆ๋‹ค. ๊ฐ™์„ ๋•Œ๋Š” ๋‘ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์ผ์น˜ํ•  ๋•Œ, i.e. A=B์ผ๋•Œ!
  • ๐Ÿง Cross-Entropy loss ๋ž€ ๋ฌด์—‡์ธ๊ฐ€? DL์—์„œ cross entropy loss๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค ํ•จ์€, gold label $y$์™€ predicted label $\hat{y}$ ์‚ฌ์ด์˜ ๋ถˆ์ผ์น˜๋ฅผ cross entropy๋กœ์จ ํ™•์ธํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ.
  • ๐Ÿง Generative Model์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€? ๋ฐ์ดํ„ฐ์˜ joint probability P(X, Y)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ๋กœ, ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฐ ์ด๋ฆ„์ด ๋ถ™์Œ.
  • ๐Ÿง Discriminative Model์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€? 
  • ๐Ÿง Discrinator function์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?
  • ๐Ÿง Overfitting ์ด๋ž€? ๊ณผ์ ํ•ฉ, training data์˜ ์—๋Ÿฌ๋Š” ๋งค์šฐ ์ž‘์€ ๋ฐ ๋ฐ˜ํ•ด ๋ชจ๋ธ์˜ variance๊ฐ€ ์ปค์„œ test data์— ๋Œ€ํ•ด์„œ๋Š” low performance๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒฝ์šฐ.
  • ๐Ÿง Underfitting์ด๋ž€? ํ•™์Šต์ด ๋œ ๋œ ๊ฒฝ์šฐ.. ๋ชจ๋ธ์˜ bias๊ฐ€ ์ปค์„œ true label๊ณผ predicted label์˜ ์ฐจ์ด๊ฐ€ ์ปค training data์— ๋Œ€ํ•ด์„œ๋„ ์—๋Ÿฌ๊ฐ€ ํฌ๋‹ค..
  • ๐Ÿง Overfitting๊ณผ Underfitting์€ ์–ด๋–ค ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๊ฐ€? 
  • ๐Ÿง Overfitting๊ณผ Underfitting์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€? ๋ฐ์ดํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜, regularization ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ ˆํ•˜๊ฑฐ๋‚˜ ...
  • ๐Ÿง Regularization์ด๋ž€? ๊ฐ’์ด ์ง€๋‚˜์น˜๊ฒŒ ํฐ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ผ์ข…์˜ ํŽ˜๋„ํ‹ฐ๋ฅผ ์ค˜์„œ ๊ทธ ๊ฐ’์„ ์กฐ์ ˆํ•˜๋Š” ๊ธฐ๋ฒ•. Ridge์˜ ๊ฒฝ์šฐ objective function์— $w_i^2$๋ฅผ, Lasso์˜ ๊ฒฝ์šฐ $|w_i^2|$ ํ•ญ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ํ›„์ž๋Š” closed form ์†”๋ฃจ์…˜๋„ ์—†๊ณ  differentiableํ•˜์ง€๋„ ์•Š๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๊ธด ํ•˜๋‹ค..
    • Ridge / Lasso
  • ๐Ÿง Activation function์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?3๊ฐ€์ง€ Activation function type์ด ์žˆ๋‹ค.
    • Ridge activation Function / Radial activation Function  / Folding activation Function
  • ๐Ÿง CNN์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•ด๋ณด๋ผ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์€ ์ฃผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋˜๋Š” ์‹ ๊ฒฝ๋ง์œผ๋กœ, convolution layer, pooling layer, fully connected layer๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. Convolution layer์—์„œ๋Š” kernel๊ณผ input์˜ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์ด, pooling layer์—์„œ๋Š” input size๋ฅผ maxpool, average ๋“ฑ์˜ ๊ธฐ๋ฒ•์œผ๋กœ ์ค„์ด๋Š” ์—ฐ์‚ฐ์ด, fc layer์—์„œ๋Š” ์•ž์—์„œ ๊ฑฐ์นœ ์—ฐ์‚ฐ์œผ๋กœ ๋‚˜์˜จ output์„ ๋‹ค์‹œ input์œผ๋กœ ๋ฐ›์•„ feed forward ์—ฐ์‚ฐ์„ ๊ฑฐ์ณ ์ตœ์ข… output์„ ๋‚ด๋Š”.. ๊ทธ๋Ÿฐ flow๋ฅผ ๊ฑฐ์นฉ๋‹ˆ๋‹ค. ์ด๋•Œ ํ•™์Šต๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์ฒ˜์Œ์— ๋žœ๋คํ•˜๊ฒŒ initialize๋๋˜ kernel, ๊ทธ๋ฆฌ๊ณ  FC layer์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ž…๋‹ˆ๋‹ค.
  • ๐Ÿง RNN์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•ด๋ณด๋ผ sequentialํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์ฃผ๋Š” ์‹ ๊ฒฝ๋ง์œผ๋กœ, time step t์—์„œ์˜ input $x_t$์˜ hidden layer ์—ฐ์‚ฐ๊ฐ’ $h_t$๊ฐ€ ๋‹ค์Œ time step์˜ input $x_{t+1}$๊ณผ ํ•จ๊ป˜ ์—ฐ์‚ฐ์œผ๋กœ 
  • ๐Ÿง Netwon's method๋ž€ ๋ฌด์—‡์ธ๊ฐ€?
  • ๐Ÿง Gradient Descent๋ž€ ๋ฌด์—‡์ธ๊ฐ€? ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ์˜ closed form solution์„ ๊ตฌํ•˜๋Š” ๋Œ€์‹ , ๊ณ„์‚ฐ์˜ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ์˜ search space์—์„œ ํ˜„์žฌ step์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•ด ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•ด๊ฐ€๋ฉด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์—ฌ local/global optimum์„ ์ฐพ์•„๊ฐ€๋Š” ๋ฐฉ์‹.
  • ๐Ÿง Stochastic Gradient Descent๋ž€ ๋ฌด์—‡์ธ๊ฐ€? Gradient descent๋Š” ๋ชจ๋“  training sample์— ๋Œ€ํ•ด ์ ์šฉํ•ด์ฃผ๋Š” ๋ฐ ๋ฐ˜ํ•ด, SGD๋Š” batch ๋‹จ์œ„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. GD์— ๋น„ํ•ด unstableํ•˜์ง€๋งŒ, local optimum์— ๊ฐ‡ํ˜€๋„ ๋น ์ ธ๋‚˜์™€ global optimum์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ๊ณผ ์—ฐ์‚ฐ์ด ๋น ๋ฅด๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿง Local optimum์œผ๋กœ ๋น ์ง€๋Š”๋ฐ ์„ฑ๋Šฅ์ด ์ข‹์€ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€?
  • ๐Ÿง Internal Covariance Shift ๋ž€ ๋ฌด์—‡์ธ๊ฐ€? ๋ณดํ†ต training data์™€ test data์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์„ covariate shift๋ผ๊ณ  ๋ถ€๋ฅด๋Š”๋ฐ, ์ด๊ฒƒ์ด NN ์•ˆ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ˜„์ƒ์„ ์ผ์ปซ๋Š” ์šฉ์–ด์ž…๋‹ˆ๋‹ค. ์ฆ‰, ํ•œ hidden layer์—์„œ ๋‹ค๋ฅธ hidden layer๋กœ ๋„˜์–ด๊ฐ€๋ฉด์„œ ๊ณ„์† ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์กฐ๊ธˆ์”ฉ ๋‹ฌ๋ผ์ง€๋Š” ํ˜„์ƒ์„ ์˜๋ฏธํ•ด์š”. 
  • ๐Ÿง Batch Normalization์€ ๋ฌด์—‡์ด๊ณ  ์™œ ํ•˜๋Š”๊ฐ€? ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ICS๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, mini batch ์•ˆ์—์„œ์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์œผ๋กœ batch๋ฅผ normalize๋ฅผ ํ•ด์ฃผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, batch ๋‹จ์œ„์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”๋ฅผ ํ•ด์ค˜์„œ ํ‰๊ท ์ด 0, ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ด ๋˜๋„๋ก scale-and-shift๋ฅผ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด์—์š”. training ์‹œ์—๋Š” batch ๋‹จ์œ„๋กœ ์ •๊ทœํ™”๋ฅผ ์ง„ํ–‰ํ•˜์ง€๋งŒ, inference ์‹œ์—๋Š” training set ์ „์ฒด์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ด normalize๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿง Backpropagation์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€? feed forward ๊ณผ์ •์œผ๋กœ output๊ณผ loss๋ฅผ ๊ณ„์‚ฐํ–ˆ์„ ๋•Œ, ์ด loss์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์—…๋ฐ์ดํŠธํ•ด์ฃผ๋Š” ๊ณผ์ •์„ ๋งํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿง Optimizer์˜ ์ข…๋ฅ˜์™€ ์ฐจ์ด์— ๋Œ€ํ•ด์„œ ์•„๋Š”๊ฐ€? Momentum, AdaGrad, RMSProp, AdaM optimizer ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. momentum์€ ์ด์ „ time-step์˜ gradient, ์ฆ‰ ๊ด€์„ฑ์„ ์ด์šฉํ•ด step size๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ด๊ณ , AdaGrad๋Š” gradient์˜ second moment ๋ˆ„์ ๊ฐ’์„, RMSProp์€ gradient์˜ exponential moving average๋ฅผ, ๊ทธ๋ฆฌ๊ณ  Adam์€ momentum๊ณผ rmsprop์„ ํ˜ผํ•ฉํ•œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. 
  • ๐Ÿง Ensemble์ด๋ž€?
  • ๐Ÿง Stacking Ensemble์ด๋ž€?
  • ๐Ÿง Bagging์ด๋ž€?
  • ๐Ÿง Bootstrapping์ด๋ž€?
  • ๐Ÿง Boosting์ด๋ž€?
  • ๐Ÿง Bagging ๊ณผ Boosting์˜ ์ฐจ์ด๋Š”?
  • ๐Ÿง AdaBoost / Logit Boost / Gradient Boost
  • ๐Ÿง Support Vector Machine์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€? linear decision boundary๋ฅผ ์ฐพ์„ ๋•Œ, decison boundary์™€ datapoint์˜ margin์„ ์ตœ๋Œ€ํ™”ํ•˜๊ฒ ๋‹ค๋Š” ์•„์ด๋””์–ด๋กœ ๋‚˜์˜จ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. constrained QP ๋ฌธ์ œ์— ํ•ด๋‹น๋˜๊ณ , KKT condition์„ ๋งŒ์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— dual problem์œผ๋กœ ๋ฐ”๊พธ์–ด์„œ ์ง์ ‘ linear decision boundary์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๋Š” ๊ฒƒ๋ณด๋‹ค support vector, ์ฆ‰ canonical line ์œ„์— ์žˆ๋Š” datapoint๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๋กœ ์ „ํ™˜ํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ํ’€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, linearํ•œ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋”๋ผ๋„ kernel trick์„ ์ด์šฉํ•œ๋‹ค๋ฉด Non-linearํ•œ decision boundary๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ์—๋„ ์ ์šฉ๋  ์ˆ˜ ์žˆ์–ด์š”. 
  • ๐Ÿง Margin์„ ์ตœ๋Œ€ํ™”ํ•˜๋ฉด ์–ด๋–ค ์žฅ์ ์ด ์žˆ๋Š”๊ฐ€? ๋งŒ์กฑํ•˜๋Š” decision boundary๊ฐ€ ํ•˜๋‚˜๋กœ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. perceptron์€ ์•„๋‹ˆ์—ˆ๊ฑฐ๋“ ..

์•ˆ ์“ด ๋ถ€๋ถ„์€ ใ…Žใ…Ž ๋Œ€๋‹ต ์Šค๋ฌด์Šคํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์„ ๊ฑฐ ๊ฐ€ํ‹ˆ

728x90

'math4ai > Machine Learning and Deep Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

M4AI) BayesNet, Markov Chain, HMM  (1) 2024.07.31
M4AI) Special Topics: MLDL Techniques  (0) 2024.07.31
M4AI) Attention & Transformer  (1) 2024.07.29
M4AI) Special Topics: Deep Learning, etc. (1)  (1) 2024.07.29
M4AI) LDA & Ensemble  (1) 2024.07.29