ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [๊ธฐ๊ณ„ํ•™์Šต] 5. Nearest Neighbor Method
    ๐ŸณDev/Machine Learning 2021. 12. 20. 16:16
    ์ถฉ๋‚จ๋Œ€ํ•™๊ต์˜ ๊น€๋™์ผ ๊ต์ˆ˜๋‹˜์˜ ๊ธฐ๊ณ„ํ•™์Šต ์ˆ˜์—…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

    1. k-Nearest Neighbors Classifier (k-NN)

    1. k-Nearest Neighbors Classifier (k-NN)

    ๊ฐ€์žฅ ์‰ฝ๊ณ  ์ง๊ด€์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด์ž, "๋‚˜์™€ ๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜๋ฅผ ์„ค๋ช…ํ•œ๋‹ค"๋ผ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹์˜ ๊ธฐ๋ณธ ์ปจ์…‰์„ ์ž˜ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. k-NN์—์„œ k๋Š” ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ ๋ช‡ ๊ฐœ์˜ ์ธ์ ‘ ์ด์›ƒ์„ ๊ฐ€์ง€๊ณ  ๋น„์Šทํ•จ์— ๋”ฐ๋ผ ๋ถ„๋ฅ˜๋ฅผ ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. k-NN์—์„œ ๋น„์Šทํ•จ์€ ๊ฑฐ๋ฆฌ๋กœ ๊ฒฐ์ •ํ•˜๋Š”๋ฐ, ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ์œ ์‚ฌํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

    1) Distance Metric

    ์ง‘ํ•ฉ์˜ ๊ฐ ์›์†Œ ์Œ๋ผ๋ฆฌ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•˜๋Š” ๊ฑฐ๋ฆฌ ํ•จ์ˆ˜๋“ค์„ ์•Œ์•„๋ณด์ž.

    1. Minkowski distance

    Minkowski๋Š” k๊ฐ’์„ ๊ฐ€์ •ํ•˜์ง€ ์•Š์€ ๊ณต์‹์„ ์˜๋ฏธํ•˜๋ฉฐ, ์•„๋ž˜๋Š” k๊ฐ’์— ๋”ฐ๋ผ์„œ ๊ณต์‹์˜ ์ด๋ฆ„์ด ๋‹ฌ๋ผ์ง„๋‹ค.

    2. k = 1, Manhattan distance

    3. k = 2, Euclidean distance

    ์ผ๋ฐ˜์ ์œผ๋กœ ์“ฐ์ด๋Š” ๊ณต์‹์€ ์œ ํด๋ฆฌ๋””์–ธ ๊ฑฐ๋ฆฌ์ด๋ฉฐ, ๊ฑฐ๋ฆฌ ๊ณต์‹์„ ๋ช…์‹œํ•˜์ง€ ์•Š์•˜์œผ๋ฉด ์ด๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ๊ณต์‹์„ ๋ณด๋ฉด, ํ”ํžˆ ์•„๋Š” ๋‘ ์ ์˜ ์ง์„  ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•˜๋Š” ๊ณต์‹์ด๋‹ค.

    4. Mahalanobis distance

    ์œ ํด๋ฆฌ๋””์–ธ ๊ฑฐ๋ฆฌ๋Š” ๊ฐ’๋“ค์˜ ํฌ๊ธฐ์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์ง€๋งŒ, ๊ณต๋ถ„์‚ฐ์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ •๊ทœํ™”๋ฅผ ์‹œ์ผœ์ค„ ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์ด ๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ์ด๋‹ค. ์ด๋•Œ ๋ชจ๋“  ๋ณ€์ˆ˜๋“ค์€ ์ƒํ˜ธ ๋…๋ฆฝ์„ ๊ฐ€์ •ํ•œ ์ƒํƒœ์—์„œ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์—๋Š” ๋†’์€ ๊ณต๋ถ„์‚ฐ ๊ฐ’์„, ๋จผ ๊ฑฐ๋ฆฌ์—๋Š” ๋‚ฎ์€ ๊ณต๋ถ„์‚ฐ ๊ฐ’์œผ๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•˜๋ฉด ํ‰๊ท  50์ ์ธ ์‹œํ—˜์—์„œ 70์ ๋ฐ›์€ ๊ฒฝ์šฐ์™€ ํ‰๊ท  80์ ์—์„œ 70์  ๋ฐ›์€ ๊ฒฝ์šฐ๋ฅผ ํ•˜๋‚˜์˜ ๊ธฐ์ค€์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•œ ์ฒ˜๋ฆฌ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

    5. Distance of categorical features

    • One hot encoding(One of C coding)

    ์œ„์™€ ๊ฐ™์ด ์ˆซ์ž๊ฐ€ ์•„๋‹Œ ๊ฐ’์„ ์ธ์ฝ”๋”ฉํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ž. ๊ฐ ๊ฐ’๋งˆ๋‹ค ๋„˜๋ฒ„๋ง์„ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ผ๋ฐ˜์ ์œผ๋กœ๋Š” One Hot Encoding์„ ์‚ฌ์šฉํ•œ๋‹ค. ์•„๋ž˜๋ฅผ ๋ณด์ž.

    ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋งˆ๋‹ค ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜์—ฌ, feature์— ํ•ด๋‹นํ•˜๋Š” ๋ณ€์ˆ˜์— 1์„ ๋‚˜๋จธ์ง€์—๋Š” 0์„ ์ €์žฅํ•œ๋‹ค. ์œ„์˜ ์˜ˆ์‹œ์—์„œ Engineer๋Š” k๋ฒˆ์งธ ID์™ธ์— ๋ชจ๋“  ID๋Š” 0์„ ๊ฐ€์งˆ ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ๋‚˜๋จธ์ง€ 0, ํ•˜๋‚˜์—๋งŒ ๋ชฐ์•„์„œ 1์„ ํ• ๋‹นํ•˜๋ฏ€๋กœ One Hot์ด๋ผ๋Š” ์ด๋ฆ„์ด ๋ถ™์—ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋ณ€์ˆ˜ ๊ฐ„์˜ ๋ชจ๋“  ๊ฑฐ๋ฆฌ๋Š” ๋ฃจํŠธ 2๋กœ ํ•ญ์ƒ ๊ฐ™์€ ๊ฐ’์„ ์œ ์ง€ํ•˜์ง€๋งŒ, ๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์ด ๋Š˜์–ด๋‚˜๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ๋งŒ์•ฝ์— ์ด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ ค๋ฉด, feature๋ฅผ Ordinal ๋˜๋Š” Norminal ๋ณ€์ˆ˜์™€ ๊ฐ™์ด ๋ณ€ํ™˜ํ•ด์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๋ฐ ๊ณ„์‚ฐํ•˜๊ธฐ ์•„์ฃผ ๋ณต์žกํ•˜๋‹ค!

    2) Normalization

    ๋ณ€์ˆ˜๋“ค์˜ ์Šค์ผ€์ผ์ด ๋‹ค๋ฅธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฏ€๋กœ ์™œ๊ณก์ด ์–ธ์ œ๋“  ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. distance๋ฅผ ๋‹ค๋ฃฐ ๋•Œ๋Š” ๋ฐ˜๋“œ์‹œ ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•จ์„ ์žŠ์ง€๋ง์ž.

    • z-normalization
    • min-max normalization or scaling

    3) k-NN

    1. Algorithm

    1. class label์ด ์žˆ๋Š” training data๋ฅผ ์ค€๋น„ํ•œ๋‹ค
    2. label์ด ์—†๋Š” test data๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค
    3. ๋ชจ๋“  training data์— ๋Œ€ํ•ด test data์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค
    4. ๊ฐ๊ฐ์˜ test data์™€ ๊ฐ€๊นŒ์šด ์ˆœ์„œ๋Œ€๋กœ k๊ฐœ์˜ ์ด์›ƒ๋“ค์„ ์„ ํƒํ•œ๋‹ค
    5. ์ด์›ƒ๋“ค์˜ label์„ ํ™•์ธํ•˜์—ฌ(vote) ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ label์„ test data์˜ label๋กœ ๊ฒฐ์ •ํ•œ๋‹ค

    2. k์˜ ๊ฒฐ์ •

    ๊ทธ๋ ‡๋‹ค๋ฉด k์˜ ํฌ๊ธฐ๋Š” ์–ด๋–ป๊ฒŒ ๊ฒฐ์ • ํ• ๊นŒ?
    k๋Š” ์ด์™•์ด๋ฉด ํ™€์ˆ˜๋กœ, ์‚ฌ๋žŒ์ด ์—ฌ๋Ÿฌ ์‹œ๋„๋ฅผ ํ†ตํ•ด ์ ์ ˆํ•œ k๋ฅผ ์„ค์ •ํ•œ๋‹ค.

    k๊ฐ€ ํฌ๋ฉด, smoothing but not sensitive
    k๊ฐ€ ์ž‘์œผ๋ฉด, complex but too sensitive

    3. Lazy Learner

    k-NN์€ ๊ฒŒ์œผ๋ฅธ ํ•™์Šต์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š”๋ฐ ์ด์œ ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ๋ฐ˜๋Œ“๋ง์€ Eager Learner์ด๋‹ค.

    • no training process : ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๊ณ  ๋‚˜์„œ์•ผ ๊ณ„์‚ฐํ•จ
    • reference vectors : ํ•™์Šต๋ชจ๋ธ์ด๋ผ๊ณ  ํ•˜๊ธฐ ์• ๋งคํ•ด์„œ ์ €๋ ‡๊ฒŒ ๋ถ€๋ฆ„
    • memory-based reasoning
    • time consuming for test process : ํ‰๊ฐ€ํ•˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ์š”, ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅ

    4. k-NN Regression

    voting ๋Œ€์‹  ์ด์›ƒ๋“ค์˜ label๊ฐ’์˜ ํ‰๊ท ์„ ๊ตฌํ•˜์—ฌ label์„ ๊ฒฐ์ •ํ•œ๋‹ค.

    4) Summary

    • ๊ฐ€๊นŒ์šด ์ด์›ƒ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธก
    • kNN์˜ ์ตœ์•…์˜ ์„ฑ๋Šฅ์ด classification ์ตœ์ ์˜ 2๋ฐฐ๋ผ๊ณ  ์ฆ๋ช…๋˜์–ด์žˆ์–ด - - baseline์ด ๋จ, ์ฆ‰ kNN๋ณด๋‹ค ์•ˆ์ข‹๋‹ค? ์“ฐ์ง€๋งˆ
    • ๋งค์šฐ ๊ฐ„๋‹จํ•จ์—๋„, ๋น„์„ ํ˜• ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃธ
    • Lazy Learning
Designed by Tistory.