|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Mean Normalization\n", |
| 8 | + "\n", |
| 9 | + "In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.\n", |
| 10 | + "\n", |
| 11 | + "In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. \n", |
| 12 | + "\n", |
| 13 | + "# To Do:\n", |
| 14 | + "\n", |
| 15 | + "You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below" |
| 16 | + ] |
| 17 | + }, |
| 18 | + { |
| 19 | + "cell_type": "code", |
| 20 | + "execution_count": 1, |
| 21 | + "metadata": {}, |
| 22 | + "outputs": [ |
| 23 | + { |
| 24 | + "name": "stdout", |
| 25 | + "output_type": "stream", |
| 26 | + "text": [ |
| 27 | + "(1000, 20)\n" |
| 28 | + ] |
| 29 | + } |
| 30 | + ], |
| 31 | + "source": [ |
| 32 | + "# import NumPy into Python\n", |
| 33 | + "import numpy as np\n", |
| 34 | + "\n", |
| 35 | + "# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).\n", |
| 36 | + "X = np.random.randint(0,5001, size = (1000, 20))\n", |
| 37 | + "\n", |
| 38 | + "# print the shape of X\n", |
| 39 | + "print(X.shape)" |
| 40 | + ] |
| 41 | + }, |
| 42 | + { |
| 43 | + "cell_type": "markdown", |
| 44 | + "metadata": {}, |
| 45 | + "source": [ |
| 46 | + "Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:\n", |
| 47 | + "\n", |
| 48 | + "Norm_Coli = (Coli - Mi) / $\\sigma_i$\n", |
| 49 | + "\n", |
| 50 | + "where Coli is the i th column of X, Mi is average of the values in the $i$th column of $X$, and $\\sigma_i$ is the standard deviation of the values in the ith column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. " |
| 51 | + ] |
| 52 | + }, |
| 53 | + { |
| 54 | + "cell_type": "code", |
| 55 | + "execution_count": 3, |
| 56 | + "metadata": {}, |
| 57 | + "outputs": [ |
| 58 | + { |
| 59 | + "name": "stdout", |
| 60 | + "output_type": "stream", |
| 61 | + "text": [ |
| 62 | + "avg [2586.193 2431.678 2472.509 2425.142 2549.725 2512.455 2467.976 2513.422\n", |
| 63 | + " 2544.432 2507.975 2578.101 2547.339 2627.567 2478.674 2482.932 2474.226\n", |
| 64 | + " 2430.581 2536.959 2543.466 2538.073] Std [1423.80099795 1424.04427611 1431.6269402 1436.90375176 1457.08430895\n", |
| 65 | + " 1422.28549032 1413.17238914 1458.06015717 1444.06997039 1439.56461556\n", |
| 66 | + " 1433.30276104 1430.12214726 1428.97612209 1424.50018523 1456.94740515\n", |
| 67 | + " 1435.42107652 1414.82414789 1433.96176704 1460.43488347 1440.94841118]\n" |
| 68 | + ] |
| 69 | + } |
| 70 | + ], |
| 71 | + "source": [ |
| 72 | + "# Average of the values in each column of X\n", |
| 73 | + "ave_cols = np.mean(X, axis = 0)\n", |
| 74 | + "\n", |
| 75 | + "# Standard Deviation of the values in each column of X\n", |
| 76 | + "std_cols = np.std(X, axis = 0)\n", |
| 77 | + "\n", |
| 78 | + "print(\"avg \", ave_cols, \"Std \", std_cols)" |
| 79 | + ] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "markdown", |
| 83 | + "metadata": {}, |
| 84 | + "source": [ |
| 85 | + "If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:" |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "cell_type": "code", |
| 90 | + "execution_count": 4, |
| 91 | + "metadata": {}, |
| 92 | + "outputs": [ |
| 93 | + { |
| 94 | + "name": "stdout", |
| 95 | + "output_type": "stream", |
| 96 | + "text": [ |
| 97 | + "ave_cols shape (20,)\n", |
| 98 | + "std_cols shape (20,)\n" |
| 99 | + ] |
| 100 | + } |
| 101 | + ], |
| 102 | + "source": [ |
| 103 | + "# Print the shape of ave_cols\n", |
| 104 | + "print(\"ave_cols shape \",ave_cols.shape)\n", |
| 105 | + "\n", |
| 106 | + "# Print the shape of std_cols\n", |
| 107 | + "print(\"std_cols shape \",std_cols.shape)" |
| 108 | + ] |
| 109 | + }, |
| 110 | + { |
| 111 | + "cell_type": "markdown", |
| 112 | + "metadata": {}, |
| 113 | + "source": [ |
| 114 | + "You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below" |
| 115 | + ] |
| 116 | + }, |
| 117 | + { |
| 118 | + "cell_type": "code", |
| 119 | + "execution_count": 6, |
| 120 | + "metadata": {}, |
| 121 | + "outputs": [ |
| 122 | + { |
| 123 | + "name": "stdout", |
| 124 | + "output_type": "stream", |
| 125 | + "text": [ |
| 126 | + "[[ 0.96348226 0.03463516 -0.92657449 ... 1.2483185 0.50295567\n", |
| 127 | + " 0.91809463]\n", |
| 128 | + " [ 1.57101098 0.42507246 -1.21016792 ... -1.67016936 -0.48921455\n", |
| 129 | + " 0.77443925]\n", |
| 130 | + " [ 1.33923702 -1.63174561 1.40643553 ... -0.35214258 -0.67477572\n", |
| 131 | + " 0.95348799]\n", |
| 132 | + " ...\n", |
| 133 | + " [-0.19257818 -0.6261589 -0.0541405 ... 0.18134444 -0.88567181\n", |
| 134 | + " 1.21165129]\n", |
| 135 | + " [-0.83803355 1.39765458 1.54753374 ... -1.51256404 -0.87540089\n", |
| 136 | + " 1.21442724]\n", |
| 137 | + " [ 1.18612573 -1.36981555 0.26856927 ... -0.71686639 -1.61696083\n", |
| 138 | + " -1.410233 ]]\n" |
| 139 | + ] |
| 140 | + } |
| 141 | + ], |
| 142 | + "source": [ |
| 143 | + "# Mean normalize X\n", |
| 144 | + "X_norm = (X - ave_cols) / std_cols\n", |
| 145 | + "print(X_norm)" |
| 146 | + ] |
| 147 | + }, |
| 148 | + { |
| 149 | + "cell_type": "markdown", |
| 150 | + "metadata": {}, |
| 151 | + "source": [ |
| 152 | + "If you have performed the mean normalization correctly, then the average of all the elements in Xnorm should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:" |
| 153 | + ] |
| 154 | + }, |
| 155 | + { |
| 156 | + "cell_type": "code", |
| 157 | + "execution_count": 7, |
| 158 | + "metadata": {}, |
| 159 | + "outputs": [ |
| 160 | + { |
| 161 | + "name": "stdout", |
| 162 | + "output_type": "stream", |
| 163 | + "text": [ |
| 164 | + "The average of all the values of X_norm is: \n", |
| 165 | + "2.4513724383723456e-17\n", |
| 166 | + "2.4513724383723456e-17\n", |
| 167 | + "The average of the minimum value in each column of X_norm is: \n", |
| 168 | + "-1.7476323670189688\n", |
| 169 | + "-1.7476323670189688\n", |
| 170 | + "The average of the maximum value in each column of X_norm is: \n", |
| 171 | + "1.7300298956249751\n", |
| 172 | + "1.7300298956249751\n" |
| 173 | + ] |
| 174 | + } |
| 175 | + ], |
| 176 | + "source": [ |
| 177 | + "# Print the average of all the values of X_norm\n", |
| 178 | + "# You can use either the function or a method. So, there are multiple ways to solve. \n", |
| 179 | + "print(\"The average of all the values of X_norm is: \")\n", |
| 180 | + "print(np.mean(X_norm))\n", |
| 181 | + "print(X_norm.mean())\n", |
| 182 | + "\n", |
| 183 | + "# Print the average of the minimum value in each column of X_norm\n", |
| 184 | + "print(\"The average of the minimum value in each column of X_norm is: \")\n", |
| 185 | + "print(X_norm.min(axis = 0).mean())\n", |
| 186 | + "print(np.mean(np.sort(X_norm, axis=0)[0]))\n", |
| 187 | + "\n", |
| 188 | + "# Print the average of the maximum value in each column of X_norm\n", |
| 189 | + "print(\"The average of the maximum value in each column of X_norm is: \")\n", |
| 190 | + "print(np.mean(np.sort(X_norm, axis=0)[-1]))\n", |
| 191 | + "print(X_norm.max(axis = 0).mean())" |
| 192 | + ] |
| 193 | + }, |
| 194 | + { |
| 195 | + "cell_type": "markdown", |
| 196 | + "metadata": {}, |
| 197 | + "source": [ |
| 198 | + "You should note that since $X$ was created using random integers, the above values will vary. \n", |
| 199 | + "\n", |
| 200 | + "# Data Separation\n", |
| 201 | + "\n", |
| 202 | + "After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:\n", |
| 203 | + "\n", |
| 204 | + "1. A Training Set\n", |
| 205 | + "2. A Cross Validation Set\n", |
| 206 | + "3. A Test Set\n", |
| 207 | + "\n", |
| 208 | + "The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. \n", |
| 209 | + "\n", |
| 210 | + "In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.\n", |
| 211 | + "\n", |
| 212 | + "You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:" |
| 213 | + ] |
| 214 | + }, |
| 215 | + { |
| 216 | + "cell_type": "code", |
| 217 | + "execution_count": null, |
| 218 | + "metadata": {}, |
| 219 | + "outputs": [], |
| 220 | + "source": [ |
| 221 | + "# We create a random permutation of integers 0 to 4\n", |
| 222 | + "np.random.permutation(5)" |
| 223 | + ] |
| 224 | + }, |
| 225 | + { |
| 226 | + "cell_type": "markdown", |
| 227 | + "metadata": {}, |
| 228 | + "source": [ |
| 229 | + "# To Do\n", |
| 230 | + "\n", |
| 231 | + "In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`." |
| 232 | + ] |
| 233 | + }, |
| 234 | + { |
| 235 | + "cell_type": "code", |
| 236 | + "execution_count": 8, |
| 237 | + "metadata": {}, |
| 238 | + "outputs": [ |
| 239 | + { |
| 240 | + "name": "stdout", |
| 241 | + "output_type": "stream", |
| 242 | + "text": [ |
| 243 | + "[242 761 956 391 263 916 312 131 445 366 403 795 605 152 906 140 310 782\n", |
| 244 | + " 550 661 545 251 260 84 808 884 983 52 134 58 528 270 878 852 785 460\n", |
| 245 | + " 180 563 384 555 394 532 237 111 120 932 400 606 585 342 787 734 883 684\n", |
| 246 | + " 498 847 598 945 526 335 538 796 845 806 520 202 339 733 625 174 102 749\n", |
| 247 | + " 101 697 172 399 377 970 860 899 378 558 420 979 337 126 716 830 763 219\n", |
| 248 | + " 651 564 657 631 879 307 390 115 478 412 167 104 912 546 811 646 881 145\n", |
| 249 | + " 301 187 214 740 989 622 117 893 433 888 273 815 287 981 930 21 313 866\n", |
| 250 | + " 774 159 470 64 918 936 687 944 204 333 41 946 12 818 11 660 254 980\n", |
| 251 | + " 10 533 178 508 15 427 754 321 952 453 352 76 396 608 968 580 463 7\n", |
| 252 | + " 959 856 836 623 701 386 900 437 978 314 340 371 34 746 951 693 609 566\n", |
| 253 | + " 150 770 431 296 539 729 348 972 109 973 775 100 171 90 107 144 379 29\n", |
| 254 | + " 425 721 894 519 725 53 341 596 250 402 814 303 148 169 189 4 976 490\n", |
| 255 | + " 722 531 792 812 939 680 516 293 849 74 789 767 559 820 850 327 764 786\n", |
| 256 | + " 855 966 653 40 907 49 543 639 469 426 176 875 147 261 839 163 33 589\n", |
| 257 | + " 277 987 891 328 604 514 27 199 567 486 389 71 658 428 370 772 895 297\n", |
| 258 | + " 146 813 840 422 873 735 79 500 819 476 385 940 650 527 556 155 941 347\n", |
| 259 | + " 168 423 957 802 581 38 193 397 913 507 411 61 85 512 37 165 450 264\n", |
| 260 | + " 410 311 50 568 728 345 755 121 965 988 656 136 224 3 483 659 984 624\n", |
| 261 | + " 809 741 292 439 843 51 994 216 481 619 380 183 867 803 505 780 626 93\n", |
| 262 | + " 603 196 573 383 467 45 977 637 986 497 679 272 673 227 9 707 614 357\n", |
| 263 | + " 434 181 960 142 139 304 294 990 801 82 288 592 824 382 493 413 97 23\n", |
| 264 | + " 757 221 504 534 827 736 599 457 278 641 151 188 324 613 766 283 991 440\n", |
| 265 | + " 336 583 663 525 696 574 465 231 179 375 213 91 617 669 235 835 810 919\n", |
| 266 | + " 523 760 280 119 129 271 138 323 648 477 924 248 702 322 635 320 584 854\n", |
| 267 | + " 628 904 106 768 649 831 73 112 299 5 258 992 554 737 877 141 909 334\n", |
| 268 | + " 128 928 206 765 805 943 561 590 302 417 709 791 87 80 198 16 170 530\n", |
| 269 | + " 156 513 665 130 190 226 560 275 499 110 571 1 223 621 392 529 602 720\n", |
| 270 | + " 32 56 689 289 748 842 36 672 591 30 285 94 354 203 78 75 455 96\n", |
| 271 | + " 713 197 759 369 868 212 846 703 670 205 962 211 184 442 536 634 892 816\n", |
| 272 | + " 826 47 964 718 48 643 676 68 238 632 998 429 317 769 706 898 547 885\n", |
| 273 | + " 502 929 372 887 290 331 62 518 971 654 166 967 157 489 834 535 612 870\n", |
| 274 | + " 291 175 524 511 777 616 35 704 448 25 154 432 99 475 522 861 901 776\n", |
| 275 | + " 471 468 122 344 794 682 376 705 925 496 418 43 88 864 999 857 947 712\n", |
| 276 | + " 217 191 917 668 253 63 484 451 208 472 678 281 201 346 149 282 416 910\n", |
| 277 | + " 667 620 927 630 229 249 724 325 256 719 807 305 480 266 948 447 60 421\n", |
| 278 | + " 241 548 683 553 495 982 593 162 934 276 542 690 711 790 642 640 662 926\n", |
| 279 | + " 268 125 127 996 865 83 326 89 494 752 788 611 961 832 373 949 192 675\n", |
| 280 | + " 804 600 401 674 13 914 173 750 225 633 108 482 309 404 691 783 158 975\n", |
| 281 | + " 958 681 744 284 692 862 742 349 903 233 316 452 459 747 133 353 540 31\n", |
| 282 | + " 698 597 793 267 262 338 974 438 969 358 863 671 607 182 501 890 441 116\n", |
| 283 | + " 565 743 185 329 935 880 265 255 279 195 55 745 473 259 841 837 295 351\n", |
| 284 | + " 798 458 103 695 446 700 462 717 28 931 738 569 186 234 244 409 915 503\n", |
| 285 | + " 953 521 908 430 723 997 143 576 710 985 615 544 444 645 506 610 118 889\n", |
| 286 | + " 487 274 950 753 466 694 586 319 762 367 552 24 933 464 364 638 46 594\n", |
| 287 | + " 363 26 537 779 575 137 485 332 474 6 124 228 921 405 823 731 510 414\n", |
| 288 | + " 778 876 230 209 300 449 509 123 627 938 454 222 955 874 245 39 86 897\n", |
| 289 | + " 269 359 114 236 72 8 461 954 773 368 618 781 714 655 306 194 922 132\n", |
| 290 | + " 92 963 66 817 22 577 872 911 387 644 161 636 859 20 308 436 647 298\n", |
| 291 | + " 844 587 578 17 869 406 239 993 456 851 886 398 825 164 77 286 98 362\n", |
| 292 | + " 937 330 315 549 677 920 727 18 365 715 252 0 664 797 435 67 557 492\n", |
| 293 | + " 739 232 360 828 69 177 572 407 479 44 726 160 19 105 758 756 415 350\n", |
| 294 | + " 882 95 424 488 833 848 243 905 708 858 829 343 135 355 800 942 579 70\n", |
| 295 | + " 220 923 210 570 54 871 799 388 685 699 732 821 666 517 853 443 374 2\n", |
| 296 | + " 65 395 686 381 153 515 652 257 838 688 240 491 200 318 246 822 113 218\n", |
| 297 | + " 207 582 215 562 551 356 896 247 771 408 995 629 81 57 595 784 361 59\n", |
| 298 | + " 730 419 42 588 541 751 14 601 902 393]\n" |
| 299 | + ] |
| 300 | + } |
| 301 | + ], |
| 302 | + "source": [ |
| 303 | + "# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`\n", |
| 304 | + "row_indices = np.random.permutation(X_norm.shape[0])\n", |
| 305 | + "print(row_indices)" |
| 306 | + ] |
| 307 | + }, |
| 308 | + { |
| 309 | + "cell_type": "markdown", |
| 310 | + "metadata": {}, |
| 311 | + "source": [ |
| 312 | + "Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below" |
| 313 | + ] |
| 314 | + }, |
| 315 | + { |
| 316 | + "cell_type": "code", |
| 317 | + "execution_count": 9, |
| 318 | + "metadata": {}, |
| 319 | + "outputs": [], |
| 320 | + "source": [ |
| 321 | + "# Make any necessary calculations.\n", |
| 322 | + "# You can save your calculations into variables to use later.\n", |
| 323 | + "\n", |
| 324 | + "sixty = int(len(X_norm) * 0.6)\n", |
| 325 | + "eighty = int(len(X_norm) * 0.8)\n", |
| 326 | + "\n", |
| 327 | + "# Create a Training Set\n", |
| 328 | + "X_train = X_norm[row_indices[:sixty], :]\n", |
| 329 | + "\n", |
| 330 | + "# Create a Cross Validation Set\n", |
| 331 | + "X_crossVal = X_norm[row_indices[sixty: eighty], :]\n", |
| 332 | + "\n", |
| 333 | + "# Create a Test Set\n", |
| 334 | + "X_test = X_norm[row_indices[eighty: ], :]" |
| 335 | + ] |
| 336 | + }, |
| 337 | + { |
| 338 | + "cell_type": "markdown", |
| 339 | + "metadata": {}, |
| 340 | + "source": [ |
| 341 | + "If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:" |
| 342 | + ] |
| 343 | + }, |
| 344 | + { |
| 345 | + "cell_type": "code", |
| 346 | + "execution_count": 10, |
| 347 | + "metadata": {}, |
| 348 | + "outputs": [ |
| 349 | + { |
| 350 | + "name": "stdout", |
| 351 | + "output_type": "stream", |
| 352 | + "text": [ |
| 353 | + "(600, 20)\n", |
| 354 | + "(200, 20)\n", |
| 355 | + "(200, 20)\n" |
| 356 | + ] |
| 357 | + } |
| 358 | + ], |
| 359 | + "source": [ |
| 360 | + "# Print the shape of X_train\n", |
| 361 | + "print(X_train.shape)\n", |
| 362 | + "\n", |
| 363 | + "# Print the shape of X_crossVal\n", |
| 364 | + "print(X_crossVal.shape)\n", |
| 365 | + "\n", |
| 366 | + "# Print the shape of X_test\n", |
| 367 | + "print(X_test.shape)" |
| 368 | + ] |
| 369 | + } |
| 370 | + ], |
| 371 | + "metadata": { |
| 372 | + "kernelspec": { |
| 373 | + "display_name": "Python 3", |
| 374 | + "language": "python", |
| 375 | + "name": "python3" |
| 376 | + }, |
| 377 | + "language_info": { |
| 378 | + "codemirror_mode": { |
| 379 | + "name": "ipython", |
| 380 | + "version": 3 |
| 381 | + }, |
| 382 | + "file_extension": ".py", |
| 383 | + "mimetype": "text/x-python", |
| 384 | + "name": "python", |
| 385 | + "nbconvert_exporter": "python", |
| 386 | + "pygments_lexer": "ipython3", |
| 387 | + "version": "3.11.0" |
| 388 | + } |
| 389 | + }, |
| 390 | + "nbformat": 4, |
| 391 | + "nbformat_minor": 2 |
| 392 | +} |
0 commit comments