diff --git a/notebooks/machine-learning/answers/0.bias-and-common-errors.ipynb b/notebooks/machine-learning/answers/0.bias-and-common-errors.ipynb index d02f22faef70e5a95e4f1939c0cbfca05a01c4f8..bafde20c133a916d453932927ab18586f48c7478 100644 --- a/notebooks/machine-learning/answers/0.bias-and-common-errors.ipynb +++ b/notebooks/machine-learning/answers/0.bias-and-common-errors.ipynb @@ -32,8 +32,153 @@ }, { "cell_type": "markdown", - "id": "61c8d84f-a791-425e-ae70-306f0da93a55", + "id": "220410a9-d71d-4d16-b724-1f31539ed987", + "metadata": {}, + "source": [ + "## Une étude de genre" + ] + }, + { + "cell_type": "markdown", + "id": "cfc33885-ca65-4f89-8eac-04d519b8c6ab", + "metadata": {}, + "source": [ + "L’enquête [*Self-Reports of Height and Weight*](../0.about-datasets.ipynb#Self-Reports-of-Height-and-Weight) (Davis, 1990) compare une auto-évaluation de leurs tailles et poids d’individus engagés dans un programme d’exercices avec les mesures réalisées par l’équipe encadrante.\n", + "\n", + "Imaginons un objectif où, en fonction des valeurs renseignées, on souhaiterait déduire l’étiquette *H* ou *F* qui leur est associée. Chargeons dans un premier temps les données et affichons un résumé :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f7609ab-f6d7-459a-bdea-cfab3f255332", + "metadata": {}, + "outputs": [], + "source": [ + "# load data\n", + "df = pd.read_csv(\"../files/davis.csv\", sep=\"\\t\")\n", + "\n", + "# select variables\n", + "target = \"sex\"\n", + "features = [\"weight\", \"height\", \"repwt\", \"repht\"]\n", + "\n", + "# a copy of the data frame\n", + "data = df.copy()\n", + "data = data[[target] + features]\n", + "\n", + "data.info()" + ] + }, + { + "cell_type": "markdown", + "id": "e3479ed2-ec29-4a05-9554-1691a59f3e4d", "metadata": {}, + "source": [ + "Le jeu de données est composée de 200 observations mais comme toutes ne sont pas remplies pour tous les champs, il convient dans un premier temps de s’en occuper. Nous retenons comme stratégie de les combler avec la valeur moyenne de la colonne :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "638eaa6f-d30a-45f3-b888-d727eb00ef53", + "metadata": {}, + "outputs": [], + "source": [ + "# mean value\n", + "repwt_mean = int(data.repwt.mean())\n", + "repht_mean = int(data.repht.mean())\n", + "\n", + "# fill NA\n", + "data.repwt.fillna(repwt_mean, inplace=True)\n", + "data.repht.fillna(repht_mean, inplace=True)\n", + "\n", + "data.info()" + ] + }, + { + "cell_type": "markdown", + "id": "ee52c74a-9e9d-4998-99f6-f370419a7926", + "metadata": {}, + "source": [ + "La seconde étape consiste à séparer le *dataset* en deux parties inégales : l’une pour le jeu d’entraînement, constituée de 80 % de l’ensemble ; et l’autre pour le jeu de test." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cece0234-c72c-4f7f-a0db-4805e0f98f0f", + "metadata": {}, + "outputs": [], + "source": [ + "limit = int(len(data) * 0.2)\n", + "\n", + "# split\n", + "train = data[limit:]\n", + "test = data[:limit]" + ] + }, + { + "cell_type": "markdown", + "id": "f3d0c802-2b5f-48bf-a34c-65ad0b30520b", + "metadata": {}, + "source": [ + "Attachons-nous à étudier le rapport entre le poids et la taille des individus. Intuitivement, on penserait que ces caractéristiques sont globalement liées par une corrélation positive : l’augmentation chez l’une entraîne une augmentation chez l’autre. Si nous affichons une droite de régression sur le jeu de données complet, on observe bien le phénomène attendu :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "312afd57-af0f-4e38-9154-a05c0402715e", + "metadata": {}, + "outputs": [], + "source": [ + "_ = sns.regplot(data=data, x=\"weight\", y=\"height\")" + ] + }, + { + "cell_type": "markdown", + "id": "84c8a544-351b-4b47-89ba-abfb9f1f031e", + "metadata": {}, + "source": [ + "Pour autant, il n’en va pas de même avec les jeux d’entraînement et de test :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cdc0d499-1ee3-4740-9c9d-2d3b4b0b5f90", + "metadata": {}, + "outputs": [], + "source": [ + "figure, (col_1, col_2)= plt.subplots(1, 2, figsize=(12,4))\n", + "\n", + "sns.regplot(data=train, x=\"weight\", y=\"height\", ax=col_1)\n", + "sns.regplot(data=test, x=\"weight\", y=\"height\", ax=col_2)\n", + "\n", + "figure.suptitle(\"Relation entre le poids et la taille des individus\", y=1.05)\n", + "\n", + "col_1.set(title=\"Jeu d’entraînement\")\n", + "col_2.set(title=\"Jeu de test\")\n", + "\n", + "sns.despine()\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "c500a2fa-07c5-45f4-a8c7-548abd3d0c9e", + "metadata": {}, + "source": [ + "À votre avis, quelles erreurs peuvent avoir faussé notre interprétation ?" + ] + }, + { + "cell_type": "markdown", + "id": "61c8d84f-a791-425e-ae70-306f0da93a55", + "metadata": { + "tags": [] + }, "source": [ "## Les relations à distance" ] @@ -97,7 +242,7 @@ "id": "fbb20849-4a22-4870-940b-8067fd06e548", "metadata": {}, "source": [ - "Rien de bien concluant à première vue, non ? Afin de déterminer visuellement s’il existe bien une relation linéaire entre la distance et la vitesse d’éloignement, affichez une droite de régression :" + "Euh… rien de bien concluant à première vue, non ? Afin de déterminer visuellement s’il existe bien une relation linéaire entre la distance et la vitesse d’éloignement, affichez une droite de régression :" ] }, { diff --git a/notebooks/machine-learning/files/davis.csv b/notebooks/machine-learning/files/davis.csv index 92effcd007b3a3c5e7c501e72d3a87d4c9476b93..a1a4dc96f029014b6a8802695ab11b1bacb6ff7e 100644 --- a/notebooks/machine-learning/files/davis.csv +++ b/notebooks/machine-learning/files/davis.csv @@ -1,201 +1,201 @@ -"","sex","weight","height","repwt","repht" -"1","M",77,182,77,180 -"2","F",58,161,51,159 -"3","F",53,161,54,158 -"4","M",68,177,70,175 -"5","F",59,157,59,155 -"6","M",76,170,76,165 -"7","M",76,167,77,165 -"8","M",69,186,73,180 -"9","M",71,178,71,175 -"10","M",65,171,64,170 -"11","M",70,175,75,174 -"12","F",166,57,56,163 -"13","F",51,161,52,158 -"14","F",64,168,64,165 -"15","F",52,163,57,160 -"16","F",65,166,66,165 -"17","M",92,187,101,185 -"18","F",62,168,62,165 -"19","M",76,197,75,200 -"20","F",61,175,61,171 -"21","M",119,180,124,178 -"22","F",61,170,61,170 -"23","M",65,175,66,173 -"24","M",66,173,70,170 -"25","F",54,171,59,168 -"26","F",50,166,50,165 -"27","F",63,169,61,168 -"28","F",58,166,60,160 -"29","F",39,157,41,153 -"30","M",101,183,100,180 -"31","F",71,166,71,165 -"32","M",75,178,73,175 -"33","M",79,173,76,173 -"34","F",52,164,52,161 -"35","F",68,169,63,170 -"36","M",64,176,65,175 -"37","F",56,166,54,165 -"38","M",69,174,69,171 -"39","M",88,178,86,175 -"40","M",65,187,67,188 -"41","F",54,164,53,160 -"42","M",80,178,80,178 -"43","F",63,163,59,159 -"44","M",78,183,80,180 -"45","M",85,179,82,175 -"46","F",54,160,55,158 -"47","M",73,180,NA,NA -"48","F",49,161,NA,NA -"49","F",54,174,56,173 -"50","F",75,162,75,158 -"51","M",82,182,85,183 -"52","F",56,165,57,163 -"53","M",74,169,73,170 -"54","M",102,185,107,185 -"55","M",64,177,NA,NA -"56","M",65,176,64,172 -"57","F",66,170,65,NA -"58","M",73,183,74,180 -"59","M",75,172,70,169 -"60","M",57,173,58,170 -"61","M",68,165,69,165 -"62","M",71,177,71,170 -"63","M",71,180,76,175 -"64","F",78,173,75,169 -"65","M",97,189,98,185 -"66","F",60,162,59,160 -"67","F",64,165,63,163 -"68","F",64,164,62,161 -"69","F",52,158,51,155 -"70","M",80,178,76,175 -"71","F",62,175,61,171 -"72","M",66,173,66,175 -"73","F",55,165,54,163 -"74","F",56,163,57,159 -"75","F",50,166,50,161 -"76","F",50,171,NA,NA -"77","F",50,160,55,150 -"78","F",63,160,64,158 -"79","M",69,182,70,180 -"80","M",69,183,70,183 -"81","F",61,165,60,163 -"82","M",55,168,56,170 -"83","F",53,169,52,175 -"84","F",60,167,55,163 -"85","F",56,170,56,170 -"86","M",59,182,61,183 -"87","M",62,178,66,175 -"88","F",53,165,53,165 -"89","F",57,163,59,160 -"90","F",57,162,56,160 -"91","M",70,173,68,170 -"92","F",56,161,56,161 -"93","M",84,184,86,183 -"94","M",69,180,71,180 -"95","M",88,189,87,185 -"96","F",56,165,57,160 -"97","M",103,185,101,182 -"98","F",50,169,50,165 -"99","F",52,159,52,153 -"100","F",55,155,NA,154 -"101","F",55,164,55,163 -"102","M",63,178,63,175 -"103","F",47,163,47,160 -"104","F",45,163,45,160 -"105","F",62,175,63,173 -"106","F",53,164,51,160 -"107","F",52,152,51,150 -"108","F",57,167,55,164 -"109","F",64,166,64,165 -"110","F",59,166,55,163 -"111","M",84,183,90,183 -"112","M",79,179,79,171 -"113","F",55,174,57,171 -"114","M",67,179,67,179 -"115","F",76,167,77,165 -"116","F",62,168,62,163 -"117","M",83,184,83,181 -"118","M",96,184,94,183 -"119","M",75,169,76,165 -"120","M",65,178,66,178 -"121","M",78,178,77,175 -"122","M",69,167,73,165 -"123","F",68,178,68,175 -"124","F",55,165,55,163 -"125","M",67,179,NA,NA -"126","F",52,169,56,NA -"127","F",47,153,NA,154 -"128","F",45,157,45,153 -"129","F",68,171,68,169 -"130","F",44,157,44,155 -"131","F",62,166,61,163 -"132","M",87,185,89,185 -"133","F",56,160,53,158 -"134","F",50,148,47,148 -"135","M",83,177,84,175 -"136","F",53,162,53,160 -"137","F",64,172,62,168 -"138","F",62,167,NA,NA -"139","M",90,188,91,185 -"140","M",85,191,83,188 -"141","M",66,175,68,175 -"142","F",52,163,53,160 -"143","F",53,165,55,163 -"144","F",54,176,55,176 -"145","F",64,171,66,171 -"146","F",55,160,55,155 -"147","F",55,165,55,165 -"148","F",59,157,55,158 -"149","F",70,173,67,170 -"150","M",88,184,86,183 -"151","F",57,168,58,165 -"152","F",47,162,47,160 -"153","F",47,150,45,152 -"154","F",55,162,NA,NA -"155","F",48,163,44,160 -"156","M",54,169,58,165 -"157","M",69,172,68,174 -"158","F",59,170,NA,NA -"159","F",58,169,NA,NA -"160","F",57,167,56,165 -"161","F",51,163,50,160 -"162","F",54,161,54,160 -"163","F",53,162,52,158 -"164","F",59,172,58,171 -"165","M",56,163,58,161 -"166","F",59,159,59,155 -"167","F",63,170,62,168 -"168","F",66,166,66,165 -"169","M",96,191,95,188 -"170","F",53,158,50,155 -"171","M",76,169,75,165 -"172","F",54,163,NA,NA -"173","M",61,170,61,170 -"174","M",82,176,NA,NA -"175","M",62,168,64,168 -"176","M",71,178,68,178 -"177","F",60,174,NA,NA -"178","M",66,170,67,165 -"179","M",81,178,82,175 -"180","M",68,174,68,173 -"181","M",80,176,78,175 -"182","F",43,154,NA,NA -"183","M",82,181,NA,NA -"184","F",63,165,59,160 -"185","M",70,173,70,173 -"186","F",56,162,56,160 -"187","F",60,172,55,168 -"188","F",58,169,54,166 -"189","M",76,183,75,180 -"190","F",50,158,49,155 -"191","M",88,185,93,188 -"192","M",89,173,86,173 -"193","F",59,164,59,165 -"194","F",51,156,51,158 -"195","F",62,164,61,161 -"196","M",74,175,71,175 -"197","M",83,180,80,180 -"198","M",81,175,NA,NA -"199","M",90,181,91,178 -"200","M",79,177,81,178 + sex weight height repwt repht +1 F 166 57 56 163 +2 F 50 148 47 148 +3 F 47 150 45 152 +4 F 52 152 51 150 +5 F 47 153 NA 154 +6 F 43 154 NA NA +7 F 55 155 NA 154 +8 F 51 156 51 158 +9 F 59 157 59 155 +10 F 39 157 41 153 +11 F 45 157 45 153 +12 F 44 157 44 155 +13 F 59 157 55 158 +14 F 52 158 51 155 +15 F 53 158 50 155 +16 F 50 158 49 155 +17 F 52 159 52 153 +18 F 59 159 59 155 +19 F 54 160 55 158 +20 F 50 160 55 150 +21 F 63 160 64 158 +22 F 56 160 53 158 +23 F 55 160 55 155 +24 F 58 161 51 159 +25 F 53 161 54 158 +26 F 51 161 52 158 +27 F 49 161 NA NA +28 F 56 161 56 161 +29 F 54 161 54 160 +30 F 75 162 75 158 +31 F 60 162 59 160 +32 F 57 162 56 160 +33 F 53 162 53 160 +34 F 47 162 47 160 +35 F 55 162 NA NA +36 F 53 162 52 158 +37 F 56 162 56 160 +38 F 52 163 57 160 +39 F 63 163 59 159 +40 F 56 163 57 159 +41 F 57 163 59 160 +42 F 47 163 47 160 +43 F 45 163 45 160 +44 F 52 163 53 160 +45 F 48 163 44 160 +46 F 51 163 50 160 +47 F 54 163 NA NA +48 M 56 163 58 161 +49 F 52 164 52 161 +50 F 54 164 53 160 +51 F 64 164 62 161 +52 F 55 164 55 163 +53 F 53 164 51 160 +54 F 59 164 59 165 +55 F 62 164 61 161 +56 F 56 165 57 163 +57 F 64 165 63 163 +58 F 55 165 54 163 +59 F 61 165 60 163 +60 F 53 165 53 165 +61 F 56 165 57 160 +62 F 55 165 55 163 +63 F 53 165 55 163 +64 F 55 165 55 165 +65 F 63 165 59 160 +66 M 68 165 69 165 +67 F 65 166 66 165 +68 F 50 166 50 165 +69 F 58 166 60 160 +70 F 71 166 71 165 +71 F 56 166 54 165 +72 F 50 166 50 161 +73 F 64 166 64 165 +74 F 59 166 55 163 +75 F 62 166 61 163 +76 F 66 166 66 165 +77 F 60 167 55 163 +78 F 57 167 55 164 +79 F 76 167 77 165 +80 F 62 167 NA NA +81 F 57 167 56 165 +82 M 76 167 77 165 +83 M 69 167 73 165 +84 F 64 168 64 165 +85 F 62 168 62 165 +86 F 62 168 62 163 +87 F 57 168 58 165 +88 M 55 168 56 170 +89 M 62 168 64 168 +90 F 63 169 61 168 +91 F 68 169 63 170 +92 F 53 169 52 175 +93 F 50 169 50 165 +94 F 52 169 56 NA +95 F 58 169 NA NA +96 F 58 169 54 166 +97 M 74 169 73 170 +98 M 75 169 76 165 +99 M 54 169 58 165 +100 M 76 169 75 165 +101 F 61 170 61 170 +102 F 66 170 65 NA +103 F 56 170 56 170 +104 F 59 170 NA NA +105 F 63 170 62 168 +106 M 76 170 76 165 +107 M 61 170 61 170 +108 M 66 170 67 165 +109 F 54 171 59 168 +110 F 50 171 NA NA +111 F 68 171 68 169 +112 F 64 171 66 171 +113 M 65 171 64 170 +114 F 64 172 62 168 +115 F 59 172 58 171 +116 F 60 172 55 168 +117 M 75 172 70 169 +118 M 69 172 68 174 +119 F 78 173 75 169 +120 F 70 173 67 170 +121 M 66 173 70 170 +122 M 79 173 76 173 +123 M 57 173 58 170 +124 M 66 173 66 175 +125 M 70 173 68 170 +126 M 70 173 70 173 +127 M 89 173 86 173 +128 F 54 174 56 173 +129 F 55 174 57 171 +130 F 60 174 NA NA +131 M 69 174 69 171 +132 M 68 174 68 173 +133 F 61 175 61 171 +134 F 62 175 61 171 +135 F 62 175 63 173 +136 M 70 175 75 174 +137 M 65 175 66 173 +138 M 66 175 68 175 +139 M 74 175 71 175 +140 M 81 175 NA NA +141 F 54 176 55 176 +142 M 64 176 65 175 +143 M 65 176 64 172 +144 M 82 176 NA NA +145 M 80 176 78 175 +146 M 68 177 70 175 +147 M 64 177 NA NA +148 M 71 177 71 170 +149 M 83 177 84 175 +150 M 79 177 81 178 +151 F 68 178 68 175 +152 M 71 178 71 175 +153 M 75 178 73 175 +154 M 88 178 86 175 +155 M 80 178 80 178 +156 M 80 178 76 175 +157 M 62 178 66 175 +158 M 63 178 63 175 +159 M 65 178 66 178 +160 M 78 178 77 175 +161 M 71 178 68 178 +162 M 81 178 82 175 +163 M 85 179 82 175 +164 M 79 179 79 171 +165 M 67 179 67 179 +166 M 67 179 NA NA +167 M 119 180 124 178 +168 M 73 180 NA NA +169 M 71 180 76 175 +170 M 69 180 71 180 +171 M 83 180 80 180 +172 M 82 181 NA NA +173 M 90 181 91 178 +174 M 77 182 77 180 +175 M 82 182 85 183 +176 M 69 182 70 180 +177 M 59 182 61 183 +178 M 101 183 100 180 +179 M 78 183 80 180 +180 M 73 183 74 180 +181 M 69 183 70 183 +182 M 84 183 90 183 +183 M 76 183 75 180 +184 M 84 184 86 183 +185 M 83 184 83 181 +186 M 96 184 94 183 +187 M 88 184 86 183 +188 M 102 185 107 185 +189 M 103 185 101 182 +190 M 87 185 89 185 +191 M 88 185 93 188 +192 M 69 186 73 180 +193 M 92 187 101 185 +194 M 65 187 67 188 +195 M 90 188 91 185 +196 M 97 189 98 185 +197 M 88 189 87 185 +198 M 85 191 83 188 +199 M 96 191 95 188 +200 M 76 197 75 200