{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assignment 3: Hello Vectors\n",
"\n",
"Welcome to this week's programming assignment on exploring word vectors.\n",
"In natural language processing, we represent each word as a vector consisting of numbers.\n",
"The vector encodes the meaning of the word. These numbers (or weights) for each word are learned using various machine\n",
"learning models, which we will explore in more detail later in this specialization. Rather than make you code the\n",
"machine learning models from scratch, we will show you how to use them. In the real world, you can always load the\n",
"trained word vectors, and you will almost never have to train them from scratch. In this assignment, you will:\n",
"\n",
"- Predict analogies between words.\n",
"- Use PCA to reduce the dimensionality of the word embeddings and plot them in two dimensions.\n",
"- Compare word embeddings by using a similarity measure (the cosine similarity).\n",
"- Understand how these vector space models work.\n",
"\n",
"\n",
"\n",
"## 1.0 Predict the Countries from Capitals\n",
"\n",
"In the lectures, we have illustrated the word analogies\n",
"by finding the capital of a country from the country. \n",
"We have changed the problem a bit in this part of the assignment. You are asked to predict the **countries** \n",
"that corresponds to some **capitals**.\n",
"You are playing trivia against some second grader who just took their geography test and knows all the capitals by heart.\n",
"Thanks to NLP, you will be able to answer the questions properly. In other words, you will write a program that can give\n",
"you the country by its capital. That way you are pretty sure you will win the trivia game. We will start by exploring the data set.\n",
"\n",
"\n",
"\n",
"### 1.1 Importing the data\n",
"\n",
"As usual, you start by importing some essential Python libraries and then load the dataset.\n",
"The dataset will be loaded as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html),\n",
"which is very a common method in data science.\n",
"This may take a few minutes because of the large size of the data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Run this cell to import packages.\n",
"import pickle\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from utils import get_vectors"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | city1 | \n", "country1 | \n", "city2 | \n", "country2 | \n", "
---|---|---|---|---|
0 | \n", "Athens | \n", "Greece | \n", "Bangkok | \n", "Thailand | \n", "
1 | \n", "Athens | \n", "Greece | \n", "Beijing | \n", "China | \n", "
2 | \n", "Athens | \n", "Greece | \n", "Berlin | \n", "Germany | \n", "
3 | \n", "Athens | \n", "Greece | \n", "Bern | \n", "Switzerland | \n", "
4 | \n", "Athens | \n", "Greece | \n", "Cairo | \n", "Egypt | \n", "
\n", "
\n", "
\n", "
\n", "
axis = 0
, you take the mean for each column. If you set axis = 1
, you take the mean for each row. Remember that each row is a word vector, and the number of columns are the number of dimensions in a word vector. rowvar
is True
. From the documentation: \"If rowvar is True (default), then each row represents a variable, with observations in the columns.\" In our case, each row is a word vector observation, and each column is a feature (variable). x[::-1]
.x[indices_sorted]
.x[:,indices_sorted]
(n_observations, n_features)
. (n_features, n_components)
.(n_components, n_features)
and the data (n_features, n_observations).(n_components,n_observations)
. Take its transpose to get the shape (n_observations, n_components)
.\n", " 0.43437323\n", " | \n", "\n", " 0.49820384\n", " | \n", "
\n", " 0.42077249\n", " | \n", "\n", " -0.50351448\n", " | \n", "
\n", " -0.85514571\n", " | \n", "\n", " 0.00531064\n", " | \n", "