Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
Homework 1: COVID-19 Cases Prediction (Regression)
Deep Learning
Deep Learning
朱世林
发布于 2024-04-11
推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c2_m4_cpu
Homework 1: COVID-19 Cases Prediction (Regression)
Download Data
Import Some Packages
Some Utilities
Preprocess
Dataset
DataLoader
Deep Neural Network
Train/Dev/Test
Training
Validation
Testing
Setup Hyper-parameters
Load data and model
Start Training!
Testing
Hints
Simple Baseline
Medium Baseline
Strong Baseline
Reference

Homework 1: COVID-19 Cases Prediction (Regression)

代码
文本

Author: Heng-Jui Chang

Slides: https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.pdf
Video: TBA

Objectives:

  • Solve a regression problem with deep neural networks (DNN).
  • Understand basic DNN training tips.
  • Get familiar with PyTorch.

If any questions, please contact the TAs via TA hours, NTU COOL, or email.

代码
文本

Download Data

If the Google drive links are dead, you can download data from kaggle, and upload data manually to the workspace.

代码
文本
[12]
import os
os.environ['HTTP_PROXY'] = 'http://ga.dp.tech:8118'
os.environ['HTTPS_PROXY'] = 'http://ga.dp.tech:8118'

tr_path = 'covid.train.csv' # path to training data
tt_path = 'covid.test.csv' # path to testing data

!gdown --id '19CCyCgJrUxtvgZF53vnctJiOJ23T5mqF' --output covid.train.csv
!gdown --id '1CE240jLm2npU-tdz81-oVKEF3T2yfT1O' --output covid.test.csv
/opt/conda/lib/python3.8/site-packages/gdown/cli.py:127: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
  warnings.warn(
Downloading...
From: https://drive.google.com/uc?id=19CCyCgJrUxtvgZF53vnctJiOJ23T5mqF
To: /covid.train.csv
100%|██████████████████████████████████████| 2.00M/2.00M [00:00<00:00, 6.39MB/s]
/opt/conda/lib/python3.8/site-packages/gdown/cli.py:127: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
  warnings.warn(
Downloading...
From: https://drive.google.com/uc?id=1CE240jLm2npU-tdz81-oVKEF3T2yfT1O
To: /covid.test.csv
100%|████████████████████████████████████████| 651k/651k [00:00<00:00, 9.27MB/s]
代码
文本

Import Some Packages

代码
文本
[13]
# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# For data preprocess
import numpy as np
import csv
import os

# For plotting
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

myseed = 42069 # set a random seed for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(myseed)
torch.manual_seed(myseed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(myseed)
代码
文本

Some Utilities

You do not need to modify this part.

代码
文本
[14]
def get_device():
''' Get device (if GPU is available, use GPU) '''
return 'cuda' if torch.cuda.is_available() else 'cpu'

def plot_learning_curve(loss_record, title=''):
''' Plot learning curve of your DNN (train & dev loss) '''
total_steps = len(loss_record['train'])
x_1 = range(total_steps)
x_2 = x_1[::len(loss_record['train']) // len(loss_record['dev'])]
figure(figsize=(6, 4))
plt.plot(x_1, loss_record['train'], c='tab:red', label='train')
plt.plot(x_2, loss_record['dev'], c='tab:cyan', label='dev')
plt.ylim(0.0, 5.)
plt.xlabel('Training steps')
plt.ylabel('MSE loss')
plt.title('Learning curve of {}'.format(title))
plt.legend()
plt.show()


def plot_pred(dv_set, model, device, lim=35., preds=None, targets=None):
''' Plot prediction of your DNN '''
if preds is None or targets is None:
model.eval()
preds, targets = [], []
for x, y in dv_set:
x, y = x.to(device), y.to(device)
with torch.no_grad():
pred = model(x)
preds.append(pred.detach().cpu())
targets.append(y.detach().cpu())
preds = torch.cat(preds, dim=0).numpy()
targets = torch.cat(targets, dim=0).numpy()

figure(figsize=(5, 5))
plt.scatter(targets, preds, c='r', alpha=0.5)
plt.plot([-0.2, lim], [-0.2, lim], c='b')
plt.xlim(-0.2, lim)
plt.ylim(-0.2, lim)
plt.xlabel('ground truth value')
plt.ylabel('predicted value')
plt.title('Ground Truth v.s. Prediction')
plt.show()
代码
文本

Preprocess

We have three kinds of datasets:

  • train: for training
  • dev: for validation
  • test: for testing (w/o target value)
代码
文本

Dataset

The COVID19Dataset below does:

  • read .csv files
  • extract features
  • split covid.train.csv into train/dev sets
  • normalize features

Finishing TODO below might make you pass medium baseline.

代码
文本
[15]
class COVID19Dataset(Dataset):
''' Dataset for loading and preprocessing the COVID19 dataset '''
def __init__(self,
path,
mode='train',
target_only=False):
self.mode = mode

# Read data into numpy arrays
with open(path, 'r') as fp:
data = list(csv.reader(fp))
data = np.array(data[1:])[:, 1:].astype(float)
if not target_only:
feats = list(range(93))
else:
# TODO: Using 40 states & 2 tested_positive features (indices = 57 & 75)
pass

if mode == 'test':
# Testing data
# data: 893 x 93 (40 states + day 1 (18) + day 2 (18) + day 3 (17))
data = data[:, feats]
self.data = torch.FloatTensor(data)
else:
# Training data (train/dev sets)
# data: 2700 x 94 (40 states + day 1 (18) + day 2 (18) + day 3 (18))
target = data[:, -1]
data = data[:, feats]
# Splitting training data into train & dev sets
if mode == 'train':
indices = [i for i in range(len(data)) if i % 10 != 0]
elif mode == 'dev':
indices = [i for i in range(len(data)) if i % 10 == 0]
# Convert data into PyTorch tensors
self.data = torch.FloatTensor(data[indices])
self.target = torch.FloatTensor(target[indices])

# Normalize features (you may remove this part to see what will happen)
self.data[:, 40:] = \
(self.data[:, 40:] - self.data[:, 40:].mean(dim=0, keepdim=True)) \
/ self.data[:, 40:].std(dim=0, keepdim=True)

self.dim = self.data.shape[1]

print('Finished reading the {} set of COVID19 Dataset ({} samples found, each dim = {})'
.format(mode, len(self.data), self.dim))

def __getitem__(self, index):
# Returns one sample at a time
if self.mode in ['train', 'dev']:
# For training
return self.data[index], self.target[index]
else:
# For testing (no target)
return self.data[index]

def __len__(self):
# Returns the size of the dataset
return len(self.data)
代码
文本

DataLoader

A DataLoader loads data from a given Dataset into batches.

代码
文本
[16]
def prep_dataloader(path, mode, batch_size, n_jobs=0, target_only=False):
''' Generates a dataset, then is put into a dataloader. '''
dataset = COVID19Dataset(path, mode=mode, target_only=target_only) # Construct dataset
dataloader = DataLoader(
dataset, batch_size,
shuffle=(mode == 'train'), drop_last=False,
num_workers=n_jobs, pin_memory=True) # Construct dataloader
return dataloader
代码
文本

Deep Neural Network

NeuralNet is an nn.Module designed for regression. The DNN consists of 2 fully-connected layers with ReLU activation. This module also included a function cal_loss for calculating loss.

代码
文本
[17]
class NeuralNet(nn.Module):
''' A simple fully-connected deep neural network '''
def __init__(self, input_dim):
super(NeuralNet, self).__init__()

# Define your neural network here
# TODO: How to modify this model to achieve better performance?
self.net = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 1)
)

# Mean squared error loss
self.criterion = nn.MSELoss(reduction='mean')

def forward(self, x):
''' Given input of size (batch_size x input_dim), compute output of the network '''
return self.net(x).squeeze(1)

def cal_loss(self, pred, target):
''' Calculate loss '''
# TODO: you may implement L2 regularization here
return self.criterion(pred, target)
代码
文本

Train/Dev/Test

代码
文本

Training

代码
文本
[18]
def train(tr_set, dv_set, model, config, device):
''' DNN training '''

n_epochs = config['n_epochs'] # Maximum number of epochs

# Setup optimizer
optimizer = getattr(torch.optim, config['optimizer'])(
model.parameters(), **config['optim_hparas'])

min_mse = 1000.
loss_record = {'train': [], 'dev': []} # for recording training loss
early_stop_cnt = 0
epoch = 0
while epoch < n_epochs:
model.train() # set model to training mode
for x, y in tr_set: # iterate through the dataloader
optimizer.zero_grad() # set gradient to zero
x, y = x.to(device), y.to(device) # move data to device (cpu/cuda)
pred = model(x) # forward pass (compute output)
mse_loss = model.cal_loss(pred, y) # compute loss
mse_loss.backward() # compute gradient (backpropagation)
optimizer.step() # update model with optimizer
loss_record['train'].append(mse_loss.detach().cpu().item())

# After each epoch, test your model on the validation (development) set.
dev_mse = dev(dv_set, model, device)
if dev_mse < min_mse:
# Save model if your model improved
min_mse = dev_mse
print('Saving model (epoch = {:4d}, loss = {:.4f})'
.format(epoch + 1, min_mse))
torch.save(model.state_dict(), config['save_path']) # Save model to specified path
early_stop_cnt = 0
else:
early_stop_cnt += 1

epoch += 1
loss_record['dev'].append(dev_mse)
if early_stop_cnt > config['early_stop']:
# Stop training if your model stops improving for "config['early_stop']" epochs.
break

print('Finished training after {} epochs'.format(epoch))
return min_mse, loss_record
代码
文本

Validation

代码
文本
[19]
def dev(dv_set, model, device):
model.eval() # set model to evalutation mode
total_loss = 0
for x, y in dv_set: # iterate through the dataloader
x, y = x.to(device), y.to(device) # move data to device (cpu/cuda)
with torch.no_grad(): # disable gradient calculation
pred = model(x) # forward pass (compute output)
mse_loss = model.cal_loss(pred, y) # compute loss
total_loss += mse_loss.detach().cpu().item() * len(x) # accumulate loss
total_loss = total_loss / len(dv_set.dataset) # compute averaged loss

return total_loss
代码
文本

Testing

代码
文本
[20]
def test(tt_set, model, device):
model.eval() # set model to evalutation mode
preds = []
for x in tt_set: # iterate through the dataloader
x = x.to(device) # move data to device (cpu/cuda)
with torch.no_grad(): # disable gradient calculation
pred = model(x) # forward pass (compute output)
preds.append(pred.detach().cpu()) # collect prediction
preds = torch.cat(preds, dim=0).numpy() # concatenate all predictions and convert to a numpy array
return preds
代码
文本

Setup Hyper-parameters

config contains hyper-parameters for training and the path to save your model.

代码
文本
[21]
device = get_device() # get the current available device ('cpu' or 'cuda')
os.makedirs('models', exist_ok=True) # The trained model will be saved to ./models/
target_only = False # TODO: Using 40 states & 2 tested_positive features

# TODO: How to tune these hyper-parameters to improve your model's performance?
config = {
'n_epochs': 3000, # maximum number of epochs
'batch_size': 270, # mini-batch size for dataloader
'optimizer': 'SGD', # optimization algorithm (optimizer in torch.optim)
'optim_hparas': { # hyper-parameters for the optimizer (depends on which optimizer you are using)
'lr': 0.001, # learning rate of SGD
'momentum': 0.9 # momentum for SGD
},
'early_stop': 200, # early stopping epochs (the number epochs since your model's last improvement)
'save_path': 'models/model.pth' # your model will be saved here
}
代码
文本

Load data and model

代码
文本
[22]
tr_set = prep_dataloader(tr_path, 'train', config['batch_size'], target_only=target_only)
dv_set = prep_dataloader(tr_path, 'dev', config['batch_size'], target_only=target_only)
tt_set = prep_dataloader(tt_path, 'test', config['batch_size'], target_only=target_only)
Finished reading the train set of COVID19 Dataset (2430 samples found, each dim = 93)
Finished reading the dev set of COVID19 Dataset (270 samples found, each dim = 93)
Finished reading the test set of COVID19 Dataset (893 samples found, each dim = 93)
代码
文本
[23]
model = NeuralNet(tr_set.dataset.dim).to(device) # Construct model and move to device
代码
文本

Start Training!

代码
文本
[24]
model_loss, model_loss_record = train(tr_set, dv_set, model, config, device)
Saving model (epoch =    1, loss = 78.8524)
Saving model (epoch =    2, loss = 37.6170)
Saving model (epoch =    3, loss = 26.1203)
Saving model (epoch =    4, loss = 16.1862)
Saving model (epoch =    5, loss = 9.7153)
Saving model (epoch =    6, loss = 6.3701)
Saving model (epoch =    7, loss = 5.1802)
Saving model (epoch =    8, loss = 4.4255)
Saving model (epoch =    9, loss = 3.8009)
Saving model (epoch =   10, loss = 3.3691)
Saving model (epoch =   11, loss = 3.0943)
Saving model (epoch =   12, loss = 2.8176)
Saving model (epoch =   13, loss = 2.6274)
Saving model (epoch =   14, loss = 2.4542)
Saving model (epoch =   15, loss = 2.3012)
Saving model (epoch =   16, loss = 2.1766)
Saving model (epoch =   17, loss = 2.0641)
Saving model (epoch =   18, loss = 1.9399)
Saving model (epoch =   19, loss = 1.8978)
Saving model (epoch =   20, loss = 1.7950)
Saving model (epoch =   21, loss = 1.7164)
Saving model (epoch =   22, loss = 1.6455)
Saving model (epoch =   23, loss = 1.5912)
Saving model (epoch =   24, loss = 1.5599)
Saving model (epoch =   25, loss = 1.5197)
Saving model (epoch =   26, loss = 1.4698)
Saving model (epoch =   27, loss = 1.4189)
Saving model (epoch =   28, loss = 1.3992)
Saving model (epoch =   29, loss = 1.3696)
Saving model (epoch =   30, loss = 1.3442)
Saving model (epoch =   31, loss = 1.3231)
Saving model (epoch =   32, loss = 1.2834)
Saving model (epoch =   33, loss = 1.2804)
Saving model (epoch =   34, loss = 1.2471)
Saving model (epoch =   36, loss = 1.2414)
Saving model (epoch =   37, loss = 1.2138)
Saving model (epoch =   38, loss = 1.2083)
Saving model (epoch =   41, loss = 1.1591)
Saving model (epoch =   42, loss = 1.1484)
Saving model (epoch =   44, loss = 1.1209)
Saving model (epoch =   47, loss = 1.1122)
Saving model (epoch =   48, loss = 1.0937)
Saving model (epoch =   50, loss = 1.0842)
Saving model (epoch =   53, loss = 1.0654)
Saving model (epoch =   54, loss = 1.0613)
Saving model (epoch =   57, loss = 1.0525)
Saving model (epoch =   58, loss = 1.0395)
Saving model (epoch =   60, loss = 1.0265)
Saving model (epoch =   63, loss = 1.0248)
Saving model (epoch =   66, loss = 1.0098)
Saving model (epoch =   70, loss = 0.9828)
Saving model (epoch =   72, loss = 0.9813)
Saving model (epoch =   73, loss = 0.9740)
Saving model (epoch =   75, loss = 0.9672)
Saving model (epoch =   78, loss = 0.9642)
Saving model (epoch =   79, loss = 0.9594)
Saving model (epoch =   85, loss = 0.9544)
Saving model (epoch =   86, loss = 0.9528)
Saving model (epoch =   90, loss = 0.9464)
Saving model (epoch =   92, loss = 0.9432)
Saving model (epoch =   93, loss = 0.9230)
Saving model (epoch =   95, loss = 0.9126)
Saving model (epoch =  104, loss = 0.9117)
Saving model (epoch =  107, loss = 0.8998)
Saving model (epoch =  110, loss = 0.8940)
Saving model (epoch =  116, loss = 0.8890)
Saving model (epoch =  124, loss = 0.8874)
Saving model (epoch =  128, loss = 0.8729)
Saving model (epoch =  134, loss = 0.8728)
Saving model (epoch =  139, loss = 0.8680)
Saving model (epoch =  146, loss = 0.8656)
Saving model (epoch =  156, loss = 0.8642)
Saving model (epoch =  159, loss = 0.8532)
Saving model (epoch =  167, loss = 0.8502)
Saving model (epoch =  173, loss = 0.8490)
Saving model (epoch =  176, loss = 0.8462)
Saving model (epoch =  178, loss = 0.8411)
Saving model (epoch =  182, loss = 0.8376)
Saving model (epoch =  199, loss = 0.8302)
Saving model (epoch =  202, loss = 0.8300)
Saving model (epoch =  212, loss = 0.8278)
Saving model (epoch =  235, loss = 0.8254)
Saving model (epoch =  238, loss = 0.8238)
Saving model (epoch =  251, loss = 0.8207)
Saving model (epoch =  253, loss = 0.8202)
Saving model (epoch =  258, loss = 0.8177)
Saving model (epoch =  284, loss = 0.8142)
Saving model (epoch =  308, loss = 0.8138)
Saving model (epoch =  312, loss = 0.8080)
Saving model (epoch =  324, loss = 0.8046)
Saving model (epoch =  400, loss = 0.8040)
Saving model (epoch =  404, loss = 0.8011)
Saving model (epoch =  466, loss = 0.8002)
Saving model (epoch =  472, loss = 0.8002)
Saving model (epoch =  525, loss = 0.7995)
Saving model (epoch =  561, loss = 0.7954)
Saving model (epoch =  584, loss = 0.7905)
Saving model (epoch =  667, loss = 0.7890)
Saving model (epoch =  717, loss = 0.7817)
Saving model (epoch =  776, loss = 0.7812)
Saving model (epoch =  835, loss = 0.7806)
Saving model (epoch =  866, loss = 0.7775)
Saving model (epoch =  919, loss = 0.7770)
Saving model (epoch =  933, loss = 0.7748)
Saving model (epoch =  965, loss = 0.7704)
Saving model (epoch = 1027, loss = 0.7671)
Saving model (epoch = 1119, loss = 0.7659)
Saving model (epoch = 1140, loss = 0.7654)
Saving model (epoch = 1196, loss = 0.7622)
Saving model (epoch = 1234, loss = 0.7611)
Saving model (epoch = 1243, loss = 0.7580)
Saving model (epoch = 1323, loss = 0.7571)
Finished training after 1524 epochs
代码
文本
[25]
plot_learning_curve(model_loss_record, title='deep model')
代码
文本
[26]
del model
model = NeuralNet(tr_set.dataset.dim).to(device)
ckpt = torch.load(config['save_path'], map_location='cpu') # Load your best model
model.load_state_dict(ckpt)
plot_pred(dv_set, model, device) # Show prediction on the validation set
代码
文本

Testing

The predictions of your model on testing set will be stored at pred.csv.

代码
文本
[27]
def save_pred(preds, file):
''' Save predictions to specified file '''
print('Saving results to {}'.format(file))
with open(file, 'w') as fp:
writer = csv.writer(fp)
writer.writerow(['id', 'tested_positive'])
for i, p in enumerate(preds):
writer.writerow([i, p])

preds = test(tt_set, model, device) # predict COVID-19 cases with your model
save_pred(preds, 'pred.csv') # save prediction file to pred.csv
Saving results to pred.csv
代码
文本

Hints

Simple Baseline

  • Run sample code

Medium Baseline

  • Feature selection: 40 states + 2 tested_positive (TODO in dataset)

Strong Baseline

  • Feature selection (what other features are useful?)
  • DNN architecture (layers? dimension? activation function?)
  • Training (mini-batch? optimizer? learning rate?)
  • L2 regularization
  • There are some mistakes in the sample code, can you find them?
代码
文本

Reference

This code is completely written by Heng-Jui Chang @ NTUEE.
Copying or reusing this code is required to specify the original author.

E.g.
Source: Heng-Jui Chang @ NTUEE (https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)

代码
文本
Deep Learning
Deep Learning
点个赞吧
推荐阅读
公开
Homework 1: COVID-19 Cases Prediction (Regression)
Deep Learningnotebookpython
Deep Learningnotebookpython
goujiaxin
发布于 2024-04-11
1 赞1 转存文件
公开
Homework 1: COVID-19 Cases Prediction (Regression)
Deep Learning
Deep Learning
ck
发布于 2024-03-18
1 赞1 转存文件