Example Code
In previous chapters, various features and techniques of Nytorch have been discussed. This chapter provides a practical example to demonstrate its application.
Purpose
This chapter serves two main purposes:
Introduction of ParticleModule: Earlier tutorials excluded ParticleModule for brevity, which is insufficient in practical training scenarios. Therefore, this chapter introduces its integration.
Combining Evolutionary Algorithm and Gradient Descent: Nytorch facilitates the integration of these methods. Here, both algorithms are employed in training: Gradient Descent optimizes parameters in most iterations, while periodically, every 5 iterations, the Evolutionary Algorithm adjusts a subset of the swarm to explore better solutions efficiently.
For the Evolutionary Algorithm phase, we adopt an approach similar to Accelerated Particle Swarm Optimization, updating models based on:
where:
\(W_{i,t}\) is particle i at time t.
\(W_{i,t-1}\) is particle i at time t-1.
\(W_{g,t-1}\) is the best-known particle in the swarm at time t-1.
\(\alpha\) is a scalar.
To optimize distributed models across nodes with high communication costs, adjustments include:
Reduce communication frequency
Optimizing only a subset of the swarm per iteration.
Reducing communication frequency involves periodic use of Evolutionary Algorithm, such as every 5 iterations, while Gradient Descent optimizes in other iterations. Optimizing a subset of the swarm involves selecting random particles for optimization, enhancing efficiency.
Example Content
Let’s begin by configuring the training parameters:
from nytorch import NytoModule, ParticleModule
from nytorch.particle_module import PMProduct
from random import choices, random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Subset, DataLoader, random_split
from torchvision import datasets, transforms
BATCH_SIZE = 64
TRAIN_BATCH_NUM = 256
TEST_BATCH_NUM = 16
POOL_SIZE = 12
SWARM_SIZE = 6
LR = 0.01
ALPHA = 0.5
SWARM_INTERVAL = 16
PRINT_INTERVAL = 16
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
We use the MNIST dataset for demonstration, selecting only a subset for the example:
full_train_dataset = datasets.MNIST('mnist',
train=True,
download=True,
transform=transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]))
full_test_dataset = datasets.MNIST('mnist',
train=False,
transform=transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]))
train_size = TRAIN_BATCH_NUM * BATCH_SIZE
test_size = TEST_BATCH_NUM * BATCH_SIZE
train_dataset, _ = random_split(full_train_dataset, [train_size, len(full_train_dataset)-train_size])
test_dataset, _ = random_split(full_test_dataset, [test_size, len(full_test_dataset)-test_size])
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
Next, we define the model:
class ConvNet(NytoModule):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 10, 5)
self.conv2 = nn.Conv2d(10, 20, 3)
self.fc = nn.Linear(20*10*10, 10)
def forward(self, x):
in_size = x.size(0)
out = self.conv1(x)
out = F.relu(out)
out = F.max_pool2d(out, 2, 2)
out = self.conv2(out)
out = F.relu(out)
out = out.view(in_size,-1)
out = self.fc(out)
out = F.log_softmax(out,dim=1)
return out
class ConvModel:
@classmethod
def from_product(cls, product, device):
assert isinstance(product, PMProduct)
return cls(product.module(), device)
def __init__(self, particle, device):
assert isinstance(particle, ParticleModule)
self.device = device
self.particle = particle
self.optimizer = optim.SGD(self.particle.parameters(), lr=LR)
self.particle.to(self.device)
def product(self):
return self.particle.product()
def train(self, data, target):
data, target = data.to(self.device), target.to(self.device)
self.particle.train()
self.optimizer.zero_grad()
loss = F.nll_loss(self.particle(data), target)
loss.backward()
self.optimizer.step()
return loss.item()
def test(self, data, target):
data, target = data.to(self.device), target.to(self.device)
self.particle.eval()
with torch.no_grad():
output = self.particle(data)
loss = F.nll_loss(output, target, reduction='sum').item()
pred = output.max(1, keepdim=True)[1]
correct = pred.eq(target.view_as(pred)).sum().item()
return loss, correct
We also create a decorator for ConvNet called ConvModel,
which wraps the optimizer and training/testing methods.
The product method returns a PMProduct instance for particle operations,
and the from_product method transforms the PMProduct instance back to ConvModel after particle operations.
Since we are using a swarm-based algorithm, we need some swarm operations during training, which we wrap into functions::
def create_pool(size, device):
assert size >= 2
pool = [ParticleModule(ConvNet()) for _ in range(size)]
p0 = pool[0]
return [ConvModel(p0.clone_from(p), device) for p in pool[1:]] + [ConvModel(p0, device)]
def test_model(model, test_loader):
test_loss = 0
total_correct = 0
for data, target in test_loader:
loss, correct = model.test(data, target)
test_loss += loss
total_correct += correct
test_loss /= len(test_loader.dataset)
test_acc = total_correct / len(test_loader.dataset)
return test_loss, test_acc
def swarm_algorithm(pool, swarm_size, loss_list, alpha):
assert 0 < swarm_size <= len(pool) == len(loss_list)
assert 1 > alpha > 0
idx_list = choices(list(range(len(pool))), k=swarm_size)
idx_loss_list = [(idx, loss_list[idx]) for idx in idx_list]
idx_loss_list = sorted(idx_loss_list, key=lambda idx_loss: idx_loss[1])
best_seed_idx, _ = idx_loss_list[0]
for i, (idx, loss) in enumerate(idx_loss_list):
if idx == best_seed_idx: continue
seed0 = pool[best_seed_idx].product()
seed1 = pool[idx].product()
new_product = alpha*seed0 + (1-alpha)*seed1
pool[idx] = ConvModel.from_product(new_product, pool[idx].device)
def train_pool(pool, train_loader, test_loader, swarm_size, swarm_interval=4, alpha=0.5, print_interval=8):
assert len(pool) >= swarm_size >= 2
assert swarm_interval > 0
assert 1 > alpha > 0
assert print_interval > 0
for batch_idx, (data, target) in enumerate(train_loader):
loss_list = [model.train(data, target) for model in pool]
if (batch_idx+1)%swarm_interval == 0:
swarm_algorithm(pool, swarm_size, loss_list, alpha)
if batch_idx==0 or (batch_idx+1)%print_interval == 0:
print(f"batch: {batch_idx:>3} Accuracy: ", end='')
for idx, model in enumerate(pool):
_, acc = test_model(model, test_loader)
print(f"[{idx}]{acc:.2f}", end=' ')
print()
We pay special attention to the techniques used in create_pool and swarm_algorithm.
In create_pool, we use ParticleModule to wrap NytoModule to eliminate circular references and reduce memory pressure.
In swarm_algorithm,
we use the product method to transform to PMProduct instances for particle operations,
and then transform back to ParticleModule instances in a batch to reduce unnecessary conversions.
Finally, we start training:
if __name__ == '__main__':
pool = create_pool(POOL_SIZE, DEVICE)
train_pool(pool,
train_loader,
test_loader,
SWARM_SIZE,
SWARM_INTERVAL,
ALPHA,
PRINT_INTERVAL)
print("End")
print("Accuracy: ", end='')
for idx, model in enumerate(pool):
_, acc = test_model(model, test_loader)
print(f"[{idx}]{acc:.2f}", end=' ')
Below is the output of the program:
batch: 0 Accuracy: [0]0.11 [1]0.06 [2]0.09 [3]0.07 [4]0.08 [5]0.07 [6]0.11 [7]0.17 [8]0.15 [9]0.13 [10]0.13 [11]0.20
batch: 15 Accuracy: [0]0.37 [1]0.43 [2]0.51 [3]0.60 [4]0.63 [5]0.36 [6]0.22 [7]0.44 [8]0.62 [9]0.62 [10]0.58 [11]0.63
batch: 31 Accuracy: [0]0.47 [1]0.47 [2]0.61 [3]0.69 [4]0.73 [5]0.57 [6]0.55 [7]0.56 [8]0.70 [9]0.55 [10]0.69 [11]0.63
batch: 47 Accuracy: [0]0.70 [1]0.67 [2]0.72 [3]0.83 [4]0.84 [5]0.69 [6]0.77 [7]0.67 [8]0.81 [9]0.77 [10]0.80 [11]0.75
batch: 63 Accuracy: [0]0.79 [1]0.75 [2]0.80 [3]0.80 [4]0.84 [5]0.80 [6]0.78 [7]0.78 [8]0.81 [9]0.78 [10]0.73 [11]0.82
batch: 79 Accuracy: [0]0.81 [1]0.84 [2]0.84 [3]0.84 [4]0.87 [5]0.82 [6]0.84 [7]0.82 [8]0.84 [9]0.82 [10]0.87 [11]0.87
batch: 95 Accuracy: [0]0.85 [1]0.83 [2]0.85 [3]0.85 [4]0.88 [5]0.83 [6]0.84 [7]0.80 [8]0.85 [9]0.85 [10]0.86 [11]0.86
batch: 111 Accuracy: [0]0.87 [1]0.88 [2]0.87 [3]0.88 [4]0.89 [5]0.84 [6]0.87 [7]0.85 [8]0.82 [9]0.87 [10]0.89 [11]0.89
batch: 127 Accuracy: [0]0.87 [1]0.86 [2]0.86 [3]0.87 [4]0.87 [5]0.85 [6]0.87 [7]0.87 [8]0.86 [9]0.85 [10]0.87 [11]0.87
batch: 143 Accuracy: [0]0.87 [1]0.86 [2]0.85 [3]0.84 [4]0.87 [5]0.83 [6]0.87 [7]0.86 [8]0.87 [9]0.87 [10]0.87 [11]0.87
batch: 159 Accuracy: [0]0.86 [1]0.83 [2]0.82 [3]0.84 [4]0.88 [5]0.87 [6]0.87 [7]0.83 [8]0.87 [9]0.83 [10]0.88 [11]0.88
batch: 175 Accuracy: [0]0.89 [1]0.89 [2]0.89 [3]0.88 [4]0.90 [5]0.90 [6]0.90 [7]0.90 [8]0.90 [9]0.90 [10]0.90 [11]0.90
batch: 191 Accuracy: [0]0.89 [1]0.88 [2]0.89 [3]0.88 [4]0.89 [5]0.89 [6]0.89 [7]0.89 [8]0.89 [9]0.88 [10]0.89 [11]0.89
batch: 207 Accuracy: [0]0.90 [1]0.90 [2]0.90 [3]0.81 [4]0.90 [5]0.90 [6]0.90 [7]0.90 [8]0.90 [9]0.90 [10]0.90 [11]0.90
batch: 223 Accuracy: [0]0.90 [1]0.90 [2]0.90 [3]0.88 [4]0.90 [5]0.90 [6]0.90 [7]0.90 [8]0.90 [9]0.90 [10]0.90 [11]0.90
batch: 239 Accuracy: [0]0.90 [1]0.90 [2]0.90 [3]0.88 [4]0.90 [5]0.90 [6]0.90 [7]0.90 [8]0.89 [9]0.89 [10]0.90 [11]0.90
batch: 255 Accuracy: [0]0.90 [1]0.90 [2]0.90 [3]0.87 [4]0.90 [5]0.90 [6]0.90 [7]0.90 [8]0.90 [9]0.90 [10]0.90 [11]0.90
End
Accuracy: [0]0.90 [1]0.90 [2]0.90 [3]0.87 [4]0.90 [5]0.90 [6]0.90 [7]0.90 [8]0.90 [9]0.90 [10]0.90 [11]0.90
As training progresses, particle performance converges, demonstrating the Evolutionary Algorithm’s efficacy. Initially impactful, its influence diminishes as parameters converge.
By slowing Evolutionary Algorithm convergence, particles explore better solutions, though computational overhead increases.
Summary
This chapter detailed Nytorch usage for model training, emphasizing Gradient Descent and Evolutionary Algorithm optimization. Techniques included encapsulating NytoModule with ParticleModule and using PMProduct for particle operations, fostering deeper Nytorch application insights.