Train a model on a CosmicAC GPU
In this tutorial, you will train a model on a CosmicAC GPU container. By the end, you will have run a training job on a GPU, seen live loss output in your terminal, and saved a model checkpoint inside the container.
This tutorial uses a simple training script so the focus stays on the CosmicAC workflow. Once you have completed it, you can swap in your own script and follow the same steps.
Before you begin
You will need:
- A CosmicAC account
- The CosmicAC CLI installed and authenticated. See Installation.
Create a GPU container job
You can create the job from the CLI or the web UI. This tutorial uses the CLI.
Run the interactive setup:
cosmicac jobs initFollow the prompts. Select the GPU_CONTAINER type, the RTX H100 GPU, and the ubuntu:24.04 container image. For a walkthrough of every prompt, see Create a GPU Container Job.
PyTorch and CUDA are not pre-installed. You will install them yourself inside the container.
Review job.config.json and confirm the settings, then submit the job:
cosmicac jobs createCheck that your container is running:
cosmicac jobs listCopy the job ID and container ID from the output. Your container is ready when the status shows Running.
Open a shell session
Connect to the container:
cosmicac jobs shell <jobId> <containerId>Verify the GPU is available:
nvidia-smiYou should see your GPU listed with its VRAM. You are now inside the container.
Install pip and PyTorch
Install pip:
apt-get update && apt-get install -y python3-pipInstall PyTorch with CUDA support:
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121Verify PyTorch can see the GPU:
python3 -c "import torch; print(torch.cuda.is_available())"This should print True. If it prints False, check that nvidia-smi shows a GPU before continuing.
Create the training script
Create a new file called train.py:
nano train.pyPaste the following script into the file. It trains a small neural network to fit a straight line, simple enough to finish in under a minute and produce real loss output and a checkpoint:
import torch
import torch.nn as nn
import os
# Seed for reproducibility
torch.manual_seed(42)
# Generate synthetic data: y = 3x + 1 with a small amount of noise
X = torch.linspace(0, 1, 100).unsqueeze(1)
y = 3 * X + 1 + 0.1 * torch.randn(100, 1)
# Select device — use the GPU if available, otherwise fall back to CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Training on: {device}")
X = X.to(device)
y = y.to(device)
# Define the model, optimizer, and loss function
model = nn.Linear(1, 1).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()
# Training loop
for epoch in range(100):
predictions = model(X)
loss = loss_fn(predictions, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch + 1}/100 — loss: {loss.item():.4f}")
# Save the trained model weights
os.makedirs("./checkpoints", exist_ok=True)
torch.save(model.state_dict(), "./checkpoints/model.pt")
print("Checkpoint saved to ./checkpoints/model.pt")Save and exit with Ctrl+X → Y → Enter.
Run the training job
python3 train.pyYou should see output like this:
Training on: cuda
Epoch 10/100 — loss: 0.3241
Epoch 20/100 — loss: 0.1823
Epoch 30/100 — loss: 0.1204
Epoch 40/100 — loss: 0.0961
Epoch 50/100 — loss: 0.0880
Epoch 60/100 — loss: 0.0852
Epoch 70/100 — loss: 0.0843
Epoch 80/100 — loss: 0.0840
Epoch 90/100 — loss: 0.0839
Epoch 100/100 — loss: 0.0839
Checkpoint saved to ./checkpoints/model.ptTraining on: cuda confirms the GPU is being used. If you see Training on: cpu, run nvidia-smi to confirm the GPU is visible inside the container. If it is, the problem is the install. Reinstall PyTorch with the correct CUDA wheel.
Stop the job
Exit the shell session:
exitStop the container to avoid further charges:
cosmicac jobs stop <jobId>What you have done
You created a CosmicAC GPU container job, installed PyTorch inside the container, ran a training script on the GPU, saved a model checkpoint, and stopped the job.
This tutorial used PyTorch, but you can install any ML framework the same way, such as TensorFlow, JAX, or Hugging Face Transformers. The CosmicAC workflow is the same for any training script. Swap in your own code and follow the same steps.
Next steps
- GPU Container Job — understand how containers, shell access, and job lifecycle work on CosmicAC
- Job Management CLI reference — full reference for all job commands
- GPU Types — available GPU hardware and VRAM configurations