尝试学习训练一个GPT-2对话模型 - 郑瀚Andrew
2023-4-15 20:14:0 Author: www.cnblogs.com(查看原文) 阅读量:41 收藏

# 训练shakespeare 
python3 train.py config/train_shakespeare_char.py

# 实测V100 GPU,训练100分钟后,train loss可以降到0.15左右,valid loss可以降到3.72
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
batch_size = 64
block_size = 256 # context of up to 256 previous characters
dtype = 'bfloat16'


# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
compile = False # do not torch compile the model

# on GPU server
device = 'cuda'

total number of tokens per iteration: 655360
Traceback (most recent call last):
  File "train.py", line 106, in <module>
    ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
  File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 234, in __init__
    raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.')
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.
[email protected]:~/nanoGPT# python3 train.py config/train_shakespeare_char.py
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
batch_size = 64
block_size = 256 # context of up to 256 previous characters


# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

# on GPU server
device = 'cuda'

total number of tokens per iteration: 655360
Traceback (most recent call last):
  File "train.py", line 106, in <module>
    ptdtype = {'float32': torch.float32, 'float16': torch.float16}[dtype]
KeyError: 'bfloat16'
[email protected]:~/nanoGPT# python3 train.py config/train_shakespeare_char.py
[email protected]:~/nanoGPT# python3 train.py config/train_shakespeare_char.py
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
batch_size = 64
block_size = 256 # context of up to 256 previous characters


# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

# on GPU server
device = 'cuda'

total number of tokens per iteration: 655360
found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)
Initializing a new model from scratch
number of parameters: 10.65M
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 4.2874, val loss 4.2823
[2023-04-14 09:55:30,333] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
iter 0: loss 4.2586, time 35048.17ms, mfu -100.00%
iter 10: loss 3.2202, time 1389.25ms, mfu 10.73%
iter 20: loss 2.7714, time 1392.25ms, mfu 10.73%
iter 30: loss 2.6154, time 1392.92ms, mfu 10.72%
iter 40: loss 2.5368, time 1394.59ms, mfu 10.72%
iter 50: loss 2.5093, time 1394.69ms, mfu 10.72%
iter 60: loss 2.4757, time 1394.85ms, mfu 10.71%
iter 70: loss 2.5073, time 1394.47ms, mfu 10.71%
iter 80: loss 2.4474, time 1394.96ms, mfu 10.71%
iter 90: loss 2.4334, time 1395.45ms, mfu 10.71%
iter 100: loss 2.4050, time 1395.06ms, mfu 10.70%
iter 110: loss 2.3856, time 1396.25ms, mfu 10.70%
iter 120: loss 2.3631, time 1394.67ms, mfu 10.70%
iter 130: loss 2.3024, time 1394.34ms, mfu 10.70%
iter 140: loss 2.2330, time 1394.40ms, mfu 10.70%
iter 150: loss 2.1229, time 1396.18ms, mfu 10.70%
iter 160: loss 2.0596, time 1396.76ms, mfu 10.69%
iter 170: loss 2.0247, time 1396.03ms, mfu 10.69%
iter 180: loss 1.9253, time 1394.91ms, mfu 10.69%
iter 190: loss 1.8770, time 1395.84ms, mfu 10.69%
iter 200: loss 1.8505, time 1396.60ms, mfu 10.69%
iter 210: loss 1.8220, time 1396.59ms, mfu 10.69%
iter 220: loss 1.7351, time 1397.92ms, mfu 10.68%
iter 230: loss 1.7186, time 1396.52ms, mfu 10.68%
iter 240: loss 1.6742, time 1395.36ms, mfu 10.68%
step 250: train loss 1.5482, val loss 1.7322
saving checkpoint to out-shakespeare-char
iter 250: loss 1.6170, time 6002.98ms, mfu 9.86%
iter 260: loss 1.6227, time 1396.29ms, mfu 9.94%
iter 270: loss 1.6086, time 1395.38ms, mfu 10.02%
iter 280: loss 1.5508, time 1396.46ms, mfu 10.08%
iter 290: loss 1.5237, time 1395.96ms, mfu 10.14%
iter 300: loss 1.5497, time 1395.28ms, mfu 10.20%
iter 310: loss 1.5187, time 1397.89ms, mfu 10.24%
iter 320: loss 1.5137, time 1396.06ms, mfu 10.29%
iter 330: loss 1.5041, time 1395.99ms, mfu 10.33%
iter 340: loss 1.4562, time 1394.80ms, mfu 10.36%
iter 350: loss 1.4466, time 1396.15ms, mfu 10.39%
iter 360: loss 1.3967, time 1399.17ms, mfu 10.42%
iter 370: loss 1.3867, time 1396.58ms, mfu 10.44%
iter 380: loss 1.3648, time 1395.66ms, mfu 10.47%
iter 390: loss 1.3446, time 1395.32ms, mfu 10.49%
iter 400: loss 1.3223, time 1396.27ms, mfu 10.51%
iter 410: loss 1.3614, time 1395.41ms, mfu 10.53%
iter 420: loss 1.3121, time 1396.52ms, mfu 10.54%
iter 430: loss 1.2831, time 1396.91ms, mfu 10.55%
iter 440: loss 1.3500, time 1395.62ms, mfu 10.57%
iter 450: loss 1.3271, time 1395.62ms, mfu 10.58%
iter 460: loss 1.2502, time 1396.25ms, mfu 10.59%
iter 470: loss 1.3077, time 1397.06ms, mfu 10.60%
iter 480: loss 1.2766, time 1396.11ms, mfu 10.60%
iter 490: loss 1.2447, time 1395.38ms, mfu 10.61%
step 500: train loss 1.1257, val loss 1.4794
saving checkpoint to out-shakespeare-char
iter 500: loss 1.2409, time 6310.81ms, mfu 9.79%
iter 510: loss 1.2128, time 1395.79ms, mfu 9.88%
iter 520: loss 1.1950, time 1396.49ms, mfu 9.96%
iter 530: loss 1.2109, time 1397.05ms, mfu 10.03%
iter 540: loss 1.1947, time 1396.67ms, mfu 10.09%
iter 550: loss 1.1853, time 1395.17ms, mfu 10.15%
iter 560: loss 1.2016, time 1396.62ms, mfu 10.20%
iter 570: loss 1.1693, time 1397.37ms, mfu 10.25%
iter 580: loss 1.1706, time 1395.91ms, mfu 10.29%
iter 590: loss 1.1353, time 1396.05ms, mfu 10.33%
iter 600: loss 1.1314, time 1395.56ms, mfu 10.37%
iter 610: loss 1.1187, time 1395.38ms, mfu 10.40%
iter 620: loss 1.1109, time 1396.23ms, mfu 10.42%
iter 630: loss 1.0877, time 1397.28ms, mfu 10.45%
iter 640: loss 1.1222, time 1397.20ms, mfu 10.47%
iter 650: loss 1.0938, time 1396.87ms, mfu 10.49%
iter 660: loss 1.0652, time 1396.59ms, mfu 10.51%
iter 670: loss 1.0469, time 1397.01ms, mfu 10.52%
iter 680: loss 1.0372, time 1397.16ms, mfu 10.54%
iter 690: loss 1.0529, time 1397.81ms, mfu 10.55%
iter 700: loss 1.0402, time 1396.57ms, mfu 10.56%
iter 710: loss 1.0225, time 1396.72ms, mfu 10.57%
iter 720: loss 0.9876, time 1396.02ms, mfu 10.58%
iter 730: loss 1.0127, time 1396.87ms, mfu 10.59%
iter 740: loss 0.9794, time 1397.62ms, mfu 10.60%
step 750: train loss 0.7875, val loss 1.5834
iter 750: loss 0.9941, time 5848.39ms, mfu 9.80%
iter 760: loss 0.9972, time 1394.78ms, mfu 9.88%
iter 770: loss 0.9471, time 1397.23ms, mfu 9.96%
iter 780: loss 0.9479, time 1397.72ms, mfu 10.03%
iter 790: loss 0.9377, time 1396.61ms, mfu 10.10%
iter 800: loss 0.8917, time 1397.22ms, mfu 10.15%
iter 810: loss 0.8710, time 1396.42ms, mfu 10.21%
iter 820: loss 0.8780, time 1395.73ms, mfu 10.25%
iter 830: loss 0.8634, time 1396.84ms, mfu 10.29%
iter 840: loss 0.8529, time 1397.88ms, mfu 10.33%
iter 850: loss 0.8546, time 1396.87ms, mfu 10.37%
iter 860: loss 0.8158, time 1396.09ms, mfu 10.40%
iter 870: loss 0.8265, time 1395.68ms, mfu 10.42%
iter 880: loss 0.8065, time 1396.95ms, mfu 10.45%
iter 890: loss 0.8108, time 1397.06ms, mfu 10.47%
iter 900: loss 0.7922, time 1395.59ms, mfu 10.49%
iter 910: loss 0.8111, time 1396.20ms, mfu 10.51%
iter 920: loss 0.7672, time 1396.92ms, mfu 10.53%
iter 930: loss 0.7691, time 1397.41ms, mfu 10.54%
iter 940: loss 0.7607, time 1397.17ms, mfu 10.55%
iter 950: loss 0.7706, time 1396.58ms, mfu 10.57%
iter 960: loss 0.7467, time 1396.98ms, mfu 10.58%
iter 970: loss 0.7432, time 1395.20ms, mfu 10.59%
iter 980: loss 0.7039, time 1396.55ms, mfu 10.59%
iter 990: loss 0.7100, time 1397.82ms, mfu 10.60%
step 1000: train loss 0.3959, val loss 1.9050
iter 1000: loss 0.6856, time 5838.01ms, mfu 9.80%
iter 1010: loss 0.6781, time 1396.71ms, mfu 9.88%
iter 1020: loss 0.6765, time 1395.95ms, mfu 9.96%
iter 1030: loss 0.6651, time 1395.96ms, mfu 10.03%
iter 1040: loss 0.6758, time 1397.37ms, mfu 10.10%
iter 1050: loss 0.6483, time 1397.06ms, mfu 10.16%
iter 1060: loss 0.6382, time 1397.28ms, mfu 10.21%
iter 1070: loss 0.5898, time 1397.25ms, mfu 10.25%
iter 1080: loss 0.6376, time 1396.11ms, mfu 10.29%
iter 1090: loss 0.6204, time 1396.74ms, mfu 10.33%
iter 1100: loss 0.5924, time 1397.62ms, mfu 10.37%
iter 1110: loss 0.5955, time 1395.80ms, mfu 10.40%
iter 1120: loss 0.5758, time 1395.05ms, mfu 10.43%
iter 1130: loss 0.5956, time 1396.69ms, mfu 10.45%
iter 1140: loss 0.5833, time 1395.38ms, mfu 10.47%
iter 1150: loss 0.5774, time 1397.76ms, mfu 10.49%
iter 1160: loss 0.5521, time 1396.44ms, mfu 10.51%
iter 1170: loss 0.5472, time 1394.76ms, mfu 10.53%
iter 1180: loss 0.5513, time 1396.83ms, mfu 10.54%
iter 1190: loss 0.5299, time 1395.86ms, mfu 10.56%
iter 1200: loss 0.5342, time 1398.13ms, mfu 10.57%
iter 1210: loss 0.5397, time 1396.97ms, mfu 10.58%
iter 1220: loss 0.5248, time 1396.42ms, mfu 10.59%
iter 1230: loss 0.5127, time 1395.75ms, mfu 10.60%
iter 1240: loss 0.5328, time 1395.91ms, mfu 10.60%
step 1250: train loss 0.1908, val loss 2.2568
iter 1250: loss 0.5135, time 5839.61ms, mfu 9.80%
iter 1260: loss 0.5065, time 1397.27ms, mfu 9.89%
iter 1270: loss 0.5214, time 1397.27ms, mfu 9.96%
iter 1280: loss 0.4986, time 1395.61ms, mfu 10.04%
iter 1290: loss 0.4790, time 1397.00ms, mfu 10.10%
iter 1300: loss 0.4788, time 1396.65ms, mfu 10.16%
iter 1310: loss 0.4886, time 1397.53ms, mfu 10.21%
iter 1320: loss 0.4646, time 1395.61ms, mfu 10.25%
iter 1330: loss 0.4611, time 1395.60ms, mfu 10.30%
iter 1340: loss 0.4612, time 1396.26ms, mfu 10.33%
iter 1350: loss 0.4525, time 1396.28ms, mfu 10.37%
iter 1360: loss 0.4236, time 1397.25ms, mfu 10.40%
iter 1370: loss 0.4528, time 1395.44ms, mfu 10.43%
iter 1380: loss 0.4495, time 1395.44ms, mfu 10.45%
iter 1390: loss 0.4413, time 1394.99ms, mfu 10.48%
iter 1400: loss 0.4362, time 1397.86ms, mfu 10.49%
iter 1410: loss 0.4302, time 1397.51ms, mfu 10.51%
iter 1420: loss 0.4267, time 1396.74ms, mfu 10.53%
iter 1430: loss 0.4190, time 1396.87ms, mfu 10.54%
iter 1440: loss 0.4370, time 1396.64ms, mfu 10.55%
iter 1450: loss 0.4101, time 1397.86ms, mfu 10.57%
iter 1460: loss 0.4200, time 1396.51ms, mfu 10.58%
iter 1470: loss 0.4043, time 1395.97ms, mfu 10.59%
iter 1480: loss 0.4027, time 1396.46ms, mfu 10.59%
iter 1490: loss 0.4051, time 1395.87ms, mfu 10.60%
step 1500: train loss 0.1302, val loss 2.4975
iter 1500: loss 0.4120, time 5848.71ms, mfu 9.80%
iter 1510: loss 0.3907, time 1396.52ms, mfu 9.89%
iter 1520: loss 0.3884, time 1396.87ms, mfu 9.96%
iter 1530: loss 0.3842, time 1395.13ms, mfu 10.04%
iter 1540: loss 0.3896, time 1396.81ms, mfu 10.10%
iter 1550: loss 0.3729, time 1396.97ms, mfu 10.16%
iter 1560: loss 0.3719, time 1396.46ms, mfu 10.21%
iter 1570: loss 0.3951, time 1397.06ms, mfu 10.25%
iter 1580: loss 0.3723, time 1395.96ms, mfu 10.30%
iter 1590: loss 0.3719, time 1396.24ms, mfu 10.33%
iter 1600: loss 0.3787, time 1395.88ms, mfu 10.37%
iter 1610: loss 0.3628, time 1395.75ms, mfu 10.40%
iter 1620: loss 0.3713, time 1397.60ms, mfu 10.43%
iter 1630: loss 0.3550, time 1396.80ms, mfu 10.45%
iter 1640: loss 0.3717, time 1397.31ms, mfu 10.47%
iter 1650: loss 0.3657, time 1395.51ms, mfu 10.49%
iter 1660: loss 0.3553, time 1395.29ms, mfu 10.51%
iter 1670: loss 0.3558, time 1396.13ms, mfu 10.53%
iter 1680: loss 0.3377, time 1397.38ms, mfu 10.54%
iter 1690: loss 0.3515, time 1396.99ms, mfu 10.55%
iter 1700: loss 0.3486, time 1395.50ms, mfu 10.57%
iter 1710: loss 0.3422, time 1395.65ms, mfu 10.58%
iter 1720: loss 0.3527, time 1395.82ms, mfu 10.59%
iter 1730: loss 0.3397, time 1397.71ms, mfu 10.60%
iter 1740: loss 0.3379, time 1396.31ms, mfu 10.60%
step 1750: train loss 0.1023, val loss 2.7085
iter 1750: loss 0.3402, time 5830.73ms, mfu 9.80%
iter 1760: loss 0.3718, time 1396.05ms, mfu 9.89%
iter 1770: loss 0.3542, time 1395.77ms, mfu 9.97%
iter 1780: loss 0.3278, time 1396.88ms, mfu 10.04%
iter 1790: loss 0.3237, time 1395.49ms, mfu 10.10%
iter 1800: loss 0.3190, time 1396.11ms, mfu 10.16%
iter 1810: loss 0.3168, time 1395.17ms, mfu 10.21%
iter 1820: loss 0.3173, time 1397.21ms, mfu 10.26%
iter 1830: loss 0.3182, time 1398.33ms, mfu 10.30%
iter 1840: loss 0.3205, time 1396.30ms, mfu 10.33%
iter 1850: loss 0.3148, time 1395.38ms, mfu 10.37%
iter 1860: loss 0.3084, time 1396.11ms, mfu 10.40%
iter 1870: loss 0.3156, time 1395.58ms, mfu 10.43%
iter 1880: loss 0.3139, time 1396.58ms, mfu 10.45%
iter 1890: loss 0.3217, time 1396.58ms, mfu 10.47%
iter 1900: loss 0.3148, time 1397.03ms, mfu 10.49%
iter 1910: loss 0.3084, time 1395.46ms, mfu 10.51%
iter 1920: loss 0.3127, time 1395.68ms, mfu 10.53%
iter 1930: loss 0.3201, time 1396.21ms, mfu 10.54%
iter 1940: loss 0.3035, time 1397.30ms, mfu 10.56%
iter 1950: loss 0.3101, time 1396.34ms, mfu 10.57%
iter 1960: loss 0.2990, time 1396.22ms, mfu 10.58%
iter 1970: loss 0.3049, time 1395.96ms, mfu 10.59%
iter 1980: loss 0.2934, time 1395.03ms, mfu 10.60%
iter 1990: loss 0.2874, time 1397.65ms, mfu 10.60%
step 2000: train loss 0.0942, val loss 2.8577
iter 2000: loss 0.2923, time 5852.53ms, mfu 9.80%
iter 2010: loss 0.2912, time 1395.97ms, mfu 9.89%
iter 2020: loss 0.2946, time 1395.69ms, mfu 9.97%
iter 2030: loss 0.3042, time 1396.45ms, mfu 10.04%
iter 2040: loss 0.2845, time 1397.80ms, mfu 10.10%
iter 2050: loss 0.2835, time 1395.66ms, mfu 10.16%
iter 2060: loss 0.2955, time 1395.43ms, mfu 10.21%
iter 2070: loss 0.2914, time 1396.30ms, mfu 10.26%
iter 2080: loss 0.2805, time 1395.78ms, mfu 10.30%
iter 2090: loss 0.2995, time 1396.00ms, mfu 10.34%
iter 2100: loss 0.2913, time 1396.46ms, mfu 10.37%
iter 2110: loss 0.2899, time 1396.60ms, mfu 10.40%
iter 2120: loss 0.2925, time 1396.35ms, mfu 10.43%
iter 2130: loss 0.2807, time 1396.85ms, mfu 10.45%
iter 2140: loss 0.2756, time 1396.39ms, mfu 10.47%
iter 2150: loss 0.2790, time 1395.13ms, mfu 10.50%
iter 2160: loss 0.2801, time 1396.35ms, mfu 10.51%
iter 2170: loss 0.2680, time 1396.02ms, mfu 10.53%
iter 2180: loss 0.2809, time 1396.18ms, mfu 10.54%
iter 2190: loss 0.2725, time 1396.69ms, mfu 10.56%
iter 2200: loss 0.2723, time 1395.61ms, mfu 10.57%
iter 2210: loss 0.2750, time 1395.76ms, mfu 10.58%
iter 2220: loss 0.2665, time 1396.04ms, mfu 10.59%
iter 2230: loss 0.2632, time 1397.37ms, mfu 10.60%
iter 2240: loss 0.2750, time 1396.50ms, mfu 10.60%
step 2250: train loss 0.0883, val loss 2.9841
iter 2250: loss 0.2809, time 5827.53ms, mfu 9.80%
iter 2260: loss 0.2735, time 1395.44ms, mfu 9.89%
iter 2270: loss 0.2649, time 1395.60ms, mfu 9.97%
iter 2280: loss 0.2677, time 1396.26ms, mfu 10.04%
iter 2290: loss 0.2708, time 1397.06ms, mfu 10.10%
iter 2300: loss 0.2592, time 1395.89ms, mfu 10.16%
iter 2310: loss 0.2555, time 1395.71ms, mfu 10.21%
iter 2320: loss 0.2637, time 1395.79ms, mfu 10.26%
iter 2330: loss 0.2607, time 1396.06ms, mfu 10.30%
iter 2340: loss 0.2667, time 1396.14ms, mfu 10.34%
iter 2350: loss 0.2542, time 1396.13ms, mfu 10.37%
iter 2360: loss 0.2603, time 1394.59ms, mfu 10.40%
iter 2370: loss 0.2569, time 1395.76ms, mfu 10.43%
iter 2380: loss 0.2542, time 1395.66ms, mfu 10.46%
iter 2390: loss 0.2636, time 1396.60ms, mfu 10.48%
iter 2400: loss 0.2527, time 1396.18ms, mfu 10.50%
iter 2410: loss 0.2454, time 1395.95ms, mfu 10.51%
iter 2420: loss 0.2493, time 1395.37ms, mfu 10.53%
iter 2430: loss 0.2559, time 1396.10ms, mfu 10.55%
iter 2440: loss 0.2569, time 1396.71ms, mfu 10.56%
iter 2450: loss 0.2573, time 1396.07ms, mfu 10.57%
iter 2460: loss 0.2479, time 1395.60ms, mfu 10.58%
iter 2470: loss 0.2514, time 1395.71ms, mfu 10.59%
iter 2480: loss 0.2505, time 1396.36ms, mfu 10.60%
iter 2490: loss 0.2551, time 1397.24ms, mfu 10.61%
step 2500: train loss 0.0846, val loss 3.1065
iter 2500: loss 0.2564, time 5855.72ms, mfu 9.80%
iter 2510: loss 0.2534, time 1395.32ms, mfu 9.89%
iter 2520: loss 0.2538, time 1396.35ms, mfu 9.97%
iter 2530: loss 0.2599, time 1397.59ms, mfu 10.04%
iter 2540: loss 0.2439, time 1396.39ms, mfu 10.10%
iter 2550: loss 0.2446, time 1396.15ms, mfu 10.16%
iter 2560: loss 0.2497, time 1395.30ms, mfu 10.21%
iter 2570: loss 0.2503, time 1395.12ms, mfu 10.26%
iter 2580: loss 0.2413, time 1395.61ms, mfu 10.30%
iter 2590: loss 0.2550, time 1397.11ms, mfu 10.34%
iter 2600: loss 0.2450, time 1396.70ms, mfu 10.37%
iter 2610: loss 0.2449, time 1396.05ms, mfu 10.40%
iter 2620: loss 0.2401, time 1395.63ms, mfu 10.43%
iter 2630: loss 0.2367, time 1395.24ms, mfu 10.45%
iter 2640: loss 0.2387, time 1396.21ms, mfu 10.48%
iter 2650: loss 0.2481, time 1395.63ms, mfu 10.50%
iter 2660: loss 0.2281, time 1395.85ms, mfu 10.51%
iter 2670: loss 0.2364, time 1396.07ms, mfu 10.53%
iter 2680: loss 0.2368, time 1395.36ms, mfu 10.55%
iter 2690: loss 0.2381, time 1396.54ms, mfu 10.56%
iter 2700: loss 0.2320, time 1395.94ms, mfu 10.57%
iter 2710: loss 0.2345, time 1395.72ms, mfu 10.58%
iter 2720: loss 0.2361, time 1394.89ms, mfu 10.59%
iter 2730: loss 0.2322, time 1396.44ms, mfu 10.60%
iter 2740: loss 0.2180, time 1396.10ms, mfu 10.61%
step 2750: train loss 0.0821, val loss 3.2077
iter 2750: loss 0.2246, time 5845.95ms, mfu 9.80%
iter 2760: loss 0.2218, time 1395.24ms, mfu 9.89%
iter 2770: loss 0.2278, time 1396.33ms, mfu 9.97%
iter 2780: loss 0.2252, time 1396.55ms, mfu 10.04%
iter 2790: loss 0.2253, time 1395.96ms, mfu 10.10%
iter 2800: loss 0.2243, time 1395.94ms, mfu 10.16%
iter 2810: loss 0.2170, time 1395.45ms, mfu 10.21%
iter 2820: loss 0.2194, time 1395.59ms, mfu 10.26%
iter 2830: loss 0.2282, time 1395.67ms, mfu 10.30%
iter 2840: loss 0.2205, time 1396.07ms, mfu 10.34%
iter 2850: loss 0.2295, time 1396.02ms, mfu 10.37%
iter 2860: loss 0.2269, time 1395.82ms, mfu 10.40%
iter 2870: loss 0.2227, time 1395.21ms, mfu 10.43%
iter 2880: loss 0.2214, time 1396.80ms, mfu 10.45%
iter 2890: loss 0.2117, time 1397.77ms, mfu 10.48%
iter 2900: loss 0.2126, time 1396.02ms, mfu 10.50%
iter 2910: loss 0.2238, time 1395.95ms, mfu 10.51%
iter 2920: loss 0.2170, time 1396.77ms, mfu 10.53%
iter 2930: loss 0.2303, time 1395.38ms, mfu 10.54%
iter 2940: loss 0.2177, time 1396.25ms, mfu 10.56%
iter 2950: loss 0.2164, time 1396.23ms, mfu 10.57%
iter 2960: loss 0.2261, time 1394.96ms, mfu 10.58%
iter 2970: loss 0.2162, time 1395.43ms, mfu 10.59%
iter 2980: loss 0.2164, time 1395.56ms, mfu 10.60%
iter 2990: loss 0.2181, time 1396.75ms, mfu 10.61%
step 3000: train loss 0.0795, val loss 3.3033
iter 3000: loss 0.2120, time 5831.66ms, mfu 9.80%
iter 3010: loss 0.2117, time 1394.72ms, mfu 9.89%
iter 3020: loss 0.2109, time 1396.49ms, mfu 9.97%
iter 3030: loss 0.2288, time 1395.74ms, mfu 10.04%
iter 3040: loss 0.2185, time 1396.67ms, mfu 10.10%
iter 3050: loss 0.2146, time 1396.54ms, mfu 10.16%
iter 3060: loss 0.2063, time 1396.53ms, mfu 10.21%
iter 3070: loss 0.2139, time 1395.24ms, mfu 10.26%
iter 3080: loss 0.2122, time 1395.87ms, mfu 10.30%
iter 3090: loss 0.2027, time 1397.17ms, mfu 10.34%
iter 3100: loss 0.2144, time 1395.34ms, mfu 10.37%
iter 3110: loss 0.2257, time 1396.23ms, mfu 10.40%
iter 3120: loss 0.2102, time 1395.60ms, mfu 10.43%
iter 3130: loss 0.2072, time 1396.10ms, mfu 10.45%
iter 3140: loss 0.2082, time 1395.68ms, mfu 10.48%
iter 3150: loss 0.2121, time 1396.81ms, mfu 10.50%
iter 3160: loss 0.2061, time 1396.68ms, mfu 10.51%
iter 3170: loss 0.1955, time 1395.18ms, mfu 10.53%
iter 3180: loss 0.2053, time 1395.35ms, mfu 10.55%
iter 3190: loss 0.2104, time 1395.92ms, mfu 10.56%
iter 3200: loss 0.2140, time 1395.34ms, mfu 10.57%
iter 3210: loss 0.1993, time 1395.98ms, mfu 10.58%
iter 3220: loss 0.2012, time 1394.71ms, mfu 10.59%
iter 3230: loss 0.2028, time 1395.98ms, mfu 10.60%
iter 3240: loss 0.2138, time 1395.71ms, mfu 10.61%
step 3250: train loss 0.0780, val loss 3.3859
iter 3250: loss 0.2091, time 5841.23ms, mfu 9.80%
iter 3260: loss 0.2058, time 1396.07ms, mfu 9.89%
iter 3270: loss 0.2043, time 1397.21ms, mfu 9.97%
iter 3280: loss 0.2045, time 1396.75ms, mfu 10.04%
iter 3290: loss 0.1999, time 1396.46ms, mfu 10.10%
iter 3300: loss 0.2028, time 1396.42ms, mfu 10.16%
iter 3310: loss 0.2022, time 1394.55ms, mfu 10.21%
iter 3320: loss 0.1993, time 1395.66ms, mfu 10.26%
iter 3330: loss 0.1987, time 1395.88ms, mfu 10.30%
iter 3340: loss 0.2015, time 1396.69ms, mfu 10.34%
iter 3350: loss 0.2003, time 1395.98ms, mfu 10.37%
iter 3360: loss 0.2053, time 1396.19ms, mfu 10.40%
iter 3370: loss 0.2030, time 1396.23ms, mfu 10.43%
iter 3380: loss 0.1946, time 1395.42ms, mfu 10.45%
iter 3390: loss 0.1991, time 1396.85ms, mfu 10.48%
iter 3400: loss 0.1966, time 1396.09ms, mfu 10.50%
iter 3410: loss 0.2060, time 1396.34ms, mfu 10.51%
iter 3420: loss 0.2016, time 1396.14ms, mfu 10.53%
iter 3430: loss 0.2013, time 1395.82ms, mfu 10.54%
iter 3440: loss 0.2015, time 1397.46ms, mfu 10.56%
iter 3450: loss 0.1937, time 1395.00ms, mfu 10.57%
iter 3460: loss 0.1895, time 1395.58ms, mfu 10.58%
iter 3470: loss 0.1941, time 1393.97ms, mfu 10.59%
iter 3480: loss 0.2000, time 1395.54ms, mfu 10.60%
iter 3490: loss 0.1968, time 1396.20ms, mfu 10.61%
step 3500: train loss 0.0765, val loss 3.4490
iter 3500: loss 0.2007, time 5843.20ms, mfu 9.80%
iter 3510: loss 0.1947, time 1396.31ms, mfu 9.89%
iter 3520: loss 0.1970, time 1395.62ms, mfu 9.97%
iter 3530: loss 0.2012, time 1395.89ms, mfu 10.04%
iter 3540: loss 0.1977, time 1396.59ms, mfu 10.10%
iter 3550: loss 0.2031, time 1395.86ms, mfu 10.16%
iter 3560: loss 0.1864, time 1395.95ms, mfu 10.21%
iter 3570: loss 0.1994, time 1395.73ms, mfu 10.26%
iter 3580: loss 0.1943, time 1395.84ms, mfu 10.30%
iter 3590: loss 0.1883, time 1396.66ms, mfu 10.34%
iter 3600: loss 0.1949, time 1395.57ms, mfu 10.37%
iter 3610: loss 0.1937, time 1394.28ms, mfu 10.40%
iter 3620: loss 0.1857, time 1395.96ms, mfu 10.43%
iter 3630: loss 0.1880, time 1398.29ms, mfu 10.45%
iter 3640: loss 0.1928, time 1395.48ms, mfu 10.48%
iter 3650: loss 0.1925, time 1396.16ms, mfu 10.50%
iter 3660: loss 0.1888, time 1394.57ms, mfu 10.52%
iter 3670: loss 0.1942, time 1394.98ms, mfu 10.53%
iter 3680: loss 0.1876, time 1395.96ms, mfu 10.55%
iter 3690: loss 0.1879, time 1395.39ms, mfu 10.56%
iter 3700: loss 0.1776, time 1395.65ms, mfu 10.57%
iter 3710: loss 0.1937, time 1394.41ms, mfu 10.58%
iter 3720: loss 0.1820, time 1396.65ms, mfu 10.59%
iter 3730: loss 0.1953, time 1396.12ms, mfu 10.60%
iter 3740: loss 0.1856, time 1395.92ms, mfu 10.61%
step 3750: train loss 0.0755, val loss 3.5307
iter 3750: loss 0.1828, time 5845.61ms, mfu 9.80%
iter 3760: loss 0.1798, time 1396.08ms, mfu 9.89%
iter 3770: loss 0.1882, time 1395.16ms, mfu 9.97%
iter 3780: loss 0.1831, time 1395.07ms, mfu 10.04%
iter 3790: loss 0.1847, time 1395.98ms, mfu 10.10%
iter 3800: loss 0.1837, time 1394.44ms, mfu 10.16%
iter 3810: loss 0.1864, time 1394.99ms, mfu 10.22%
iter 3820: loss 0.1850, time 1394.58ms, mfu 10.26%
iter 3830: loss 0.1831, time 1395.50ms, mfu 10.30%
iter 3840: loss 0.1845, time 1395.93ms, mfu 10.34%
iter 3850: loss 0.1837, time 1395.45ms, mfu 10.38%
iter 3860: loss 0.1850, time 1396.44ms, mfu 10.41%
iter 3870: loss 0.1727, time 1395.43ms, mfu 10.43%
iter 3880: loss 0.1832, time 1395.11ms, mfu 10.46%
iter 3890: loss 0.1860, time 1396.42ms, mfu 10.48%
iter 3900: loss 0.1835, time 1396.00ms, mfu 10.50%
iter 3910: loss 0.1960, time 1395.40ms, mfu 10.52%
iter 3920: loss 0.1815, time 1395.38ms, mfu 10.53%
iter 3930: loss 0.1906, time 1395.15ms, mfu 10.55%
iter 3940: loss 0.1807, time 1395.60ms, mfu 10.56%
iter 3950: loss 0.1817, time 1398.31ms, mfu 10.57%
iter 3960: loss 0.1764, time 1396.48ms, mfu 10.58%
iter 3970: loss 0.1787, time 1395.17ms, mfu 10.59%
iter 3980: loss 0.1727, time 1395.32ms, mfu 10.60%
iter 3990: loss 0.1772, time 1395.31ms, mfu 10.61%
step 4000: train loss 0.0742, val loss 3.5892
iter 4000: loss 0.1825, time 5853.00ms, mfu 9.80%
iter 4010: loss 0.1832, time 1395.68ms, mfu 9.89%
iter 4020: loss 0.1800, time 1395.25ms, mfu 9.97%
iter 4030: loss 0.1753, time 1394.93ms, mfu 10.04%
iter 4040: loss 0.1822, time 1396.01ms, mfu 10.10%
iter 4050: loss 0.1792, time 1396.04ms, mfu 10.16%
iter 4060: loss 0.1805, time 1397.10ms, mfu 10.21%
iter 4070: loss 0.1791, time 1396.38ms, mfu 10.26%
iter 4080: loss 0.1727, time 1395.72ms, mfu 10.30%
iter 4090: loss 0.1771, time 1395.96ms, mfu 10.34%
iter 4100: loss 0.1730, time 1395.62ms, mfu 10.37%
iter 4110: loss 0.1744, time 1396.09ms, mfu 10.40%
iter 4120: loss 0.1790, time 1396.16ms, mfu 10.43%
iter 4130: loss 0.1748, time 1395.76ms, mfu 10.46%
iter 4140: loss 0.1809, time 1395.48ms, mfu 10.48%
iter 4150: loss 0.1730, time 1396.41ms, mfu 10.50%
iter 4160: loss 0.1768, time 1396.96ms, mfu 10.51%
iter 4170: loss 0.1772, time 1396.36ms, mfu 10.53%
iter 4180: loss 0.1701, time 1395.68ms, mfu 10.55%
iter 4190: loss 0.1759, time 1394.81ms, mfu 10.56%
iter 4200: loss 0.1776, time 1397.39ms, mfu 10.57%
iter 4210: loss 0.1722, time 1397.47ms, mfu 10.58%
iter 4220: loss 0.1730, time 1396.22ms, mfu 10.59%
iter 4230: loss 0.1715, time 1394.98ms, mfu 10.60%
iter 4240: loss 0.1782, time 1395.52ms, mfu 10.61%
step 4250: train loss 0.0737, val loss 3.6301
iter 4250: loss 0.1742, time 5829.30ms, mfu 9.80%
iter 4260: loss 0.1719, time 1395.96ms, mfu 9.89%
iter 4270: loss 0.1737, time 1397.36ms, mfu 9.97%
iter 4280: loss 0.1750, time 1396.12ms, mfu 10.04%
iter 4290: loss 0.1716, time 1395.20ms, mfu 10.10%
iter 4300: loss 0.1742, time 1395.36ms, mfu 10.16%
iter 4310: loss 0.1698, time 1395.47ms, mfu 10.21%
iter 4320: loss 0.1679, time 1396.83ms, mfu 10.26%
iter 4330: loss 0.1758, time 1395.56ms, mfu 10.30%
iter 4340: loss 0.1737, time 1395.96ms, mfu 10.34%
iter 4350: loss 0.1728, time 1395.92ms, mfu 10.37%
iter 4360: loss 0.1638, time 1395.93ms, mfu 10.40%
iter 4370: loss 0.1704, time 1396.43ms, mfu 10.43%
iter 4380: loss 0.1731, time 1395.75ms, mfu 10.45%
iter 4390: loss 0.1734, time 1395.29ms, mfu 10.48%
iter 4400: loss 0.1755, time 1396.20ms, mfu 10.50%
iter 4410: loss 0.1734, time 1396.99ms, mfu 10.51%
iter 4420: loss 0.1671, time 1396.79ms, mfu 10.53%
iter 4430: loss 0.1746, time 1395.35ms, mfu 10.55%
iter 4440: loss 0.1698, time 1394.89ms, mfu 10.56%
iter 4450: loss 0.1709, time 1395.87ms, mfu 10.57%
iter 4460: loss 0.1732, time 1396.20ms, mfu 10.58%
iter 4470: loss 0.1709, time 1396.58ms, mfu 10.59%
iter 4480: loss 0.1744, time 1395.89ms, mfu 10.60%
iter 4490: loss 0.1680, time 1395.58ms, mfu 10.61%
step 4500: train loss 0.0729, val loss 3.6449
iter 4500: loss 0.1739, time 5832.06ms, mfu 9.80%
iter 4510: loss 0.1693, time 1396.05ms, mfu 9.89%
iter 4520: loss 0.1708, time 1396.85ms, mfu 9.97%
iter 4530: loss 0.1594, time 1395.56ms, mfu 10.04%
iter 4540: loss 0.1661, time 1395.39ms, mfu 10.10%
iter 4550: loss 0.1665, time 1395.12ms, mfu 10.16%
iter 4560: loss 0.1690, time 1397.28ms, mfu 10.21%
iter 4570: loss 0.1664, time 1395.59ms, mfu 10.26%
iter 4580: loss 0.1691, time 1395.47ms, mfu 10.30%
iter 4590: loss 0.1700, time 1394.83ms, mfu 10.34%
iter 4600: loss 0.1639, time 1394.86ms, mfu 10.37%
iter 4610: loss 0.1618, time 1395.88ms, mfu 10.40%
iter 4620: loss 0.1678, time 1395.99ms, mfu 10.43%
iter 4630: loss 0.1694, time 1396.96ms, mfu 10.46%
iter 4640: loss 0.1697, time 1394.56ms, mfu 10.48%
iter 4650: loss 0.1699, time 1393.97ms, mfu 10.50%
iter 4660: loss 0.1691, time 1395.08ms, mfu 10.52%
iter 4670: loss 0.1747, time 1395.97ms, mfu 10.53%
iter 4680: loss 0.1670, time 1396.05ms, mfu 10.55%
iter 4690: loss 0.1677, time 1395.60ms, mfu 10.56%
iter 4700: loss 0.1668, time 1396.53ms, mfu 10.57%
iter 4710: loss 0.1686, time 1396.15ms, mfu 10.58%
iter 4720: loss 0.1749, time 1397.71ms, mfu 10.59%
iter 4730: loss 0.1677, time 1396.46ms, mfu 10.60%
iter 4740: loss 0.1651, time 1395.50ms, mfu 10.61%
step 4750: train loss 0.0726, val loss 3.6949
iter 4750: loss 0.1612, time 5828.90ms, mfu 9.80%
iter 4760: loss 0.1647, time 1396.15ms, mfu 9.89%
iter 4770: loss 0.1631, time 1397.44ms, mfu 9.97%
iter 4780: loss 0.1584, time 1397.02ms, mfu 10.04%
iter 4790: loss 0.1677, time 1395.88ms, mfu 10.10%
iter 4800: loss 0.1676, time 1395.00ms, mfu 10.16%
iter 4810: loss 0.1651, time 1394.33ms, mfu 10.21%
iter 4820: loss 0.1628, time 1395.26ms, mfu 10.26%
iter 4830: loss 0.1674, time 1396.76ms, mfu 10.30%
iter 4840: loss 0.1605, time 1395.77ms, mfu 10.34%
iter 4850: loss 0.1639, time 1395.68ms, mfu 10.37%
iter 4860: loss 0.1762, time 1395.44ms, mfu 10.40%
iter 4870: loss 0.1628, time 1396.52ms, mfu 10.43%
iter 4880: loss 0.1628, time 1396.79ms, mfu 10.45%
iter 4890: loss 0.1591, time 1395.89ms, mfu 10.48%
iter 4900: loss 0.1672, time 1395.60ms, mfu 10.50%
iter 4910: loss 0.1634, time 1396.36ms, mfu 10.51%
iter 4920: loss 0.1596, time 1396.20ms, mfu 10.53%
iter 4930: loss 0.1680, time 1396.82ms, mfu 10.54%
iter 4940: loss 0.1590, time 1396.95ms, mfu 10.56%
iter 4950: loss 0.1608, time 1395.59ms, mfu 10.57%
iter 4960: loss 0.1650, time 1394.79ms, mfu 10.58%
iter 4970: loss 0.1675, time 1395.83ms, mfu 10.59%
iter 4980: loss 0.1636, time 1397.06ms, mfu 10.60%
iter 4990: loss 0.1673, time 1396.25ms, mfu 10.61%
step 5000: train loss 0.0720, val loss 3.7275
iter 5000: loss 0.1584, time 5840.43ms, mfu 9.80%

文章来源: https://www.cnblogs.com/LittleHann/p/17316908.html
如有侵权请联系:admin#unsafe.sh