按照 Getting Started 中的运行步骤运行完前3步后在第4步出错
运行给定的 Create pod sharing one GPU
时报错
Events:
Type Reason Age From Message
Normal Scheduled 30m default-scheduler Successfully assigned default/cuda-gpu-test-968d697b9-rnsrq to 10.10.0.15
Warning FailedScheduling 30m default-scheduler pod e4947b6d-c27d-424b-a7ac-0c04ed232ce0 is in the cache, so can't be assumed
Warning Failed 28m (x12 over 30m) kubelet Error: device plugin PreStartContainer rpc failed with err: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: no such file or directory"
Normal Pulled 41s (x141 over 30m) kubelet Container image "nvidia/cuda:10.0-base" already present on machine
K8s version: v1.21.3
elastic-gpu-scheduler logs:
I0721 06:53:58.076290 1 main.go:44] priority algorithm: binpack
I0721 06:53:58.096595 1 controller.go:57] Creating event broadcaster
I0721 06:53:58.096769 1 controller.go:104] begin to wait for cache
I0721 06:53:58.197332 1 controller.go:109] init the node cache successfully
I0721 06:53:58.197374 1 controller.go:115] init the pod cache successfully
I0721 06:53:58.197379 1 controller.go:118] end to wait for cache
I0721 06:53:58.197407 1 main.go:97] server starting on the port: 39999
I0721 06:53:58.197440 1 controller.go:128] Starting GPU Sharing Controller.
I0721 06:53:58.197451 1 controller.go:129] Waiting for informer caches to sync
I0721 06:53:58.197454 1 controller.go:131] Starting 1 workers.
I0721 06:53:58.197461 1 controller.go:136] Started workers
I0721 07:10:32.095025 1 routes.go:66] start filter for pod default/cuda-gpu-test-968d697b9-2cc2z
I0721 07:10:32.110343 1 gpu.go:66] Trade: (core: 0, memory: 256, gpu count: 0)
I0721 07:10:32.110373 1 gpu.go:95] start to allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpus: scheduler.GPUs{(*scheduler.GPU)(0xc000116de0)}
I0721 07:10:32.110412 1 gpu.go:116] allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpu: &scheduler.GPU{CoreAvailable:100, MemoryAvailable:15109, CoreTotal:100, MemoryTotal:15109}
I0721 07:10:32.110422 1 scheduler.go:146] assume: 10.10.0.15 [[0]], err:
I0721 07:10:32.110523 1 scheduler.go:170] node allcated: {"54ab24a2":{"Request":[{"Core":0,"Memory":256,"GPUCount":0}],"Allocated":[[0]],"Score":0}}
I0721 07:10:32.110569 1 routes.go:79] ElasticGPUPredicate extenderFilterResult = {"Nodes":null,"NodeNames":["10.10.0.15"],"FailedNodes":{},"Error":""}
I0721 07:10:32.111581 1 routes.go:138] start bind pod default/cuda-gpu-test-968d697b9-2cc2z to node 10.10.0.15
I0721 07:10:32.114008 1 node.go:95] allocated option: &{Request:(core: 0, memory: 256, gpu count: 0) Allocated:[[0]] Score:0}
I0721 07:10:32.123467 1 routes.go:66] start filter for pod default/cuda-gpu-test-968d697b9-2cc2z
I0721 07:10:32.128699 1 routes.go:155] extenderBindingResult = {"Error":""}
I0721 07:10:32.128888 1 gpu.go:66] Trade: (core: 0, memory: 256, gpu count: 0)
I0721 07:10:32.128926 1 gpu.go:95] start to allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpus: scheduler.GPUs{(*scheduler.GPU)(0xc000116de0)}
I0721 07:10:32.128966 1 gpu.go:116] allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpu: &scheduler.GPU{CoreAvailable:100, MemoryAvailable:14853, CoreTotal:100, MemoryTotal:15109}
I0721 07:10:32.128975 1 scheduler.go:146] assume: 10.10.0.15 [[0]], err:
I0721 07:10:32.129012 1 scheduler.go:170] node allcated: {"54ab24a2":{"Request":[{"Core":0,"Memory":256,"GPUCount":0}],"Allocated":[[0]],"Score":0}}
I0721 07:10:32.129035 1 routes.go:79] ElasticGPUPredicate extenderFilterResult = {"Nodes":null,"NodeNames":["10.10.0.15"],"FailedNodes":{},"Error":""}
I0721 07:16:32.058377 1 routes.go:66] start filter for pod default/cuda-gpu-test-968d697b9-5h7k4
I0721 07:16:32.058425 1 scheduler.go:146] assume: 10.10.0.15 [[0]], err:
I0721 07:16:32.058461 1 scheduler.go:170] node allcated: {"54ab24a2":{"Request":[{"Core":0,"Memory":256,"GPUCount":0}],"Allocated":[[0]],"Score":0}}
I0721 07:16:32.058497 1 routes.go:79] ElasticGPUPredicate extenderFilterResult = {"Nodes":null,"NodeNames":["10.10.0.15"],"FailedNodes":{},"Error":""}
I0721 07:16:32.059431 1 routes.go:138] start bind pod default/cuda-gpu-test-968d697b9-5h7k4 to node 10.10.0.15
I0721 07:16:32.067092 1 node.go:95] allocated option: &{Request:(core: 0, memory: 256, gpu count: 0) Allocated:[[0]] Score:0}
I0721 07:16:32.075019 1 routes.go:66] start filter for pod default/cuda-gpu-test-968d697b9-5h7k4
I0721 07:16:32.078795 1 routes.go:155] extenderBindingResult = {"Error":""}
I0721 07:16:32.078864 1 gpu.go:66] Trade: (core: 0, memory: 256, gpu count: 0)
I0721 07:16:32.078932 1 gpu.go:95] start to allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpus: scheduler.GPUs{(*scheduler.GPU)(0xc000116de0)}
I0721 07:16:32.078950 1 gpu.go:116] allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpu: &scheduler.GPU{CoreAvailable:100, MemoryAvailable:14853, CoreTotal:100, MemoryTotal:15109}
I0721 07:16:32.078957 1 scheduler.go:146] assume: 10.10.0.15 [[0]], err:
I0721 07:16:32.078970 1 scheduler.go:170] node allcated: {"54ab24a2":{"Request":[{"Core":0,"Memory":256,"GPUCount":0}],"Allocated":[[0]],"Score":0}}
I0721 07:16:32.078976 1 routes.go:79] ElasticGPUPredicate extenderFilterResult = {"Nodes":null,"NodeNames":["10.10.0.15"],"FailedNodes":{},"Error":""}
I0721 07:16:39.620552 1 controller.go:296] delete pod default/cuda-gpu-test-968d697b9-2cc2z
I0721 07:17:04.615647 1 controller.go:296] delete pod default/cuda-gpu-test-968d697b9-5h7k4
I0721 07:17:27.528301 1 routes.go:66] start filter for pod default/cuda-gpu-test-968d697b9-rnsrq
I0721 07:17:27.528349 1 scheduler.go:146] assume: 10.10.0.15 [[0]], err:
I0721 07:17:27.528393 1 scheduler.go:170] node allcated: {"54ab24a2":{"Request":[{"Core":0,"Memory":256,"GPUCount":0}],"Allocated":[[0]],"Score":0}}
I0721 07:17:27.528407 1 routes.go:79] ElasticGPUPredicate extenderFilterResult = {"Nodes":null,"NodeNames":["10.10.0.15"],"FailedNodes":{},"Error":""}
I0721 07:17:27.530168 1 routes.go:138] start bind pod default/cuda-gpu-test-968d697b9-rnsrq to node 10.10.0.15
I0721 07:17:27.548958 1 node.go:95] allocated option: &{Request:(core: 0, memory: 256, gpu count: 0) Allocated:[[0]] Score:0}
I0721 07:17:27.559443 1 routes.go:66] start filter for pod default/cuda-gpu-test-968d697b9-rnsrq
I0721 07:17:27.563396 1 routes.go:155] extenderBindingResult = {"Error":""}
I0721 07:17:27.563491 1 gpu.go:66] Trade: (core: 0, memory: 256, gpu count: 0)
I0721 07:17:27.563507 1 gpu.go:95] start to allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpus: scheduler.GPUs{(*scheduler.GPU)(0xc000116de0)}
I0721 07:17:27.563533 1 gpu.go:116] allocate request: scheduler.GPUUnit{Core:0, Memory:256, GPUCount:0}, gpu: &scheduler.GPU{CoreAvailable:100, MemoryAvailable:14853, CoreTotal:100, MemoryTotal:15109}
I0721 07:17:27.563543 1 scheduler.go:146] assume: 10.10.0.15 [[0]], err:
I0721 07:17:27.563561 1 scheduler.go:170] node allcated: {"54ab24a2":{"Request":[{"Core":0,"Memory":256,"GPUCount":0}],"Allocated":[[0]],"Score":0}}
I0721 07:17:27.563574 1 routes.go:79] ElasticGPUPredicate extenderFilterResult = {"Nodes":null,"NodeNames":["10.10.0.15"],"FailedNodes":{},"Error":""}
elastic-gpu-agent logs:
I0721 06:53:53.073287 1 main.go:31] start to run elastic gpu agent
I0721 06:53:53.073314 1 manager.go:146] start to run gpu manager
I0721 06:53:53.073345 1 manager.go:150] polling if the sitter has done listing pods:false
I0721 06:53:53.174001 1 manager.go:150] polling if the sitter has done listing pods:true
I0721 06:53:53.174045 1 base.go:237] start plugin elasticgpu.io/gpu-memory
I0721 06:53:53.174053 1 base.go:237] start plugin elasticgpu.io/gpu-core
I0721 06:54:53.174171 1 base.go:250] gpushare plugin starts to GC
I0721 06:55:53.174616 1 base.go:250] gpushare plugin starts to GC
I0721 06:56:53.175347 1 base.go:250] gpushare plugin starts to GC
I0721 06:57:53.176272 1 base.go:250] gpushare plugin starts to GC
I0721 06:58:53.177349 1 base.go:250] gpushare plugin starts to GC
I0721 06:59:53.178092 1 base.go:250] gpushare plugin starts to GC
I0721 07:00:53.178252 1 base.go:250] gpushare plugin starts to GC
I0721 07:01:53.179227 1 base.go:250] gpushare plugin starts to GC
I0721 07:02:53.180218 1 base.go:250] gpushare plugin starts to GC
I0721 07:03:53.180939 1 base.go:250] gpushare plugin starts to GC
I0721 07:04:53.181332 1 base.go:250] gpushare plugin starts to GC
I0721 07:05:53.181719 1 base.go:250] gpushare plugin starts to GC
I0721 07:06:53.182780 1 base.go:250] gpushare plugin starts to GC
I0721 07:07:53.183888 1 base.go:250] gpushare plugin starts to GC
I0721 07:08:53.184118 1 base.go:250] gpushare plugin starts to GC
I0721 07:09:53.185022 1 base.go:250] gpushare plugin starts to GC
I0721 07:10:53.185990 1 base.go:250] gpushare plugin starts to GC
E0721 07:10:55.889130 1 gpushare.go:220] no pod with such device list: 0-3343:0-6581:0-5234:0-12894:0-7636:0-6888:0-4835:0-9011:0-8006:0-6861:0-1042:0-6959:0-14053:0-509:0-10899:0-7266:0-14452:0-1227:0-4281:0-13003:0-13958:0-1698:0-2474:0-4914:0-1056:0-4461:0-5743:0-8145:0-10397:0-4574:0-4712:0-9063:0-14279:0-7690:0-2665:0-3927:0-2140:0-8928:0-13664:0-4660:0-12006:0-349:0-3692:0-13498:0-6416:0-2820:0-6215:0-4206:0-12306:0-13773:0-8747:0-1643:0-12232:0-11414:0-8718:0-9506:0-10449:0-1593:0-5686:0-14147:0-12616:0-12921:0-3725:0-10957:0-9107:0-6061:0-1675:0-10198:0-1969:0-2793:0-6456:0-11085:0-10066:0-11882:0-13693:0-10912:0-13679:0-1604:0-13171:0-10028:0-10307:0-2865:0-7240:0-6591:0-7183:0-8774:0-1445:0-6064:0-13261:0-700:0-8486:0-7533:0-3108:0-6295:0-1228:0-14363:0-14691:0-2755:0-9188:0-11138:0-13751:0-4396:0-4673:0-7527:0-8713:0-12582:0-9839:0-9814:0-2558:0-9850:0-14212:0-7968:0-7261:0-7206:0-9499:0-6289:0-11172:0-14683:0-8812:0-1316:0-15076:0-8205:0-1083:0-3454:0-11144:0-3161:0-3389:0-12705:0-7300:0-11754:0-2775:0-12189:0-11084:0-2654:0-1246:0-13387:0-9103:0-8207:0-3903:0-11021:0-548:0-1886:0-10441:0-1312:0-11527:0-4247:0-12398:0-7862:0-4874:0-8739:0-9199:0-11251:0-6430:0-10498:0-12020:0-7648:0-2390:0-10994:0-7306:0-5316:0-9179:0-7841:0-9633:0-103:0-12639:0-1546:0-12134:0-11651:0-5150:0-6772:0-1729:0-9793:0-11632:0-8060:0-767:0-5867:0-14830:0-12005:0-2264:0-1181:0-8836:0-4173:0-5627:0-2954:0-9415:0-13840:0-14879:0-9787:0-5261:0-7048:0-11162:0-7094:0-7665:0-11361:0-11129:0-904:0-4225:0-6400:0-12790:0-12920:0-12244:0-13686:0-8915:0-5882:0-4138:0-3202:0-566:0-10562:0-9305:0-2550:0-14799:0-10298:0-13103:0-4904:0-115:0-13839:0-1210:0-13409:0-10036:0-9355:0-9686:0-13114:0-14686:0-12524:0-14628:0-14752:0-13436:0-3273:0-6540:0-11628:0-8079:0-6230:0-4218:0-8831:0-593:0-15067:0-13974:0-6399:0-10514:0-4308:0-11764:0-11163:0-1793:0-7385:0-8848:0-10001:0-9543:0-3005:0-11847:0-14245:0-3496:0-13000:0-2435:0-3297:0-5697:0-5872
E0721 07:10:56.382019 1 gpushare.go:220] no pod with such device list: 0-3297:0-8915:0-8836:0-6289:0-4173:0-4461:0-6581:0-14879:0-11414:0-8774:0-8486:0-11163:0-2558:0-4225:0-6591:0-10397:0-6959:0-4308:0-3692:0-12189:0-12398:0-14799:0-5697:0-9793:0-8713:0-9199:0-1246:0-3725:0-7183:0-14628:0-4904:0-13409:0-5882:0-3161:0-12006:0-13839:0-8831:0-548:0-12921:0-3927:0-3005:0-8739:0-1604:0-11361:0-11754:0-1793:0-7690:0-8205:0-3343:0-1969:0-8718:0-1643:0-9506:0-13387:0-8006:0-9787:0-2550:0-8928:0-14686:0-1083:0-9011:0-767:0-12134:0-8747:0-11847:0-11632:0-12524:0-5150:0-1546:0-8207:0-1228:0-6064:0-11129:0-12894:0-13686:0-7266:0-6230:0-4712:0-9543:0-12639:0-1675:0-10066:0-14212:0-4218:0-14053:0-5234:0-1210:0-6861:0-14147:0-8060:0-12790:0-4396:0-5261:0-2390:0-7841:0-13693:0-4574:0-566:0-10562:0-9188:0-6295:0-13840:0-7648:0-2793:0-11085:0-13773:0-5867:0-9103:0-1886:0-509:0-4281:0-14691:0-7206:0-6540:0-9107:0-6888:0-6400:0-10994:0-13664:0-10198:0-10498:0-1312:0-115:0-5627:0-10298:0-13103:0-2435:0-13498:0-4874:0-1181:0-7968:0-4914:0-4247:0-6430:0-11084:0-5743:0-349:0-13000:0-7261:0-8848:0-12306:0-8145:0-6061:0-13261:0-2665:0-8812:0-9499:0-5316:0-6215:0-593:0-1593:0-10001:0-3454:0-9415:0-2474:0-5686:0-6456:0-7527:0-13958:0-9850:0-1042:0-2865:0-10912:0-12616:0-7240:0-4673:0-12705:0-11138:0-10441:0-14279:0-14363:0-1227:0-2654:0-5872:0-11651:0-7665:0-11021:0-8079:0-3108:0-12020:0-1729:0-9839:0-2264:0-10449:0-7636:0-12244:0-13436:0-4660:0-9179:0-7048:0-2140:0-11162:0-700:0-9814:0-15076:0-7385:0-9633:0-3903:0-15067:0-10899:0-2954:0-103:0-6772:0-11764:0-10514:0-7862:0-9063:0-1698:0-14752:0-13003:0-3273:0-14452:0-1445:0-13751:0-2775:0-7533:0-7306:0-4138:0-10028:0-14683:0-13114:0-10957:0-13171:0-12232:0-11144:0-10307:0-904:0-9686:0-11882:0-3496:0-9305:0-7300:0-11172:0-11251:0-11628:0-1056:0-11527:0-6399:0-9355:0-7094:0-13679:0-14245:0-1316:0-13974:0-4206:0-2820:0-10036:0-12005:0-14830:0-2755:0-4835:0-12920:0-12582:0-3202:0-3389:0-6416
后面全是 no pod with such device list: 的重复