前言:
最近到毕业的季节了,想着在工作前锻炼下自己阅读英文文献的能力,又对分布式有兴趣,就开了分布式的坑。国内分布式资料较少,因此参考了mit的课程:http://nil.csail.mit.edu/6.824/2015/schedule.html
这门课虽然没有视频,但是给出了分布式相关的论文,并设置了5个实验(go语言),今天终于完成了lab1.有需要的同学可自取。
Lab1
第一个实验分成了三个部分:
part 1: 实现map reduce模型的“Hello World”:计算海量文本的单词出现频率
part 2: part 1中分配map 和reduce任务不是并行的,因此在这个部分,你需要将map和reduce改写成并行的
part 3:part 2中假设worker是不会出错的,现在你需要解决worker出错的情况
前两个部分实际上是part3的子集,所以下面只有part 3的解释
代码:
链接: https://pan.baidu.com/s/1z38zbg1hORT3383miQYhxA
提取码: 3rh2
part 3解释:
我们需要做什么?
1. 因为新worker随时可能注册,我们需要一个协程来不断处理新注册的worker地址
2. 因为旧worker随时可能被kill,我们需要不断更新worker的状态
3. 因为被分配的任务可能由于worker被kill导致无法完成,所以我们需要查看任务是否已经被分配完
工作流程:
大概流程如上,但是随时有新worker注册/旧worker失效,所以这些模块是运行在不同协程的(多线程)。
实现:
lab1自带了一个MapReduce结构体,是用来管理数据的。
在额外状态那里,我用数组(slice?)维护需要分配的任务的状态,和一个int类型保存已经完成任务的数量。
type JobState struct{
JobNum int
// 0 not done
// 1 assigned
// 2 done
State int
}
type MapReduce struct {
nMap int // Number of Map jobs
nReduce int // Number of Reduce jobs
file string // Name of input file
MasterAddress string
registerChannel chan string
DoneChannel chan bool
alive bool
l net.Listener
stats *list.List
// Map of registered workers that you need to keep up to date
Workers map[string]*WorkerInfo
// add any additional state here
MapJobList []JobState
MapJobFinish int
ReduceJobList []JobState
ReduceJobFinish int
}
注册模块
不断从registerChannel接受新的注册者,如果没有新注册者,会一直阻塞在channel读取
func HandleRegister(mr *MapReduce, mutex *sync.Mutex){
i:=0
s:="worker"
for mr.alive{
workerName, _:= <-mr.registerChannel
WorkerInfo_tmp := new(WorkerInfo)
WorkerInfo_tmp.address = workerName
WorkerInfo_tmp.idle = true
WorkerInfo_tmp.alive = true
WorkerInfo_tmp.jobNum = -1
mutex.Lock()
mr.Workers[s+strconv.Itoa(i)] = WorkerInfo_tmp
mutex.Unlock()
fmt.Println("Register workerName:" + s+strconv.Itoa(i))
i++
}
}
map任务分配模块
map任务分配模块分成两部分,一个处理分配具体任务逻辑的函数,一个是周期性检查看是否有未被处理的任务,如果有就分配给具体worker。
下面是处理分配逻辑的函数
分配任务编号(MapJobNum),并通过rpc将任务分配给空闲Worker。如果任务完成了,则把该任务状态改为完成;否则,worker可能被kill了,因此将该worker设置为不可达,重新分配任务。
//Find idle worker and assign map job
//when worker finish map job and return res.OK
//worker state become idle again
func assignMapJob(mr *MapReduce, MapJobNum int,
WorkerName string, m_lock *sync.Mutex ){
JobArg := new(DoJobArgs)
JobArg.File = mr.file
JobArg.Operation = Map
JobArg.JobNumber = MapJobNum
JobArg.NumOtherPhase = nReduce
//assign work
//set worker idle = false
//set jobnum state = 1
reply := new(*DoJobReply)
m_lock.Lock()
address:=mr.Workers[WorkerName].address
m_lock.Unlock()
fmt.Println("wait map job:", MapJobNum)
ok:=call(address, "Worker.DoJob", JobArg, &reply)
fmt.Println(MapJobNum,"MAP JOB RETUAN")
if ok == false{
//mean worker unreachable
//set worker to dead
//reset job to not done state
fmt.Println(WorkerName + "do map job failed")
m_lock.Lock()
mr.Workers[WorkerName].alive = false
for index:=range mr.MapJobList {
if(mr.MapJobList[index].JobNum == mr.Workers[WorkerName].jobNum){
mr.MapJobList[index].State = 0
break;
}
}
//delete(mr.Workers, WorkerName)
m_lock.Unlock()
}else {
//mean Job has done
//set worker to idle and clean its job record
//set job done
fmt.Println("Map:", MapJobNum, "finished","nMap: ",
mr.nMap, " Finish: ", mr.MapJobFinish)
m_lock.Lock()
mr.Workers[WorkerName].idle = true
mr.MapJobFinish++
for index:=range mr.MapJobList{
if(mr.MapJobList[index].JobNum == mr.Workers[WorkerName].jobNum){
mr.MapJobList[index].State = 2
break;
}
}
mr.Workers[WorkerName].jobNum = -1
m_lock.Unlock()
}
}
周期性分配map任务直到分配完成
FindIdleWorker寻找当前空闲的worker,将可用的worker返回。
然后将任务设置成已分配(1),worker状态设置为忙绿,用协程调用上面的assignMapJob函数处理具体分配逻辑。
func ManagerAssignMapJob(mr *MapReduce, mutex *sync.Mutex){
for mr.nMap > mr.MapJobFinish{
time.Sleep(time.Duration(20)*time.Microsecond)
index:=0
for ;index<len(mr.MapJobList);index++{
if(mr.MapJobList[index].State==0){
workerName:=FindIdleWorker(mr, mutex)
mutex.Lock()
fmt.Println("mapManager get lock and assign map job: ", index)
mr.MapJobList[index].State = 1
//set worker state to not idle
//不能放到gocfunc里面做,因为启动“线程”会有延迟,导致数据不同步
mr.Workers[workerName].idle = false
mr.Workers[workerName].jobNum = index
mutex.Unlock()
fmt.Println("mapManager release lock")
//fmt.Println("assign Map Job", index, " MAX: ", mr.nMap,
// " worker:", workerName, "len(MapJob)", len(mr.MapJobList))
go func(indexs int, wkName string){
assignMapJob(mr, indexs, wkName, &*mutex)
}(mr.MapJobList[index].JobNum, workerName)
}
}
}
}
FindIdleWorker
//变成周期性的轮询并加锁
func FindIdleWorker(mr *MapReduce, mutex *sync.Mutex) string{
for{
mutex.Lock()
for key, value := range mr.Workers{
if value.idle&&value.alive{
mutex.Unlock()
return key
}
}
mutex.Unlock()
//fmt.Println("Not Found idle worker, find again latter")
time.Sleep(time.Duration(20)*time.Microsecond)
}
}
Reduce模块和map模块类似,就不单独提出
RunMaster
工作函数
func (mr *MapReduce) RunMaster() *list.List {
// Your code here
//handle register all the time
var _lock sync.Mutex
go HandleRegister(mr, &_lock)
//map
var wait chan int = make(chan int)
go func(){
ManagerAssignMapJob(mr, &_lock)
wait <- 1
}()
//wait map
_, _ = <-wait
fmt.Println("map job has all finished")
//reduce
var waitReduce chan int = make(chan int)
go func(){
ManagerAssignReduceJob(mr, &_lock)
waitReduce <- 1
}()
//wait reduce
_, _ = <- waitReduce
close(wait)
close(waitReduce)
//close mr channel
return mr.KillWorkers()
}
注意点:
1. golang的map(似乎)不支持并发访问,最好加锁再访问
2. 注意完成一个任务分配/寻找空闲worker后要sleep一段时间,我没写上sleep的时候,不知道为何,会一直卡在最后一个任务,猜测可能是调度问题(我也不是很了解,有知道原因的同学一起讨论下?)