Implement News on Singers
- Webpages - (Information collection) -> Database - (Information retrieval) -> Rank, Search, Recommend
- Information collection with crawler..
How to crawl a page?
import urlib2
#request source file
url = '...(URL)...'
request = urlib2.Request(url)
response = urlib2.urlopen(request)
page = response.read()
#save the source file
webFile = open('xxx', 'wb')
webFile.write(page)
webFile.close()
What is the network process when you are crawling a webpage?
Layers
How to crawl the contents?
- Crawl all the news from a website
- Identify the list page
- Identify the links of news
How about more websites?
- allocate one crawler for each website
- use a scheduler to schedule the list crawlers' tasks and the news crawlers' tasks
In fact, we can use less crawlers
- task table + scheduler
- taskID, priority, type, state, link, availableTime, endTime
How to design a scheduler?
"Sleep" solution
/* - Maintaining two tables: taskTable and pageTable - Don't forget to change the state of task at the end of each loop! */ while(true){ lock(taskTable); // find a non-working(new) task if((taskTable.find(state == "new")) == NULL){ unlock(taskTable); sleep(1000); } else { task = taskTable.findOne(state == "new"); task.state = "working"; unlock(taskTable); } page = crawl(task.url); if(task.type == "list"){ lock(taskTable); for(newTask:page){ taskTable.append(newTask); } task.state = "done"; unlock(taskTable); } else { lock(pageTable); pageTable.append(page); unlock(pageTable); lock(taskTable); task.state = "done"; unlock(taskTable); } }
Considering the sleep solution: how about new tasks are come within 1s?
- The thread cannot be awaken immediately!
"Conditional Variable" solution
conditional_variable_wait(cond, mutex){ lock(cond.threadWaitList); cond.threadWaitList.append(this.thread); unlock(cond.threadWaitList); unlock(mutex); block(this.thread); // after receiving the signal, unblock(wakeup) this.thread lock(mutex); } conditional_variable_signal(cond){ lock(cond.threadWaitList); if(cond.threadWaitList.size() > 0){ thread = cond.threadWaitList.pop(); wakeup(thread); } unlock(threadWaitList); } while(true){ lock(taskTable); // find a non-working(new) task while((taskTable.find(state == "new")) == NULL){ conditional_variable_wait(cond, taskTable); } task = taskTable.findOne(state == "new"); task.state = "working"; unlock(taskTable); page = crawl(task.url); if(task.type == "list"){ lock(taskTable); for(newTask:page){ taskTable.append(newTask); // new task is available conditional_variable_signal(cond); } task.state = "done"; unlock(taskTable); } else { lock(pageTable); pageTable.append(page); unlock(pageTable); lock(taskTable); task.state = "done"; unlock(taskTable); } }
Simplified/quantified conditional variable - semaphore
Distribute crawlers in multiple machines
One-to-one
- Too many connectors
- Too many connectors
One-to-many
- sender & receiver
- sender and receiver share taskTable and crawlerTable
- still complicated - # of connections is not reduced!
Using Database
- update/add/findOne..