Implement News on Singers

Webpages - (Information collection) -> Database - (Information retrieval) -> Rank, Search, Recommend
Information collection with crawler..

How to crawl a page?

import urlib2 

#request source file
url = '...(URL)...'
request = urlib2.Request(url)
response = urlib2.urlopen(request)
page = response.read()

#save the source file
webFile = open('xxx', 'wb')
webFile.write(page)
webFile.close()

What is the network process when you are crawling a webpage?

Layers

How to crawl the contents?

Crawl all the news from a website
- Identify the list page
- Identify the links of news

How about more websites?
- allocate one crawler for each website
- use a scheduler to schedule the list crawlers' tasks and the news crawlers' tasks
In fact, we can use less crawlers
- task table + scheduler
- taskID, priority, type, state, link, availableTime, endTime

How to design a scheduler?

"Sleep" solution

/*
    - Maintaining two tables: taskTable and pageTable
    - Don't forget to change the state of task at the end of each loop!
*/

while(true){
    lock(taskTable);

    // find a non-working(new) task
    if((taskTable.find(state == "new")) == NULL){
        unlock(taskTable);
        sleep(1000);
    } else {
        task = taskTable.findOne(state == "new");
        task.state = "working";
        unlock(taskTable);
    }

    page = crawl(task.url);

    if(task.type == "list"){
        lock(taskTable);
        for(newTask:page){
            taskTable.append(newTask);
        }
        task.state = "done";
        unlock(taskTable);
    } else {
        lock(pageTable);
        pageTable.append(page);
        unlock(pageTable);
        lock(taskTable);
        task.state = "done";
        unlock(taskTable);
    }
}

Considering the sleep solution: how about new tasks are come within 1s?

The thread cannot be awaken immediately!
"Conditional Variable" solution

conditional_variable_wait(cond, mutex){
    lock(cond.threadWaitList);
    cond.threadWaitList.append(this.thread);
    unlock(cond.threadWaitList);

    unlock(mutex);
    block(this.thread);
    // after receiving the signal, unblock(wakeup) this.thread
    lock(mutex);
}

conditional_variable_signal(cond){
    lock(cond.threadWaitList);
    if(cond.threadWaitList.size() > 0){
        thread = cond.threadWaitList.pop();
        wakeup(thread);
    }
    unlock(threadWaitList);
}

while(true){
    lock(taskTable);

    // find a non-working(new) task
    while((taskTable.find(state == "new")) == NULL){
        conditional_variable_wait(cond, taskTable);
    }
    task = taskTable.findOne(state == "new");
    task.state = "working";
    unlock(taskTable);

    page = crawl(task.url);

    if(task.type == "list"){
        lock(taskTable);
        for(newTask:page){
            taskTable.append(newTask);
            // new task is available
            conditional_variable_signal(cond);
        }
        task.state = "done";
        unlock(taskTable);
    } else {
        lock(pageTable);
        pageTable.append(page);
        unlock(pageTable);
        lock(taskTable);
        task.state = "done";
        unlock(taskTable);
    }

}

Simplified/quantified conditional variable - semaphore

Distribute crawlers in multiple machines

One-to-one
- Too many connectors
One-to-many
- sender & receiver
- sender and receiver share taskTable and crawlerTable
- still complicated - # of connections is not reduced!