Implement News on Singers

  • Webpages - (Information collection) -> Database - (Information retrieval) -> Rank, Search, Recommend
  • Information collection with crawler..

How to crawl a page?

import urlib2 

#request source file
url = '...(URL)...'
request = urlib2.Request(url)
response = urlib2.urlopen(request)
page = response.read()

#save the source file
webFile = open('xxx', 'wb')
webFile.write(page)
webFile.close()

What is the network process when you are crawling a webpage?

Layers

How to crawl the contents?

  • Crawl all the news from a website
    • Identify the list page
    • Identify the links of news

  • How about more websites?

    • allocate one crawler for each website
    • use a scheduler to schedule the list crawlers' tasks and the news crawlers' tasks
  • In fact, we can use less crawlers

    • task table + scheduler
    • taskID, priority, type, state, link, availableTime, endTime

  • How to design a scheduler?

    • "Sleep" solution

    /*
        - Maintaining two tables: taskTable and pageTable
        - Don't forget to change the state of task at the end of each loop!
    */
    
    while(true){
        lock(taskTable);
    
        // find a non-working(new) task
        if((taskTable.find(state == "new")) == NULL){
            unlock(taskTable);
            sleep(1000);
        } else {
            task = taskTable.findOne(state == "new");
            task.state = "working";
            unlock(taskTable);
        }
    
        page = crawl(task.url);
    
        if(task.type == "list"){
            lock(taskTable);
            for(newTask:page){
                taskTable.append(newTask);
            }
            task.state = "done";
            unlock(taskTable);
        } else {
            lock(pageTable);
            pageTable.append(page);
            unlock(pageTable);
            lock(taskTable);
            task.state = "done";
            unlock(taskTable);
        }
    }
    
  • Considering the sleep solution: how about new tasks are come within 1s?

    • The thread cannot be awaken immediately!
    • "Conditional Variable" solution

    conditional_variable_wait(cond, mutex){
        lock(cond.threadWaitList);
        cond.threadWaitList.append(this.thread);
        unlock(cond.threadWaitList);
    
        unlock(mutex);
        block(this.thread);
        // after receiving the signal, unblock(wakeup) this.thread
        lock(mutex);
    }
    
    conditional_variable_signal(cond){
        lock(cond.threadWaitList);
        if(cond.threadWaitList.size() > 0){
            thread = cond.threadWaitList.pop();
            wakeup(thread);
        }
        unlock(threadWaitList);
    }
    
    while(true){
        lock(taskTable);
    
        // find a non-working(new) task
        while((taskTable.find(state == "new")) == NULL){
            conditional_variable_wait(cond, taskTable);
        }
        task = taskTable.findOne(state == "new");
        task.state = "working";
        unlock(taskTable);
    
        page = crawl(task.url);
    
        if(task.type == "list"){
            lock(taskTable);
            for(newTask:page){
                taskTable.append(newTask);
                // new task is available
                conditional_variable_signal(cond);
            }
            task.state = "done";
            unlock(taskTable);
        } else {
            lock(pageTable);
            pageTable.append(page);
            unlock(pageTable);
            lock(taskTable);
            task.state = "done";
            unlock(taskTable);
        }
    
    }
    
  • Simplified/quantified conditional variable - semaphore


Distribute crawlers in multiple machines

  • One-to-one

    • Too many connectors
  • One-to-many

    • sender & receiver
    • sender and receiver share taskTable and crawlerTable
    • still complicated - # of connections is not reduced!

Using Database

  • update/add/findOne..

results matching ""

    No results matching ""