Python 爬蟲實戰教學 - Github 倉庫特徵爬取

前言#

因為這期學期上半我選了機器學習與數據挖掘，我在課程小組項目中負責了爬蟲這個部份，我會在這篇文章紀錄一下我的過程，並教導大家如何實戰 Python 爬蟲。

項目說明#

我們這個小組項目的爬取需求是爬取 Github 項目的各個特徵，這篇文章會分兩個部份，第一個部份是使用 BeautifulSoup 爬取原生的 HTML 頁面，第二部份是使用 Github API，其中第二部份包含了每個倉庫的 Commits、pull_request 等月變化量(2020.03 ~2021.03)。

P.S. 其實是我在剛開始寫爬蟲的時候，覺得 GIthub API 回傳的數據特徵有點少，所以就直接爬 Github 原生 HTML 頁面，沒有去爬月變化的數據，結果組長說要我重新去寫爬蟲，要爬月變化數據。

開發前準備#

安裝相關庫#

因為我們需要使用到的庫有 BeautifulSoup 和 requests 等。所以有些需要自己去安裝，

1
import requests
2
import time
3
import csv
4
import json
5
from bs4 import BeautifulSoup

1
pip install requests BeautifulSoup

Github 私人 Token#

因為我們需要爬取 Github 的網頁和使用 Github API，我們需要申請 Personal token，避免我們的請求達到 Rate Limit，Github 有限制一分鐘內的請求次數。

請求設定#

我們請求在請求頁面時需要模仿瀏覽器的請求方式

如果自己沒有 proxies 可以去掉這一項。

1
# request settings
2

3
# token
4
token = ''
5

6
# proxy
7
proxies = {
8
    "http": "http://127.0.0.1:7890",
9
    "https": "http://127.0.0.1:7890",
10
}
11

12
# through Github api
13
headers = {
14
    'Accept': 'application/vnd.github.v3+json',
15
    'Authorization': 'token {token}'.format(token = token)
16
}
17

18
# 模仿瀏覽器請求原生頁面
19
headers_raw_page = {
20
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
21
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
22
    'Authorization': 'token {token}'.format(token = token)
23
}
24

25
timeoutSec = 15  # setup your timeout spec(sec)

第一部份 - 原生頁面#

確定爬取規則#

先確定要爬的數據，我們預先在 Github topic 中爬取 deep-learning 的相關 Repositories 的 URL 作為我們數據分析的數據量。並且觀察頁面的請求與分頁。

我們可以按下 F12 可以在網路請求部份找到我們頁面請求的內容，並且在傳參部份找的到 page 參數，page 參數以及 per_page 參數可以作為我們循環請求的依據。

per_page 參數是決定每頁顯示多少列倉庫，這部份是我在看完 Github API 後知道的參數。

爬取我們要的所有倉庫#

下面是使用迴圈去調參，接著使用 BS4 去解析頁面元素。爬取的內容會存到 start_urls。

1
start_urls = [] # save urls
2
base_url = "https://github.com/topics/deep-learning?page="
3
allowed_domains = "https://github.com"
4

5
print("\n------------start grabbing all urls--------------------------\n")
6

7
for i in range(1, 35):
8
    try:
9
        html = requests.get(base_url+str(i), headers=headers_raw_page, proxies=proxies, timeout= timeoutSec)
10
    except Exception as e:
11
        print(e)
12
        print('fail to get request from ip via proxy')
13

14
    soup = BeautifulSoup(html.text, "html.parser")
15
    # print(soup.prettify())
16
    # urls = soup.find_all("a", {'class': 'text-bold'})
17

18
    for j in soup.find_all("a", {'class': 'text-bold'}):
19
        start_urls.append(allowed_domains + j['href'])
20
        print(allowed_domains + j['href'])
21

22
    time.sleep(1.5)

細爬每個倉庫特徵#

由於我們一開始爬的倉庫特徵並不是很夠，我們需要去細爬我們每個倉庫的更多數據特徵，接著我們根據 start_url 存的內容去爬取需要的倉庫細節。

1
print(start_urls)
2
print("The data length: ", len(start_urls), "\n")
3

4
result = []
5
labels = ['name', 'star', 'commits', 'fork', 'issues', 'pull_requests',
6
          'branches', 'tags']
7

8
print("\n------------start grabbing data--------------------------\n")
9
time.sleep(1.5)
10

11
i = 1
12

13
for url in start_urls:
14
    try:
15
        html = requests.get(url, headers=headers, proxies=proxies, timeout= timeoutSec)
16
    except Exception as e:
17
        print(e)
18
        print('fail to get request from ip via proxy')
19
    # print(html)
20
    soup = BeautifulSoup(html.text, "html.parser")
21
    # star = soup.findall("a", text="starred")
22
    # print(soup.prettify())
23
    item = {}
24
    item['name'] = url
25
    print("all: ", len(start_urls), "index ", i, ", start: ", item['name'])
26
    i+=1
27
    num = soup.find_all("a", {'class': 'social-count'})
28
    # print(num)
29
    # print('\n')
30
    item['star'] = num[0]
31
    item['fork'] = num[1]
32

33
    num = soup.find_all("span", {'class': 'd-none d-sm-inline'})
34
    # print('\n')
35
    # print(num)
36
    if(len(num) == 2):
37
        item['commits'] = num[1]
38
    else:
39
        item['commits'] = num[0]
40

41
    num = soup.find_all("span", {'class': 'Counter'})
42
    # print('\n')
43
    # print(num)
44

45
    item['issues'] = num[1]
46
    item['pull_requests'] = num[2]
47
    # item['contributors'] = num[7]
48
    # item['projects'] = num[4]
49
    # item['security'] = num[5]
50

51
    num = soup.find_all("a", {'class': 'Link--primary no-underline'})
52
    # print('\n')
53
    # print(num)
54
    item['branches'] = num[0]
55
    # item['release'] = num[1]
56
    # item['used_by'] = num[3]
57

58
    # num = soup.find_all("span", {'class': 'Counter'})
59
    # item['contributors'] = num[4]
60

61
    num = soup.find_all("a", {'class': 'ml-3 Link--primary no-underline'})
62
    # print('\n')
63
    # print(num)
64
    item['tags'] = num[0]
65

66
    print("end", item['name'], "\n")
67
    # print("\n", item['commits'])
68
    result.append(item)
69
    time.sleep(1.5)

存成 CSV#

我們接下來根據 labels 去存成 CSV 文件。

1
print("\n------------start saving data as csv--------------------------\n")
2

3
try:
4
    with open('csv_dct.csv', 'w') as f:
5
        writer = csv.DictWriter(f, fieldnames=labels)
6
        writer.writeheader()
7
        for elem in result:
8
            writer.writerow(elem)
9
        print("save success")
10
except IOError:
11
    print("I/O error")

第二部份 - Github API#

確定爬取需求#

我們需要的是各語言的熱門 repository，每個倉庫的最終 name、language、stargzers、forks、open_issues、watchers、readme，以按照月份變化(2020.03 ~2021.03)的 commits、pull requests、forks、issues events。

為了求方便我們需要使用 GIthub 提供的 API 進行爬取，然而 Readme 部分需要去爬取原生的 HTML 頁面。

確定爬取後需要分成三張表，基本數據 gitub_basic.csv；月變化數據 github_commits.csv、github_pull_requests.csv。(後面因為 forks 和 issues 的樂變化數據爬取過多，拖長每個 repo 的數據處理進度，導致平均一個 repo 需要爬取 8 mins 左右，所以棄掉這兩項月變化數據)

先查看 Github API#

我們先查看 Github 提供的 API 文檔。

我們首先需要用到的 API 是 Search 裡面的 https://api.github.com/search/repositories。

我們先測試一下 API：

你也可以使用 Postman 去做測試，因為有時回傳值太長了，終端機直接砍掉前面部份。

確定特徵#

因為我們使用 Github API 回傳的數據特徵已經有

name
language
stargazers
forks
open_issues
watchers

以及各種請求 URL（作為後面爬取各），雖然特徵沒有第一部份爬取 HTML 頁面的數據特徵多，但作為數據分析特徵夠了。

編寫爬取類#

編寫爬蟲類的成員變量，與構造函數。並且使用 Github search repositories 的 API 分別，總共的語言有 python、java、golang、js。

其中

repositories: 爬取的 repo
basic_table: repo 的基本特徵
commits_table: 每個 repo 的 commits 月變化數據
pull_requests_table: 每個 repo 的 pull requests月變化數據
forks_table: 每個 repo 的 forks月變化數據
issues_table: 每個 repo 的 issues月變化數據
basic_url: 根據爬取語言熱門 repos 的爬取 api
basic_labels: basic_table 存成 csv 文件的標籤
template_labels: 月變化數據存成 csv 文件的標籤
template_table: 月變化數據初始化賦值模板

1
class github_grab(object):
2
    def __init__(self):
3
        self.repositories = []
4
        self.basic_table = []
5
        self.commits_table = []
6
        self.pull_requests_table = []
7
        self.forks_table = []
8
        self.issues_table = []
9
        self.base_url = [
10
            "https://api.github.com/search/repositories?q=language:python&sort=stars",
11
            "https://api.github.com/search/repositories?q=language:java&sort=stars",
12
            "https://api.github.com/search/repositories?q=language:c&sort=stars",
13
            "https://api.github.com/search/repositories?q=language:golang&sort=stars",
14
            "https://api.github.com/search/repositories?q=language:js&sort=stars"
15
        ]
16
        self.basic_labels = [
17
            'name', 'language', 'stargazers' , 'forks', 'open_issues', 'watchers', 'readme'
18
        ]
19
        # commits pull_requests forks issues_events
20
        self.template_labels = [
21
            'name',
22
            '2019_01', '2019_02', '2019_03', '2019_04',
23
            '2019_05', '2019_06', '2019_07', '2019_08',
24
            '2019_09', '2019_10', '2019_11', '2019_12',
25
            '2020_01', '2020_02', '2020_03', '2020_04',
26
            '2020_05', '2020_06', '2020_07', '2020_08',
27
            '2020_09', '2020_10', '2020_11', '2020_12',
28
            '2021_01', '2021_02', '2021_03'
29
        ]
30

31
        self.template_table = {
32
            'name': '',
33
            '2020_03': 0, '2020_04': 0, '2020_05': 0, '2020_06': 0,
34
            '2020_07': 0, '2020_08': 0, '2020_09': 0, '2020_10': 0,
35
            '2020_11': 0, '2020_12': 0, '2021_01': 0, '2021_02': 0,
36
            '2021_03': 0
37
        }

爬取我們需要的所有熱門倉庫#

根據爬取類的 baseURL 遍歷請求，由於怕數據量太大，爬取過久，所以我們每個語言的熱門倉庫只爬取 40 個，總共會爬取 200 個倉庫。每次請求都需要設定如果爬取回來的狀態碼如果是 403，則需要等待 1 分鐘繼續爬取，如果是 404 或是 204 則代表資源不存在。

1
class github_grab(object):
2
    # ...
3
    def get_all_repositories(self):
4
        print("\n------------start grabbing all repositories--------------------------\n")
5
        index = 1
6
        for url in self.base_url:
7
            print("\nbase url : " + url + "\n")
8
            for i in range(1, 5):
9
                try:
10
                    req = requests.get(url + "&page=" + str(i) + "&per_page=100",
11
                                       headers=headers, proxies=proxies, timeout=timeoutSec)
12
                    if(req.status_code == 403):
13
                        print("Rate limit, sleep 60 sec")
14
                        time.sleep(60)
15
                        i -= 1
16
                        continue
17
                    elif(req.status_code == 404 or req.status_code == 204):
18
                        print('The source is not found')
19
                        continue
20
                    items = req.json()['items']
21
                    # temp = json.loads(req)
22
                    # print(type(req))
23
                    print("req len " + str(len(items)))
24
                    self.repositories += items
25
                    print('grab page ' + str(index) +
26
                          ', current repository quantity : ' + str(len(self.repositories)))
27
                    index += 1
28

29
                    # print(self.repositories)
30
                except Exception as e:
31
                    print(e)
32
                    print('fail to get request from ip via proxy')
33

34
                # sleep
35
                # time.sleep(2)

編寫 Readme 爬取方法#

由於每個倉庫的 readme 的文件名都不一樣，所以我只要以文件名是 readme 或 README，文件類型是 *.md 或 *.rst，如果還有其它的命名方式，我們可以認為該倉庫的文件命名不規範，不需加入考量。這裡的爬取使用 BS4。

1
class github_grab(object):
2
    # ...
3
    def get_repository_readme(self, url):
4
        print("start to get " + url + " readme")
5
        try:
6
            req = requests.get("https://github.com/" + url + "/blob/master/README.md",
7
                               headers=headers_raw_page, proxies=proxies, timeout=timeoutSec)
8
            if(req.status_code == 403):
9
                print("Rate limit, sleep 60 sec")
10
                time.sleep(60)
11
                req = requests.get("https://github.com/" + url + "/blob/master/README.md",
12
                               headers=headers_raw_page, proxies=proxies, timeout=timeoutSec)
13
            elif(req.status_code == 404 or req.status_code == 204):
14
                print('The source is not found')
15
            # print(type(req))
16
            soup = BeautifulSoup(req.text.replace('\n', ''), "html.parser")
17
            num = soup.find_all("div", {'id': 'readme'})
18
            # print(req)
19
            # print("req len " + str(len(req)))
20
            # time.sleep(2)
21
            return num[0] if len(num) >= 1 else ""
22

23
        except Exception as e:
24
            print(e)
25
            print('fail to get request from ip via proxy')

編寫處理所有的 repo 方法#

由於我們在前面在 get_all_repositories 類方法已經將很多基本數據爬下來，所以我們 basic 部分不需要進一步爬取，月變化數據需要另外寫類方法處理。

1
class github_grab(object):
2
    # ...
3
    def deal_with_repositories(self):
4
        print("-------------------start to deal with repo\'s data----------------------\n")
5
        i = 1
6
        for repo in self.repositories:
7
            print('deal with the ' + str(i) + ' / ' +
8
                  str(len(self.repositories)) + " " + repo['full_name'])
9
            i += 1
10
            # basic_table
11
            basic_temp = {}
12
            basic_temp['name'] = repo['full_name']
13
            basic_temp['stargazers'] = repo['stargazers_count']
14
            basic_temp['watchers'] = repo['watchers_count']
15
            basic_temp['language'] = repo['language']
16
            basic_temp['forks'] = repo['forks_count']
17
            basic_temp['open_issues'] = repo['open_issues']
18
            basic_temp['readme'] = self.get_repository_readme(
19
                repo['full_name'])
20
            self.basic_table.append(basic_temp)
21
            # commits_table
22
            #  init use copy
23
            temp = self.template_table.copy()
24
            # print('init temp')
25
            # print(temp)
26
            temp['name'] = repo['full_name']
27
            temp = self.get_repository_commits(temp)
28
            # print(temp)
29
            self.commits_table.append(temp)
30
            # print("-----")
31
            # print(self.commits_table)
32
            # print("----")
33

34
            # pull request
35
            temp = self.template_table.copy()
36
            temp['name'] = repo['full_name']
37
            temp = self.get_repository_pull_requests(temp)
38
            self.pull_requests_table.append(temp)
39

40
            # # forks
41
            # temp = self.template_table.copy()
42
            # temp['name'] = repo['full_name']
43
            # temp = self.get_repository_forks(temp)
44
            # self.forks_table.append(temp)
45

46
            # # issues_events
47
            # temp = self.template_table.copy()
48
            # temp['name'] = repo['full_name']
49
            # temp = self.get_repository_issues(temp)
50
            # self.issues_table.append(temp)

編寫月份分類方法#

因為我們的月份分得較死，一時沒想到怎麼封裝，所以寫起來又臭又長。

1
class github_grab(object):
2
    # ...
3
    def date_classify(self, temp, date_time):
4
        # print(date_time)
5
        if("2021-03-01T00:00:00Z" <= date_time and date_time < "2021-04-01T00:00:00Z"):
6
            temp['2021_03'] += 1
7
        elif("2021-02-01T00:00:00Z" <= date_time and date_time < "2021-03-01T00:00:00Z"):
8
            temp['2021_02'] += 1
9
        elif("2021-01-01T00:00:00Z" <= date_time and date_time < "2021-02-01T00:00:00Z"):
10
            temp['2021_01'] += 1
11
        elif("2020-12-01T00:00:00Z" <= date_time and date_time < "2021-01-01T00:00:00Z"):
12
            temp['2020_12'] += 1
13
        elif("2020-11-01T00:00:00Z" <= date_time and date_time < "2020-12-01T00:00:00Z"):
14
            temp['2020_11'] += 1
15
        elif("2020-10-01T00:00:00Z" <= date_time and date_time < "2020-11-01T00:00:00Z"):
16
            temp['2020_10'] += 1
17
        elif("2020-09-01T00:00:00Z" <= date_time and date_time < "2020-10-01T00:00:00Z"):
18
            temp['2020_09'] += 1
19
        elif("2020-08-01T00:00:00Z" <= date_time and date_time < "2020-09-01T00:00:00Z"):
20
            temp['2020_08'] += 1
21
        elif("2020-07-01T00:00:00Z" <= date_time and date_time < "2020-08-01T00:00:00Z"):
22
            temp['2020_07'] += 1
23
        elif("2020-06-01T00:00:00Z" <= date_time and date_time < "2020-07-01T00:00:00Z"):
24
            temp['2020_06'] += 1
25
        elif("2020-05-01T00:00:00Z" <= date_time and date_time < "2020-06-01T00:00:00Z"):
26
            temp['2020_05'] += 1
27
        elif("2020-04-01T00:00:00Z" <= date_time and date_time < "2020-05-01T00:00:00Z"):
28
            temp['2020_04'] += 1
29
        elif("2020-03-01T00:00:00Z" <= date_time and date_time < "2020-04-01T00:00:00Z"):
30
            temp['2020_03'] += 1
31

32
        return temp

編寫單個 repo 月份 commits 變化#

這裡主要就是需要注意要先自己使用 Postman 或是命令指令行試著請求看，檢查回傳的查詢結果的結構，還有時間的判斷，後面的月份變化爬取也是差不多這樣。

1
class github_grab(object):
2
    # ...
3
    def get_repository_commits(self, commits_temp):
4
        print("start to get " + commits_temp['name'] + " commits\' data")
5
        try:
6
            for i in range(1, 100000):
7
                req = requests.get("https://api.github.com/repos/" + commits_temp['name'] + "/commits?per_page=100&page=" + str(i),
8
                                   headers=headers, proxies=proxies, timeout=timeoutSec)
9
                if(req.status_code == 403):
10
                    print("Rate limit, sleep 60 sec")
11
                    time.sleep(60)
12
                    i -= 1
13
                    continue
14
                elif(req.status_code == 404 or req.status_code == 204):
15
                    print('The source is not found')
16
                    continue
17
                # print("commits times" + str(i))
18
                items = req.json()
19
                # if no data
20
                if len(items) == 0:
21
                    break
22
                # if the time is too older
23
                if(items[0]['commit']['author']['date'] < "2020-03-01T00:00:00Z"):
24
                    break
25
                if(items[-1]['commit']['author']['date'] > "2021-04-01T00:00:00Z"):
26
                    continue
27
                for date_time in items:
28
                    if(date_time['commit']['author']['date'] < "2020-03-01T00:00:00Z"):
29
                        break
30
                    commits_temp = self.date_classify(
31
                        commits_temp, date_time['commit']['author']['date'])
32
                # time.sleep(2)
33
            return commits_temp
34
        except Exception as e:
35
            print(e)
36
            print('fail to get request from ip via proxy')

編寫單個 repo 月份 pull requests 變化#

1
class github_grab(object):
2
    # ...
3
    def get_repository_pull_requests(self, pr_temp):
4
        print("start to get " + pr_temp['name'] + " pull requests\' data")
5
        try:
6
            for i in range(1, 100000):
7
                req = requests.get("https://api.github.com/repos/" + pr_temp['name'] + "/pulls?per_page=100&page=" + str(i),
8
                                   headers=headers, proxies=proxies, timeout=timeoutSec)
9
                if(req.status_code == 403):
10
                    print("Rate limit, sleep 60 sec")
11
                    time.sleep(60)
12
                    i -= 1
13
                    continue
14
                elif(req.status_code == 404 or req.status_code == 204):
15
                    print('The source is not found')
16
                    continue
17
                items = req.json()
18
                # if no data
19
                if len(items) == 0:
20
                    break
21
                # if the time is too older
22
                if(items[0]['created_at'] < "2020-03-01T00:00:00Z"):
23
                    break
24
                if(items[-1]['created_at'] > "2021-04-01T00:00:00Z"):
25
                    continue
26
                for date_time in items:
27
                    # print(date_time['created_at'])
28
                    if(date_time['created_at'] < "2020-03-01T00:00:00Z"):
29
                        break
30
                    pr_temp = self.date_classify(
31
                        pr_temp, date_time['created_at'])
32
                # time.sleep(2)
33
            return pr_temp
34
        except Exception as e:
35
            print(e)
36
            print('fail to get request from ip via proxy')

編寫單個 repo 月份 forks 變化#

雖然最後沒有用到這個類方法，但還是將代碼放出來。

1
class github_grab(object):
2
    # ...
3
    def get_repository_forks(self, forks_temp):
4
        print("start to get " + forks_temp['name'] + " forks\' data")
5
        try:
6
            for i in range(1, 100000):
7
                req = requests.get("https://api.github.com/repos/" + forks_temp['name'] + "/forks?per_page=100&page=" + str(i),
8
                                   headers=headers, proxies=proxies, timeout=timeoutSec)
9
                if(req.status_code == 403):
10
                    print("Rate limit, sleep 60 sec")
11
                    time.sleep(60)
12
                    i -= 1
13
                    continue
14
                elif(req.status_code == 404 or req.status_code == 204):
15
                    print('The source is not found')
16
                    continue
17
                items = req.json()
18
                # print("forks times" + str(i))
19
                # if no data
20
                if len(items) == 0:
21
                    break
22
                # if the time is too older
23
                if(items[0]['created_at'] < "2020-03-01T00:00:00Z"):
24
                    break
25
                if(items[-1]['created_at'] > "2021-04-01T00:00:00Z"):
26
                    continue
27
                for date_time in items:
28
                    # print(date_time['created_at'])
29
                    if(date_time['created_at'] < "2020-03-01T00:00:00Z"):
30
                        break
31
                    forks_temp = self.date_classify(
32
                        forks_temp, date_time['created_at'])
33
                # time.sleep(2)
34
            return forks_temp
35
        except Exception as e:
36
            print(e)
37
            print('fail to get request from ip via proxy')

編寫單個 repo 月份 issues 變化#

1
class github_grab(object):
2
    # ...
3
    def get_repository_issues(self, issues_temp):
4
        print("start to get " + issues_temp['name'] + " issues\' data")
5
        try:
6
            for i in range(1, 100000):
7
                req = requests.get("https://api.github.com/repos/" + issues_temp['name'] + "/issues/events?per_page=100&page=" + str(i),
8
                                   headers=headers, proxies=proxies, timeout=timeoutSec)
9
                if(req.status_code == 403):
10
                    print("Rate limit, sleep 60 sec")
11
                    time.sleep(60)
12
                    i -= 1
13
                    continue
14
                elif(req.status_code == 404 or req.status_code == 204):
15
                    print('The source is not found')
16
                    continue
17
                items = req.json()
18
                # if no data
19
                if len(items) == 0:
20
                    break
21
                # if the time is too older
22
                if(items[0]['created_at'] < "2020-03-01T00:00:00Z"):
23
                    break
24
                if(items[-1]['created_at'] > "2021-04-01T00:00:00Z"):
25
                    continue
26
                for date_time in items:
27
                    # print(date_time['created_at'])
28
                    if(date_time['created_at'] < "2020-03-01T00:00:00Z"):
29
                        break
30
                    issues_temp = self.date_classify(
31
                        issues_temp, date_time['created_at'])
32
                # time.sleep(2)
33
            return issues_temp
34
        except Exception as e:
35
            print(e)
36
            print('fail to get request from ip via proxy')

將所有的數據從成 CSV 文件#

分別有三個 *.csv 文件，github_basic.csv 是基本的 repo 數據，其它兩個是月變化數據。這裡需要注意的是編碼方式要使用 utf-8，因為 readme 有很多中文字，如果不使用 utf-8，會報存儲錯誤。

1
    def save_all_to_csv(self):
2
        self.save_as_csv("github_basic.csv", self.basic_labels, self.basic_table)
3
        self.save_as_csv("github_commits.csv", self.template_labels, self.commits_table)
4
        self.save_as_csv("github_pull_requests.csv", self.template_labels, self.pull_requests_table)
5
        # self.save_as_csv("github_forks.csv", self.template_labels, self.forks_table)
6
        # self.save_as_csv("github_issues.csv", self.template_labels, self.issues_table)
7

8

9
    def save_as_csv(self, fileName, labels, table):
10
        print("\n------------start saving data as csv--------------------------\n")
11

12
        # save csv
13
        try:
14
            with open(fileName, 'w') as f:
15
                writer = csv.DictWriter(f, fieldnames=labels)
16
                writer.writeheader()
17
                for elem in table:
18
                    writer.writerow(elem)
19
                print("save " + fileName + " success")
20
        except IOError:
21
            print("I/O error")

編寫 main 函數#

1
if __name__ == "__main__":
2
    github = github_grab()
3
    github.get_all_repositories()
4
    print("repo len " + str(len(github.repositories)))
5
    github.deal_with_repositories()
6
    # print(github.basic_table)
7
    # print(github.commits_table)
8
    # print(github.pull_requests_table)
9
    # print(github.forks_table)
10
    # print(github.issues_table)
11
    github.save_all_to_csv()

確認結果#

執行文件確認運行結果。

補充#

Javascript 的指針問題#

因為我在賦值 dict 對象時，發現如果只是普通的用 = 去賦值會導致指針問題，必須使用 dict 對象的 copy() 方法。

1
temp = self.template_table.copy()

結語#

源碼

這次的機器學習與數據挖掘的實驗我負責爬蟲的部分，由於一開始我沒有爬月變化的數據特徵，所以導致全組都在等我重新爬取新的數據，也在合作過程中發現自己的缺失，這次的工作分配讓我學到很多數據爬取的技巧，感謝大家的分工合作，最後完成了小組項目。