3338 words
17 minutes
Python 爬蟲實戰教學 - Github 倉庫特徵爬取

前言#

因為這期學期上半我選了機器學習與數據挖掘,我在課程小組項目中負責了爬蟲這個部份,我會在這篇文章紀錄一下我的過程,並教導大家如何實戰 Python 爬蟲。

項目說明#

我們這個小組項目的爬取需求是爬取 Github 項目的各個特徵,這篇文章會分兩個部份,第一個部份是使用 BeautifulSoup 爬取原生的 HTML 頁面第二部份是使用 Github API,其中第二部份包含了每個倉庫的 Commits、pull_request 等月變化量(2020.03 ~2021.03)。

P.S. 其實是我在剛開始寫爬蟲的時候,覺得 GIthub API 回傳的數據特徵有點少,所以就直接爬 Github 原生 HTML 頁面,沒有去爬月變化的數據,結果組長說要我重新去寫爬蟲,要爬月變化數據。

開發前準備#

安裝相關庫#

因為我們需要使用到的庫有 BeautifulSouprequests 等。所以有些需要自己去安裝,

import requests
import time
import csv
import json
from bs4 import BeautifulSoup
Terminal window
pip install requests BeautifulSoup

Github 私人 Token#

因為我們需要爬取 Github 的網頁和使用 Github API,我們需要申請 Personal token,避免我們的請求達到 Rate Limit,Github 有限制一分鐘內的請求次數。

github_personal_token.png

請求設定#

我們請求在請求頁面時需要模仿瀏覽器的請求方式

如果自己沒有 proxies 可以去掉這一項。

# request settings
# token
token = ''
# proxy
proxies = {
"http": "http://127.0.0.1:7890",
"https": "http://127.0.0.1:7890",
}
# through Github api
headers = {
'Accept': 'application/vnd.github.v3+json',
'Authorization': 'token {token}'.format(token = token)
}
# 模仿瀏覽器請求原生頁面
headers_raw_page = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Authorization': 'token {token}'.format(token = token)
}
timeoutSec = 15 # setup your timeout spec(sec)

第一部份 - 原生頁面#

確定爬取規則#

先確定要爬的數據,我們預先在 Github topic 中爬取 deep-learning 的相關 Repositories 的 URL 作為我們數據分析的數據量。並且觀察頁面的請求與分頁

deep_learning_topic.png

我們可以按下 F12 可以在網路請求部份找到我們頁面請求的內容,並且在傳參部份找的到 page 參數page 參數以及 per_page 參數可以作為我們循環請求的依據。

per_page 參數是決定每頁顯示多少列倉庫,這部份是我在看完 Github API 後知道的參數。

topic_request_page.png

爬取我們要的所有倉庫#

下面是使用迴圈去調參,接著使用 BS4 去解析頁面元素。爬取的內容會存到 start_urls

start_urls = [] # save urls
base_url = "https://github.com/topics/deep-learning?page="
allowed_domains = "https://github.com"
print("\n------------start grabbing all urls--------------------------\n")
for i in range(1, 35):
try:
html = requests.get(base_url+str(i), headers=headers_raw_page, proxies=proxies, timeout= timeoutSec)
except Exception as e:
print(e)
print('fail to get request from ip via proxy')
soup = BeautifulSoup(html.text, "html.parser")
# print(soup.prettify())
# urls = soup.find_all("a", {'class': 'text-bold'})
for j in soup.find_all("a", {'class': 'text-bold'}):
start_urls.append(allowed_domains + j['href'])
print(allowed_domains + j['href'])
time.sleep(1.5)

細爬每個倉庫特徵#

由於我們一開始爬的倉庫特徵並不是很夠,我們需要去細爬我們每個倉庫的更多數據特徵,接著我們根據 start_url 存的內容去爬取需要的倉庫細節。

repository_html.png

print(start_urls)
print("The data length: ", len(start_urls), "\n")
result = []
labels = ['name', 'star', 'commits', 'fork', 'issues', 'pull_requests',
'branches', 'tags']
print("\n------------start grabbing data--------------------------\n")
time.sleep(1.5)
i = 1
for url in start_urls:
try:
html = requests.get(url, headers=headers, proxies=proxies, timeout= timeoutSec)
except Exception as e:
print(e)
print('fail to get request from ip via proxy')
# print(html)
soup = BeautifulSoup(html.text, "html.parser")
# star = soup.findall("a", text="starred")
# print(soup.prettify())
item = {}
item['name'] = url
print("all: ", len(start_urls), "index ", i, ", start: ", item['name'])
i+=1
num = soup.find_all("a", {'class': 'social-count'})
# print(num)
# print('\n')
item['star'] = num[0]
item['fork'] = num[1]
num = soup.find_all("span", {'class': 'd-none d-sm-inline'})
# print('\n')
# print(num)
if(len(num) == 2):
item['commits'] = num[1]
else:
item['commits'] = num[0]
num = soup.find_all("span", {'class': 'Counter'})
# print('\n')
# print(num)
item['issues'] = num[1]
item['pull_requests'] = num[2]
# item['contributors'] = num[7]
# item['projects'] = num[4]
# item['security'] = num[5]
num = soup.find_all("a", {'class': 'Link--primary no-underline'})
# print('\n')
# print(num)
item['branches'] = num[0]
# item['release'] = num[1]
# item['used_by'] = num[3]
# num = soup.find_all("span", {'class': 'Counter'})
# item['contributors'] = num[4]
num = soup.find_all("a", {'class': 'ml-3 Link--primary no-underline'})
# print('\n')
# print(num)
item['tags'] = num[0]
print("end", item['name'], "\n")
# print("\n", item['commits'])
result.append(item)
time.sleep(1.5)

存成 CSV#

我們接下來根據 labels 去存成 CSV 文件。

print("\n------------start saving data as csv--------------------------\n")
try:
with open('csv_dct.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=labels)
writer.writeheader()
for elem in result:
writer.writerow(elem)
print("save success")
except IOError:
print("I/O error")

第二部份 - Github API#

確定爬取需求#

我們需要的是各語言的熱門 repository,每個倉庫的最終 name、language、stargzers、forks、open_issues、watchers、readme,以按照月份變化(2020.03 ~2021.03)的 commits、pull requests、forks、issues events。

為了求方便 我們需要使用 GIthub 提供的 API 進行爬取,然而 Readme 部分需要去爬取原生的 HTML 頁面

確定爬取後需要分成三張表,基本數據 gitub_basic.csv;月變化數據 github_commits.csv、github_pull_requests.csv。(後面因為 forks 和 issues 的樂變化數據爬取過多,拖長每個 repo 的數據處理進度,導致平均一個 repo 需要爬取 8 mins 左右,所以棄掉這兩項月變化數據)

先查看 Github API#

我們先查看 Github 提供的 API 文檔。

github_rest_api_docs.png

我們首先需要用到的 API 是 Search 裡面的 https://api.github.com/search/repositories

github_api_list.png

我們先測試一下 API:

你也可以使用 Postman 去做測試,因為有時回傳值太長了,終端機直接砍掉前面部份。

github_api_test.png

確定特徵#

因為我們使用 Github API 回傳的數據特徵已經有

  • name
  • language
  • stargazers
  • forks
  • open_issues
  • watchers

以及各種請求 URL(作為後面爬取各),雖然特徵沒有第一部份爬取 HTML 頁面的數據特徵多,但作為數據分析特徵夠了。

編寫爬取類#

編寫爬蟲類的成員變量,與構造函數。並且使用 Github search repositories 的 API 分別,總共的語言有 python、java、golang、js。

其中

  • repositories: 爬取的 repo
  • basic_table: repo 的基本特徵
  • commits_table: 每個 repo 的 commits 月變化數據
  • pull_requests_table: 每個 repo 的 pull requests月變化數據
  • forks_table: 每個 repo 的 forks月變化數據
  • issues_table: 每個 repo 的 issues月變化數據
  • basic_url: 根據爬取語言熱門 repos 的爬取 api
  • basic_labels: basic_table 存成 csv 文件的標籤
  • template_labels: 月變化數據存成 csv 文件的標籤
  • template_table: 月變化數據初始化賦值模板
class github_grab(object):
def __init__(self):
self.repositories = []
self.basic_table = []
self.commits_table = []
self.pull_requests_table = []
self.forks_table = []
self.issues_table = []
self.base_url = [
"https://api.github.com/search/repositories?q=language:python&sort=stars",
"https://api.github.com/search/repositories?q=language:java&sort=stars",
"https://api.github.com/search/repositories?q=language:c&sort=stars",
"https://api.github.com/search/repositories?q=language:golang&sort=stars",
"https://api.github.com/search/repositories?q=language:js&sort=stars"
]
self.basic_labels = [
'name', 'language', 'stargazers' , 'forks', 'open_issues', 'watchers', 'readme'
]
# commits pull_requests forks issues_events
self.template_labels = [
'name',
'2019_01', '2019_02', '2019_03', '2019_04',
'2019_05', '2019_06', '2019_07', '2019_08',
'2019_09', '2019_10', '2019_11', '2019_12',
'2020_01', '2020_02', '2020_03', '2020_04',
'2020_05', '2020_06', '2020_07', '2020_08',
'2020_09', '2020_10', '2020_11', '2020_12',
'2021_01', '2021_02', '2021_03'
]
self.template_table = {
'name': '',
'2020_03': 0, '2020_04': 0, '2020_05': 0, '2020_06': 0,
'2020_07': 0, '2020_08': 0, '2020_09': 0, '2020_10': 0,
'2020_11': 0, '2020_12': 0, '2021_01': 0, '2021_02': 0,
'2021_03': 0
}

爬取我們需要的所有熱門倉庫#

根據爬取類的 baseURL 遍歷請求,由於怕數據量太大,爬取過久,所以我們每個語言的熱門倉庫只爬取 40 個,總共會爬取 200 個倉庫。每次請求都需要設定如果爬取回來的狀態碼如果是 403,則需要等待 1 分鐘繼續爬取,如果是 404 或是 204 則代表資源不存在。

class github_grab(object):
# ...
def get_all_repositories(self):
print("\n------------start grabbing all repositories--------------------------\n")
index = 1
for url in self.base_url:
print("\nbase url : " + url + "\n")
for i in range(1, 5):
try:
req = requests.get(url + "&page=" + str(i) + "&per_page=100",
headers=headers, proxies=proxies, timeout=timeoutSec)
if(req.status_code == 403):
print("Rate limit, sleep 60 sec")
time.sleep(60)
i -= 1
continue
elif(req.status_code == 404 or req.status_code == 204):
print('The source is not found')
continue
items = req.json()['items']
# temp = json.loads(req)
# print(type(req))
print("req len " + str(len(items)))
self.repositories += items
print('grab page ' + str(index) +
', current repository quantity : ' + str(len(self.repositories)))
index += 1
# print(self.repositories)
except Exception as e:
print(e)
print('fail to get request from ip via proxy')
# sleep
# time.sleep(2)

編寫 Readme 爬取方法#

由於每個倉庫的 readme 的文件名都不一樣,所以我只要以文件名是 readmeREADME,文件類型是 *.md*.rst,如果還有其它的命名方式,我們可以認為該倉庫的文件命名不規範,不需加入考量。這裡的爬取使用 BS4

class github_grab(object):
# ...
def get_repository_readme(self, url):
print("start to get " + url + " readme")
try:
req = requests.get("https://github.com/" + url + "/blob/master/README.md",
headers=headers_raw_page, proxies=proxies, timeout=timeoutSec)
if(req.status_code == 403):
print("Rate limit, sleep 60 sec")
time.sleep(60)
req = requests.get("https://github.com/" + url + "/blob/master/README.md",
headers=headers_raw_page, proxies=proxies, timeout=timeoutSec)
elif(req.status_code == 404 or req.status_code == 204):
print('The source is not found')
# print(type(req))
soup = BeautifulSoup(req.text.replace('\n', ''), "html.parser")
num = soup.find_all("div", {'id': 'readme'})
# print(req)
# print("req len " + str(len(req)))
# time.sleep(2)
return num[0] if len(num) >= 1 else ""
except Exception as e:
print(e)
print('fail to get request from ip via proxy')

編寫處理所有的 repo 方法#

由於我們在前面在 get_all_repositories 類方法已經將很多基本數據爬下來,所以我們 basic 部分不需要進一步爬取,月變化數據需要另外寫類方法處理。

class github_grab(object):
# ...
def deal_with_repositories(self):
print("-------------------start to deal with repo\'s data----------------------\n")
i = 1
for repo in self.repositories:
print('deal with the ' + str(i) + ' / ' +
str(len(self.repositories)) + " " + repo['full_name'])
i += 1
# basic_table
basic_temp = {}
basic_temp['name'] = repo['full_name']
basic_temp['stargazers'] = repo['stargazers_count']
basic_temp['watchers'] = repo['watchers_count']
basic_temp['language'] = repo['language']
basic_temp['forks'] = repo['forks_count']
basic_temp['open_issues'] = repo['open_issues']
basic_temp['readme'] = self.get_repository_readme(
repo['full_name'])
self.basic_table.append(basic_temp)
# commits_table
# init use copy
temp = self.template_table.copy()
# print('init temp')
# print(temp)
temp['name'] = repo['full_name']
temp = self.get_repository_commits(temp)
# print(temp)
self.commits_table.append(temp)
# print("-----")
# print(self.commits_table)
# print("----")
# pull request
temp = self.template_table.copy()
temp['name'] = repo['full_name']
temp = self.get_repository_pull_requests(temp)
self.pull_requests_table.append(temp)
# # forks
# temp = self.template_table.copy()
# temp['name'] = repo['full_name']
# temp = self.get_repository_forks(temp)
# self.forks_table.append(temp)
# # issues_events
# temp = self.template_table.copy()
# temp['name'] = repo['full_name']
# temp = self.get_repository_issues(temp)
# self.issues_table.append(temp)

編寫月份分類方法#

因為我們的月份分得較死,一時沒想到怎麼封裝,所以寫起來又臭又長。

class github_grab(object):
# ...
def date_classify(self, temp, date_time):
# print(date_time)
if("2021-03-01T00:00:00Z" <= date_time and date_time < "2021-04-01T00:00:00Z"):
temp['2021_03'] += 1
elif("2021-02-01T00:00:00Z" <= date_time and date_time < "2021-03-01T00:00:00Z"):
temp['2021_02'] += 1
elif("2021-01-01T00:00:00Z" <= date_time and date_time < "2021-02-01T00:00:00Z"):
temp['2021_01'] += 1
elif("2020-12-01T00:00:00Z" <= date_time and date_time < "2021-01-01T00:00:00Z"):
temp['2020_12'] += 1
elif("2020-11-01T00:00:00Z" <= date_time and date_time < "2020-12-01T00:00:00Z"):
temp['2020_11'] += 1
elif("2020-10-01T00:00:00Z" <= date_time and date_time < "2020-11-01T00:00:00Z"):
temp['2020_10'] += 1
elif("2020-09-01T00:00:00Z" <= date_time and date_time < "2020-10-01T00:00:00Z"):
temp['2020_09'] += 1
elif("2020-08-01T00:00:00Z" <= date_time and date_time < "2020-09-01T00:00:00Z"):
temp['2020_08'] += 1
elif("2020-07-01T00:00:00Z" <= date_time and date_time < "2020-08-01T00:00:00Z"):
temp['2020_07'] += 1
elif("2020-06-01T00:00:00Z" <= date_time and date_time < "2020-07-01T00:00:00Z"):
temp['2020_06'] += 1
elif("2020-05-01T00:00:00Z" <= date_time and date_time < "2020-06-01T00:00:00Z"):
temp['2020_05'] += 1
elif("2020-04-01T00:00:00Z" <= date_time and date_time < "2020-05-01T00:00:00Z"):
temp['2020_04'] += 1
elif("2020-03-01T00:00:00Z" <= date_time and date_time < "2020-04-01T00:00:00Z"):
temp['2020_03'] += 1
return temp

編寫單個 repo 月份 commits 變化#

這裡主要就是需要注意要先自己使用 Postman 或是命令指令行試著請求看,檢查回傳的查詢結果的結構,還有時間的判斷,後面的月份變化爬取也是差不多這樣

class github_grab(object):
# ...
def get_repository_commits(self, commits_temp):
print("start to get " + commits_temp['name'] + " commits\' data")
try:
for i in range(1, 100000):
req = requests.get("https://api.github.com/repos/" + commits_temp['name'] + "/commits?per_page=100&page=" + str(i),
headers=headers, proxies=proxies, timeout=timeoutSec)
if(req.status_code == 403):
print("Rate limit, sleep 60 sec")
time.sleep(60)
i -= 1
continue
elif(req.status_code == 404 or req.status_code == 204):
print('The source is not found')
continue
# print("commits times" + str(i))
items = req.json()
# if no data
if len(items) == 0:
break
# if the time is too older
if(items[0]['commit']['author']['date'] < "2020-03-01T00:00:00Z"):
break
if(items[-1]['commit']['author']['date'] > "2021-04-01T00:00:00Z"):
continue
for date_time in items:
if(date_time['commit']['author']['date'] < "2020-03-01T00:00:00Z"):
break
commits_temp = self.date_classify(
commits_temp, date_time['commit']['author']['date'])
# time.sleep(2)
return commits_temp
except Exception as e:
print(e)
print('fail to get request from ip via proxy')

編寫單個 repo 月份 pull requests 變化#

class github_grab(object):
# ...
def get_repository_pull_requests(self, pr_temp):
print("start to get " + pr_temp['name'] + " pull requests\' data")
try:
for i in range(1, 100000):
req = requests.get("https://api.github.com/repos/" + pr_temp['name'] + "/pulls?per_page=100&page=" + str(i),
headers=headers, proxies=proxies, timeout=timeoutSec)
if(req.status_code == 403):
print("Rate limit, sleep 60 sec")
time.sleep(60)
i -= 1
continue
elif(req.status_code == 404 or req.status_code == 204):
print('The source is not found')
continue
items = req.json()
# if no data
if len(items) == 0:
break
# if the time is too older
if(items[0]['created_at'] < "2020-03-01T00:00:00Z"):
break
if(items[-1]['created_at'] > "2021-04-01T00:00:00Z"):
continue
for date_time in items:
# print(date_time['created_at'])
if(date_time['created_at'] < "2020-03-01T00:00:00Z"):
break
pr_temp = self.date_classify(
pr_temp, date_time['created_at'])
# time.sleep(2)
return pr_temp
except Exception as e:
print(e)
print('fail to get request from ip via proxy')

編寫單個 repo 月份 forks 變化#

雖然最後沒有用到這個類方法,但還是將代碼放出來。

class github_grab(object):
# ...
def get_repository_forks(self, forks_temp):
print("start to get " + forks_temp['name'] + " forks\' data")
try:
for i in range(1, 100000):
req = requests.get("https://api.github.com/repos/" + forks_temp['name'] + "/forks?per_page=100&page=" + str(i),
headers=headers, proxies=proxies, timeout=timeoutSec)
if(req.status_code == 403):
print("Rate limit, sleep 60 sec")
time.sleep(60)
i -= 1
continue
elif(req.status_code == 404 or req.status_code == 204):
print('The source is not found')
continue
items = req.json()
# print("forks times" + str(i))
# if no data
if len(items) == 0:
break
# if the time is too older
if(items[0]['created_at'] < "2020-03-01T00:00:00Z"):
break
if(items[-1]['created_at'] > "2021-04-01T00:00:00Z"):
continue
for date_time in items:
# print(date_time['created_at'])
if(date_time['created_at'] < "2020-03-01T00:00:00Z"):
break
forks_temp = self.date_classify(
forks_temp, date_time['created_at'])
# time.sleep(2)
return forks_temp
except Exception as e:
print(e)
print('fail to get request from ip via proxy')

編寫單個 repo 月份 issues 變化#

class github_grab(object):
# ...
def get_repository_issues(self, issues_temp):
print("start to get " + issues_temp['name'] + " issues\' data")
try:
for i in range(1, 100000):
req = requests.get("https://api.github.com/repos/" + issues_temp['name'] + "/issues/events?per_page=100&page=" + str(i),
headers=headers, proxies=proxies, timeout=timeoutSec)
if(req.status_code == 403):
print("Rate limit, sleep 60 sec")
time.sleep(60)
i -= 1
continue
elif(req.status_code == 404 or req.status_code == 204):
print('The source is not found')
continue
items = req.json()
# if no data
if len(items) == 0:
break
# if the time is too older
if(items[0]['created_at'] < "2020-03-01T00:00:00Z"):
break
if(items[-1]['created_at'] > "2021-04-01T00:00:00Z"):
continue
for date_time in items:
# print(date_time['created_at'])
if(date_time['created_at'] < "2020-03-01T00:00:00Z"):
break
issues_temp = self.date_classify(
issues_temp, date_time['created_at'])
# time.sleep(2)
return issues_temp
except Exception as e:
print(e)
print('fail to get request from ip via proxy')

將所有的數據從成 CSV 文件#

分別有三個 *.csv 文件,github_basic.csv 是基本的 repo 數據,其它兩個是月變化數據。這裡需要注意的是編碼方式要使用 utf-8,因為 readme 有很多中文字,如果不使用 utf-8,會報存儲錯誤。

def save_all_to_csv(self):
self.save_as_csv("github_basic.csv", self.basic_labels, self.basic_table)
self.save_as_csv("github_commits.csv", self.template_labels, self.commits_table)
self.save_as_csv("github_pull_requests.csv", self.template_labels, self.pull_requests_table)
# self.save_as_csv("github_forks.csv", self.template_labels, self.forks_table)
# self.save_as_csv("github_issues.csv", self.template_labels, self.issues_table)
def save_as_csv(self, fileName, labels, table):
print("\n------------start saving data as csv--------------------------\n")
# save csv
try:
with open(fileName, 'w') as f:
writer = csv.DictWriter(f, fieldnames=labels)
writer.writeheader()
for elem in table:
writer.writerow(elem)
print("save " + fileName + " success")
except IOError:
print("I/O error")

編寫 main 函數#

if __name__ == "__main__":
github = github_grab()
github.get_all_repositories()
print("repo len " + str(len(github.repositories)))
github.deal_with_repositories()
# print(github.basic_table)
# print(github.commits_table)
# print(github.pull_requests_table)
# print(github.forks_table)
# print(github.issues_table)
github.save_all_to_csv()

確認結果#

執行文件確認運行結果。

github_part2_result.png

補充#

Javascript 的指針問題#

因為我在賦值 dict 對象時,發現如果只是普通的用 = 去賦值會導致指針問題,必須使用 dict 對象的 copy() 方法。

temp = self.template_table.copy()

結語#

源碼

這次的機器學習與數據挖掘的實驗我負責爬蟲的部分,由於一開始我沒有爬月變化的數據特徵,所以導致全組都在等我重新爬取新的數據,也在合作過程中發現自己的缺失,這次的工作分配讓我學到很多數據爬取的技巧,感謝大家的分工合作,最後完成了小組項目。

Python 爬蟲實戰教學 - Github 倉庫特徵爬取
https://huangno1.github.io/posts/python_spider_experience_instruct_github_repos/
Author
HuangNO1
Published at
2021-05-04
License
CC BY-NC-SA 4.0