update analtyics 2

2026-05-23 17:18:23 +00:00 · 2015-10-23 22:27:02 +08:00 · 2015-10-23 22:27:02 +08:00 · 779e9652b6
commit 779e9652b6
parent e203c23929
3 changed files with 239 additions and 179 deletions
--- a/chapters/04-analytics-02.md
+++ b/chapters/04-analytics-02.md
@ -3,25 +3,34 @@
 让我们分析之前的程序，然后再想办法做出优化。网上看到一篇文章[http://www.huyng.com/posts/python-performance-analysis/](http://www.huyng.com/posts/python-performance-analysis/)讲的就是分析这部分内容的。
-#time python分析#
+##time python分析
 分析程序的运行时间
-    $time python handle.py
+```bash     
 $time python handle.py
 ```
 结果便是，但是对于我们的分析没有一点意义
-     real	0m43.411s
+```
-     user	0m39.226s
+    real	0m43.411s
-     sys	0m0.618s
+    user	0m39.226s
    sys	0m0.618s
 ```
 ##line_profiler python
 #line_profiler python#
 这是
 ##Mac OS X 10.9 line_profiler Install##
-     sudo ARCHFLAGS="-Wno-error=unused-command-line-argument-hard-error-in-future" easy_install line_profiler
+```bash
 sudo ARCHFLAGS="-Wno-error=unused-command-line-argument-hard-error-in-future" easy_install line_profiler
 ```
 然后在我们的``parse_data.py``的``handle_json``前面加上``@profile``
-<pre><code class="python">
+
 ```python
@profile
 def handle_json(jsonfile):
    f = open(jsonfile, "r")
@ -37,107 +46,127 @@ def handle_json(jsonfile):
    f.close()
    return datacount, dataarray
-</pre></code>
+```
 Line_profiler带了一个分析脚本``kernprof.py``，so
-      kernprof.py -l -v handle.py
+```bash
 kernprof.py -l -v handle.py
 ```
 我们便会得到下面的结果
 ```
 Wrote profile results to handle.py.lprof
 Timer unit: 1e-06 s
-	Wrote profile results to handle.py.lprof
+File: parse_data.py
-	Timer unit: 1e-06 s
+Function: handle_json at line 15
 Total time: 127.332 s
-	File: parse_data.py
+Line #      Hits         Time  Per Hit   % Time  Line Contents
-	Function: handle_json at line 15
+==============================================================
-	Total time: 127.332 s
+    15                                           @profile
-
+    16                                           def handle_json(jsonfile):
-	Line #      Hits         Time  Per Hit   % Time  Line Contents
+    17        19          636     33.5      0.0      f = open(jsonfile, "r")
-	==============================================================
+    18        19           21      1.1      0.0      dataarray = []
-	    15                                           @profile
+    19        19           16      0.8      0.0      datacount = 0
-	    16                                           def handle_json(jsonfile):
+    20
-	    17        19          636     33.5      0.0      f = open(jsonfile, "r")
+    21    212373       730344      3.4      0.6      for line in open(jsonfile):
-	    18        19           21      1.1      0.0      dataarray = []
+    22    212354      2826826     13.3      2.2          line = f.readline()
-	    19        19           16      0.8      0.0      datacount = 0
+    23    212354     13848171     65.2     10.9          lin = json.loads(line)
-	    20
+    24    212354    109427317    515.3     85.9          date = dateutil.parser.parse(lin["created_at"])
-	    21    212373       730344      3.4      0.6      for line in open(jsonfile):
+    25    212354       238112      1.1      0.2          datacount += 1
-	    22    212354      2826826     13.3      2.2          line = f.readline()
+    26    212354       260227      1.2      0.2          dataarray.append(date.minute)
-	    23    212354     13848171     65.2     10.9          lin = json.loads(line)
+    27
-	    24    212354    109427317    515.3     85.9          date = dateutil.parser.parse(lin["created_at"])
+    28        19          349     18.4      0.0      f.close()
-	    25    212354       238112      1.1      0.2          datacount += 1
+    29        19           20      1.1      0.0      return datacount, dataarray
-	    26    212354       260227      1.2      0.2          dataarray.append(date.minute)
+```
 	    27
 	    28        19          349     18.4      0.0      f.close()
 	    29        19           20      1.1      0.0      return datacount, dataarray
 于是我们就发现我们的瓶颈就是从读取``created_at``，即创建时间。。。以及解析json，反而不是我们关心的IO，果然``readline``很强大。
-#memory_profiler python#
+##memory_profiler python
 ##memory_profiler install##
-    $ pip install -U memory_profiler
+###memory_profiler install
-    $ pip install psutil
+
 ```bash
 $ pip install -U memory_profiler
 $ pip install psutil
 ```
 ###memory_profiler python
 ##memory_profiler python##
 如上，我们只需要在``handle_json``前面加上``@profile``
-     python -m memory_profiler handle.py
+```bash
 python -m memory_profiler handle.py
 ```
 于是
 ```
 Filename: parse_data.py
-	Filename: parse_data.py
+Line #    Mem usage    Increment   Line Contents
 ================================================
    13   39.930 MiB    0.000 MiB   @profile
    14                             def handle_json(jsonfile):
    15   39.930 MiB    0.000 MiB       f = open(jsonfile, "r")
    16   39.930 MiB    0.000 MiB       dataarray = []
    17   39.930 MiB    0.000 MiB       datacount = 0
    18
    19   40.055 MiB    0.125 MiB       for line in open(jsonfile):
    20   40.055 MiB    0.000 MiB           line = f.readline()
    21   40.066 MiB    0.012 MiB           lin = json.loads(line)
    22   40.055 MiB   -0.012 MiB           date = dateutil.parser.parse(lin["created_at"])
    23   40.055 MiB    0.000 MiB           datacount += 1
    24   40.055 MiB    0.000 MiB           dataarray.append(date.minute)
    25
    26                                 f.close()
    27                                 return datacount, dataarray
 ```
-	Line #    Mem usage    Increment   Line Contents
+##objgraph python
 	================================================
 	    13   39.930 MiB    0.000 MiB   @profile
 	    14                             def handle_json(jsonfile):
 	    15   39.930 MiB    0.000 MiB       f = open(jsonfile, "r")
 	    16   39.930 MiB    0.000 MiB       dataarray = []
 	    17   39.930 MiB    0.000 MiB       datacount = 0
 	    18
 	    19   40.055 MiB    0.125 MiB       for line in open(jsonfile):
 	    20   40.055 MiB    0.000 MiB           line = f.readline()
 	    21   40.066 MiB    0.012 MiB           lin = json.loads(line)
 	    22   40.055 MiB   -0.012 MiB           date = dateutil.parser.parse(lin["created_at"])
 	    23   40.055 MiB    0.000 MiB           datacount += 1
 	    24   40.055 MiB    0.000 MiB           dataarray.append(date.minute)
 	    25
 	    26                                 f.close()
 	    27                                 return datacount, dataarray
 ###objgraph install
-#objgraph python#
+```bash
-
+pip install objgraph
-##objgraph install##
+```
     pip install objgraph
 我们需要调用他
-      import pdb;
+```python
 import pdb;
 ```
 以及在需要调度的地方加上
-     pdb.set_trace()
+```python
 pdb.set_trace()
 ```
 接着会进入``command``模式
-    (pdb) import objgraph
+```python
-    (pdb) objgraph.show_most_common_types()
+(pdb) import objgraph
 (pdb) objgraph.show_most_common_types()
 ```
 然后我们可以找到。。
-	function                   8259
+```
-	dict                       2137
+function                   8259
-	tuple                      1949
+dict                       2137
-	wrapper_descriptor         1625
+tuple                      1949
-	list                       1586
+wrapper_descriptor         1625
-	weakref                    1145
+list                       1586
-	builtin_function_or_method 1117
+weakref                    1145
-	method_descriptor          948
+builtin_function_or_method 1117
-	getset_descriptor          708
+method_descriptor          948
-	type                       705
+getset_descriptor          708
 type                       705
 ```
 也可以用他生成图形，貌似这里是用``dot``生成的，加上``python-xdot``
@ -145,17 +174,20 @@ Line_profiler带了一个分析脚本``kernprof.py``，so
 如果我们每次都要花同样的时间去做一件事，去扫那些数据的话，那么这是最好的打发时间的方法。
-##python SQLite3 查询数据##
+##python SQLite3 查询数据
 我们创建了一个名为``userdata.db``的数据库文件，然后创建了一个表，里面有owner,language,eventtype,name url
-    def init_db():
+```python
-        conn = sqlite3.connect('userdata.db')
+def init_db():
-        c = conn.cursor()
+    conn = sqlite3.connect('userdata.db')
-        c.execute('''CREATE TABLE userinfo (owner text, language text, eventtype text, name text, url text)''')
+    c = conn.cursor()
    c.execute('''CREATE TABLE userinfo (owner text, language text, eventtype text, name text, url text)''')
 ```
 接着我们就可以查询数据，这里从结果讲起。
-<pre><code class="python">
+```python
 def get_count(username):
    count = 0
    userinfo = []
@ -165,11 +197,11 @@ def get_count(username):
        userinfo.append(zero)
    return count, userinfo
-
+```
 </code></pre>
 当我查询``gmszone``的时候，也就是我自己就会有如下的结果
-<pre><code class="bash">
+
 ```bash
 (u'gmszone', u'ForkEvent', u'RESUME', u'TeX', u'https://github.com/gmszone/RESUME')
 (u'gmszone', u'WatchEvent', u'iot-dashboard', u'JavaScript', u'https://github.com/gmszone/iot-dashboard')
 (u'gmszone', u'PushEvent', u'wechat-wordpress', u'Ruby', u'https://github.com/gmszone/wechat-wordpress')
@ -180,43 +212,53 @@ def get_count(username):
 (u'gmszone', u'PushEvent', u'iot-doc', u'TeX', u'https://github.com/gmszone/iot-doc')
 (u'gmszone', u'PushEvent', u'iot-doc', u'TeX', u'https://github.com/gmszone/iot-doc')
 109
-</pre></code>
+````
 一共有109个事件，有``Watch``,``Create``,``Push``,``Fork``还有其他的，
 项目主要有``iot``,``RESUME``,``iot-dashboard``,``wechat-wordpress``,
 接着就是语言了，``Tex``,``Javascript``,``Ruby``,接着就是项目的url了。
 值得注意的是。
-<pre><code class="bash">
+
 ```bash
 -rw-r--r--   1 fdhuang staff 905M Apr 12 14:59 userdata.db
-</code></pre>
+```
 这个数据库文件有**905M**，不过查询结果相当让人满意，至少相对于原来的结果来说。
-##Python SQLite3##
+##Python SQLite3
 Python自带了对SQLite3的支持，然而我们还需要安装SQLite3
-      brew install sqlite3
+```bash
 brew install sqlite3
 ```
 或者是
-     sudo port install sqlite3
+```bash   
 sudo port install sqlite3
 ```
 或者是Ubuntu的
-     sudo apt-get install sqlite3
+```bash
 sudo apt-get install sqlite3
 ```
 openSUSE自然就是
-     sudo zypper install sqlite3
+```bash
 sudo zypper install sqlite3
 ```
 不过，用yast2也很不错，不是么。。
-##Pythont Github Sqlite3数据导入##
+##Pythont Github Sqlite3数据导入
 需要注意的是这里是需要python2.7，起源于对gzip的上下文管理器的支持问题
-<pre><code class="python">
+```python
 def handle_gzip_file(filename):
    userinfo = []
    with gzip.GzipFile(filename) as f:
@ -264,7 +306,7 @@ def build_db_with_gzip():
    conn.commit()
    c.close()
-</code></pre>
+```
 ``executemany``可以插入多条数据，对于我们的数据来说，一小时的文件大概有五六千个会符合我们上面的安装，也就是有``actor``又有``type``才是我们需要记录的数据，我们只需要统计用户的那些事件，而非全部的事件。
@ -276,7 +318,9 @@ def build_db_with_gzip():
 首先是正规匹配
-     date_re = re.compile(r"([0-9]{4})-([0-9]{2})-([0-9]{2})-([0-9]+)\.json.gz")
+```python
 date_re = re.compile(r"([0-9]{4})-([0-9]{2})-([0-9]{2})-([0-9]+)\.json.gz")
 ```
 不过主要的还是在于``glob.glob``
@ -290,7 +334,7 @@ def build_db_with_gzip():
 更好的方案？
-###redis###
+###redis
 结合了前面两篇我们终于可以成功地读取出用户数据、处理，再接着可以找相近的用户。
@ -298,30 +342,36 @@ def build_db_with_gzip():
 查询用户事件总数
-     import redis
+```python
-     r = redis.StrictRedis(host='localhost', port=6379, db=0)
+import redis
-     pipe = pipe = r.pipeline()
+r = redis.StrictRedis(host='localhost', port=6379, db=0)
-     pipe.zscore('osrc:user',"gmszone")
+pipe = pipe = r.pipeline()
-     pipe.execute()
+pipe.zscore('osrc:user',"gmszone")
 pipe.execute()
 ```
 系统返回了``227.0``,试试别人。
-    >>> pipe.zscore('osrc:user',"dfm")
+```bash
-    <redis.client.StrictPipeline object at 0x104fa7f50>
+>>> pipe.zscore('osrc:user',"dfm")
-    >>> pipe.execute()
+<redis.client.StrictPipeline object at 0x104fa7f50>
-    [425.0]
+>>> pipe.execute()
-    >>>
+[425.0]
 >>>
 ```
 看看主要是在哪一天提交的
-    >>> pipe.hgetall('osrc:user:gmszone:day')
+```python
-    <redis.client.StrictPipeline object at 0x104fa7f50>
+>>> pipe.hgetall('osrc:user:gmszone:day')
-    >>> pipe.execute()
+<redis.client.StrictPipeline object at 0x104fa7f50>
-    [{'1': '51', '0': '41', '3': '17', '2': '34', '5': '28', '4': '22', '6': '34'}]
+>>> pipe.execute()
 [{'1': '51', '0': '41', '3': '17', '2': '34', '5': '28', '4': '22', '6': '34'}]
 ```
 结果大致如下图所示:
-![SMTWTFS][1]
+![SMTWTFS](./img/smtwtfs.png)
 看看主要的事件是？
@ -331,17 +381,17 @@ def build_db_with_gzip():
    [[('PushEvent', 154.0), ('CreateEvent', 41.0), ('WatchEvent', 18.0), ('GollumEvent', 8.0), ('MemberEvent', 3.0), ('ForkEvent', 2.0), ('ReleaseEvent', 1.0)]]
    >>>
-![Main Event][2]
+![Main Event](./img/main-events.png)
 蓝色的就是push事件，黄色的是create等等。
 到这里我们算是知道了OSRC的数据库部分是如何工作的。
-##Python redis 查询
+###Python redis 查询
 主要代码如下所示
-<pre><code class="python">
+```python
 def get_vector(user, pipe=None):
    r = redis.StrictRedis(host='localhost', port=6379, db=0)
@ -364,18 +414,19 @@ def get_vector(user, pipe=None):
    if no_pipe:
        return pipe.execute()
-</code></pre>
+```
 结果在上一篇中显示出来了，也就是
-      [227.0, {'1': '51', '0': '41', '3': '17', '2': '34', '5': '28', '4': '22', '6': '34'}, [('PushEvent', 154.0), ('CreateEvent', 41.0), ('WatchEvent', 18.0), ('GollumEvent', 8.0), ('MemberEvent', 3.0), ('ForkEvent', 2.0), ('ReleaseEvent', 1.0)], 0, 0, 0, 11, [('CSS', 74.0), ('JavaScript', 60.0), ('Ruby', 12.0), ('TeX', 6.0), ('Python', 6.0), ('Java', 5.0), ('C++', 5.0), ('Assembly', 5.0), ('C', 3.0), ('Emacs Lisp', 2.0), ('Arduino', 2.0)]]
+```
 [227.0, {'1': '51', '0': '41', '3': '17', '2': '34', '5': '28', '4': '22', '6': '34'}, [('PushEvent', 154.0), ('CreateEvent', 41.0), ('WatchEvent', 18.0), ('GollumEvent', 8.0), ('MemberEvent', 3.0), ('ForkEvent', 2.0), ('ReleaseEvent', 1.0)], 0, 0, 0, 11, [('CSS', 74.0), ('JavaScript', 60.0), ('Ruby', 12.0), ('TeX', 6.0), ('Python', 6.0), ('Java', 5.0), ('C++', 5.0), ('Assembly', 5.0), ('C', 3.0), ('Emacs Lisp', 2.0), ('Arduino', 2.0)]]
 ```
 有意思的是在这里生成了和自己相近的人
-     ['alesdokshanin', 'hjiawei', 'andrewreedy', 'christj6', '1995eaton']
+```
-
+['alesdokshanin', 'hjiawei', 'andrewreedy', 'christj6', '1995eaton']
-  [1]: https://www.phodal.com/static/media/uploads/screen_shot_2014-04-15_at_8.11.14_pm.png
+```
  [2]: https://www.phodal.com/static/media/uploads/screen_shot_2014-04-15_at_8.14.52_pm.png
 osrc最有意思的一部分莫过于flann，当然说的也是系统后台的设计的一个很关键及有意思的部分。
@ -386,20 +437,24 @@ osrc最有意思的一部分莫过于flann，当然说的也是系统后台的
 换句话说，我们需要一些样本来当作我们的分析资料，这里东西用到的就是我们之前的。
-     [227.0, {'1': '51', '0': '41', '3': '17', '2': '34', '5': '28', '4': '22', '6': '34'}, [('PushEvent', 154.0), ('CreateEvent', 41.0), ('WatchEvent', 18.0), ('GollumEvent', 8.0), ('MemberEvent', 3.0), ('ForkEvent', 2.0), ('ReleaseEvent', 1.0)], 0, 0, 0, 11, [('CSS', 74.0), ('JavaScript', 60.0), ('Ruby', 12.0), ('TeX', 6.0), ('Python', 6.0), ('Java', 5.0), ('C++', 5.0), ('Assembly', 5.0), ('C', 3.0), ('Emacs Lisp', 2.0), ('Arduino', 2.0)]]
+```
 [227.0, {'1': '51', '0': '41', '3': '17', '2': '34', '5': '28', '4': '22', '6': '34'}, [('PushEvent', 154.0), ('CreateEvent', 41.0), ('WatchEvent', 18.0), ('GollumEvent', 8.0), ('MemberEvent', 3.0), ('ForkEvent', 2.0), ('ReleaseEvent', 1.0)], 0, 0, 0, 11, [('CSS', 74.0), ('JavaScript', 60.0), ('Ruby', 12.0), ('TeX', 6.0), ('Python', 6.0), ('Java', 5.0), ('C++', 5.0), ('Assembly', 5.0), ('C', 3.0), ('Emacs Lisp', 2.0), ('Arduino', 2.0)]]
 ```
 在代码中是构建了一个points.h5的文件来分析每个用户的points，之后再记录到hdf5文件中。
-    [ 0.00438596  0.18061674  0.2246696   0.14977974  0.07488987  0.0969163
+```
-      0.12334802  0.14977974  0.          0.18061674  0.          0.          0.
+[ 0.00438596  0.18061674  0.2246696   0.14977974  0.07488987  0.0969163
-      0.00881057  0.          0.          0.03524229  0.          0.
+    0.12334802  0.14977974  0.          0.18061674  0.          0.          0.
-      0.01321586  0.          0.          0.          0.6784141   0.
+    0.00881057  0.          0.          0.03524229  0.          0.
-      0.07929515  0.00440529  1.          1.          1.          0.08333333
+    0.01321586  0.          0.          0.          0.6784141   0.
-      0.26431718  0.02202643  0.05286344  0.02643172  0.          0.01321586
+    0.07929515  0.00440529  1.          1.          1.          0.08333333
-      0.02202643  0.          0.          0.          0.          0.          0.
+    0.26431718  0.02202643  0.05286344  0.02643172  0.          0.01321586
-      0.          0.          0.00881057  0.          0.          0.          0.
+    0.02202643  0.          0.          0.          0.          0.          0.
-      0.          0.          0.          0.          0.          0.          0.
+    0.          0.          0.00881057  0.          0.          0.          0.
-      0.          0.          0.          0.          0.00881057]
+    0.          0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.          0.00881057]
 ```
 这里分析到用户的大部分行为，再找到与其行为相近的用户，主要的行为有下面这些:
@ -410,62 +465,67 @@ osrc最有意思的一部分莫过于flann，当然说的也是系统后台的
 osrc中用于解析的代码
 ```python
 def parse_vector(results):
    points = np.zeros(nvector)
    total = int(results[0])
-    def parse_vector(results):
+    points[0] = 1.0 / (total + 1)
        points = np.zeros(nvector)
        total = int(results[0])
-        points[0] = 1.0 / (total + 1)
+    # Week means.
    for k, v in results[1].iteritems():
        points[1 + int(k)] = float(v) / total
-        # Week means.
+    # Event types.
-        for k, v in results[1].iteritems():
+    n = 8
-            points[1 + int(k)] = float(v) / total
+    for k, v in results[2]:
        points[n + evttypes.index(k)] = float(v) / total
-        # Event types.
+    # Number of contributions, connections and languages.
-        n = 8
+    n += nevts
-        for k, v in results[2]:
+    points[n] = 1.0 / (float(results[3]) + 1)
-            points[n + evttypes.index(k)] = float(v) / total
+    points[n + 1] = 1.0 / (float(results[4]) + 1)
    points[n + 2] = 1.0 / (float(results[5]) + 1)
    points[n + 3] = 1.0 / (float(results[6]) + 1)
-        # Number of contributions, connections and languages.
+    # Top languages.
-        n += nevts
+    n += 4
-        points[n] = 1.0 / (float(results[3]) + 1)
+    for k, v in results[7]:
-        points[n + 1] = 1.0 / (float(results[4]) + 1)
+        if k in langs:
-        points[n + 2] = 1.0 / (float(results[5]) + 1)
+            points[n + langs.index(k)] = float(v) / total
-        points[n + 3] = 1.0 / (float(results[6]) + 1)
+        else:
            # Unknown language.
            points[-1] = float(v) / total
-        # Top languages.
+    return points
-        n += 4
+```
        for k, v in results[7]:
            if k in langs:
                points[n + langs.index(k)] = float(v) / total
            else:
                # Unknown language.
                points[-1] = float(v) / total
        return points
 这样也就返回我们需要的点数，然后我们可以用``get_points``来获取这些
-    def get_points(usernames):
+```python
-        r = redis.StrictRedis(host='localhost', port=6379, db=0)
+def get_points(usernames):
-        pipe = r.pipeline()
+    r = redis.StrictRedis(host='localhost', port=6379, db=0)
    pipe = r.pipeline()
-        results = get_vector(usernames)
+    results = get_vector(usernames)
-        points = np.zeros([len(usernames), nvector])
+    points = np.zeros([len(usernames), nvector])
-        points = parse_vector(results)
+    points = parse_vector(results)
-        return points
+    return points
 ```
 就会得到我们的相应的数据，接着找找和自己邻近的，看看结果。
-    [ 0.01298701  0.19736842  0.          0.30263158  0.21052632  0.19736842
+```
-      0.          0.09210526  0.          0.22368421  0.01315789  0.          0.
+[ 0.01298701  0.19736842  0.          0.30263158  0.21052632  0.19736842
-      0.          0.          0.          0.01315789  0.          0.
+    0.          0.09210526  0.          0.22368421  0.01315789  0.          0.
-      0.01315789  0.          0.          0.          0.73684211  0.          0.
+    0.          0.          0.          0.01315789  0.          0.
-      0.          1.          1.          1.          0.2         0.42105263
+    0.01315789  0.          0.          0.          0.73684211  0.          0.
-      0.09210526  0.          0.          0.          0.          0.23684211
+    0.          1.          1.          1.          0.2         0.42105263
-      0.          0.          0.03947368  0.          0.          0.          0.
+    0.09210526  0.          0.          0.          0.          0.23684211
-      0.          0.          0.          0.          0.          0.          0.
+    0.          0.          0.03947368  0.          0.          0.          0.
-      0.          0.          0.          0.          0.          0.          0.
+    0.          0.          0.          0.          0.          0.          0.
-      0.          0.          0.          0.        ]
+    0.          0.          0.          0.          0.          0.          0.
    0.          0.          0.          0.        ]
 ```
 真看不出来两者有什么相似的地方 。。。。  
--- a/img/main-events.png
+++ b/img/main-events.png
--- a/img/smtwtfs.png
+++ b/img/smtwtfs.png