Highlight Job Description Keywords with WordCloud

1 minute read

Visual images are always more attractive than pure numbers or statistics, that’s why WordCloud are so popular, I am considering look for a new job, and gathered some vacancies with job descriptions. I want to know what parts they are most interest, so creating a WordCloud seems perfect for this purpose.

The following python packages are required for creating WordCloud:

pip3 install wordcloud
pip3 install imageio
pip3 install jieba

imageio for generating final image, jieba for Chinese segmentation.

You may want to use a mirror site to save download time, this can be done by either specify in command line:

pip3 install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

or configured in dot file:

cat << EOT >> $HOME/.pip/pip.conf
[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
EOT

The following script is based on example from official repository with slight adaptions, you need to download SourceHanSerifK-Light.otf and put the font file into /fonts/SourceHanSerif/ to make this work.

#!/usr/bin/env python

import jieba
import os
from os import path
from imageio import imread
from wordcloud import WordCloud, ImageColorGenerator

# Setting up parallel processes: 4
jieba.enable_parallel(4)
d= os.getcwd()

filterwords_path = d + '/filterwords.txt'
# Chinese fonts must be set
font_path = d + '/fonts/SourceHanSerif/SourceHanSerifK-Light.otf'

# the path to save worldcloud
imgname = d + '/jobs-masked.jpg'

# read the mask / color image taken from
back_coloring = imread(path.join(d, d + '/mask.jpg'))

# Read the whole text.
text = open(path.join(d, d + '/jd.txt')).read()

# The function for processing text with Jieba
def jieba_processing_txt(text):
    mywordlist = []
    seg_list = jieba.cut(text, cut_all=False)
    liststr = "/ ".join(seg_list)

    with open(filterwords_path, encoding='utf-8') as f_filter:
        f_filter_text = f_filter.read()
        f_filter_seg_list = f_filter_text.splitlines()

    for myword in liststr.split('/'):
        if not (myword.strip() in f_filter_seg_list) and len(myword.strip()) > 1:
            mywordlist.append(myword)
    return ' '.join(mywordlist)

wc = WordCloud(font_path=font_path, background_color="white", max_words=2000, mask=back_coloring,
               max_font_size=100, random_state=42, width=1000, height=860, margin=2,)
wc.generate(jieba_processing_txt(text))

# create coloring from image
image_colors_byImg = ImageColorGenerator(back_coloring)
wc.recolor(color_func=image_colors_byImg)

# save wordcloud
wc.to_file(path.join(d, imgname))

WALL.E Keywords
           Image source