[Homework] load_userdict()

  1. Select 10 new poems and let Jieba segment them.
  2. Discuss with your team members which phrases you think Jieba did not segment correctly.
  3. Suppose you have many rules you want to teach Jieba, writing them one by one with add_word() will be troublesome.
  4. You may write them in a text file and jieba.load_userdict('mydict.txt')
  5. Each line in ‘mydict.txt’ consists of 3 fields: word, frequency, tag.
  6. Separated by a space.
  7. You only need to submit your 'mydict.txt' file, because the Python script is trivial.
  8. You may test your 'mydict.txt' with the following program:
    from google.colab import drive
    drive.mount('/content/gdrive')
    
    def readPoemBody(fn):
      infile =  open(fn, "r")
      # Skip the first 3 lines (title, author, separator)
      title     = infile.readline()[:-1]
      author    = infile.readline()[:-1]
      separator = infile.readline()[:-1]
      if separator != '':
        print("[Warning] 3rd line of {} not empty\n  {}".format(
              fn, separator))
      else:
        print("Reading {}({})".format(title, author))
      body = infile.read()
      infile.close()
      return body
    
    # 斷詞
    import jieba
    jieba.load_userdict('gdrive/MyDrive/KGHS/mydict.txt')
    
    import glob
    filenames = glob.glob('gdrive/MyDrive/KGHS/war*.txt')
    for fn in filenames:
      body = readPoemBody(fn)
      if body[-1] == '\n':
         body = body[:-1]
      for line in body.split('\n'):
          tokens = jieba.lcut(line)
          print(line, '->', '/'.join(tokens))