MongoDB InvalidDocument: невозможно закодировать объект

Я использую scrapy для регистрации блогов, а затем сохраняю данные в mongodb. Сначала я получил InvalidDocument Exception. Для меня очевидно, что данные не в правильной кодировке. Поэтому, прежде чем продолжать объект, в моей MongoPipeline я проверяю, находится ли документ в "utf-8 strict", и только затем я пытаюсь сохранить объект в mongodb. НО все же я получаю InvalidDocument Exceptions, теперь это раздражает.

Это мой код моего объекта MongoPipeline, который сохраняет объекты в mongodb

# -*- coding: utf-8 -*-

# Define your item pipelines here
#

import pymongo
import sys, traceback
from scrapy.exceptions import DropItem
from crawler.items import BlogItem, CommentItem


class MongoPipeline(object):
    collection_name = 'master'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):

        if type(item) is BlogItem:
            try:
                if 'url' in item:
                    item['url'] = item['url'].encode('utf-8', 'strict')
                if 'domain' in item:
                    item['domain'] = item['domain'].encode('utf-8', 'strict')
                if 'title' in item:
                    item['title'] = item['title'].encode('utf-8', 'strict')
                if 'date' in item:
                    item['date'] = item['date'].encode('utf-8', 'strict')
                if 'content' in item:
                    item['content'] = item['content'].encode('utf-8', 'strict')
                if 'author' in item:
                    item['author'] = item['author'].encode('utf-8', 'strict')

            except:  # catch *all* exceptions
                e = sys.exc_info()[0]
                spider.logger.critical("ERROR ENCODING %s", e)
                traceback.print_exc(file=sys.stdout)
                raise DropItem("Error encoding BLOG %s" % item['url'])

            if 'comments' in item:
                comments = item['comments']
                item['comments'] = []

                try:
                    for comment in comments:
                        if 'date' in comment:
                            comment['date'] = comment['date'].encode('utf-8', 'strict')
                        if 'author' in comment:
                            comment['author'] = comment['author'].encode('utf-8', 'strict')
                        if 'content' in comment:
                            comment['content'] = comment['content'].encode('utf-8', 'strict')

                        item['comments'].append(comment)

                except:  # catch *all* exceptions
                    e = sys.exc_info()[0]
                    spider.logger.critical("ERROR ENCODING COMMENT %s", e)
                    traceback.print_exc(file=sys.stdout)

        self.db[self.collection_name].insert(dict(item))

        return item

И все же я получаю следующее исключение:

au coeur de l\u2019explosion de la bulle Internet n\u2019est probablement pas \xe9tranger au succ\xe8s qui a suivi. Mais franchement, c\u2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui pr\xe9c\xe8de, c\u2019est quelles sont les conditions pour r\xe9ussir si l\u2019on est vraiment contraint de rester en France. Ce sont des sujets que je d\xe9velopperai dans un autre article.',
     'date': u'2012-06-27T23:21:25+00:00',
     'domain': 'reussir-sa-boite.fr',
     'title': u'Peut-on encore entreprendre en France ?\t\t\t ',
     'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
    Traceback (most recent call last):
      File "h:\program files\anaconda\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "H:\PDS\BNP\crawler\crawler\pipelines.py", line 76, in process_item
        self.db[self.collection_name].insert(dict(item))
      File "h:\program files\anaconda\lib\site-packages\pymongo\collection.py", line 409, in insert
        gen(), check_keys, self.uuid_subtype, client)
    InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',
     'content': 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me co\xc3\xbbterait bien trop cher. Bref, 100% d\xe2\x80\x99accord avec vous. Le probl\xc3\xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si, j\xe2\x80\x99ai pu lire il me semble qu\xe2\x80\x99ils avaient en t\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS pour les petites entreprises et de l\xe2\x80\x99augmenter pour les grandes\xe2\x80\xa6 A voir',
     'date': '2012-06-27T23:21:25+00:00'}
    2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)
    2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 252396,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),
     'log_count/DEBUG': 2,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start)
    time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)}

Еще одна забавная вещь из комментария @eLRuLL я сделал следующее:

>>> s = "Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me"
>>> s
'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me'
>>> se = s.encode("utf8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> se = s.encode("utf-8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)

Тогда мой вопрос. Если этот текст не может быть закодирован. Тогда почему моя МонгоПипелина попытается поймать не поймать этот ИСКЛЮЧИТЕЛЬ? Потому что к объекту ['comments'] следует добавить только объекты, которые не вызывают каких-либо исключений?

Ответ 1

Наконец я понял это. Проблема заключалась не в кодировании. Это было со структурой документов.

Потому что я отправился на стандартный пример MongoPipeline, который не касается вложенных предметов.

Что я делаю: BlogItem: "URL" ... comments = [CommentItem]

Итак, у моего BlogItem есть список CommentItems. Теперь проблема пришла сюда, для сохранения объекта в базе данных я делаю:

self.db[self.collection_name].insert(dict(item))

Итак, я разбираю BlogItem с dict. Но я не разбираюсь в списке CommentItems. И поскольку traceback отображает тип CommentItem как dict, мне не приходило в голову, что проблематичный объект не является типом!

Итак, наконец, способ исправить эту проблему - изменить строку при добавлении комментария в список комментариев как таковой:

item['comments'].append(dict(comment))

Теперь MongoDB считает его действительным документом.

Наконец, для последней части, где я спрашиваю, почему я получаю исключение на консоли python, а не в script.

Причина в том, что я работал над консолью python, которая поддерживает только ascii. И, следовательно, ошибка.

Ответ 2

Во-первых, когда вы выполняете "somestring".encode(...), не меняете "somestring", но он возвращает новую закодированную строку, поэтому вы должны использовать что-то вроде:

 item['author'] = item['author'].encode('utf-8', 'strict')

и то же самое для других полей.

Ответ 3

Вы можете хранить "response.text".

Ответ 4

Я получил эту ошибку при выполнении запроса

db.collection.find({'attr': {'$gte': 20}})

и некоторые записи в collection имели нечисловое значение для attr.