Получение координат строки с использованием ITextExtractionStrategy и LocationTextExtractionStrategy в Itextsharp

У меня есть файл PDF, который я читаю в строке, используя ITextExtractionStrategy.Now из строки я беру подстроку, например My name is XYZ, и мне нужно получить прямоугольные координаты подстроки из файла PDF, но не в состоянии это сделать. В googling я узнал, что LocationTextExtractionStrategy, но не получаю, как использовать это, чтобы получить координаты.

Вот код.

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);

string getcoordinate="My name is XYZ";

Как я могу получить прямоугольную координату этой подстроки с помощью ITEXTSHARP..

Пожалуйста, помогите.

Ответ 1

Вот очень простая версия реализации.

Перед реализацией важно очень знать, что PDF файлы имеют нулевую концепцию "слов", "абзацев", "предложений" и т.д. Кроме того, текст в PDF не обязательно выкладывается влево справа и сверху вниз, и это не имеет ничего общего с языками, отличными от LTR. Фраза "Hello World" может быть записана в PDF как:

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

Он также может быть записан как

Draw Hello World at (10,10)

Интерфейс ITextExtractionStrategy, который нужно реализовать, имеет метод под названием RenderText, который вызывается один раз для каждого фрагмента текста в PDF файле. Заметьте, я сказал "кусок", а не "слово". В первом примере выше метод будет называться четыре раза для этих двух слов. Во втором примере он будет называться один раз для этих двух слов. Это очень важная часть для понимания. PDF файлы не имеют слов, и из-за этого iTextSharp также не имеет слов. "Слово" на 100% зависит от вас.

Также в этих строках, как я сказал выше, в файлах PDF нет абзацев. Причина того, что это известно, заключается в том, что PDF файлы не могут переносить текст в новую строку. Каждый раз, когда вы видите что-то, похожее на возврат абзаца, вы видите новую команду рисования текста, которая имеет другую координату y в качестве предыдущей строки. См. для дальнейшего обсуждения.

Код ниже - очень простая реализация. Для этого я подклассифицирую LocationTextExtractionStrategy, который уже реализует ITextExtractionStrategy. При каждом вызове RenderText() я найду прямоугольник текущего фрагмента (используя Mark code here) и сохраняя его позже. Я использую этот простой вспомогательный класс для хранения этих кусков и прямоугольников:

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

И здесь подкласс:

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

И, наконец, реализация вышеперечисленного:

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var doc = new Document()) {
        using (var writer = PdfWriter.GetInstance(doc, fs)) {
            doc.Open();

            doc.Add(new Paragraph("This is my sample file"));

            doc.Close();
        }
    }
}

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

Я не могу подчеркнуть, что в приведенном выше не принимают слова "слова", это зависит от вас. Объект TextRenderInfo, который передается в RenderText, имеет метод под названием GetCharacterRenderInfos(), который вы могли бы использовать для получения дополнительной информации. Вы также можете использовать GetBaseline() instead of GetDescentLine() `, если вы не заботитесь о descenders в шрифте.

ИЗМЕНИТЬ

(У меня был отличный обед, поэтому я чувствую себя немного более полезным.)

Здесь обновленная версия MyLocationTextExtractionStrategy, которая делает то, что говорят мои комментарии ниже, а именно: требуется строка для поиска и поиска каждого фрагмента для этой строки. По всем перечисленным причинам это не будет работать в некоторых/многих/самых/всех случаях. Если подстрока существует несколько раз в одном блоке, она также вернет только первый экземпляр. Лигатуры и диакритики могут также испортиться.

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //The string that we're searching for
    public String TextToSearchFor { get; set; }

    //How to compare strings
    public System.Globalization.CompareOptions CompareOptions { get; set; }

    public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //See if the current chunk contains the text
        var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

        //If not found bail
        if (startPosition < 0) {
            return;
        }

        //Grab the individual characters
        var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar = chars.Last();


        //Get the bounding box for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight = lastChar.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
    }

Вы использовали бы это так же, как и раньше, но теперь у конструктора есть один требуемый параметр:

var t = new MyLocationTextExtractionStrategy("sample");

Ответ 2

Это старый вопрос, но я оставляю здесь свой ответ, поскольку я не мог найти правильный ответ в Интернете.

Как показал Крис Хаас, дело не просто в том, чтобы говорить, как iText занимается кусками. Код, по которому сообщение Криса потерпело неудачу в большинстве моих тестов, потому что слово обычно разбивается на разные куски (он предупреждает об этом в сообщении).

Чтобы решить эту проблему, это стратегия, которую я использовал:

Разделить куски в символах (на самом деле объекты textrenderinfo за каждый char)
Группировка строк по строке. Это не так прямо, как вам приходится иметь дело с выравниванием куска.
Найдите слово, которое нужно найти для каждой строки

Я оставляю здесь код. Я тестирую его с несколькими документами, и он работает очень хорошо, но в некоторых сценариях это может быть неудачным, потому что это немного сложнее в этом преобразовании chunk → words.

Надеюсь, что это кому-то поможет.

  class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
    private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
    private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
    public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
    private String m_SearchText;
    public const float PDF_PX_TO_MM = 0.3528f;
    public float m_PageSizeY;


    public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
        : base()
    {
        this.m_SearchText = sSearchText;
        this.m_PageSizeY = fPageSizeY;
    }

    private void searchText()
    {
        foreach (LineInfo aLineInfo in m_LinesTextInfo)
        {
            int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
            if (iIndex != -1)
            {
                TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
                SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
                this.m_SearchResultsList.Add(aSearchResult);
            }
        }
    }

    private void groupChunksbyLine()
    {                     
        LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
        LocationTextExtractionStrategyEx.LineInfo textInfo = null;
        foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
        {
            if (textChunk1 == null)
            {                    
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            else if (textChunk2.sameLine(textChunk1))
            {                      
                textInfo.appendText(textChunk2);
            }
            else
            {                                        
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            textChunk1 = textChunk2;
        }
    }

    public override string GetResultantText()
    {
        groupChunksbyLine();
        searchText();
        //In this case the return value is not useful
        return "";
    }

    public override void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment baseline = renderInfo.GetBaseline();
        //Create ExtendedChunk
        ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
        this.m_DocChunks.Add(aExtendedChunk);
    }

    public class ExtendedTextChunk
    {
        public string m_text;
        private Vector m_startLocation;
        private Vector m_endLocation;
        private Vector m_orientationVector;
        private int m_orientationMagnitude;
        private int m_distPerpendicular;           
        private float m_charSpaceWidth;           
        public List<TextRenderInfo> m_ChunkChars;


        public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
        {
            this.m_text = txt;
            this.m_startLocation = startLoc;
            this.m_endLocation = endLoc;
            this.m_charSpaceWidth = charSpaceWidth;                
            this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
            this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
            this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];                
            this.m_ChunkChars = chunkChars;

        }


        public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
        {
            return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
        }


    }

    public class SearchResult
    {
        public int iPosX;
        public int iPosY;

        public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
        {
            //Get position of upperLeft coordinate
            Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
            //PosX
            float fPosX = vTopLeft[Vector.I1]; 
            //PosY
            float fPosY = vTopLeft[Vector.I2];
            //Transform to mm and get y from top of page
            iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
            iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
        }
    }

    public class LineInfo
    {            
        public string m_Text;
        public List<TextRenderInfo> m_LineCharsList;

        public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
        {                
            this.m_Text = initialTextChunk.m_text;
            this.m_LineCharsList = initialTextChunk.m_ChunkChars;
        }

        public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
        {
            m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
            this.m_Text += additionalTextChunk.m_text;
        }
    }
}

Ответ 3

Я знаю, что это действительно старый вопрос, но ниже - это то, что я закончил. Просто разместите его здесь, надеясь, что это будет полезно для кого-то другого.

В следующем коде будут указаны начальные координаты строк, содержащих текст поиска. Не должно быть трудно изменить его, чтобы дать позиции слов. Заметка. Я протестировал это на itextsharp 5.5.11.0 и не буду работать над некоторыми более старыми версиями

Как упоминалось выше, в pdf файлах нет понятия слов/строк или абзацев. Но я обнаружил, что LocationTextExtractionStrategy отлично справляется с разделением строк и слов. Поэтому мое решение основано на этом.

ОТКАЗ:

Это решение основано на https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs, и в этом файле есть комментарий, в котором говорится, что это предварительный просмотр. Так что это может не сработать в будущем.

В любом случае здесь код.

using System.Collections.Generic;
using iTextSharp.text.pdf.parser;

namespace Logic
{
    public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
    {
        private readonly List<TextChunk> locationalResult = new List<TextChunk>();

        private readonly ITextChunkLocationStrategy tclStrat;

        public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp()) {
        }

        /**
         * Creates a new text extraction renderer, with a custom strategy for
         * creating new TextChunkLocation objects based on the input of the
         * TextRenderInfo.
         * @param strat the custom strategy
         */
        public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
        {
            tclStrat = strat;
        }


        private bool StartsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[0] == ' ';
        }


        private bool EndsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[str.Length - 1] == ' ';
        }

        /**
         * Filters the provided list with the provided filter
         * @param textChunks a list of all TextChunks that this strategy found during processing
         * @param filter the filter to apply.  If null, filtering will be skipped.
         * @return the filtered list
         * @since 5.3.3
         */

        private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
        {
            if (filter == null)
            {
                return textChunks;
            }

            var filtered = new List<TextChunk>();

            foreach (var textChunk in textChunks)
            {
                if (filter.Accept(textChunk))
                {
                    filtered.Add(textChunk);
                }
            }

            return filtered;
        }

        public override void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            if (renderInfo.GetRise() != 0)
            { // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
                Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
                segment = segment.TransformBy(riseOffsetTransform);
            }
            TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
            locationalResult.Add(tc);
        }


        public IList<TextLocation> GetLocations()
        {

            var filteredTextChunks = filterTextChunks(locationalResult, null);
            filteredTextChunks.Sort();

            TextChunk lastChunk = null;

             var textLocations = new List<TextLocation>();

            foreach (var chunk in filteredTextChunks)
            {

                if (lastChunk == null)
                {
                    //initial
                    textLocations.Add(new TextLocation
                    {
                        Text = chunk.Text,
                        X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                        Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                    });

                }
                else
                {
                    if (chunk.SameLine(lastChunk))
                    {
                        var text = "";
                        // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                        if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
                            text += ' ';

                        text += chunk.Text;

                        textLocations[textLocations.Count - 1].Text += text;

                    }
                    else
                    {

                        textLocations.Add(new TextLocation
                        {
                            Text = chunk.Text,
                            X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                            Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                        });
                    }
                }
                lastChunk = chunk;
            }

            //now find the location(s) with the given texts
            return textLocations;

        }

    }

    public class TextLocation
    {
        public float X { get; set; }
        public float Y { get; set; }

        public string Text { get; set; }
    }
}

Как вызвать метод:

        using (var reader = new PdfReader(inputPdf))
            {

                var parser = new PdfReaderContentParser(reader);

                var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());

                var res = strategy.GetLocations();

                reader.Close();
             }
                var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();




inputPdf is a byte[] that has the pdf data

pageNumber is the page where you want to search in