Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » Gendoc » XLSXParser::layoutCells() performance issue
XLSXParser::layoutCells() performance issue [message #1848210] Thu, 25 November 2021 06:20 Go to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
Hi

We generate a big enough xlsx file. The method XLSXParser::layoutCells() works very slow:

index.php/fa/41365/0/

The image shows just a snapshot. The process is not even finished.

It seems that the performance issue is caused by rows.getLength() in the following method:

	public void layoutCells() {
		int minRow = Integer.MAX_VALUE;
		int minCol = Integer.MAX_VALUE;
		int maxCol = 0;
		
		int curRow = 0;
                NodeList rows = getDocument().getElementsByTagName("row");
		for (int i=0; i<rows.getLength(); i++) {
                        // ...


Could you please cache the length here?

Is it possible to define and use my custom XLSXParser as a workaround? How can I replace the default parser by a custom one?
Re: XLSXParser::layoutCells() performance issue [message #1848218 is a reply to message #1848210] Thu, 25 November 2021 10:38 Go to previous messageGo to next message
Carsten Pitz is currently offline Carsten PitzFriend
Messages: 471
Registered: May 2015
Location: Germany
Senior Member
I would rather identify the XML Parser Xerces as the most time consuming part. Which exactly is what I expect, because Xerces is doing a damn lot of checks. You may optimize it by disabling all checks in Xerces.
/Carsten
Re: XLSXParser::layoutCells() performance issue [message #1848219 is a reply to message #1848218] Thu, 25 November 2021 10:47 Go to previous messageGo to next message
Tristan Faure is currently offline Tristan FaureFriend
Messages: 460
Registered: July 2009
Senior Member
Yes we could investigate on xerces check operations. Nevertheless caching getLentgth() is also relevant

@Denis about overriding XLSparser you can but you have to override the document Factory which can instantiate your XLS Document which will return your xls parser. Doable but not so quick, please ask if you need more details.
About changing the code base I could check or will see with other committers sorry for the potential delay :(

EDIT : BTW thank you very much for the performed analysis @Denis




[Updated on: Thu, 25 November 2021 10:48]

Report message to a moderator

Re: XLSXParser::layoutCells() performance issue [message #1848220 is a reply to message #1848219] Thu, 25 November 2021 11:53 Go to previous messageGo to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
Thanks for answers!

I'm not experienced with Xerces and don't understand how to change its options...

On custom XLSXParser... If I understand you right I should do the following:

1) Define my XLSXFactory
2) It should create my XLSXDocument
3) It shoud create my XLSXParserProvider in the XLSXDocument::getXmlParsers()
4) It shoud create my XLSXParser
5) Register my XLSXFactory using org.eclipse.gendoc.document.parser.factory extension point

Is it right? The 5th step seems to be the most problematic because the extension point already has registration for xlsx files in the org.eclipse.gendoc.document.parser.xlsx plugin. Should I create a custom feature excluding this plugin and including my custom one?
Re: XLSXParser::layoutCells() performance issue [message #1848224 is a reply to message #1848220] Thu, 25 November 2021 12:52 Go to previous messageGo to next message
Tristan Faure is currently offline Tristan FaureFriend
Messages: 460
Registered: July 2009
Senior Member
Hello yes for the described steps
I made a mistake I thought it was a service and a service extension has a priority and could override another one.
In your case for document factory indeed once an extension is registered you can't override it. Maybe the solution you describe using features works but I'm not sure
A workaround could be to create a factory for a specific extension for instances "xlsx2" ,match this extension in your factory and change the extension of your doc ...




[Updated on: Thu, 25 November 2021 13:13]

Report message to a moderator

Re: XLSXParser::layoutCells() performance issue [message #1848248 is a reply to message #1848224] Fri, 26 November 2021 03:44 Go to previous messageGo to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
I added a cache for getLength(). It works significantly faster now. At least the generation process is finished. But now the next method with a performance issue is item():

index.php/fa/41375/0/

I'll try to implement the iteration differently.
Re: XLSXParser::layoutCells() performance issue [message #1848249 is a reply to message #1848248] Fri, 26 November 2021 04:27 Go to previous messageGo to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
The following code works fast enough:

        Node sheetData = getDocument().getElementsByTagName("sheetData").item(0);
        for (Node node = sheetData.getFirstChild(); node != null; node = node.getNextSibling()) {
            if (!(node instanceof Element)) {
                continue;
            }
            Element row = (Element) node;
            // ...


index.php/fa/41376/0/

It seems that cleanup() and getAppliedMarks() methods can be improved too. They evaluate the followin XPath expression:

XPathXlsxUtils.evaluateNodes(xlsxParser.getDocument(),"//:row/:c/gendoc:mark[@id='"+this.mark+"']")


Maybe I'm wrong but it seems that addition of :sheetData to the expression:

XPathXlsxUtils.evaluateNodes(xlsxParser.getDocument(),"/:sheetData/:row/:c/gendoc:mark[@id='"+this.mark+"']")

reduces time taken by layoutCells() from 27 seconds to 17 seconds. Maybe the XPath could be replaced by a loop over all rows...
Re: XLSXParser::layoutCells() performance issue [message #1848250 is a reply to message #1848249] Fri, 26 November 2021 05:30 Go to previous messageGo to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
It seems that all performance issues are fixed now. layoutCells() takes 1,5 seconds on my test model.

Here are all changes I made:

1) org.eclipse.gendoc.document.parser.xlsx.XLSXParser::layoutCells()

The loop looks as follows:
    public void layoutCells() {
        int minRow = Integer.MAX_VALUE;
        int minCol = Integer.MAX_VALUE;
        int maxCol = 0;

        int curRow = 0;
        Node sheetData = getDocument().getElementsByTagName("sheetData").item(0);
        for (Node node = sheetData.getFirstChild(); node != null; node = node.getNextSibling()) {
            if (!(node instanceof Element)) {
                continue;
            }
            Element row = (Element) node;
            // ...


2) The following methods are changed in org.eclipse.gendoc.document.parser.xlsx.cellmarkers.AbstractCellMarker:

    public List<CellMark> getAppliedMarks(XLSXParser xlsxParser) {
        Set<CellMark> res = new HashSet<CellMark>();
        try {
            Node sheetData = xlsxParser.getDocument().getElementsByTagName("sheetData").item(0);
            for (Node row = sheetData.getFirstChild(); row != null; row = row.getNextSibling()) {
                for (Node cell = row.getFirstChild(); cell != null; cell = cell.getNextSibling()) {
                    for (Node node = cell.getFirstChild(); node != null; node = node.getNextSibling()) {
                        if ("gendoc:mark".equals(node.getNodeName())) {
                            if (mark.equals(((Element) node).getAttribute("id"))) {
                                res.add(getCellMark((Element) node));
                            }
                        }
                    }
                }
            }

//            NodeList refs = XPathXlsxUtils.evaluateNodes(xlsxParser.getDocument(),
//                    "//:row/:c/gendoc:mark[@id='" + this.mark + "']");
//            for (int i = 0; i < refs.getLength(); i++) {
//                Element markEl = (Element) refs.item(i);
//                CellMark m = getCellMark(markEl);
//                res.add(m);
//            }
        } catch (Exception e) {
            throw new IllegalArgumentException(e.getMessage(), e);
        }
        List<CellMark> l = new ArrayList<CellMark>(res);
        Collections.sort(l);
        return l;
    }

    protected void unmarkCells(XLSXParser xlsxParser) throws XPathExpressionException {
        Node sheetData = xlsxParser.getDocument().getElementsByTagName("sheetData").item(0);
        for (Node row = sheetData.getFirstChild(); row != null; row = row.getNextSibling()) {
            for (Node cell = row.getFirstChild(); cell != null; cell = cell.getNextSibling()) {
                for (Node node = cell.getFirstChild(); node != null; node = node.getNextSibling()) {
                    if ("gendoc:mark".equals(node.getNodeName())) {
                        if (mark.equals(((Element) node).getAttribute("id"))) {
                            node.getParentNode().removeChild(node);
                        }
                    }
                }
            }
        }

//        NodeList nl = XPathXlsxUtils.evaluateNodes(xlsxParser.getDocument(),
//                "//:row/:c/gendoc:mark[@id='" + mark + "']");
//        for (int i = 0; i < nl.getLength(); i++) {
//            Node n = nl.item(i);
//            n.getParentNode().removeChild(n);
//        }
    }


The interesting thing is that this methods do nothing. It seems that there are no marks in my document and nothing to process. If I replace original code with my custom one or remove this code at all nothing is changed. The generated file size stays the same. So I can't check whether my code for getAppliedMarks() and unmarkCells() is correct. At least it works faster.
Re: XLSXParser::layoutCells() performance issue [message #1848275 is a reply to message #1848250] Fri, 26 November 2021 10:33 Go to previous messageGo to next message
Tristan Faure is currently offline Tristan FaureFriend
Messages: 460
Registered: July 2009
Senior Member
Thank you for the investigation !
Are you familiar with gerrit usage, it could be nice to contribute your code : https://wiki.eclipse.org/Gendoc/developerResources/contribute
Can I ask you in which context you are using gendoc ? If it is a professional usage you could consider asking for professional services from Atos (Anne Haugommard and Stephane Duprat), I think that would speed up the release of a new version.




Re: XLSXParser::layoutCells() performance issue [message #1848395 is a reply to message #1848275] Wed, 01 December 2021 03:45 Go to previous messageGo to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
Sorry for delay. I've re-tested changes to AbstractCellMarker and fixed some things.

I tried to create a gerrit commit. But it seems that refs/for/master is out of date and I can't push to it. And also I can't push to master. Could you please update refs/for/master branch? Or maybe I don't understand something, it's my first gerrit commit. Could you please give some instructions?

Yes, we develop a product similar to Eclipse Capella. It's on a very early stage, and has no public release yet. We're actually interested in improving some of the things in Gendoc. For example, internalization, support of data tables in Excel, number formatting in Excel. I asked my chiefs on the support topic.
Re: XLSXParser::layoutCells() performance issue [message #1848445 is a reply to message #1848395] Thu, 02 December 2021 09:19 Go to previous messageGo to next message
Tristan Faure is currently offline Tristan FaureFriend
Messages: 460
Registered: July 2009
Senior Member
Hello
i will check what can be wrong with Gerrit. I have integrated the previous changes, run tests and push them on master, so please pull master if you have more changes




Re: XLSXParser::layoutCells() performance issue [message #1848448 is a reply to message #1848445] Thu, 02 December 2021 11:36 Go to previous messageGo to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
I get the following error when trying to push to master:

git.exe push -v --progress "origin" master:master
Pushing to ssh://git.eclipse.org:29418/gendoc/org.eclipse.gendoc
Enumerating objects: 27, done.
Counting objects: 100% (27/27), done.
Delta compression using up to 16 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (14/14), 1.51 KiB | 1.51 MiB/s, done.
Total 14 (delta 7), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (7/7)
remote: error: branch refs/heads/master:
remote: To push into this reference you need 'Push' rights.
remote: User: dnikiforov
remote: Contact an administrator to fix the permissions
remote: Processing changes: refs: 1, done
To ssh://git.eclipse.org:29418/gendoc/org.eclipse.gendoc
! [remote rejected] master -> master (prohibited by Gerrit: not permitted: update)
error: failed to push some refs to 'ssh://git.eclipse.org:29418/gendoc/org.eclipse.gendoc'


I don't have enough rights.

If I understand everithing right I should push to refs/for/master instead. But I get the following error:

git.exe push -v --progress "origin" master:refs/for/master
Pushing to ssh://git.eclipse.org:29418/gendoc/org.eclipse.gendoc
To ssh://git.eclipse.org:29418/gendoc/org.eclipse.gendoc
! [rejected]        master -> refs/for/master (non-fast-forward)
error: failed to push some refs to 'ssh://git.eclipse.org:29418/gendoc/org.eclipse.gendoc'
hint: Updates were rejected because a pushed branch tip is behind its remote
hint: counterpart. Check out this branch and integrate the remote changes
hint: (e.g. 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.


refs/for/master points to version 0.6.0. I guess it should point to HEAD or 0.7.2.

index.php/fa/41410/0/

I think that refs/for/master is the only branch for which I have push rights.
Re: XLSXParser::layoutCells() performance issue [message #1848449 is a reply to message #1848448] Thu, 02 December 2021 13:26 Go to previous messageGo to next message
Tristan Faure is currently offline Tristan FaureFriend
Messages: 460
Registered: July 2009
Senior Member
yes you are right i don't understand why refs/for/master is so far from the branch I have to check or ask the eclipse team



Re: XLSXParser::layoutCells() performance issue [message #1848450 is a reply to message #1848449] Thu, 02 December 2021 14:19 Go to previous messageGo to next message
Tristan Faure is currently offline Tristan FaureFriend
Messages: 460
Registered: July 2009
Senior Member
please try with this command : git push origin HEAD:refs/for/refs/heads/master
(pull from master to get latest commits)




Re: XLSXParser::layoutCells() performance issue [message #1848454 is a reply to message #1848450] Thu, 02 December 2021 14:53 Go to previous messageGo to next message
Denis Nikiforov is currently offline Denis NikiforovFriend
Messages: 343
Registered: August 2013
Senior Member
Thanks! It works: https://git.eclipse.org/r/c/gendoc/org.eclipse.gendoc/+/188477
Re: XLSXParser::layoutCells() performance issue [message #1848600 is a reply to message #1848454] Thu, 09 December 2021 10:46 Go to previous message
Tristan Faure is currently offline Tristan FaureFriend
Messages: 460
Registered: July 2009
Senior Member
just validated your patch, thank you for your contribution Denis!



Previous Topic:Issue with GenDoc on Papyrus 5.2 Windows
Goto Forum:
  


Current Time: Sun Feb 25 09:51:19 GMT 2024

Powered by FUDForum. Page generated in 0.25138 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top