Page 1 of 1

File Contents of a Microsoft Office Document

Posted: Mon Jun 20, 2011 7:41 pm
by jacek
Dylan wrote:Hey!

Currently I have been dabbling in my spare time on an upload system. Ideally, I'd love to allow people to upload their article / file / document or whatever and get the contents of that to store in the database. I know this is plausible and I have the system down, or the logic to it anyways.

My problem comes when someone tries to upload a file which was made in Microsoft Word (and perhaps other fullscale word processing program, I haven't tested), the content of the contents of the file, (in accordance to "file_get_contents") is a whole bunch of symboled gibberish.

Do you have a work around for that? Perhaps a header, or something I am missing about the function.

Re: File Contents of a Microsoft Office Document

Posted: Mon Jun 20, 2011 7:42 pm
by jacek
to make sure I understand correctly, you are trying to get the text from a word document and not the full formatted document ?

Re: File Contents of a Microsoft Office Document

Posted: Mon Jun 20, 2011 7:53 pm
by Dylan
That's correct.

Just the content; it works fine with documents created outside of microsoft word, in notepad saved in any format. As well as in microsoft word saved as a .txt.

Re: File Contents of a Microsoft Office Document

Posted: Mon Jun 20, 2011 7:58 pm
by Torniquet
can you not save it as .txt when its uploaded?

Re: File Contents of a Microsoft Office Document

Posted: Mon Jun 20, 2011 8:00 pm
by jacek
Well a .docx file is essentially a zip archive, so I would start by unzipping it http://uk3.php.net/manual/en/ref.zip.php inside the zip there should be a content.xml file which should contain the text.

Googling things like ".docx to plain text" or searching on stack overflow may get you some code to get the logic from.

not something I have done before so without actually trying it I cant say "this is how to do it"
Torniquet wrote:can you not save it as .txt when its uploaded?
That would not work even slightly.

Re: File Contents of a Microsoft Office Document

Posted: Mon Jun 20, 2011 8:02 pm
by Torniquet
jacek wrote:
Torniquet wrote:can you not save it as .txt when its uploaded?
That would not work even slightly.
hell ya learn summit new every day lol.

Re: File Contents of a Microsoft Office Document

Posted: Tue Jun 21, 2011 1:23 am
by Dylan
Okay, so approaching it the way you suggested is working, however, I may have been slightly misleading in saying I just want the content with no formatting.

I should have stated I want the content and white space (namely new lines and spaces), with no regards to other special formatting. Is there anyway you can figure to do that? I have it working in the sense that I can get all the content; just in one long string.

Re: File Contents of a Microsoft Office Document

Posted: Tue Jun 21, 2011 12:13 pm
by jacek
hmm I assumed whitespace would be there as text. From the looks of this http://dev.plutext.org/forums/docx-java ... -t509.html is seems that the xml elements indicate where the spaces are.

Re: File Contents of a Microsoft Office Document

Posted: Tue Jun 21, 2011 8:02 pm
by Dylan
Thanks!

With that and some searching I was able to find just about all the characters and such that I need.