File Contents of a Microsoft Office Document

Ask about a PHP problem here.
Post Reply
User avatar
jacek
Site Admin
Posts: 3262
Joined: Thu May 05, 2011 1:45 pm
Location: UK
Contact:

File Contents of a Microsoft Office Document

Post by jacek »

Dylan wrote:Hey!

Currently I have been dabbling in my spare time on an upload system. Ideally, I'd love to allow people to upload their article / file / document or whatever and get the contents of that to store in the database. I know this is plausible and I have the system down, or the logic to it anyways.

My problem comes when someone tries to upload a file which was made in Microsoft Word (and perhaps other fullscale word processing program, I haven't tested), the content of the contents of the file, (in accordance to "file_get_contents") is a whole bunch of symboled gibberish.

Do you have a work around for that? Perhaps a header, or something I am missing about the function.
Image
User avatar
jacek
Site Admin
Posts: 3262
Joined: Thu May 05, 2011 1:45 pm
Location: UK
Contact:

Re: File Contents of a Microsoft Office Document

Post by jacek »

to make sure I understand correctly, you are trying to get the text from a word document and not the full formatted document ?
Image
User avatar
Dylan
Posts: 150
Joined: Fri May 06, 2011 7:14 pm

Re: File Contents of a Microsoft Office Document

Post by Dylan »

That's correct.

Just the content; it works fine with documents created outside of microsoft word, in notepad saved in any format. As well as in microsoft word saved as a .txt.
Torniquet
Posts: 52
Joined: Sun Jun 19, 2011 8:10 am
Contact:

Re: File Contents of a Microsoft Office Document

Post by Torniquet »

can you not save it as .txt when its uploaded?
User avatar
jacek
Site Admin
Posts: 3262
Joined: Thu May 05, 2011 1:45 pm
Location: UK
Contact:

Re: File Contents of a Microsoft Office Document

Post by jacek »

Well a .docx file is essentially a zip archive, so I would start by unzipping it http://uk3.php.net/manual/en/ref.zip.php inside the zip there should be a content.xml file which should contain the text.

Googling things like ".docx to plain text" or searching on stack overflow may get you some code to get the logic from.

not something I have done before so without actually trying it I cant say "this is how to do it"

Torniquet wrote:can you not save it as .txt when its uploaded?

That would not work even slightly.
Image
Torniquet
Posts: 52
Joined: Sun Jun 19, 2011 8:10 am
Contact:

Re: File Contents of a Microsoft Office Document

Post by Torniquet »

jacek wrote:
Torniquet wrote:can you not save it as .txt when its uploaded?

That would not work even slightly.


hell ya learn summit new every day lol.
User avatar
Dylan
Posts: 150
Joined: Fri May 06, 2011 7:14 pm

Re: File Contents of a Microsoft Office Document

Post by Dylan »

Okay, so approaching it the way you suggested is working, however, I may have been slightly misleading in saying I just want the content with no formatting.

I should have stated I want the content and white space (namely new lines and spaces), with no regards to other special formatting. Is there anyway you can figure to do that? I have it working in the sense that I can get all the content; just in one long string.
User avatar
jacek
Site Admin
Posts: 3262
Joined: Thu May 05, 2011 1:45 pm
Location: UK
Contact:

Re: File Contents of a Microsoft Office Document

Post by jacek »

hmm I assumed whitespace would be there as text. From the looks of this http://dev.plutext.org/forums/docx-java ... -t509.html is seems that the xml elements indicate where the spaces are.
Image
User avatar
Dylan
Posts: 150
Joined: Fri May 06, 2011 7:14 pm

Re: File Contents of a Microsoft Office Document

Post by Dylan »

Thanks!

With that and some searching I was able to find just about all the characters and such that I need.
Post Reply