Coder Social home page Coder Social logo

Comments (48)

fideloper avatar fideloper commented on July 20, 2024 6

It's just part of the out_charset parameter. For example:

// $result = iconv($in_charset, $out_charset, $str);
$convertedString = iconv('ISO-8859-1', 'UTF-8//IGNORE', $stringToConvert);

from fetch.

fideloper avatar fideloper commented on July 20, 2024 2

You can use UTF-8//IGNORE to ignore illegal characters and remove them. This avoids this error.

One caveat, however. Note the comments here. Especially the 2nd one, which says that //IGNORE doesn't work. We've found that in php 5.3, sometimes iconv misbehaves. Illegal characters can lead to an empty string being returned instead of simply dropping the illegal characters. Later versions of PHP fixes that.

from fetch.

fideloper avatar fideloper commented on July 20, 2024 1

Ignore and Translit are mutually exclusive (You can't use them together).

Your options are to either

  1. Ignore (drop them) characters not recognized (via //IGNORE)
  2. Attempt to replace them with approximates (via //TRANSLIT).
  3. You can use neither, altho the behavior you get when using neither isn't usually desired behavior - it just cuts the string off at the first unrecognized character.

This is all in the docs!

from fetch.

fideloper avatar fideloper commented on July 20, 2024 1

I think you need to PR it (clone the repo, commit/push your changes in, the make the pr). Based on the branches/commit history, it looks like a PR to master is most appropriate.

(Also might be something that's worth a unit test)

from fetch.

amansilla avatar amansilla commented on July 20, 2024

I have the same problem.

from fetch.

cleartext avatar cleartext commented on July 20, 2024

Same here, will come back here and comment if I fix it!

from fetch.

red-crown avatar red-crown commented on July 20, 2024

same here

from fetch.

iampersistent avatar iampersistent commented on July 20, 2024

I ran into this and the problem in my case was actually with strings that shouldn't be decoded where getting decoded. The decoded strings would sometimes have information that wasn't understood and would replace that data with �. I resolved this by fixing the decode method in #56

from fetch.

 avatar commented on July 20, 2024

How can we go about passing UTF-8//IGNORE?

from fetch.

 avatar commented on July 20, 2024

Hmmm...I changed the output value to 'UTF-8//IGNORE//TRANSLIT', however I'm still getting issues...not sure why...

from fetch.

 avatar commented on July 20, 2024

Yes I figured it out. It was on a users end. My bad.

from fetch.

fideloper avatar fideloper commented on July 20, 2024

👍 🍺

from fetch.

 avatar commented on July 20, 2024

Hmmm...still having issues though on what appears to be a UTF-8 encoded email.

I'm getting error...

iconv(): Detected an illegal character in input string

Which is line 441

if (!empty($parameters['charset']) && $parameters['charset'] !== self::$charset)
                $messageBody = iconv($parameters['charset'], self::$charset, $messageBody);

I'm using 'UTF-8//TRANSLIT'

Here is the email in question that's causing it:

http://pastie.org/private/4b7pdeox56crvmfdk31dw

from fetch.

 avatar commented on July 20, 2024

This needs a lot of work, encoding seems to be very goofy. I currently have 'UTF-8//IGNORE' set and am still getting iconv(): Detected an illegal character in input string errors on some messages.

from fetch.

fideloper avatar fideloper commented on July 20, 2024
  • What's the value of self::$charset exactly? Did you set public static $charset = 'UTF-8//IGNORE'; explicitly here?
  • What version of PHP are you using?
  • Do you have xDebug with "SCREAM" set to on? (This may show notices despite the //IGNORE, altho it shouldn't to my knowledge)

I'm not sure why you'd get a notice pop up when using //IGNORE.

A note on your email example from pastie.org: This is why character encoding sucks - This is what happens when you copied/pasted that raw email into pastie.org:

  • We don't know if the app you used to read the file with changed its encoding
  • We don't know if the form pastie uses sets a character encoding
  • We don't know how pastie's code handles character encoding
  • We don't know if pastie sends its data to its database using UTF8, nor if its database is set to store UTF-8 or not

In other words, anyone here would need your raw file to test with, as sending it over via pastie.org has a ton of points of possible character set conversion. ARE WE HAVING FUN YET!??! :D

This needs a lot of work, encoding seems to be very goofy.

Sure as hell is, but that's not necessarily the fault of this library. PHP is bad at this.

from fetch.

 avatar commented on July 20, 2024
  • What's the value of self::$charset exactly? Did you set public static $charset = 'UTF-8//IGNORE'; explicitly here?
  • What version of PHP are you using?
  • Do you have xDebug with "SCREAM" set to on? (This may show notices despite the //IGNORE, altho it shouldn't to my knowledge)
  1. yes
  2. PHP 5.4.4-14+deb7u10
  3. No XDebug isn't installed

I'm willing to send the email file to anyone who needs it.

A few more questions about this I guess. Should I need to install php5-intl? I'm willing to go however far down the rabbit hole we need to solve this problem.

Sure as hell is, but that's not necessarily the fault of this library. PHP is bad at this.

I've gathered that from my research on this topic. From my understanding, //TRANSLIT and //IGNORE should both work fine. I switched it to //IGNORE to help ease the issue, incase there was something wrong with //TRANSLIT. Perhaps I need to upgrade my PHP. Perhaps I need to install another module, regardless I appreciate your help @fideloper. In it for the long haul! :)

from fetch.

fideloper avatar fideloper commented on July 20, 2024

I don't think php-intl should be needed. We use iconv a lot on our app without issue (without installing php-intl), BUT we also suppress NOTICE errors from appearing, so it's possible I'm just missing them.

But, I think I found something:

Check out this SO question - That's probably what's happening in your case?

In the code:

if (!empty($parameters['charset']) && $parameters['charset'] !== self::$charset)
                $messageBody = iconv($parameters['charset'], self::$charset, $messageBody);

... the portion $parameters['charset'] !== self::$charset will likely always "pass" as self::$charset has "//TRANSLIT" or "//IGNORE" in it, and the $parameters['charset'] likely does not. That probably means the library should do some extra checking to see if the character set is already UTF8 or UTF-8.

I think a tweak to fix that bit of logic might help. (That make sense?)

from fetch.

 avatar commented on July 20, 2024

I've seen that question before and this was also a "hypothesis" of mine. However I couldn't be totally sure due to lack of other evidence from both my customers and the other possible peoples using this library and not reporting back. I'm curious as to what @tedivm thinks.

In the mean time I guess I'll work on a fix for what we maybe think is happening. sigh

from fetch.

tedivm avatar tedivm commented on July 20, 2024

I would be happy if someone wanted to tackle this (ecstatic even). I'm in Paris on vacation for a couple of weeks and probably will not have time myself.

One thing to note is that this was an attempt to normalize the charset of emails in order to make it easier for people using the library, so they could just skip worrying about that all together. This may have been a mistake, and it may be better to expose the encoding (getEncoding()) and allow the developers themselves to do what they want with it, or perhaps add a changeEncoding($charset) to the message object.

from fetch.

 avatar commented on July 20, 2024

Okay I've implemented a simple fix for now. Is it possible to submit a patch in this issue or will I have to make a separate request?

from fetch.

 avatar commented on July 20, 2024

I don't know if that submitted patch is the best route, but it solves problems with the current state of affairs. Perhaps mb_convert_encoding can do a better job than iconv? Usage of mb_convert_encoding might allow for better optional encoding as well. Thoughts?

from fetch.

tedivm avatar tedivm commented on July 20, 2024

iconv is installed by default but the mb extensions are not. It also can't be added via pecl, it has to be compiled in directly. For this reason I've been avoiding using those.

from fetch.

fideloper avatar fideloper commented on July 20, 2024

I've actually had the opposite experience on some linux distros. I think for the most part both are usually there, however.

from fetch.

tedivm avatar tedivm commented on July 20, 2024

According to the php site iconv is a default installed extension.

from fetch.

 avatar commented on July 20, 2024

well for me at least, mb_ is working wonders better than iconv, maybe it should be added as a requirement for this library...

from fetch.

tedivm avatar tedivm commented on July 20, 2024

We can always do a "if(function_exists()) else blarg" type thing- basically, look for the mb functions and if they aren't there fall back on iconv.

from fetch.

 avatar commented on July 20, 2024

That's fair, but then we return to the original issue if mb_ functions don't exist. Maybe add a note to the readme about hardcoding in UTF-8 if mb_ functions aren't found?

from fetch.

tedivm avatar tedivm commented on July 20, 2024

The issue seems more related to the inclusion of //TRANSLIT or //IGNORE, as @fideloper brought up. I think that can be solved by doing the comparison of charset to the message's actual character set after removing the //FLAG from the charset. Does that make sense?

from fetch.

 avatar commented on July 20, 2024

Yes, but rather than writing code to parse through flags correctly, why not just use a set of functions that is a dependency of most PHP5 installations (mb_ is included in php5-common a requirement for php installation on Debian and Ubuntu). To me it comes down to use case. You're going to have far more people using Ubuntu and Debian base systems rather than something else. Furthermore do you really believe that users of this library want to mess around with encoding? If they do, mb_ provides a simple list of supported encoding types. The "encoding from" charset isn't even required. So much simpler than messing around with parsing an encoding type, and supporting issues with exact verbiage of a variable.

from fetch.

tedivm avatar tedivm commented on July 20, 2024

We already would use those functions, this would simply provide a fallback for when those functions aren't there. It's not like this is particularly complicated code, and it keeps the library from breaking backwards compatibility by changing requirements. In the future we can probably deprecate use of iconv is we want to, but by doing it this way we won't be forcing an immediate update to PHP for people who depend on this library.

from fetch.

fideloper avatar fideloper commented on July 20, 2024

Speaking from experience -

Developers likely will need to mess around with encoding - especially when handling emails. UTF8 is super important for handling multiple languages in addition to special characters such as certain mathematical notations (all sorts of crazy stuff is sent through emails!)

I definitely don't agree that assuming "it works on ubuntu/debian is good enough" - RHEL, for example, is especially popular for professional use. That being said, they likely all come with iconv and mb_string out of the box (I haven't tested myself, but I'll bet its a pretty safe assumption).

I agree (mentioned-ish quickly above) that encoding could be added on an abstraction layer on top of encoding to handle the subject, body and other text portions of emails could give developers more options when dealing with their environment. Perhaps some "hooks" to be able to do anything with the text fields (subject, body) would be useful - this library could skip encoding altogether and just provide a way for developers to do it themselves to meet their needs.

I think that offers a simpler approach overall. The main point of this library probably isn't to go into the depths of issues/gotchas of character encoding (altho I don't want to speak for @tedvim), so providing hooks to aid in potentially adding filtering or other functionality a dev might need that might be the best bet.

Not sure on your opinions on that, just my 2 cents for now!

from fetch.

 avatar commented on July 20, 2024

I agree with you both, @fideloper and @tedivm. Backwards compatibility is super essential. This library hasn't reached a 1.0 version yet, so I see that as room to play around with exactly how it works (even if it means breaking something. Though I'm not saying we do need to break something). I'm leaning toward removing encoding conversion all together and let the developer deal with it as the best course for the future. However as stated, it is @tedivm 's project and I'm sure he'll deal with best interests. That being said, I think the currently the best route would be the earlier proposed solution of,

"if(function_exists()) else blarg"

However this will lead to adding a second variable:

public static $flags = '//TRANSLIT';

This can be added to the iconv code for conversion. Thoughts?

from fetch.

tedivm avatar tedivm commented on July 20, 2024

My 2:28am on-the-way-to-the-airport brain thinks that looks pretty good.

I'm not completely opposed to breaking BC, I just like to find ways around doing it when possible :-)

from fetch.

 avatar commented on July 20, 2024

With #66 committed, I guess the next step is to decide whether to write an Encode.php or to devise a plan to removing encoding. Either way, a benefit of either choice would be that encoding can be handled not only for the $messageBody, but also the Sender name and other information that may need to be encoded.

from fetch.

fideloper avatar fideloper commented on July 20, 2024

Would it be simple if those items had getters and setters, and just leaving
any encoding up to a developer to implement ok their own?

The important idea being the ability to set those attributes after encoding
them within our own code.

This gives devs an avenue to do whatever they need to, without having to
think too hard in a library that doesn't need the complexities if encoding
etc!

On Friday, June 13, 2014, Alex Kavon [email protected] wrote:

So I guess the next step is to decide whether to write an Encoding.php or
to devise a plan to removing encoding. Either way, a benefit of either
choice would be that encoding can be handled not only for the $messageBody,
but also the Sender name and other information that may need to be encoded.


Reply to this email directly or view it on GitHub
#13 (comment).

from fetch.

 avatar commented on July 20, 2024

something like:

$message->setEncoding($charset);
$message->encode($string);

This would allow for something like:

public function encode($string) {
    if (function_exists('mb_convert_encoding') { bleh } else { exception about installing mb_convert_encoding }
}

Allowing the deprecation of iconv and the conversion of any string and separating out the encoding functions so they are not required. Thoughts?

EDIT: I guess we don't need to deprecate iconv with the method of encoding. It is a good method of fallback. I'll get off the iconv hate train now. There should definitely be a note appended to the readme about the superiority of mb functions.

from fetch.

tedivm avatar tedivm commented on July 20, 2024

I do like the idea of getEncoding and setEncoding as functions here.

from fetch.

 avatar commented on July 20, 2024

I guess deciding between the two current options, to offer encoding or not, it comes down to two ideas. Is Fetch a library that offers simplified IMAP functions and presentation of email messages, or is Fetch a library that just offers easy access to IMAP? I was continuing to think about this, if Encode.php gets implemented, it could be a bigger rewrite than planned. This could lead to entire functions being rewritten to automatically encode values to UTF-8 (by default...if setEncoding() isn't called). And not require the user to ever need to call $message->encode(); The alternative being of course leaving it up to the user to decide what to encode and what not too, do they want to encode it or not?

from fetch.

tedivm avatar tedivm commented on July 20, 2024

I don't really like the Encode.php idea, to be honest. The reason I like the getEncoding/setEncoding option is that it basically punts it to the user while also offering a quick shortcut to change things.

That being said, this library was built awhile ago (2009) and hasn't had it's API change much since then. One of the reasons I rewrote the test suite recently was to make it easier to start making API changes and to expand the library based off of what I've learned from putting it together and people's feedback of it's use. In other words I'm not opposed to API changes that will make this library easier to use.

from fetch.

fideloper avatar fideloper commented on July 20, 2024

My preference (and I certainly don't mind y'all have another opinion!) would be go go simpler and simply leave an avenue for developers to do some "post processing" - probably easily done by providing some setters for the already available getter methods.

This leaves behind a lot of complexity and lets the library stay more on message ("Easy access to IMAP").

Again, I'm voicing my preference, do with it what you will!

from fetch.

tedivm avatar tedivm commented on July 20, 2024

Isn't the setter/getter method what I'm proposing?

from fetch.

fideloper avatar fideloper commented on July 20, 2024

Ah, I see - I explained that in a confusing way.

My idea was getters/setters for major portions of the email message:


class Message {

    public function getMessageBody() { ... }
    public function setMessageBody( $body ) { ... }

    public function getSubject() { ... }
    public function setSubject( $subject ) { ... }


    /* And so on, where ever it might makes sense to have getters/setters */
}

Then you can leave out all encoding and other "filtering" type concerns not directly related to email. That leaves it up to a developer to handle things like encoding - you can ignore the need for any encoding.

get_class($message) //  ->  \Fetch\Message

$encoder = new MyOwnEncoder;

$subject = $message->getSubject();
$body = $message->getBody();

$message->setSubject(
    $encoder->encode($subject, $message->getEmailCharset(), 'UTF8')
);
$message->setBody(
    $encoder->encode($body, $message->getEmailCharset(), 'UTF8')
);

Something like that! My point of view on this is that the library becomes simpler by deferring potentially opinionated and "off-topic" things such as character set encoding or other possible filtering someone might want to do when processing emails.

from fetch.

 avatar commented on July 20, 2024

@fideloper In the example above. The MyOwnEncoder class is something separate a user has built correct?

from fetch.

fideloper avatar fideloper commented on July 20, 2024

Absolutely correct!

On Saturday, June 14, 2014, Alex Kavon [email protected] wrote:

@fideloper https://github.com/fideloper In the example above. The
MyOwnEncoder class is something separate a user has built correct?


Reply to this email directly or view it on GitHub
#13 (comment).

from fetch.

 avatar commented on July 20, 2024

I'm all for entirely removing encoding.

from fetch.

chrisjraby avatar chrisjraby commented on July 20, 2024

Is there any update on solving the encoding issue? I'm using the "dev-master" branch for now, but obviously that isn't ideal in the long term.

from fetch.

orrd avatar orrd commented on July 20, 2024

I'm also wondering if there is any update to the encoding issue yet. I've only skimmed through the lengthy discussion here, but basically I'm getting the "iconv(): Detected an illegal character in input string" error when calling getMessages(), and just want to know what the quick and simple solution is to fix the error and tell Fetch to let me read the emails in UTF-8.

Is there a simple way to get Fetch to do whatever encoding/decoding is necessary to just simply return the messages in usable UTF-8 text?

from fetch.

tedivm avatar tedivm commented on July 20, 2024

Character encoding has been updated with the current release. I'm asking that people submit new tickets for any bugs that they find.

from fetch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.