How to force specific encoding in GitLab UI?

Sometimes, it is possible that some of the characters are not visible in GitLab UI. That can be caused by GitLab not assuming the correct encoding for the file. How to force GitLab to show specific characters?

2 Likes

GitLab is using CharlockHolmes for encoding detection. However, in order to accept encoding suggested by CharlockHolmes, confidence returned has to be more than 50% (otherwise GitLab assumes UTF-8).

For example, this file has iso-8859-1 encoding:

$ file -I non-utf8 
non-utf8: text/plain; charset=iso-8859-1

But after we push it to GitLab.com:

$ git add .
$ git commit -m "add non-utf8 file"
[master f70d62a] add non-utf8 file
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git push
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Writing objects: 100% (3/3), 258 bytes | 258.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To gitlab.com:mhuseinbasic/encoding-test.git
   aea6216..f70d62a  master -> master

We do not see special characters in the UI.

Reason for that is the fact that confidence returned by CharlockHolmes is not higher than 50%:

$ irb
2.4.4 :001 > require 'charlock_holmes'
 => true 
2.4.4 :002 > content = File.read('non-utf8')
 => "Characters: \xA6\xA6\xA6\n" 
2.4.4 :003 > CharlockHolmes::EncodingDetector.detect(content)
 => {:type=>:text, :encoding=>"ISO-8859-1", :ruby_encoding=>"ISO-8859-1", :confidence=>50, :language=>"en"}

Now, if we add some more content to the file in order to increase the confidence returned and push the change:

$ echo "#Boost the confidence." >> non-utf8 
$ file -I non-utf8 
non-utf8: text/plain; charset=iso-8859-1
$ irb
2.4.4 :001 > require 'charlock_holmes'
 => true 
2.4.4 :002 > content = File.read('non-utf8')
 => "Characters: \xA6\xA6\xA6\n#Boost the confidence.\n" 
2.4.4 :003 > CharlockHolmes::EncodingDetector.detect(content)
 => {:type=>:text, :encoding=>"ISO-8859-1", :ruby_encoding=>"ISO-8859-1", :confidence=>72, :language=>"en"}
2.4.4 :004 > 
$ git add .
$ git commit -m "Increasing confidence"
 [master d79e001] Increasing confidence
  1 file changed, 1 insertion(+)
$ git push
 Enumerating objects: 5, done.
 Counting objects: 100% (5/5), done.
 Writing objects: 100% (3/3), 286 bytes | 286.00 KiB/s, done.
 Total 3 (delta 0), reused 0 (delta 0)
 To gitlab.com:mhuseinbasic/encoding-test.git
    f70d62a..d79e001  master -> master

Even though the file still has iso-8859-1 encoding, confidence returned is higher than 50% and special characters are now visible in the UI. This way we can force the correct encoding to be used for displaying our content.

If we do not want to modify the file as described above, our only option is then to change the encoding of the file to utf8:

$ head -1 non-utf8 > temp
$ file -I temp 
temp: text/plain; charset=iso-8859-1
$ iconv -f ISO-8859-1 -t UTF-8 temp > utf8
$ file -I utf8 
utf8: text/plain; charset=utf-8
$ git add utf8 
$ git commit -m "change encoding to utf8"
[master a4e23f3] change encoding to utf8
 1 file changed, 1 insertion(+)
 create mode 100644 utf8
$ git push
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 286 bytes | 286.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To gitlab.com:mhuseinbasic/encoding-test.git
   d79e001..a4e23f3  master -> master

After this, special characters are still visible.

3 Likes