Sometimes, it is possible that some of the characters are not visible in GitLab UI. That can be caused by GitLab not assuming the correct encoding for the file. How to force GitLab to show specific characters?
GitLab is using CharlockHolmes for encoding detection. However, in order to accept encoding suggested by CharlockHolmes, confidence returned has to be more than 50% (otherwise GitLab assumes UTF-8).
For example, this file has iso-8859-1 encoding:
$ file -I non-utf8
non-utf8: text/plain; charset=iso-8859-1
But after we push it to GitLab.com:
$ git add .
$ git commit -m "add non-utf8 file"
[master f70d62a] add non-utf8 file
1 file changed, 1 insertion(+), 1 deletion(-)
$ git push
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Writing objects: 100% (3/3), 258 bytes | 258.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To gitlab.com:mhuseinbasic/encoding-test.git
aea6216..f70d62a master -> master
We do not see special characters in the UI.
Reason for that is the fact that confidence returned by CharlockHolmes is not higher than 50%:
$ irb
2.4.4 :001 > require 'charlock_holmes'
=> true
2.4.4 :002 > content = File.read('non-utf8')
=> "Characters: \xA6\xA6\xA6\n"
2.4.4 :003 > CharlockHolmes::EncodingDetector.detect(content)
=> {:type=>:text, :encoding=>"ISO-8859-1", :ruby_encoding=>"ISO-8859-1", :confidence=>50, :language=>"en"}
Now, if we add some more content to the file in order to increase the confidence returned and push the change:
$ echo "#Boost the confidence." >> non-utf8
$ file -I non-utf8
non-utf8: text/plain; charset=iso-8859-1
$ irb
2.4.4 :001 > require 'charlock_holmes'
=> true
2.4.4 :002 > content = File.read('non-utf8')
=> "Characters: \xA6\xA6\xA6\n#Boost the confidence.\n"
2.4.4 :003 > CharlockHolmes::EncodingDetector.detect(content)
=> {:type=>:text, :encoding=>"ISO-8859-1", :ruby_encoding=>"ISO-8859-1", :confidence=>72, :language=>"en"}
2.4.4 :004 >
$ git add .
$ git commit -m "Increasing confidence"
[master d79e001] Increasing confidence
1 file changed, 1 insertion(+)
$ git push
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Writing objects: 100% (3/3), 286 bytes | 286.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To gitlab.com:mhuseinbasic/encoding-test.git
f70d62a..d79e001 master -> master
Even though the file still has iso-8859-1 encoding, confidence returned is higher than 50% and special characters are now visible in the UI. This way we can force the correct encoding to be used for displaying our content.
If we do not want to modify the file as described above, our only option is then to change the encoding of the file to utf8:
$ head -1 non-utf8 > temp
$ file -I temp
temp: text/plain; charset=iso-8859-1
$ iconv -f ISO-8859-1 -t UTF-8 temp > utf8
$ file -I utf8
utf8: text/plain; charset=utf-8
$ git add utf8
$ git commit -m "change encoding to utf8"
[master a4e23f3] change encoding to utf8
1 file changed, 1 insertion(+)
create mode 100644 utf8
$ git push
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 286 bytes | 286.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To gitlab.com:mhuseinbasic/encoding-test.git
d79e001..a4e23f3 master -> master
After this, special characters are still visible.
Have you found the solution how to make GilTab to use a particular charset encoding instead of trying to detect it with Charlock Holmes? Probably as you I need Windows-1250 charset to be forced in gitLab’s UI.
Hello. Same problem here. We are maintaining windows-1252 source code in a Gitlab EE private repository and converting our files to UTF8 is not an option as we need to keep them in their legacy encoding.
Charlock Holmes and ICU lib behind it are doing a terrible job as it does not detect the correct encoding for most of our file.
Please consider a settings option using .gitattributes so that we can be able to force the repository to work with windows 1252 encoding.
We use visual studio code and it handles this perfectly in its settings. All files are opened with win1252 as configured except if the file uses a BOM (eg UTF8 with BOM) then VScode knows it should open it with utf8… should be as simple as this.
Please do something as it is not acceptable in such a popular tool as gitlab, especially when paying for it.
@remduv if you do have a paid version you might have more success for support by opening a ticket with Gitlab directly since this is a community forum: https://support.gitlab.com/hc/en-us/requests/new
Alternatively, if you wish for features to be added to Gitlab or for it to be looked into and fixed, then you may also wish to consider opening an issue here: Issues · GitLab.org / GitLab · GitLab
Thanks for answering
I agree but looking at this forum and the gitlab support this is a known problem since 6 years now and it’s still in the same state.
There was a conversation in some old support tickets about offering a charset setting in .gitattributes
It’s possible for newline character so I dont understand why encoding cannot be supported in the same way
Devs don’t really tend to read forums as they are busy coding, and generally they only work by issues raised on the project that have been allocated to them. Issues raised that have been voted up with more and more people experiencing the problem or wanting that feature added are likely to get looked at first since it shows exactly how many people wanting it to be fixed. Unfortunately, if nobody opens the issue and ensures it is highlighted and people are aware of it, then it’s not really going to get looked at, because basically they don’t know about it.
This post only has three people who have encounted such an issue, and as such we don’t really know if there have been others or not. Had one of the posters opened an issue 6 years ago or whenever, chances are it may have been looked at, to improve it, or change the way the encoding recognition works.
OK. Thanks. I will try to raise it through my company and hope other people will jump on the same boat.
What I also don’t get is that Notepad++ automatic encoding recognition is working fine most of the time but charlock Holmes / ICU is too often wrong on the same file.