How To Deal With Unix Files LF And BULK INSERT

Preface: I was trying to import the GeoNames data into SQL Server and was going crazy trying to do it.  The most frustrating part was the BULK INSERT and the EOF.  I found 2 relatively good posts on how-to’s, but neither of them got me all the way.  When you download the countries from GeoNames, it has unicode data and Unix file format which uses a end-of-line character line feed (LF or \n).  In order to import it into SQL Server, you first need to convert it to utf16 encoding.  A good post by Johannes Kebeck uses NAnt to do this conversion and works very well.  The only issue with his post is that I couldn’t get the import wizard to work at all.  Meanwhile I found another good post by Ed Katibah using a different approach taking advantage EditPad (like Notepad++) and BULK INSERT.  He used EditPad to convert the file, but every time I tried the application would fail 75% of the way through with memory issues; I had a little better luck with Notepad++ but instead of creating errors it never processed the whole file.  I found that using NAnt to convert the file was the fastest and most systematic approach.  This leads use to the issue of BULK INSERT, the topic of the post.

Problem: I have this Unix data file where the data fields are separated by tabs (\t) and the EOF is a line feed (LF).  BULK INSERT can be the most finicky operation.  So for this file, I attempted the following command.

BULK
INSERT dbo.GeoNames
FROM 'F:\db\Import Data\Converted\allCountries.txt'
WITH
(
DATAFILETYPE = 'widechar', -- for unicode
FIELDTERMINATOR = '\t', -- tab
ROWTERMINATOR = '\n' -- LF
)

When I ran this I got the error shown below.  At this point I started searching for answers because it doesn’t get any more simple than this.  One post said I had to convert to Windows file format which uses CR + LF (carriage return + line feed or \r\n).  I kind of got this working but I couldn’t systematically convert my whole file.

Msg 4864, Level 16, State 1, Line 1
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 19 (modification_date).

Solution:  The online documentation for BULK INSERT had a snippet of code at the bottom of the page which talked about Unix file format and importing.  Below is basically that snippet of code.  The row terminator is set to CHAR(10) which is LF (\n), the exact same row terminator definition as the above insert statement.  For some reason this works, but the above doesn’t, go figure.

DECLARE @bulk_cmd varchar(1000)
SET @bulk_cmd = 'BULK INSERT dbo.GeoNames
FROM '
'F:\db\Import Data\Converted\allCountries.txt''
WITH (ROWTERMINATOR = '
'' + CHAR(10) + ''')'
EXEC(@bulk_cmd)

2 comments:

Anonymous said...

I have tried the same with "+CHAR(10)" and "+NCHAR(10)" but still it is skipping the 1st Row.

Anonymous said...

I have tried the same with "+CHAR(10)" and "+NCHAR(10)" but still it is skipping the 1st Row.