How To Deal With Unix Files LF And BULK INSERT

Preface: I was trying to import the GeoNames data into SQL Server and was going crazy trying to do it.  The most frustrating part was the BULK INSERT and the EOF.  I found 2 relatively good posts on how-to’s, but neither of them got me all the way.  When you download the countries from GeoNames, it has unicode data and Unix file format which uses a end-of-line character line feed (LF or \n).  In order to import it into SQL Server, you first need to convert it to utf16 encoding.  A good post by Johannes Kebeck uses NAnt to do this conversion and works very well.  The only issue with his post is that I couldn’t get the import wizard to work at all.  Meanwhile I found another good post by Ed Katibah using a different approach taking advantage EditPad (like Notepad++) and BULK INSERT.  He used EditPad to convert the file, but every time I tried the application would fail 75% of the way through with memory issues; I had a little better luck with Notepad++ but instead of creating errors it never processed the whole file.  I found that using NAnt to convert the file was the fastest and most systematic approach.  This leads use to the issue of BULK INSERT, the topic of the post.

Problem: I have this Unix data file where the data fields are separated by tabs (\t) and the EOF is a line feed (LF).  BULK INSERT can be the most finicky operation.  So for this file, I attempted the following command.

BULK
INSERT dbo.GeoNames
FROM 'F:\db\Import Data\Converted\allCountries.txt'
WITH
(
DATAFILETYPE = 'widechar', -- for unicode
FIELDTERMINATOR = '\t', -- tab
ROWTERMINATOR = '\n' -- LF
)

When I ran this I got the error shown below.  At this point I started searching for answers because it doesn’t get any more simple than this.  One post said I had to convert to Windows file format which uses CR + LF (carriage return + line feed or \r\n).  I kind of got this working but I couldn’t systematically convert my whole file.

Msg 4864, Level 16, State 1, Line 1
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 19 (modification_date).

Solution:  The online documentation for BULK INSERT had a snippet of code at the bottom of the page which talked about Unix file format and importing.  Below is basically that snippet of code.  The row terminator is set to CHAR(10) which is LF (\n), the exact same row terminator definition as the above insert statement.  For some reason this works, but the above doesn’t, go figure.

DECLARE @bulk_cmd varchar(1000)
SET @bulk_cmd = 'BULK INSERT dbo.GeoNames
FROM '
'F:\db\Import Data\Converted\allCountries.txt''
WITH (ROWTERMINATOR = '
'' + CHAR(10) + ''')'
EXEC(@bulk_cmd)

Autocomplete jQuery and JSON Is Now Here

Autocompletion text boxes are one of those nifty little tools that when used in the right setting really add a great user experience.  Just recently I was trying to figure out how to return JSON from a autocomplete call and use jQuery.  Older versions of jQuery only support line break text return values limiting you on what you could return.  So hoping to find something better, I Googled it and low and behold, jQuery UI now incorporates the autocomplete feature using JSON results by default!   As of v1.7, I believe, this is supported.  There is a good blog by Jörn Zaefferer, talking about the work involved in migrating from the old jQuery autocomplete plug-in to the new method.

You can get either download jQuery UI versions or use Google CDN.

DataContractJsonSerializer Versus JavaScriptSerializer : Changing Field Names

Oh my, where do I start, simplicity or contract inclusivity.  I generally like simplicity which leans me towards using the JavaScriptSerializer, but I had some doubts when I kept reading that it was deprecated.  I finally came across a Scott Gu’s response in this post by Aaron Lerch, that indicates the JavaScriptSerializer is un-deprecated.  So which to use?  Here are some of the pros and cons:

DataContractJsonSerializer

Pros:

  • You can control the property names when serializing.  This can be nice if you have JSON coming in from a third party and the fields are like “a” or “exp”.
  • WCF uses this for serialization/deserialization

Cons:

  • Does not map enumerations.  You get an exception when trying to deserialize into a enum.
  • From what I’ve seen it’s slower than the JavaScriptSerializer.
  • You have to write more code to do a simple serialization.

JavaScriptSerializer

Pros:

  • Much easier to deal with.
  • Handles enumerations.

Cons:

  • Cannot control property names when serializing.

Looking at the cons for DataContractJsonSerializer, it would seem like you wouldn’t want to use it unless you were doing some WCF work.  It just happened the other day that I was working on a couple of projects where I decided to use one serializer in one project and the other in the other project.  My thinking was that if you’re working with JSON, and the class at hand in somewhat internal and isolated to some component, use the simple JavaScriptSerializer.  You loose property name management, but so what, keep it simple.  On the other hand, this other project I was working on I thought I might expose it publically for other people to use.  In this case I wouldn’t want property names like “a” because it has not meaning when looking at the class definition.  So to abide by the 3rd party JSON contract, yet give others a meaningful property name, I decided to use the DataContractJsonSerializer.  Below I give a simple example of pretend response JSON from some 3rd party vendor JSON API (let’s prettend a=auth token value, exp=expiration in seconds, and sid=social security ID).

Basic JSON structure from the 3rd party response:

{ 
"token": {
"a":"…",
"exp":"123456789",
"sid":"…"
}
}
Using JavaScriptSerializer
public class Token
{
public string a;
public int epx;
public string sid;
}

public class TokenResponse
{
public token;
}

//
// code to receive and deserialize token class from some web response
//
string json = ...; // our json data from the 3rd party

// create serializer
JavaScriptSerializer serializer = new JavaScriptSerializer();
// return the deserialized object
return serializer.Deserialize<TokenResponse>(json);
This is a very simplified psuedo code, but the the key points are the simplicity of deserializing the response and the field names.  It’s only 2 lines of code to deserialize object!, but using the JsonScriptSerializer does not allow us to modify the field names.  If this class was confined to some class or method only, I’d go for the simplicity of this approach.  On the other hand, if this is some response class that needs to be used throughout your application or used by other people, it would be really hard to understand the field/property names as they are.  For instance, what does the 3rd party’s “sid” mean, user security ID, user social security ID, server ID, who knows unless you have the API document right there where you looking at the code.  Let’s take a different approach with the DataContractJsonSerializer, and pretend that our token classes are exposed to a broader audience.
Using DataContractJsonSerializer
[DataContract(Name = "token")]
public class Token
{
[DataMember(Name = "a", IsRequired = true)]
public string Value;
[DataMember(Name = "exp", IsRequired = true)]
public int SecondsBeforeExpiration;
[DataMember(Name = "sid", IsRequired = true)]
public string SocialSecurityNumber;
}

[DataContract]
public class TokenResponse
{
[DataMember(Name = "token", IsRequired = true)]
public Token;
}

//
// code to receive and deserialize token class from some web response
//

string json = ...; // our json data from the 3rd party
// create stream to hold json data
using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
{
// create serializer
DataContractSerializer serializer = new DataContractSerializer(typeof(TokenResponse));

return serializer.ReadObject(ms) as TokenResponse;
}
Notice that there is a lot more annotating going on with our classes and we had to create a MemoryStream to pass to the serializer?  This isn’t that big of a deal, but it’s a bit heavy if you just need a quick JSON serialization/deserialization.  I think that’s why the ASP.NET MVC framework uses the JavaScriptSerializer.  One thing to note though, is that all the field actually have names that can be quickly understood, so that when your reading code or using intelliscence, you don’t have to question what a field/property is if a developer forgot to add commenting to it.

Implicit Operator in C#

It’s fun learning something new that gives you the feeling of “wow, that’s pretty cool”.  The other day I was searching for some code to use Google Translator API.  I found some .NET project someone had written, but it didn’t work at all and the documentation, ironically, had malformed sentences making it hard to understand.  The one think I did see was a line like this “Language lang = “en-GB”.  At first glance, I thought, that’s odd how do the do that?  Then I looked at the meta data about the assembly class and noticed that class was using something called the implicit operator.  It’s a user-defined type.  There’s a really nice article on it’s usage on Code Project, Understanding Implicit Operator Overloading in C#.  Using this cool user-defined type, I wrote my own version of the Google translator API for .NET which can be found on SorceForge, DotNetGoogleTranslator.

Web Site Menus, What To Use?

I have to say that I appreciate all the many people out there that aid all of us non Web UI layout experts with tips, tricks, and how-to posts.  One of the items I have difficulty with when laying out a menu structure for a site that has more than just a main menu, is what type of menu should I use.  Some of the options maybe vertical or horizontal menu,  tree with +/- nodes or drop down menu, etc.  I had this project where, for some reason, they just wanted a list of hyperlinks on the page for the sub menu, but they had to shorten all the names to ~5-10 characters because there were so many, it was hideous.  My friend Aaron Schnieder recommended using a mega menu.  First thing that came out of my mouth was “what the heck is a MEGA menu?”.  It sounds like some combo move in the 90s video game Killer Instinct.  He gave me this link to demo on the mega menu.  I was amazed, it’s so clean and flexible.  Now, the question is should I always use the MEGA MENU?  I think it depends.  In the case of MSDN library which has that side navigator with a gazillion possibilities the mega menu wouldn’t fit, but if you have a decent sub menu that you want your users to see when hovering over the main menu, the mega menu fits like a glove.  The tutorial on Soh Tanaka’s blog is amazing, basically there’s 4 steps:

  1. Html layout  for main menu and sub menu - ul/li structures
  2. Sprinkle of CSS
  3. Dash of jQuery – uses Hover Intent jQuery (very slick), but can optionally be excluded
  4. Very minimal on page load initialization.