Scanning; OCR packages with non-Western character support (FineReader, Recognita, Cuneiform, etc.); SGML/XML formatting; TEI standard.
HW # 4 (due end of Week 6): Scan one page from a non-Western script dictionary; use an OCR software package to convert it into a text. SGML code the first ten entries using the TEI standard. E-mail both the pure text and the SGML coded text to me.
Ample OCR (Optical Character Recognition) software is available in the Internet and leading companies offer modules for Slavic languages. Take a look at the following pieces of software:
Take a look at Anic's dictionary below as an example for SGML coding
XML file - an example:
Click here to see an XML document
Here is what is behind it. You need to have the following two files
Save this as serbepic.xml
<?xml version="1.0" standalone="no"?>
<!DOCTYPE text SYSTEM "serbepic.dtd">
<POEM>
<TITLE>Marko Kraljevic i Vila Ravijojla</TITLE>
<AUTHOR><FIRSTNAME>Unknown</FIRSTNAME>
<LASTNAME>Unknown</LASTNAME></AUTHOR>
<LINE N="1">
<FOUR>Vino pije</FOUR><SIX>Kraljevicu Marko</SIX></LINE>
<LINE N="2">
<FOUR>A u Skadru,</FOUR><SIX>gradu bijelome</SIX></LINE>
</POEM>
Save this as serbepic.dtd (Document Text Description)
<!ELEMENT POEM (TITLE, AUTHOR, LINE*)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT AUTHOR (FIRSTNAME, LASTNAME)> <!ELEMENT FIRSTNAME (#PCDATA)> <!ELEMENT LASTNAME (#PCDATA)> <!ELEMENT LINE (FOUR, SIX)> <!ELEMENT FOUR (#PCDATA)> <!ELEMENT SIX (#PCDATA)> <!ATTLIST LINE N CDATA #REQUIRED>
Also, click here to see a formated XML document. For that one, you need:
serbepic2.xml
<?xml version="1.0" encoding="windows-1250" standalone="no"?>
<?xml-stylesheet type="text/css" href="serbepic.css"?>
<!DOCTYPE text SYSTEM "serbepic.dtd">
<POEM>
<TITLE>Marko Kraljeviĉ i Vila Ravijojla</TITLE>
<AUTHOR><FIRSTNAME>Unknown</FIRSTNAME> <LASTNAME>Unknown</LASTNAME></AUTHOR>
<LINE N="1">
<FOUR>Vino pije</FOUR> <SIX>Kraljeviĉu Marko</SIX></LINE>
<LINE N="2">
<FOUR>A u Skadru,</FOUR> <SIX>gradu bijelome</SIX></LINE>
</POEM>
serbepic.dtd
<!ELEMENT POEM (TITLE, AUTHOR, LINE*)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT AUTHOR (FIRSTNAME, LASTNAME)> <!ELEMENT FIRSTNAME (#PCDATA)> <!ELEMENT LASTNAME (#PCDATA)> <!ELEMENT LINE (FOUR, SIX)> <!ELEMENT FOUR (#PCDATA)> <!ELEMENT SIX (#PCDATA)> <!ATTLIST LINE N CDATA #REQUIRED>and serbepic.css
POEM
{
background-color: gainsboro;
width: 100%;
}
LINE
{
display: block;
margin-bottom: 3pt;
margin-left: 0;
}
FOUR
{
margin-left: 3;
color:blue
}
FOUR
{
margin-left: 3;
}AUTHOR
{
display: block;
color:white;
margin-bottom: 10pt;
margin-left: 0;
}
TITLE
{
color: red;
font-size: 20pt;
}
| Printed text | TEI SGML coded text |
Regular expressions can be used even in Microsoft Word. Press Ctrl-H. Make sure to choose "More" and then mark "use wildcards". For example, if you have the following sequence
first second
and you use the following:
Find what: (<*>) (<*>) Replace with: <second>\2</second> <first>\1</first>
you will get:
<second>second</second> <first>first</first>
More about regular expressions in various languages here