Dictionary Formatting

Auburn
Administrator

Posts: 144

Dictionary Formatting Jan 21, 2015 5:36:39 GMT

Quote

Post by Auburn on Jan 21, 2015 5:36:39 GMT

Sorry I just keep posting away..

Um... there's lots of little kinks to refine for this project.

The programmers are ready to start importing dictionaries, and they let me know that the file format would be a CSV file separated by # signs. You can open up the CSV file with Notepad and it'd look like this:

Word#IPA#Part of Speech#Meaning#Source
torech#____#noun#lair, hole#PE17:89
ceven#_____#noun#earth#VT/44:21,27
aeglos#____#noun#1. snowthorn, a plant like furze (gorse)<br>2. icicle (aeg+loss)#VT??

How to make one using Excel

First you make a table like this, but better of course (this is just a mockup)

Then:
1) Go into your Control Panel (if you have Windows)
2) Select Clock, Region and Language
3) Click on "Region and Language" and a popup box will show up
4) In the "Formats" tab, click on "Additional Settings" down below
5) In this screen, find "List Separator" and you will see a comma. Replace the comma for the # symbol.
6) Save & Save again.

7) Now back in Excel, go to "Save As" and in the file type selection, select "CSV (Comma Delimited)"
8) Done!

There is only one problem with this though, I haven't figured out how to make it save IPA symbols properly. When I save and open a file back up in Excel, the special characters turn into question marks. The information is lost, so the web browser would just display question marks too.

Looking into this atm, but I think we may need to encode the dictionary using Unicode characters for all the IPA fields. Any help would be appreciated ^^;

Last Edit: Jan 21, 2015 5:42:50 GMT by Auburn

Auburn
Administrator

Posts: 144

Dictionary Formatting Jan 21, 2015 5:41:30 GMT

Quote

Post by Auburn on Jan 21, 2015 5:41:30 GMT

If we code the IPA data in the database using Unicode, then we'll change the # symbol to something like @ for the separation of data. This is because Unicode characters use the # sign.

But I'm trying to find a reliable way to convert IPA symbols to Unicode. So far I found this:

ipa.typeit.org/ = Generates IPA Characters
www.web2generators.com/html/entities = Converts to Unicode (sortof... it seems to miss some)

Hmmm... this link has the Unicodes: www.phon.ucl.ac.uk/home/wells/ipa-unicode-test.htm
and: www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm
But it'd be nice to have a seamless/quicker way to do this, considering the volume of entries a dictionary has.

Last Edit: Jan 21, 2015 5:49:30 GMT by Auburn

Auburn
Administrator

Posts: 144

Dictionary Formatting Jan 21, 2015 5:54:03 GMT

Quote

Post by Auburn on Jan 21, 2015 5:54:03 GMT

I may've found one! Let me know if this works for you guys:

Generate Symbols with: ipa.typeit.org/ - First copy the characters from this link or anywhere else you get them from. Then convert them to Unicode with: rishida.net/tools/conversion/ - Paste them in the "Mixed Input" field here, and click "convert". Down a little bit, where it says "Hexadecimal NCRs", I think that should be the proper HTML-ready encryption.

That can be pasted in the Excel document for that word's IPA and saved out as a CSV the same way as before. We'd go back into Regions and Languages and change the # sign for the @ sign.

I think that's a solution....

Last Edit: Jan 21, 2015 5:57:35 GMT by Auburn

xandarien
Beta Author

Posts: 58

Dictionary Formatting Jan 21, 2015 10:06:36 GMT

Quote

Post by xandarien on Jan 21, 2015 10:06:36 GMT

Bleh... in all seriousness, I don't have the time to convert my dictionary from word into that format if it will honestly involve doing what it looks like (copying and pasting every single entry, or converting the entire file from .txt at least) it would take far too much work. I'll ask a friend if there's a way around it...

Edit - the aforementioned IT friend (software) has unfortunately told me, no there is no easier way off the cuff, and that what I thought it would involve is how I would have to do it. He's going to possibly make me a script that could automate it, but there's no guarantees.

Last Edit: Jan 21, 2015 10:25:38 GMT by xandarien

Auburn
Administrator

Posts: 144

Dictionary Formatting Jan 21, 2015 21:00:57 GMT

Quote

Post by Auburn on Jan 21, 2015 21:00:57 GMT

hmm..
i tried my hand at an automated approach and got this: www.dropbox.com/s/u2p7j33afpjvjz4/Sindarin-English%20Dictionary%20-%20SemiComplete%201.xlsx?dl=0

lots of wonky bits to clean up, but it might be usable. fifi, i think you may've had a dictionary already in Excel format, if I remember correctly...?

Last Edit: Jan 21, 2015 22:00:48 GMT by Auburn

dreamingfifi
Beta Author

Posts: 58

Dictionary Formatting Jan 21, 2015 22:09:37 GMT

Quote

Post by dreamingfifi on Jan 21, 2015 22:09:37 GMT

I made a Verb Dictionary, an updated version of Helge's "Suggested Conjugation of all known or inferred Sindarin Verbs", but with Neo-Sindarin verbs included. Currently it's in the form of a Access database, but I can export it into a TXT file pretty easily. In fact, I believe I put a copy of it into our Dropbox folder.

BTW, ask Didier Willis, the maker of the Dragonflame dictionary. That was built with a database, I believe. He might let you import it if you ask nicely.

Xandarien - Oh gosh, I know... I'm going through that with my name lists. I have to enter each one individually into the Database. Trying to automate it causes weird problems. You can see that happening in the Elfdict website. Because the entries were imported, not entered by hand, the script was fooled or confused by all kinds of things.

Though... I wonder... Maybe this isn't the best way to go about this. Dictionaries are pretty specialized according to their uses. A Sindarin->English dictionary is for translating into English. A English->Sindarin dictionary is for translating into Sindarin. A Verb-Conjugation, or a Noun-Case&Number dictionary is for helping beginners with the hard part of learning a language - memorizing wordforms. Etymological dictionaries show the history of the words.

Maybe a better way would be to have dictionaries be organized individually to streamline their usage, and that should be handled by dictionary makers themselves. Also, I think that the dictionaries should be regular translations made through the software. So, instead of translating a passage, you'd have:

[orch] = [orc](split)[goblin] menus: [Part of Speech --> Noun] IPA version [ɔrx] annotation: [Attested forms: Orch (sources) Orc This appears to be a Gondorian dialect's version of the word. (sources) and Yrch The plural form used in Lothlórien. (sources) Erch From Late Noldorin, an alternate plural form given in the Etymologies. (sources)]

Then, by each word, you'd have "Search This document for this word" and "Search this wing of the library for this word" options, then you'd be able to see the words used in context, or all of the other words that could mean the same thing.

Dictionaries are difficult to maintain. Languages are constantly changing. The "Search This Wing of the Library for this word" option is a much more organic way to build dictionaries, and it isn't prescriptive. (meaning that you have an authority commanding that words be used in certain ways) By letting their useage in context define the terms it's descriptive, which is a more scientifically sound way to gather linguistic data. This could be a valuable tool for people trying to understand words meanings, uses, and their changes over time.

Question - Are we wanting this tool to do the translations for people, or do we want people to do the translations then explain what they mean in another language in detail for the readers?

Realelvish.net | Your Sindarin Textbook

Auburn
Administrator

Posts: 144

Dictionary Formatting Jan 21, 2015 22:46:06 GMT

Quote

Post by Auburn on Jan 21, 2015 22:46:06 GMT

Question - Are we wanting this tool to do the translations for people, or do we want people to do the translations then explain what they mean in another language in detail for the readers?

We'd want fluent people to do the translations themselves, so the software is more of a "composing" software, not a translator like Google Translate. I'd be awesome if it could also translate, but that's a far more complex algorithm.

Long answer:

SPOILER: Click to show

The only way I can see us building an actual custom translator is by building a sort of "idiom engine". So admins can add exceptions as they'd like, according to how idioms and exclusions actually translate. And on top of that, the translator would have to know all of the sentence-types of the language and be able to identify the sentence-type being inputted.

To do this, it would have to identify the part of speech of each word (in both languages) and link them up. So if it identifies a sentence structure as: [article] [noun] [verb] [adverb] (i.e. "A dog ran fast") then it could decipher how to arrange those four words in the other language. And apply the appropriate grammatical elements, according to the morphological influences of that language. (which, for Sindarin would also mean registering exactly what type of verb/noun it is and deciding what appropriate mutation it needs to undergo due to context)

This means a dictionary would be imperative, because you'd have to know what part of speech each word belongs to - to as much detail as the dictionary needs to know, to do its job. And we'd need dictionaries for both languages, to know what POS one word is and select the right effect to apply to the other language. And then you'd have to account for words that are sometimes verbs and sometimes nouns, etc. Context-based sentence algorithms. o.0 ....very quickly becomes a headache. >.<

I do agree with you (like we said in skype) that the dictionary isn't necessarily needed, and people can use the search function to pull up words in their context, and decipher the meaning of them that way. But it doesn't hurt to have the dictionary feature there too.

-- I still think it's worthwhile to add support for a dictionary because other people may have different preferences.

Y'know... what we could do, though, to simplify this process, is simplify the dictionary fields. For example, just have two fields "Root" and "Description". Then in the description field we can dump everything from the IPA to the many sources or how the sources relate to the meaning and so forth.

Come to think of it, other languages may need more fields than WORD/IPA/POS/MEANING/SOURCE, so maybe the PHP function can create as many fields as the CSV document itself has. So if we wanted to create a dictionary for Sindarin and it was like:

@word@all other stuff here, sources, and so forth
@testing@ this is just a test, related to PE17 pg.44, but reconstructed from VT/etc
@moretest@and so on and so on and so on and so on and so forth

We could do it using Xandarien's dictionary pretty easily, since a lot of the fields are meshed together this way. But it'd still give the functionality for other people to use more precise fields. Giving more control/options is always better, I think, for a tool that will be used by such a broad audience.

What do you think?

Last Edit: Jan 21, 2015 22:58:30 GMT by Auburn

dreamingfifi
Beta Author

Posts: 58

Dictionary Formatting Jan 21, 2015 23:28:15 GMT

Quote

Post by dreamingfifi on Jan 21, 2015 23:28:15 GMT

I think that it's worthwhile to give people a way to easily make, edit, download, and upload wordlists. That way we have a continually expanding and improving and updating database of dictionaries that are descriptive rather than prescriptive.

Instead of having set fields for the dictionaries... as there are many types of them with specialized uses. I think we should let people decide their own columns. That'd mean a feature that lets you add extra columns, and a feature that lets you rename them.

Realelvish.net | Your Sindarin Textbook

Auburn
Administrator

Posts: 144

Dictionary Formatting Jan 21, 2015 23:58:06 GMT

Quote

Post by Auburn on Jan 21, 2015 23:58:06 GMT

-nods- agreed. I'll let the programmers know to add that feature.

And then, taking into consideration "import" and "export", the software will read how many "@" signs are in each row, and build the appropriate number of rows from the imported file. The first row can be deliberately reserved for the names of the columns.

Likewise, exporting will be done with the "@" sign differentiating each column, for however many columns we have. And there we have it! A streamlined way to edit/make dictionaries of all sorts and sizes. The only constant parameter needed is the "root" or some equivalent so that it links/calls the right wordstring from the Entry page when you're selecting the root. Any other fields can be variable.

BTW - Xandarien I compressed together all the rows of your dictionary into 2, and this is the result: www.dropbox.com/s/5nwstgx98dhdjvt/Sindarin-English%20Dictionary%20-%20SemiComplete%201_1x2.xlsx?dl=0
I think it should be mostly accurate, though some extra spaces or extra quote marks may be here and there. Let me know how it looks to you.

xandarien
Beta Author

Posts: 58

Dictionary Formatting Jan 22, 2015 10:01:13 GMT

Quote

Post by xandarien on Jan 22, 2015 10:01:13 GMT

It's got a few errors where there are homonyms, the I, II, III that should stay with the word seem to have often got lost into the second column.

If I find the time after I finish drafting the research paper I'm currently working on I will do it properly into the six columns. (The aforementioned script could only do what you just did, so never mind!)

You mentioned simplifying it into just the two, but then you've got the part of speech lumped into the English translation (or the Sindarin translation... oh, wait, you didn't do the Sindarin-English dictionary, I assumed it was the entire document at first. Nay bother!) and that doesn't look right to me.

Agreed with the giving everyone the means to edit/update wordlists. I've tried to put every 'well used' reconstruction by the community as a whole in mine, but then I come across a text which will use a variant - eh, what's an example... oh aye! 'Tomorrow' is one I've seen several different versions of. So I have two forms, but then someone else will use a third, and I'll still understand it.

Last Edit: Jan 22, 2015 10:04:50 GMT by xandarien

dreamingfifi
Beta Author

Posts: 58

Dictionary Formatting Apr 1, 2015 18:09:50 GMT

Quote

Post by dreamingfifi on Apr 1, 2015 18:09:50 GMT

I've been going through the dictionary, trying to figure out why words like "tîn" or "edhellen" don't appear to be in it... and it looks like the script got confused A LOT. Especially if there were any special characters involved - in the word or its description. It cuts the entry off at the point where the special character appears. Thus you get entries like "t" and descriptions like "(N. i". This will take someone's full-time attention for several weeks, possibly months to go through and fix all of these. This is why I really don't like the idea of including a dictionary. It's far too much of a hassle.

Realelvish.net | Your Sindarin Textbook

Auburn Administrator Posts: 144	Dictionary Formatting Apr 2, 2015 15:19:40 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Auburn on Apr 2, 2015 15:19:40 GMT Yeah.. You could always just build it as you go along too. Adding missing roots. Or not have one at all.

Post by Auburn on Jan 21, 2015 5:36:39 GMT

Post by Auburn on Jan 21, 2015 5:41:30 GMT

Post by Auburn on Jan 21, 2015 5:54:03 GMT

Post by xandarien on Jan 21, 2015 10:06:36 GMT

Post by Auburn on Jan 21, 2015 21:00:57 GMT

Post by dreamingfifi on Jan 21, 2015 22:09:37 GMT

Post by Auburn on Jan 21, 2015 22:46:06 GMT

Post by dreamingfifi on Jan 21, 2015 23:28:15 GMT

Post by Auburn on Jan 21, 2015 23:58:06 GMT

Post by xandarien on Jan 22, 2015 10:01:13 GMT

Post by dreamingfifi on Apr 1, 2015 18:09:50 GMT

Post by Auburn on Apr 2, 2015 15:19:40 GMT

Quick Reply