[00:00] http://bettermotherfuckingwebsite.com/ [00:01] browsing with via a cell phone connection is a pain, even if it is much faster than it was back in that time [00:01] if only people still care enough to make sites lightweight [00:06] *** MrRadar has joined #archiveteam-bs [00:19] *** DoomTay has joined #archiveteam-bs [00:40] *** BlueMaxim has joined #archiveteam-bs [01:12] *** schbirid2 has joined #archiveteam-bs [01:14] *** schbirid has quit IRC (Read error: Operation timed out) [01:34] *** dashcloud has quit IRC (Remote host closed the connection) [01:36] *** dashcloud has joined #archiveteam-bs [01:45] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [01:47] *** kristian_ has quit IRC (Leaving) [02:56] I go "Why is it so hard for me to go through the FTP uploads to ingest into archive?" [02:56] And the answers are: [02:56] - No context of what I'm looking at [02:56] - Crap that is obviously going to be taken down within milliseconds [02:56] - Zero metadata [02:56] - Drag and Drop wonderment of "well, I'm done working on this, give it to jason" [02:57] So there we go. [02:58] yep, makes sense [03:00] Some stuff has been in there for north of a year. [03:00] Time to get mean. [03:01] And since this channel is apparently able to sustain the profound retardation of DoomTay, it can handle me open-calling the stuff I'm seeing on the FTP page, and going from there. [03:02] First up, 3D Lemmings CD-ROM I had. Easy to do. [03:04] Next, "Bally Alley", another of mine. [03:04] I see. These are a bunch of one-page letters between Bally developers. [03:04] And other sets. [03:06] https://archive.org/details/ballyalley?and[]=bally%20alley [03:06] I'm going to combine all the letters into one object. [03:07] archive.org/details/ [03:07] Various_Bally_Developer_Related_Letters [03:11] OK, the rest are letters that I'm more than happy to deal with in this fashion. [03:11] So they're getting uploaded now, and will be in ballyalley [03:15] b-alley [03:17] *** Stiletto has joined #archiveteam-bs [03:20] OK, they're all in. [03:21] https://archive.org/details/ballyalley?sort=-publicdate will show them populating as they go in. [03:21] NEXT [03:21] "Cinemageddon" [03:22] So, basically, someone is robin-hooding me movies from a tracker. [03:23] Oh, and magazines. [03:23] OK, well, magazines first. [03:23] SCREEN# for each in *.pdf [03:23] > do [03:23] > DELETE=1 /0/SCRIPTCITY/appleway "$each" [03:23] > done [03:27] SketchCow: "someone"? [03:27] I thought the rule was "one thing=one item" or something like that [03:30] generally. [03:38] This might not actually become relevant until the fall, but let's say I find a scanner with OCR and I want to use it to scan the magazine [03:39] With the intention of uploading that scan to archive.org [03:39] Would it be better to use the scanner's OCR, or let archive.org do the OCRing [03:59] Do we know what kind of software archive.org uses for OCR? [04:04] Screen magazine uploads still going. [04:04] (Lot of issues) [04:05] "abbyy finereader 8.0" https://archive.org/post/386344/how-to-use-ocr-on-this-site [04:06] Oh. I've used ABBYY before, from what I've seen it's pretty good. [04:06] https://archive.org/details/magazine_rack?sort=-publicdate [04:06] Screen's showing up [04:07] Meanwhile, the scanner at university I will be using also has OCR, but I don't know what software it uses or how it compares [04:10] Do you know what kind of scanner it is? [04:11] Only that it's a KIC book scanner [04:16] abbyy is some of the best available ocr software [04:17] In that case, my choice is clear [04:27] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:32] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [04:33] *** Sk1d has joined #archiveteam-bs [04:46] *** Stiletto has joined #archiveteam-bs [04:49] If you're scanning something with Chinese characters you might be better off doing the OCR yourself. [04:52] one does not simply type chinese [04:53] Ha, I won't have to worry about that. These are all English language magazines [04:53] I guess you could use one of those applications to draw them [04:53] it'd be super tedious :p [04:56] of course you can type in Chinese or Japanese; that's what pinyin/romaji and IMEs are for [04:56] there's also plenty of OCR software that recognizes hanzi/kanji [04:58] yipdw: yeah, I know about the phonetic typing. But it wouldn't work if you can't read the characters [04:58] in order to type them phonetically, that is [04:58] always possible to learn [04:58] also OCR [04:58] there's also the SKIP method which can be handy [04:59] and http://tsukurimashou.osdn.jp/idsgrep.php.en [05:01] OK, screen is REALLY showing up [05:01] I'll do the covers later. [05:01] Back to more cinemageddon [05:01] I'm going to kill this FTP [05:02] Cinemageddon# for each in * [05:02] > do [05:02] > /0/SCRIPTCITY/cdway "$each" [05:02] > done [05:02] Pumping in three movies from Cinemageddon [05:02] Probaby doomed, down within a week [05:02] 3gb of movies [05:04] SketchCow: how are you finding these FTP servers? [05:04] :P [05:04] there's a project [05:04] No, this is not "FTP servers" [05:05] This is "I have an FTP site that people can upload to to have me ingest them into the archive" combined with "Some people have done a good job uploading items in a clear anc concise fashion and others have basically dropped stuff into a disorganized shitpile" [05:05] With a twist of "Fuck it, I will not sleep tonight until I murder this collection" [05:06] nice :p [05:06] Would you rather have a "disorganized shitpile" or less stuff? [05:07] How about people put a bit of effort and label things properly [05:07] that too is an option isn't it [05:08] i guess [05:08] do people ever upload copyrighted stuff into the FTP server? [05:09] Oh you are adorable [05:09] huh? [05:09] < hook54321> Would you rather have a "disorganized shitpile" or less stuff? [05:09] This is how people get abusive partners, by the way [05:10] OK, Cinemageddon stuff uploaded. [05:10] NEXT [05:10] SketchCow: people get abusive partners by uploading shit into an FTP server? :P [05:11] whoosh [05:11] Yeah [05:11] bedtime for me [05:11] It's OK, I don't need validation. [05:12] ytho.jpg [05:14] NEXT [05:14] "DPRK Stuff" [05:15] is there some stuff on the Kwangmyong in that heap [05:15] that'd be badass [05:16] What it APPEARS to be is a disorganized pile of shit. [05:16] It's been sitting in this directory since January [05:16] yipdw: I've got videos of dancing North Korean soldiers somewhere [05:16] -rw------- 1 wacko wacko 6469990 Dec 28 2015 cds.zip [05:16] -rw------- 1 wacko wacko 7676 Dec 28 2015 certs.zip [05:16] -rw------- 1 wacko wacko 62790054 Sep 21 2014 daesong_towel.rar [05:16] -rw------- 1 wacko wacko 487424 Sep 21 2014 ftp.doc [05:16] -rw------- 1 wacko wacko 467984104 Dec 28 2015 gol.zip [05:16] -rw------- 1 wacko wacko 226936053 Sep 21 2014 item_2.zip [05:16] -rw-r--r-- 1 root root 21 May 26 2015 item_2.zip.txt [05:16] -rw------- 1 wacko wacko 21037552 Sep 21 2014 kiyctc.zip [05:16] -rw-r--r-- 1 root root 42466 May 26 2015 kiyctc.zip.txt [05:16] -rw------- 1 wacko wacko 190081047 Sep 21 2014 korfilm.zip [05:17] -rw-r--r-- 1 root root 22 May 26 2015 korfilm.zip.txt [05:17] -rw------- 1 wacko wacko 1161564 Sep 21 2014 naenara_usertable.rar [05:17] -rw------- 1 wacko wacko 16839853 Sep 21 2014 rodong.zip [05:17] -rw-r--r-- 1 root root 53681 May 26 2015 rodong.zip.txt [05:17] -rw------- 1 wacko wacko 24090843 Sep 21 2014 vok_and_gnu.zip [05:17] It's one gig of material. [05:17] oh great [05:17] Fun [05:17] DPRK_stuff# for each in *.zip; do unzip -l $each >${each}.txt; done [05:17] End-of-central-directory signature not found. Either this file is not [05:17] a zipfile, or it constitutes one disk of a multi-part archive. In the [05:17] latter case the central directory and zipfile comment will be found on [05:17] the last disk(s) of this archive. [05:17] unzip: cannot find zipfile directory in one of item_2.zip or [05:17] item_2.zip.zip, and cannot find item_2.zip.ZIP, period. [05:17] End-of-central-directory signature not found. Either this file is not [05:17] a zipfile, or it constitutes one disk of a multi-part archive. In the [05:17] latter case the central directory and zipfile comment will be found on [05:17] the last disk(s) of this archive. [05:17] unzip: cannot find zipfile directory in one of korfilm.zip or [05:18] korfilm.zip.zip, and cannot find korfilm.zip.ZIP, period. [05:18] So three of the zips are bad [05:18] This is what has slowed me up before. [05:18] So fuck it. They die. [05:18] DDOS them to death [05:19] root@teamarchive0:/0/CDROMS/DPRK_stuff# grep -i Kwangmyong *.txt [05:19] Nothin [05:19] hm [05:19] oh well [05:19] isn't there a way to sometimes read bad zip files? [05:19] Probably. [05:19] Not going to do it [05:20] This isn't some mysterious found .zip file on the bottom of a trunk [05:20] This is somebody who uploaded this shit to me and did it wrong. [05:20] You_had_one_job.gif [05:21] /0/SCRIPTCITY/cdway "DPRK_Stuff_-_Web_Material_From_North_Korean_Sites" [05:21] OK, doing it the hard way with DPRK_Stuff_-_Web_Material_From_North_Korean_Sites. [05:21] What collection does this dump into? [05:21] web [05:22] What type of item is this? (texts, software, movies, audio...) (texts is default) [05:22] data [05:22] We're doing this the hard way. [05:22] We're putting this into web. [05:22] Woot. [05:22] Going to use the filename DPRK_Stuff_-_Web_Material_From_North_Korean_Sites... [05:22] There we go. [05:22] http://memedad.com/memes/951513.jpg [05:26] http://memedad.com/memes/951519.jpg [05:29] DPRK done [05:31] Now "Floppy Images" [05:31] Includes notes, references an e-mail. Can't find e-mail [05:33] Uploading it as is. [05:33] Not really useful. [05:34] Floppy_Disks_Collection_Various_Batch_One: [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/ERM - Goofy's Express (Copy).img: [################################] 2/2 - 00:00:00 [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/MOD Sound Files #2.img: [################################] 2/2 - 00:00:00 [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/AU Format Sound 1.img: [################################] 2/2 - 00:00:00 [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/Batch One Notes.txt: [################################] 1/1 - 00:00:00 [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/17th Annual Triad Competition Logo and Schedule.img: [################################] 2/2 - 00:00:00 [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/MOD Sound Files #1.img: [################################] 2/2 - 00:00:00 [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/ERM - Corrupt Unbranded White Floppy (Corrupt #1).img: [################################] 1/1 - 00:00:00 [05:34] uploading Floppy_Disks_Collection_Various_Batch_One/ERM - ICLOGO WMF Picture File.img: [################################] 2/2 - 00:00:00 [05:36] *** SmileyG has quit IRC (Read error: Operation timed out) [05:41] *** Smiley has joined #archiveteam-bs [05:46] *** DoomTay has quit IRC (Quit: Page closed) [05:51] NEXT [05:51] Podcasts [05:51] Just figured out how I uploaded these, doing it so if it's already done it, it won't upload it, fixing it. [05:52] https://archive.org/details/2005_podcastcoresample [05:52] These are all items underneath [05:58] https://archive.org/details/2005_podcastcoresample?sort=-publicdate [06:08] Oh, it's going to be that for a while (two screens working on it) [06:08] Going to open third screen [06:09] NEWTON_Argonne# ls -l [06:09] total 80152 [06:09] -rw------- 1 wacko wacko 82072984 Feb 26 2015 www.newton.dep.anl.gov.tar.gz [06:09] No idea what this is. [06:10] Figured it out. [06:11] This is all in the wayback, but someone grabbed it. [06:14] In it goes [06:17] NOT-FULLY-UPLOADED-beos_haiku_stuff [06:17] Someone trying to be helpful [06:17] But [06:21] -rw-r--r-- 1 wacko wacko 1638111662 Dec 12 2015 SWG_Media.zip [06:21] I see this is nothing but Star Wars Galaxies Material. [06:25] 28G ftp.hp.com_2012_softpaq_archive.tar [06:25] Going up [06:37] Now... now we are getting somewhere. [06:41] Dents, dents are being made [07:11] I am a dent. O_O [07:36] *** BlueMaxim has quit IRC (Quit: Leaving) [07:51] *** Honno has joined #archiveteam-bs [08:14] Uploading continues. [08:15] I now have 8 windows uploading from this inbox into the Archive. [08:15] EVR Radio dregs, Podcasts, Minecraft Sets, Youtube grabs [08:18] *** dashcloud has quit IRC (Read error: Operation timed out) [08:22] *** dashcloud has joined #archiveteam-bs [09:01] Now, I am going to bed. I have uploaded many gigabytes. The inbox is beginning to make sense (but there's still a huge pile left to go.) [09:21] SketchCow: The Mark Levin Show didin't need to be move to the podcast collection right away [09:21] also the Podcast Collection causes alot of the my items to be downloadable in less your login [09:32] even the kpfa podcast items are affected [09:44] *** BlueMaxim has joined #archiveteam-bs [09:47] SketchCow: btw the Electrical Workers will have to be getting its own collection: https://archive.org/search.php?query=subject%3A%22Electrical%20Workers%22%20uploader%3A%22slaxemulator%40gmail.com%22 [09:49] whats funny is the pdfs you moved don't have the same problem that i'm having with the podcasts collection [10:26] *** kristian_ has joined #archiveteam-bs [12:14] *** davidar has joined #archiveteam-bs [12:48] *** BlueMaxim has quit IRC (Quit: Leaving) [14:56] I have successfully flooded the incoming s3 queue for my accounts and have to wait for it to settle down! [14:57] godane: Over time, the podcasts and magazine_rack collections will be gone through with a script and will give me or my scripts the option to find them and make collections for them. [14:58] But moving it from "godane inbox" to "podcasts" or "magazine_rack" is at least a step in the right direction. [15:23] My inbound queue is always full [15:24] ok [15:31] SketchCow: i also hope we can fix the items in the podcasts collections items to be downloadable [15:32] cause otherwise no one else can download them [15:36] so i'm up to 808k items [15:36] btw my inbound queue is always full too [16:05] *** DoomTay has joined #archiveteam-bs [16:48] *** Simpbrain has quit IRC (Read error: Operation timed out) [16:55] *** Simpbrain has joined #archiveteam-bs [16:59] We are monsters [17:27] *** VADemon has joined #archiveteam-bs [17:54] *** dashcloud has quit IRC (Read error: Operation timed out) [17:57] *** dashcloud has joined #archiveteam-bs [18:11] *** dashcloud has quit IRC (Read error: Operation timed out) [18:14] *** dashcloud has joined #archiveteam-bs [18:22] great to hear that someone is feeding cinemageddon into IA. it is a treasure trove of movies and should be archived if possible [18:30] It's OK to hear [18:30] Someone piling endless directories into my FTP site with no metadata effort is not great [18:32] totally [18:32] they should have included the descriptions and imdb ids and everything else that the tracker offers [18:33] i wonder if someone at that site would support a project like that [18:54] *** Simpbrain has quit IRC (Remote host closed the connection) [19:02] schbind2: that was only a 8gb folder of Cinemageddon [19:02] also i thing the Screen pdfs are from Cinemagedom [19:05] https://archive.org/details/Pirated_Copy_Man_yan_2004 [19:06] https://archive.org/details/scifibuzz_uk_sci-fi_special_1996.mkv [19:15] *** tomwsmf has joined #archiveteam-bs [19:26] i'm starting to upload my Google Books grab of PC Mag [19:31] https://archive.org/details/PC-Mag-1982-02 [19:33] SketchCow: my idea is to release the google books version of PC Magazine so it could at some point be get 'fixes' for bad scan pages or missing pages to be scan for a proper release [19:34] if anything else we have the google version and maybe a proper re-release from it later [20:14] *** kristian_ has quit IRC (Leaving) [20:21] *** fie_ has quit IRC (Read error: Connection reset by peer) [20:47] *** RichardG_ has joined #archiveteam-bs [20:47] *** RichardG has quit IRC (Read error: Connection reset by peer) [21:41] *** jk[SVP] has quit IRC (Ping timeout: 244 seconds) [21:41] *** jk[SVP] has joined #archiveteam-bs [21:52] *** RichardG_ is now known as RichardG [22:47] *** DoomTay has quit IRC (Quit: Page closed) [23:05] *** fie has joined #archiveteam-bs [23:31] *** Honno has quit IRC (Read error: Operation timed out)