[00:21] *** kristian_ has quit IRC (Leaving) [01:13] Anyone have experience archiving Disqus fora? I count 55 for NPR after dropping those with "dev" or "stage" in the name. [01:24] aschmitz: I was actually looking into this a bit already and I'll write up a GitHub gist about it in a minute. One note to preface all this: the comments will stay on Disqus for quite a while longer after NPR removes the embeds from their site. We probably don't need to rush on this. [01:26] Yeah, it looked like that when I was digging into it a bit. [01:34] aschmitz: Quickly threw this together, it has all the info I was able to gather: https://gist.github.com/r3c0d3x/ff33ff59bd2432a5a81a32669eb5a390 [01:47] *** HCross has quit IRC (Ping timeout: 246 seconds) [01:47] *** HCross has joined #archiveteam-bs [01:59] r3c0d3x: Cool, thanks. Added a bit in my fork: https://gist.github.com/aschmitz/19dfb67be5d0d71c74431074191062dc [02:10] *** tomwsmf has quit IRC (Read error: Operation timed out) [02:26] *** mr-b has left [03:14] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [03:15] *** BartoCH has joined #archiveteam-bs [04:09] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [04:12] *** BartoCH has joined #archiveteam-bs [04:17] *** JesseW has joined #archiveteam-bs [04:17] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:24] *** Sk1d has joined #archiveteam-bs [04:35] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [04:35] *** HCross has quit IRC (Ping timeout: 246 seconds) [04:35] *** HCross has joined #archiveteam-bs [04:43] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [04:43] *** Meroje has quit IRC (Quit: bye!) [04:44] *** Meroje has joined #archiveteam-bs [04:53] *** DFJustin has joined #archiveteam-bs [04:53] *** swebb sets mode: +o DFJustin [05:05] *** DFJustin has quit IRC (Remote host closed the connection) [05:10] *** DFJustin has joined #archiveteam-bs [05:15] *** HCross has quit IRC (Read error: Operation timed out) [05:15] *** HCross has joined #archiveteam-bs [05:57] *** phuzion has quit IRC (Read error: Operation timed out) [05:58] *** phuzion has joined #archiveteam-bs [05:59] Intense Floppy Grabs continue [05:59] That sounds like some kind of sex toy [06:00] Buy "Intense Floppy Grabs" today for deep, sensual pleasure! [06:05] *** phuzion has quit IRC (Read error: Operation timed out) [06:05] *** sep332 has quit IRC (Read error: Operation timed out) [06:05] *** midas1 has quit IRC (Read error: Operation timed out) [06:07] just know there are sex toys that are senting data back to the company [06:07] also this: http://www.dailydot.com/layer8/hackers-and-vibrators-oh-my/ [06:07] *** midas1 has joined #archiveteam-bs [06:07] *** sep332 has joined #archiveteam-bs [06:09] thank you for that [06:10] your welcome [06:11] i remember reading something about that and could find the exact article [06:11] but that was close enough to it [06:13] *** BlueMaxim has joined #archiveteam-bs [06:13] *** phuzion has joined #archiveteam-bs [06:23] turns out sploid.gizmodo.com sitemaps was big [06:23] i think about 10gb for all of it [06:24] maybe its 9gb [06:24] but its still big [07:03] *** Honno has joined #archiveteam-bs [07:11] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:31] *** REiN^ has quit IRC (Read error: Connection reset by peer) [07:33] *** phuzion has quit IRC (Read error: Operation timed out) [07:36] *** phuzion has joined #archiveteam-bs [07:52] *** schbirid has joined #archiveteam-bs [07:53] SketchCow, floppy grabs what? [08:03] ...floppy disks? [08:22] Apple II floppies [08:27] Work on http://fos.textfiles.com/pipeline.html began [08:27] Lots to do [08:33] SketchCow: turns out we don't have all of gawker.com [08:33] ? [08:33] or kotaku or lifehacker [08:33] Really. [08:33] Why? [08:33] dump sitemap [08:33] Well, have they deleted it all now? [08:33] http://kotaku.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59 [08:34] no they have not deleted it yet [08:35] i have noticed that sitemap by date hacks weird [08:35] but cause i tested on maybe 2005 or 2006 sitemaps it looks like it had everything [08:37] *** REiN^ has joined #archiveteam-bs [08:41] ok the sitemap urls a funking with us [08:41] kotaku.com for 2008-11 (one above) has 3034 urls [08:42] but if you use gawker.com in its place you get 1971 [08:43] so when i say the sitemap acts weird it does act weird [08:45] SketchCow: also i think archivebot when after gawker.com and other sites own by gawker back in 2014 or 2015 [08:45] so my sitemap grab just maybe incomplete [08:48] even my sitemap grab of sploid.gizmodo.com is incomplete [08:48] :'( [08:48] Dust yourself and go for it again [08:49] i'm doing that now [08:52] I'm watching classic movies and ripping Apple II disks, and both are going swimmingly. [08:52] curl -s 'http://gawker.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59' | sed 's|><|>\n<|g' | grep 'http' | sed 's|.*http://|http://|g' | sed 's|.*https://|http://|g' | sed "s|||g" | sed 's|]]>||g' [08:52] thats my code for grabbing the urls [08:54] after 2006-01 is done will try setup my script to attack each month of 2006 for gawker [09:02] *** BartoCH has joined #archiveteam-bs [09:17] *** HCross has quit IRC (Ping timeout: 246 seconds) [09:17] *** HCross has joined #archiveteam-bs [09:34] *** GE has joined #archiveteam-bs [09:34] *** HCross2 has quit IRC (Quit: Connection closed for inactivity) [09:52] *** Selavi has quit IRC (Ping timeout: 260 seconds) [09:53] *** Kksmkrn has joined #archiveteam-bs [09:53] *** Kksmkrn has quit IRC (Connection closed) [09:53] *** Kksmkrn has joined #archiveteam-bs [10:00] *** Selavi has joined #archiveteam-bs [10:09] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [10:14] someone may want to go after this https://twitter.com/antisec_ita/status/767856654486503424 [10:16] *** BartoCH has joined #archiveteam-bs [10:23] *** divingk has quit IRC (ChatZilla 0.9.92 [Firefox 47.0/20160604131506]) [10:31] *** Kksmkrn has quit IRC (Quit: leaving) [11:35] *** HCross has quit IRC (Ping timeout: 246 seconds) [11:35] *** HCross has joined #archiveteam-bs [11:44] https://www.reddit.com/r/Minecraft/comments/4z36un/mojangs_official_youtube_channel_was_suspended/ [11:44] stay classy, youtube [11:47] lol. [12:35] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [12:37] *** BartoCH has joined #archiveteam-bs [12:57] *** GE_ has joined #archiveteam-bs [12:59] *** GE has quit IRC (Ping timeout: 255 seconds) [12:59] *** GE_ is now known as GE [13:03] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:04] *** BlueMaxim has quit IRC (Quit: Leaving) [13:04] *** BartoCH has joined #archiveteam-bs [13:27] *** beardicus has quit IRC (bye) [13:28] *** dashcloud has quit IRC (Read error: Operation timed out) [13:31] *** beardicus has joined #archiveteam-bs [13:35] *** beardicus has quit IRC (Client Quit) [13:37] *** beardicus has joined #archiveteam-bs [13:45] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:45] *** BartoCH has joined #archiveteam-bs [13:46] *** wp494 has quit IRC (Read error: Operation timed out) [13:47] *** dashcloud has joined #archiveteam-bs [14:16] *** GE has quit IRC (Remote host closed the connection) [14:42] *** tomwsmf has joined #archiveteam-bs [14:47] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [15:15] *** wp494 has joined #archiveteam-bs [15:18] *** JesseW has joined #archiveteam-bs [15:25] *** JesseW has quit IRC (Read error: Operation timed out) [15:34] *** BartoCH has joined #archiveteam-bs [15:56] *** VADemon has joined #archiveteam-bs [16:03] *** GE has joined #archiveteam-bs [16:14] http://fos.textfiles.com/pipeline.html is OK but needs another run! Which it will get shortly. [16:32] Does archive.org have a policy about which YouTube pages get saved? [16:32] "Gangnam Style" was getting saved like 6x per day https://web.archive.org/web/*/https://www.youtube.com/watch?v=9bZkp7q19f0 [16:33] (it doesn't seem to have video data or any comments though) [16:33] That's being worked on internally [16:33] by "policy" I mean for auto-crawling [16:33] ok [16:42] *** HCross2 has joined #archiveteam-bs [16:42] The goal is in the future it will deduplicate these. [16:42] SketchCow: awesome!! [16:43] hmm [16:43] Is the video data being saved and just not rendered correctly, or plain not collected? [16:43] SketchCow: you can remove extratorrent from there [16:44] I'll do additional work after I finish the script. [16:44] A little time to go [16:44] ok [16:51] *** irl has joined #archiveteam-bs [16:51] SketchCow: hi [16:51] Hiiiiiiiiii [16:51] hiiiiiiiiiiiiiiiiiii [16:51] hi. [16:52] SketchCow: i hear you like manuals [16:52] I do. [16:52] cool [16:52] I heard you like scanning them [16:52] i have manuals [16:52] and a scanner coming [16:52] in the ebay post [16:52] Try not to damage the originals too much and have a fantastic time. [16:53] Scan at 600dpi TIFF files, put into either .ZIPs or into directories. [16:53] X.25 interface cards, network simulation software, and other things relevant to internet engineering [16:53] I can give you an FTP drop [16:53] ok awesome [16:53] so i don't go directly to IA? [16:53] you'll help out with metadata maybe? [16:53] you do your own metadata [16:53] don't make SketchCow do it [16:53] hehe [16:53] it's not that hard [16:54] got a link for how to do metadata in a nice format? [16:54] also, i have some reel-to-reel tapes and 8" floppies [16:54] I can give general information. [16:54] like, how to type in the title and date and author? [16:54] In a best case, it's: [16:55] Title, date of creation, creator (company or individual), and then a capsule description. [16:55] that seems reasonable [16:55] so i'm not understanding FTP drop then, because that sounds like i'm creating an IA collection [17:00] *** tomaspark has quit IRC (Ping timeout: 255 seconds) [17:03] SketchCow: who should I contact if I need to change the payment method for an archive.org donation? [17:12] mail info@archive.org [17:12] irl: So there's two ways to upload [17:12] You can upload yourself, or you can build up a pile of directories and I can give you an FTP drop and I shove them in. [17:12] A collection can be made and you can work on it, but I can do that initial upload process using scripts. I find that helps for bulk uploaders. [17:13] *** VerifiedJ has joined #archiveteam-bs [17:21] SketchCow: ah ok cool [17:21] so how should the metadata be done within the pile of directories? [17:22] is there some json or yaml or something format? [17:24] Whatever you're comfortable with, I can work with [17:28] SketchCow: i could do a csv like http://internetarchive.readthedocs.io/en/latest/cli.html#modifying-metadata-in-bulk [17:29] Entirely up to you. However you want. [17:29] ok, just trying to find the easiest way for you [17:55] SketchCow: daily sitemaps of gawker.com is happening [17:55] Great [17:56] i just hope they sitemap does going crazy like before with monthly ones [17:59] Any amount of Gawker functioning right now is a gift. [17:59] Or any of the properties. [17:59] And when Univision steps in, it's going to be a bloodbath [18:00] The Univision buy is so insane I'm assuming it's some corrupt reason we don't understand [18:00] Or Denton invented some snow-job that Univision bought [18:59] *** bzc6p has joined #archiveteam-bs [18:59] *** swebb sets mode: +o bzc6p [19:00] ErkDog: you reported that yahoo answers items get stuck. Do you use the new wget-lua? [19:01] --version [19:01] there is a new one from 20160530 [19:19] *** VerifiedJ has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [19:52] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:58] *** BartoCH has joined #archiveteam-bs [21:04] *** HCross2 has quit IRC (Quit: Connection closed for inactivity) [21:12] SketchCow: Is there a minimum education background requirement (other than experience) for jobs at the internet archive? [21:13] you should apply [21:13] http://archive.org/about/jobs.php [21:19] I'm not located in California unfortunately. I would consider applying for many of them if I had more programming experience. [21:19] then why are you asking? [21:29] Wondering for possible jobs in the future, and not all of say that on-site presence is required. [21:31] *all of them [21:40] *** Honno has quit IRC (Read error: Operation timed out) [22:07] *** schbirid2 has joined #archiveteam-bs [22:10] *** schbirid has quit IRC (Read error: Operation timed out) [22:13] *** whydomain has joined #archiveteam-bs [22:15] PurpleSym: what design DIY book scanner did you make? (I'm considering https://linearbookscanner.org/ ) [22:25] * FalconK looks around [22:25] hey, look! https://archive.org/details/cbcnews201607-201608 [22:26] :) [22:26] that's a lot of hourly news [22:33] *** RichardG has joined #archiveteam-bs [22:40] I've got a cronjob pulling it down every hour [22:42] *** schbirid2 has quit IRC (Read error: Operation timed out) [22:45] *** schbirid2 has joined #archiveteam-bs [22:46] I started doing it mostly because I thought it provided an interesting perspective on Trump, and I noticed CBC didn't keep a public archive of them [22:46] huh [22:46] they do *have* an archive of them [22:47] not sure how to access it. probably in person. [22:47] :| [23:09] *** whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [23:16] *** JW_work1 has joined #archiveteam-bs [23:18] *** JW_work has quit IRC (Read error: Operation timed out) [23:23] *** RichardG has quit IRC (Read error: Operation timed out) [23:38] *** rchrch has joined #archiveteam-bs [23:45] *** kristian_ has joined #archiveteam-bs [23:48] *** RichardG has joined #archiveteam-bs [23:56] *** Stiletto has quit IRC (Ping timeout: 246 seconds)