[00:00] I certainly haven't ever seen it :P [00:00] it's enough of an edge case to not care about a misnaming in that case [00:00] *** serapeum has joined #archiveteam-bs [00:00] if it's accurate in 99% of the cases, that's better than confusing in 100% of the cases.. :) [00:01] fair enough, exactly :) [00:08] dan_: https://gist.github.com/joepie91/09aed84c45dc44967699 [00:09] a lot more consistent than the RFC [00:09] :P [00:10] aha yep, RFC had to deal with implementation-specific things tacked on over years though, so I sorta forgive it~ [00:13] dan_: heh, this is 14x, it has no excuse [00:13] :) [00:14] *** c_b2 has joined #archiveteam-bs [00:14] *** c_b has quit IRC (Ping timeout: 260 seconds) [00:16] *** c_b2 is now known as c_b [00:21] dan_: hm. is there an equivalent of HTTP 400/500 in IRC? [00:21] "some error that I don't have an error code for" [00:22] oh [00:22] 400 [00:22] heh [00:24] *** mistym_ has quit IRC (Remote host closed the connection) [00:24] *** mistym has joined #archiveteam-bs [00:25] *** primus has quit IRC (Read error: Operation timed out) [00:36] https://www.alien.net.au/irc/irc2numerics.html [00:36] all those conflicts :) [00:37] yep [00:37] that has been my goto numeric guide for a long tiem [00:37] lol [00:37] haha, (Last updated: Tue, 11 Jan 2005 22:30:30 GMT) [00:38] gotta love irc [00:38] and it's still accurate! heh [00:49] *** BlueMaxim has quit IRC (Quit: Leaving) [00:51] *** BlueMaxim has joined #archiveteam-bs [00:52] *** primus104 has quit IRC (Leaving.) [01:27] *** schbirid2 has quit IRC (Read error: Operation timed out) [01:39] *** schbirid2 has joined #archiveteam-bs [01:45] *** wp494 has quit IRC (Read error: No route to host) [01:48] what do you guys think of BetaArchive [01:49] * kyan thinks they're jackasses, because they won't let other sites mirror their collection — a single point of failure for a valuable chunk of history, with a bureaucratic attitude [01:51] logchfoo: off [01:51] *** logchfoo has left [01:52] *** logchfoo starts logging #archiveteam-bs at Sun Mar 29 01:52:11 2015 [01:52] *** logchfoo has joined #archiveteam-bs [01:52] (sorry to interrupt, i wanted to remove ops from the log bot) [01:59] lol, wow: https://en.wikipedia.org/wiki/FoundationDB [01:59] On March 25, 2015 it was reported that Apple has acquired the company.[6] A notice on the FoundationDB web site indicated that the company has "evolved" its mission and would no longer offer downloads of the software.[7] [01:59] "ha ha fuck you now you can't download our software anymore that you've built your infra on" [02:00] looks like Apple may soon be joining Yahoo in the list of douchebag-acquisition companies [02:03] lol [02:03] looks like they evolved from "extend" to "extinguish" [02:04] Gotta acquire 'em all! [02:31] *** wp494 has joined #archiveteam-bs [02:57] *** vitzli has joined #archiveteam-bs [03:27] *** necenzura has joined #archiveteam-bs [03:53] *** necenzura has quit IRC (Quit: Page closed) [04:00] *** mistym has quit IRC (Remote host closed the connection) [04:04] *** dashcloud has quit IRC (Read error: Operation timed out) [04:11] *** dashcloud has joined #archiveteam-bs [04:12] *** aaaaaaaaa has quit IRC (Leaving) [04:24] *** mistym has joined #archiveteam-bs [04:30] *** vitzli has quit IRC (Quit: Leaving) [04:31] *** vitzli has joined #archiveteam-bs [04:59] *** Start_ has joined #archiveteam-bs [04:59] *** Start has quit IRC (Read error: Connection reset by peer) [05:06] *** brayden has joined #archiveteam-bs [05:11] *** c_b has quit IRC (Quit: c_b) [05:43] *** mistym has quit IRC (Remote host closed the connection) [05:49] https://www.youtube.com/watch?v=aOOE7KrrCpE [06:25] *** primus104 has joined #archiveteam-bs [07:13] *** edsu has joined #archiveteam-bs [07:20] *** john has joined #archiveteam-bs [07:21] Does wpull not support --no-clobber, despite --help listing it? [07:24] it's implemented [07:25] if you're writing WARCs you won't need to worry about it [07:25] Really? Because for me it downloads everything again. [07:25] And when I append --no-clobber it prints the usage and exits. [07:25] I built it from git master today. [07:26] use a stable version [07:26] master is generally good enough for use but I haven't been tracking it [07:27] Okay. [07:27] I thought it'd be one of those projects where git master is always the reccomended version. [07:27] what gave you that impression [07:28] chfoo is generally pretty good about releases [07:28] http://wpull.readthedocs.org/en/master/changelog.html [07:28] Who doesn't like bleeding edge? It should cut you, else it ain't good [07:28] and new [07:29] FWIW, we don't use no-clobber anywhere in archivebot [07:29] I don't know what options you're passing, but download twice is not the default behavior [07:29] Still doesn't work. [07:30] the list of options we pass is as follows: https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot/seesaw/wpull.py#L22-L57 [07:30] http_proxy="127.0.0.1:4444" wpull http://echelon.i2p/ --warc-file echelon.i2p --page-requisites --recursive --level inf --warc-max-size 5000000000 --no-clobber [07:30] That's what I'm trying. [07:30] clobber doesn't occur with WARC writing [07:31] so you don't need to specify it [07:31] the data goes right into the WARC [07:32] All right. [07:32] But it still downloads everything again. [07:32] are you seeing duplicate HTTP requests or files along with WARC records [07:33] Yes. [07:33] what the hell does that mean [07:34] It means, it requests files that are already in the warc archive. [07:34] if the request comes from a redirect, that'll happen [07:35] wpull operates on URLs, not files [07:35] at least when doing websitse [07:35] es [07:36] It's not just that, it will again fetch the robots.txt and index file too. [07:36] post the logs [07:36] All right. [07:37] http://sprunge.us/ZebB [07:38] that log looks normal, there's no duplicate fetches in there [07:40] are you resuming a stopped grab? [07:40] if so you need to record the results to a database with --database [07:41] otherwise wpull will use an in-memory database that goes away once the process exits [07:41] Oh… [07:41] So that's what that's for. All right. [07:41] http://wpull.readthedocs.org/en/master/usage.html#stopping-resuming [08:27] I must say, I'm very happy with the web archive's new design. ^_^ [09:10] *** schbirid2 has quit IRC (Leaving) [09:33] *** schbirid has joined #archiveteam-bs [09:33] anyone know a twitter bot that one can simply feed any text corpus to for funny markov chain tweets? all i found so far are based on your own tweet archive [09:36] *** vitzli has quit IRC (Quit: Leaving) [09:36] nvm, cant find the corpus i wanted to use as text anyways :( [11:49] i'm uploading Computer Power User 2014 pdfs [12:04] btw i'm also uploading Archival Outlook [12:04] from Society of American Archivists [12:05] i'm only doing that cause there is no collection of it on IA [12:06] and 2014 pdf are being put on bluetoad.org [12:08] https://archive.org/details/Archival_Outlook-2004-07 [12:27] *** primus104 has quit IRC (Leaving.) [12:39] john: it was originally (afaik) blood for the blood god [12:40] schbirid: you wanted to use the sweary one? [12:40] All right. [12:42] Smiley? [12:42] sweary corpus [12:45] *** lysobit has quit IRC (Quit: quit) [12:46] nah [12:49] aww [12:55] *** lysobit has joined #archiveteam-bs [14:29] so i found something called flightglobal.com [14:29] its has tons of Flight pdf [14:30] *** primus104 has joined #archiveteam-bs [14:32] i may have to convert pdf pages into one pdf [14:32] cause they put every page as its own pdf [14:35] Hmm… that's weird. [14:35] I thought the .com file extension was usually used for flat binary files. [14:39] *** primus104 has quit IRC (Leaving.) [14:49] john: a file extension is nothing but bytes [14:49] it doesn't define a file [14:49] it's just the name [14:49] I know. [14:50] Usually the header gives you a good idea, but even that can be decieving. [14:50] so it's probably an archive of a site named flightglobal.com :P [14:55] Oh… [15:19] *** underscor has quit IRC (Ping timeout: 370 seconds) [15:28] *** underscor has joined #archiveteam-bs [15:28] *** swebb sets mode: +o underscor [15:28] *** primus104 has joined #archiveteam-bs [15:36] so i finally figured out way kbs korea culture news stopped at the end of Jan 2003 [15:42] it looks like they just had high bit rate wmv between june 2002 to jan 2003 [15:42] btw i'm getting something called Classic Odyssey [15:45] anyone know of any very lenient regexes for matching URLs? [15:45] ie. not requiring the protocol [15:46] maybe even using a valid tld list.. [15:47] *** underscor has quit IRC (Ping timeout: 370 seconds) [15:47] *** brayden has quit IRC (Ping timeout: 606 seconds) [15:54] johtso: "valid TLD list" became infeasible since ICANN went overboard with gTLDs [15:55] technically speaking, 'hi' is a valid URL if you want to ignore the protocol [15:55] joepie91_, https://www.publicsuffix.org/list/effective_tld_names.dat [15:56] just grab that and compile it into your regex :) [15:57] yeah, no [15:57] there's a number of issues with that list and you probably don't want a regex that large [15:57] joepie91_, by URL I really mean publicly accessible web address [15:57] not to mention that this is NOT a complete list [15:57] yes [15:57] johtso: try ctrl+Fing that list for .onion [15:57] publicly accessible, just on a different network [15:57] not on the list [15:57] mm, okay [15:58] well, .onion wouldn't really be something I'd be looking for anyway ;) [15:58] really I'm trying to extract file locker urls, but for my first pass I want to make sure I don't miss anything [15:58] extract from [15:59] ? [15:59] the html content of blogger posts/comments [15:59] and can't rely on the links being in html markup [16:00] and why without the protocol? [16:00] just guessing that there must be *some* links out there that are missing the protocol [16:01] I'd rather not miss them [16:05] just grab anything [\x21-\X7E-]+\.[\x21-\X7E-]+\/[\x21-\X7E-]+ [16:06] charscharschars [16:06] *** Start_ is now known as Start [16:06] you'll get a bunch of false positives I'm sure [16:06] but that's one HEAD away [16:06] sounds like a great idea, seeing as I'm not interested in bare urls [16:06] :P [16:06] *** Start has quit IRC (Disconnected.) [16:06] *** Start has joined #archiveteam-bs [16:06] *** Start has quit IRC (Remote host closed the connection) [16:06] *** Start has joined #archiveteam-bs [16:06] one HEAD away? :) [16:07] the problem is you might be too greedy, and grab a period at the end -> 404 [16:07] or you might NOT grab the period -> 404 [16:07] ideally, you'd get both variations, but I don't think you can do that with a regex [16:08] johtso: HEAD request [16:08] requests headers, not body [16:08] ah right! [16:08] so you get the status code [16:08] if it's 200, it's probably valid [16:08] yeah, see if they're alive [16:08] Sanqui: that's postprocessing :) [16:10] anyone have a good macro recording program so I can record me clicking a button, pressing a different button for a screenshot, and then closing any windows opened by my first button press? [16:11] OS? [16:14] either windows or linux [16:14] under linux, I'd be running the program under wine [16:16] simplescreenrecorder maybe? [16:16] i just use ffmpeg x11grab if i need to record something [16:16] ffmpeg -f x11grab -s 1280x800 -r 30 -i :0.0 -qscale 0 /tmp/x11grab4.mpg [16:16] oh ignore me [16:16] haha [16:17] dashcloud: windows, autohotkey [16:17] linux, nfi. I never do GUI automation other than some wmctrl hacks to make XBMC play nice with multiple monitors [16:17] :p [16:19] dashcloud, I haven't used it, but you might want to check out http://www.sikuli.org/ [16:21] thanks! [16:27] *** brayden has joined #archiveteam-bs [16:48] have we grabbed the videos from joystiq yet? [16:49] it now redirects to engadget [16:52] I think godane did a loot of them [16:52] ok [16:55] *** underscor has joined #archiveteam-bs [16:55] *** swebb sets mode: +o underscor [17:03] i uploaded the tuaw videos to Jason's ftp [17:04] but joystiq videos i didn't grab all yet [17:05] oh [17:05] how much did you get? [17:07] i really don't remember how much i got [17:07] but i want to say 400 to 500 videos [17:07] also joystiq youtube channel still has all of the videos [17:08] that's a relief [17:09] Facebook is killing their XMPP API on April 30: https://developers.facebook.com/docs/chat [17:12] oh really, nice. [17:20] *** mistym has joined #archiveteam-bs [17:23] *** aaaaaaaaa has joined #archiveteam-bs [17:37] *** schbirid has quit IRC (Leaving) [17:37] *** schbirid has joined #archiveteam-bs [18:39] *** dashcloud has quit IRC (Read error: Connection reset by peer) [18:42] *** dashcloud has joined #archiveteam-bs [18:44] *** xtr-201 has quit IRC (Read error: Connection reset by peer) [18:49] *** aaaaaaaaa has quit IRC (Read error: Operation timed out) [19:14] https://www.youtube.com/watch?v=uPVQMZ4ikvM [19:17] ffs, git/github privacy leaking is ridiculous [19:18] if you have more than one account, you are bound to accidentally post with random ones every now and then [19:20] *** underscor has quit IRC (Ping timeout: 370 seconds) [19:24] *** underscor has joined #archiveteam-bs [19:24] *** swebb sets mode: +o underscor [19:30] *** BlueMaxim has quit IRC (Ping timeout: 512 seconds) [19:31] *** BlueMaxim has joined #archiveteam-bs [19:48] *** SN4T14__ has joined #archiveteam-bs [19:50] *** aaaaaaaaa has joined #archiveteam-bs [19:51] *** lytv has quit IRC (Read error: Operation timed out) [19:51] *** lytv has joined #archiveteam-bs [19:55] *** SN4T14_ has quit IRC (Ping timeout: 512 seconds) [20:18] *** schbirid has quit IRC (Leaving) [20:41] SketchCow: have they received their nobel prizes? [20:42] They should! [20:43] http://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/9848/VIDEO-Introducing-a-Fire-Extinguisher-Fuelled-by-Sound.aspx [20:43] In fact, the Defense Advanced Research Agency (DARPA) developed a system back in 2012 that utilized sound to put out flames. [20:44] However, this marks the first time engineers have created an actual extinguisher using sound. [20:44] "Engineering seniors Viet Tran and Seth Robertson now hold a preliminary patent application for their potentially revolutionizing device. " [20:44] well, was nice while it lasted [20:45] Dude, inventors patent shit [20:46] yep, patents are killing innovation [20:46] *** SketchCow changes topic to: Archive Team: https://i.imgur.com/d9dPE6s.gif [20:47] *** dashcloud has quit IRC (Read error: Operation timed out) [20:50] I don't agree, but the latest rustling in the fuck drawer came up entry [20:50] empty [20:50] Also, roughly $2000 went out the door yesterday into bills and debt and I am not happy [20:56] *** dashcloud has joined #archiveteam-bs [21:06] *** mistym has quit IRC (Remote host closed the connection) [21:23] *** mistym has joined #archiveteam-bs [22:16] *** dashcloud has quit IRC (Read error: Operation timed out) [22:19] *** dashcloud has joined #archiveteam-bs [23:24] *** mistym has quit IRC (Remote host closed the connection) [23:24] *** mistym has joined #archiveteam-bs [23:24] *** mistym has quit IRC (Remote host closed the connection) [23:37] *** mistym has joined #archiveteam-bs [23:38] *** primus104 has quit IRC (Leaving.) [23:49] still getting 503 trying to upload to IA :( [23:50] hopefully they'll sort it out tomorrow