[00:00] *** serapeum has joined #archiveteam [00:12] *** caber has joined #archiveteam [00:13] *** cadbury_ has joined #archiveteam [00:14] *** c_b2 has joined #archiveteam [00:14] *** c_b has quit IRC (Ping timeout: 260 seconds) [00:16] *** c_b2 is now known as c_b [00:17] *** aNthraXx has joined #archiveteam [00:17] https://www.youtube.com/watch?v=1L1EIUIjBfk [00:20] *** NovaKing_ has joined #archiveteam [00:24] *** mistym_ has quit IRC (Remote host closed the connection) [00:24] *** mistym has joined #archiveteam [00:25] *** primus has quit IRC (Read error: Operation timed out) [00:25] [18:52] Is there a process to request adding a WARC to Wayback Machine? [00:25] the process consists of: talk to SketchCow [00:25] :P [00:25] ref https://archive.org/details/pdp10nocrew https://archive.org/details/dmoz-rdf-20150327 [00:26] since pir^2 is apparently gone [00:26] also cc Kazzy aaaaaaaaa garyrh [00:26] in case somebody else asks [00:42] *** NovaKing_ has quit IRC (Read error: Operation timed out) [00:43] *** cadbury_ has quit IRC (Read error: Operation timed out) [00:43] *** aNthraXx has quit IRC (Read error: Operation timed out) [00:46] *** caber has quit IRC (Read error: Operation timed out) [00:49] *** BlueMaxim has quit IRC (Quit: Leaving) [00:51] *** BlueMaxim has joined #archiveteam [00:51] *** caber has joined #archiveteam [00:51] *** NovaKing_ has joined #archiveteam [00:52] *** primus104 has quit IRC (Leaving.) [00:53] *** cadbury_ has joined #archiveteam [00:56] *** aNthraXx has joined #archiveteam [01:17] *** SimpBrain has quit IRC (Ping timeout: 512 seconds) [01:25] *** pir^2 has joined #archiveteam [01:25] Another one for Wayback Machine - https://archive.org/details/debatesoireachtasie-XML [01:26] pir^2: [00:25:40] [18:52] Is there a process to request adding a WARC to Wayback Machine? [01:26] [00:25:48] the process consists of: talk to SketchCow [01:26] I read that on the logs :P [01:26] ah right, didn't know you were reading them :p [01:27] *** schbirid2 has quit IRC (Read error: Operation timed out) [01:33] *** pir^2 has quit IRC (Ping timeout: 370 seconds) [01:38] *** pir^2 has joined #archiveteam [01:39] *** schbirid2 has joined #archiveteam [01:39] why has the dmoz one been deriving for hours? [01:40] Kazzy, can you see this? https://archive.org/catalog.php?history=1&identifier=dmoz-rdf-20150327 [01:41] i see a task waiting for an admin [01:41] "FATAL ERROR: no file-level CDXs found in this item; rerun to clear redrow and update files.xml" [01:42] The derive hasn't started yet; see https://archive.org/catalog.php?whereami=1 to see where you are in the queue [01:45] *** wp494 has quit IRC (Read error: No route to host) [01:47] So I need to wait for an admin? How long does that take? [01:55] *** pir^2 has quit IRC (Ping timeout: 370 seconds) [02:31] *** wp494 has joined #archiveteam [02:33] What up. [02:37] i'm uploading more funny or die videos [02:40] Excellent. [02:42] also westworth military academy yearbooks are uploaded: https://archive.org/search.php?query=creator%3A%22Westworth+Military+Academy%22 [02:43] Where did those come from, anyway? [02:45] wma.edu [02:46] i was searching for pri the world and one of there pdfs came up [02:47] Neat [03:00] SketchCow: i would like some help getting the pri the world podcast [03:00] i can't find anything before 2010-06-17 [03:01] i'm trying to find the full shows so we can have a collection on IA [03:05] *** SoJa has quit IRC (Quit: Page closed) [03:14] *** Ymgve has quit IRC () [03:20] *** necenzura has joined #archiveteam [03:21] *** necenzura has quit IRC (Client Quit) [03:21] *** necenzura has joined #archiveteam [03:25] for real, rapidshare archiving? [03:38] godane, this has some http://web.archive.org/web/*/http://media.theworld.org/* [03:38] ex. http://web.archive.org/web/20120912192455/http://media.theworld.org/audio/010209full.mp3 [03:43] i got all of those [03:53] *** necenzura has quit IRC (Quit: Page closed) [03:55] I'm adding so many weird cheats to this thing [04:00] *** mistym has quit IRC (Remote host closed the connection) [04:04] *** dashcloud has quit IRC (Read error: Operation timed out) [04:07] https://ia902502.us.archive.org/22/items/archiveteam_archivebot_go_068/00_coverimage.png [04:09] some great fonts on those yearbooks https://archive.org/stream/Westworth_Military_Academy_Yearbook_1882/WMA-1882#page/n17/mode/2up [04:11] *** dashcloud has joined #archiveteam [04:12] *** aaaaaaaaa has quit IRC (Leaving) [04:15] *** john has joined #archiveteam [04:16] I have a silly question. [04:17] How can you mirror the whole of telecomix.org (it's only 12 pages or so) with wget, and download the hotlinked images too without having wget hop onto other sites, and without using -D? [04:24] *** mistym has joined #archiveteam [04:37] john: you can use wpull and --span-hosts-allow=page-requisites [04:41] Does that support the warc format? [04:41] john: yeah [04:41] Good. Thanks. [04:42] john: fwiw wpull is also ArchiveBot's crawler, which has been injecting WARCs into IA's Wayback Machine for about a year now [04:42] Neat. [04:43] Out of curiosity, is archiveteam the only group allowed to add directly to the wayback machine, or are there others? [04:43] we are the only one who does [04:44] All right. [04:44] well, there's internal IA groups [04:48] Can third parties have warc archives integrated after some review? [04:49] I'm not quite sure how that could work if the archived site is down, since warc files seem like they could be hand-made. [04:55] Archive Team is the only one [04:58] webarchiviewer can't find the html files create in warc files created by wpull for some reason. What do you usually use to view or "replay" these? [04:58] *created [04:58] *webarchiveplayer [04:59] *** Start_ has joined #archiveteam [04:59] *** Start has quit IRC (Read error: Connection reset by peer) [05:04] john: works fine for me [05:04] what WARC are you using? [05:04] yipdw: The one I just created with wpull. [05:05] I need the WARC to verify, otherwise I'm afraid I can't offer anything more substantial than "it works for me" [05:06] http://a.pomf.se/fvjica.warc [05:06] I generated it with `wpull telecomix.org --warc-file telecomix.org --span-hosts-allow=page-requisites --page-requisites'. [05:06] *** brayden has joined #archiveteam [05:07] john: try http://localhost:8090/replay/20150329050054/http://telecomix.org [05:08] Well, that works. [05:09] It's because it doesn't end with html I guess. [05:09] I don't know how pywb/webarchiveplayer's listing works, I haven't looked too closely at it [05:09] Well, it looks like it's extension based. [05:09] but the detection heuristics might just need some work [05:10] May I ask what you use? In the future I'd like to see individual files in the archive as well. [05:10] john: at least one strategy in webarchiveplayer is MIME-type based, not extension-based [05:10] Huh. Weird. [05:10] so yeah I don't know why it's not showing up [05:10] anyway [05:11] So maybe it's the webmaster's fault? [05:11] *** c_b has quit IRC (Quit: c_b) [05:11] no, the response comes back as text/html [05:11] Okay, then I have no idea. [05:12] Is there a more a more capable program you can reccommend though? [05:12] i was just thinking of setting up a web archive player for mesh network [05:12] john: no [05:12] webarchiveplayer is the best one I know of [05:12] All right. [05:12] in terms of fidelity, it exceeds Wayback [05:12] I'll go looking. [05:12] that way people can download archiveteam stuff and use it on a mesh network [05:12] if you want a listing, try https://github.com/ArchiveTeam/warc-proxy [05:13] however, keep in mind that that is an HTTP proxy server and its use is more involved [05:13] Okay. [05:14] I can create a firefox profile. No big deal for me. [05:14] oh, I think I know what happened [05:14] File "build/bdist.linux-x86_64/egg/pywb/utils/canonicalize.py", line 40, in canonicalize [05:14] raise UrlCanonicalizeException('Invalid Url: ' + url) [05:14] UrlCanonicalizeException: Invalid Url: urn:X-wpull:log [05:15] that's an error we've seen before [05:15] current pywb builds don't like WARC records with URNs [05:15] Okay. [05:18] anyway, the author of pywb is aware of wpull and its custom records, so this is likely to be fixed soon [05:21] john: you can try the wpull option --no-warc-keep-log to not keep the wpull log in the WARC, which may help [05:21] That work-proxy is kind of nice. [05:21] *warc-proxy [05:21] yipdw: Thanks. [05:22] john: another possibility is --warc-move, which might be overkill, but as a side effect it places all wpull metadata records in a -meta.warc.gz file [05:23] er [05:23] --warc-max-size, not --warc-move [05:23] Thanks. [05:23] I can see that being used, since I offer to distribute warc archives often. [05:36] yipdw: --warc-max-size expects an argument. [05:36] yeah, it's the max size [05:37] Oh, I see. So just set it to any large number? [05:37] whatever number makes sense for your environment [05:37] in archivebot we use 5 GiB [05:38] What if a site was really larger than 5 GiB? You just make sure it's doing the right thing? [05:39] --warc-max-size is a threshold size for each WARC, not the maximum fetch size [05:39] see e.g. http://wpull.readthedocs.org/en/master/options.html [05:39] Oh, all right. [05:42] Out of curiosity, might I be able to get wpull to use my browser cookies? [05:43] if you can export them in Mozilla's cookies.txt format, yeah [05:43] *** mistym has quit IRC (Remote host closed the connection) [05:43] see e.g. --load-cookies [05:45] Thanks. [06:25] *** primus104 has joined #archiveteam [06:30] *** JMC has quit IRC (Read error: Operation timed out) [06:34] Question for the code nerds [06:34] Well, first, john - use archiveteam-bs for something this long. [06:36] As long as what? [06:40] As long as this discussion went [06:40] Have the silly questions in #archiveteam-bs [06:44] A, okay. [06:44] *Ah [06:55] Code for the Code Gods. [06:57] Okay, now I must know, where is that from? Originally I thought it was an 8chan thing, but apparently it's everywhere now. [07:15] Talk about it in -bs [07:23] *** signius has quit IRC (Read error: Operation timed out) [07:36] *** signius has joined #archiveteam [09:10] *** schbirid2 has quit IRC (Leaving) [09:33] *** schbirid has joined #archiveteam [10:22] *** scyther has joined #archiveteam [10:32] *** SimpBrain has joined #archiveteam [10:38] *** Ymgve has joined #archiveteam [10:56] *** svchfoo1 has quit IRC (Read error: Connection reset by peer) [10:59] *** svchfoo1 has joined #archiveteam [10:59] *** svchfoo2 sets mode: +o svchfoo1 [11:31] https://github.com/cloudfs/ftp-cloudfs [11:43] *** will has left Textual IRC Client: www.textualapp.com [11:44] *** will has joined #archiveteam [12:00] *** scyther has quit IRC (Leaving) [12:27] *** primus104 has quit IRC (Leaving.) [12:40] *** [Beta] has joined #archiveteam [12:45] *** lysobit has quit IRC (Quit: quit) [12:49] *** habi has joined #archiveteam [12:49] *** habi has left [12:55] *** lysobit has joined #archiveteam [13:07] *** habi has joined #archiveteam [14:01] *** SimpBrain has quit IRC (Ping timeout: 512 seconds) [14:30] *** primus104 has joined #archiveteam [14:38] *** philpem has joined #archiveteam [14:39] *** primus104 has quit IRC (Leaving.) [15:19] *** underscor has quit IRC (Ping timeout: 370 seconds) [15:24] *** habi has left [15:28] *** underscor has joined #archiveteam [15:28] *** swebb sets mode: +o underscor [15:28] *** primus104 has joined #archiveteam [15:47] *** underscor has quit IRC (Ping timeout: 370 seconds) [15:47] *** brayden has quit IRC (Ping timeout: 606 seconds) [16:06] *** Start_ is now known as Start [16:06] *** Start has quit IRC (Disconnected.) [16:06] *** Start has joined #archiveteam [16:06] *** Start has quit IRC (Remote host closed the connection) [16:06] *** Start has joined #archiveteam [16:26] As part of my return to normal life this week, I'm working very hard to kill out all sorts of waiting piles of data on FOS (yes, again). I'll be asking about some of the upload jobs to see which are done. [16:27] *** brayden has joined #archiveteam [16:44] Testflight is now getting uploaded. [16:44] https://archive.org/details/archiveteam_testflight&tab=collection [16:53] I've written a script that goes to an archivebot collection and says "get the most complicated webpage grab and make that the cover image". [16:54] It's going to make that collection look sweeeeeeeet [16:55] *** underscor has joined #archiveteam [16:55] *** swebb sets mode: +o underscor [16:56] same metric as usual: screenshot every page and look for the largest-by-kb image? [16:57] nice [16:57] Yes [16:57] i love how simple that is [16:58] For the purposes of beauty, which is arbitrary, it works well. [16:58] http://teamarchive1.fnf.archive.org/SCREENCHECK/SHOWBOAT/ [16:58] beauty ~ entropy [16:58] So, that is what it shows me when I say "show me all the screenshots you generated for this item" [16:58] I then have it asking me, in a loop, which need to die. [16:59] So, I'm killing the ones that are blank or awful [17:06] Definitely going to take the machine at LEAST a week to go through all these. [17:08] *** khaoohs has joined #archiveteam [17:08] * Sanqui is gonna make a webpage that consists of random pixels and ArchiveBot it [17:08] beauty confirmed [17:08] Nice, the thumbnailer for collections is running even faster these days. Some of the items already have the improved cover. [17:09] *** khaoohs_ has quit IRC (Read error: Operation timed out) [17:09] but seriously, awesome work! [17:09] *** khaoohs_ has joined #archiveteam [17:09] *** khaoohs has quit IRC (Read error: Connection reset by peer) [17:10] A lot of it is understanding what Brewster wants. [17:10] *** khaoohs_ has quit IRC (Read error: No route to host) [17:10] *** khaoohs_ has joined #archiveteam [17:10] With the highly visual aspect of v2 coming in, he wants this nice collection to be not just useful but pretty, maybe even exquisite. [17:10] He wouldn't leave 60% of the usable space of a building as an insane church of petabytes if he didn't. [17:11] He'd gut it like a fuckin' fish [17:12] Then again: https://archive.org/details/archiveteam_archivebot_go_20141016100002 [17:12] That'll give any sane man a heart attack [17:14] good thing i'm not [17:20] *** mistym has joined #archiveteam [17:23] *** aaaaaaaaa has joined #archiveteam [17:29] *** khaoohs has joined #archiveteam [17:31] *** khaoohs_ has quit IRC (Ping timeout: 370 seconds) [17:31] *** khaoohs_ has joined #archiveteam [17:32] *** khaoohs has quit IRC (Read error: Connection reset by peer) [17:32] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [17:35] It's (almost) to the point where I will write an automatic "make the nicest one the cover" for the whole collection. [17:37] *** schbirid has quit IRC (Leaving) [17:37] *** schbirid has joined #archiveteam [17:38] I wrote it. [17:40] *** khaoohs has joined #archiveteam [17:42] Oh yeah, look at it go. [17:42] (It found a bunch I thought I'd upgraded, and I had not.) [17:42] *** khaoohs has quit IRC (Read error: Connection reset by peer) [17:42] *** khaoohs has joined #archiveteam [17:48] *** khaoohs has quit IRC (Ping timeout: 370 seconds) [17:54] *** hive-mind has quit IRC (Ping timeout: 260 seconds) [17:55] *** hive-mind has joined #archiveteam [18:39] *** dashcloud has quit IRC (Read error: Connection reset by peer) [18:42] *** dashcloud has joined #archiveteam [18:44] *** xtr-201 has quit IRC (Read error: Connection reset by peer) [18:49] *** aaaaaaaaa has quit IRC (Read error: Operation timed out) [18:49] *** scyther has joined #archiveteam [19:12] ohhhyeah [19:20] *** underscor has quit IRC (Ping timeout: 370 seconds) [19:24] *** underscor has joined #archiveteam [19:24] *** swebb sets mode: +o underscor [19:30] *** BlueMaxim has quit IRC (Ping timeout: 512 seconds) [19:31] *** BlueMaxim has joined #archiveteam [19:47] https://archive.org/details/archivebot is now at 99% sexy thumbnails (until the next set renders, of course) [19:48] *** SN4T14__ has joined #archiveteam [19:50] *** aaaaaaaaa has joined #archiveteam [19:51] *** lytv has quit IRC (Read error: Operation timed out) [19:51] *** lytv has joined #archiveteam [19:55] *** SN4T14_ has quit IRC (Ping timeout: 512 seconds) [20:18] *** schbirid has quit IRC (Leaving) [20:29] the "add new sexy thumbnails" wasn't QUITE working, now it is. [20:33] Yep, now it's working fine. [20:47] *** dashcloud has quit IRC (Read error: Operation timed out) [20:56] *** dashcloud has joined #archiveteam [20:58] *** khaoohs has joined #archiveteam [21:06] *** mistym has quit IRC (Remote host closed the connection) [21:08] *** SimpBrain has joined #archiveteam [21:09] *** khaoohs has quit IRC (Read error: Connection reset by peer) [21:09] *** khaoohs has joined #archiveteam [21:10] *** khaoohs_ has joined #archiveteam [21:10] *** khaoohs has quit IRC (Read error: Connection reset by peer) [21:11] *** khaoohs__ has joined #archiveteam [21:11] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [21:23] *** mistym has joined #archiveteam [21:33] *** khaoohs_ has joined #archiveteam [21:33] *** khaoohs__ has quit IRC (Read error: Connection reset by peer) [21:33] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [21:33] *** khaoohs_ has joined #archiveteam [21:34] *** khaoohs__ has joined #archiveteam [21:34] *** Wolfie has joined #archiveteam [21:34] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [21:35] *** khaoohs_ has joined #archiveteam [21:35] You dicks, you made me log in to IRC to ask for a secret word to write something on a wiki so that I can ensure that furry porn persists for the future. [21:35] Yes indeed [21:36] *** khaoohs__ has quit IRC (Read error: Connection reset by peer) [21:36] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [21:36] Should I copy paste the ALL CAPS request for secret word in here or is that just a trap to make people laugh at me? [21:36] http://i.huffpost.com/gen/1194885/thumbs/o-CHEERS-LEONARDO-DICAPRIO-570.jpg?5 [21:36] Wolfie, please do [21:36] Yes, seconded [21:36] *sigh8 [21:36] FINE. [21:36] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [21:36] Wolfie, yahoosucks [21:36] THY WORD IS "YAHOOSUCKS" [21:37] GO FORTH AND SAVE THAT WHICH NEEDS SAVING [21:37] (hurr hurr hurr everybody laughs) [21:37] loltiming [21:37] Spoiler, furry porn is going to persist. [21:37] We have.... I think 4 people who specialize on it here [21:39] I am well aware that furry porn is gonna persist, I just want to make sure that nobody gets their IP banhammered by some dude who likes inflation porn when they try to mirror it. [21:39] You're our kind of savior. [21:39] Cranky as fuck, but still doing it. [21:39] Welcome. [21:39] <--- jason scott [21:39] Hi Jason. [21:44] I'm gonna go out on a limb here and ping chfoo to ask him about what he wants to do with Furaffinity. [21:49] We've been bonking Furraffinity for.... a while [21:50] I bonked furaffinity personally when they disabled uploads for a full week and left a bizarre orc-creature in the nude on the front page. I got a few hundred thousand submissions before their hiatus ended and they IP banned me. [21:51] Hence... wolfie@:~$ curl www.furaffinity.net [21:51] Your IP address has been banned. [21:51] Reason: Mass scripted downloading of submissions. [21:51] So for last.fm [21:52] from what SketchCow wrote I understand that we need to save: [21:52] - the forums [21:52] - journals/user profiles [21:52] - blog [21:53] is that right? or do we have more? [21:59] arkiver: fwiw, there's a list of things on Last.fm at the wiki page: http://archiveteam.org/index.php?title=Last.fm [22:00] *** scyther has quit IRC (Read error: Connection reset by peer) [22:16] *** dashcloud has quit IRC (Read error: Operation timed out) [22:16] *** philpem has quit IRC (Ping timeout: 260 seconds) [22:19] *** dashcloud has joined #archiveteam [22:29] arkiver: we should start a discovery project for friendfeed [22:29] we could use the search like we're doing with rapidshare [22:30] http://friendfeed.com/search?q=QUERY [22:30] the archivebot run didn't get it all? [22:31] it's currently blocked. besides, a warrior project will be way faster [22:31] for groups: http://friendfeed.com/groups/search?q=QUERY [22:31] yeah now you can get multiple IPs banned instead of one [22:31] much faster [22:32] archivebot really needs some way to resolve bans and move jobs to a different pipeline [22:32] though that is a lot of work [22:32] I'm thinking !move , and !requeue [22:32] I'm thinking people should just deal with it for now [22:33] no, I mean, I am dealing with it, archivebot is awesome already [22:33] just it would be nice to have that one day :p [22:37] a more serious response is that you can take the collected URLs so far and load those into a warrior project [22:37] there's ~4 million or so URLs in the wpull database for that job [22:37] Start: #lastchance.fm [22:37] I was sent a note from a last.fm insider [22:37] that should keep things busy until April 9 [22:37] We should really be moving on this. [22:38] chfoo: can you please add lastfm-grab to projects.json? [22:38] SketchCow: we'll get it [23:11] whoa [23:11] last.fm, what happen? [23:13] new codebase, possible data loss [23:24] *** mistym has quit IRC (Remote host closed the connection) [23:24] *** mistym has joined #archiveteam [23:24] *** mistym has quit IRC (Remote host closed the connection) [23:27] when are twitpic and halo resuming? [23:33] there's more than that on last.fm, every artist and track page has user comments [23:37] *** mistym has joined #archiveteam [23:38] *** primus104 has quit IRC (Leaving.) [23:55] *** Wolfie has quit IRC (Quit: Leaving.)