[09:48] heya! [09:48] could anyone help me with a last minute archive project? [09:49] a local pittsburgh message board is going down, and I am trying to wget it all before it dissapears. [09:50] step a: link the linky link [09:50] nevertellmetheodds.org [09:50] I don't have a warrior project, just looping over each thread [09:51] supposedly he was going to take it down today, so I am not sure what the countdown clock actually looks like. [09:52] if the threads are enough, for i in {1..135000}; do wget -nv http://nevertellmetheodds.org/t.php?id=$i; done :D [09:53] cool. [09:53] thanks much! [09:54] that will not include the inline images though [09:54] ah I am not worried about that, it is all offsite links. [09:54] those were always inpermanent [09:57] for i in {102960..135000}; do wget -e robots=off -nv --page-requisites --span-hosts --reject-regex="(/favicon.ico|/style.css)" --exclude-domains=ajax.googleapis.com,www.businesscasualarchnemesis.com --convert-links --adjust-extension http://nevertellmetheodds.org/t.php?id=$i; done [09:57] small test, should grab inline images [09:58] so i'm grabbing sitemap of funnyordie.com [09:58] there is like 230 of them [09:59] from there i can grab the video pages [09:59] but i really don't have too [09:59] hmmmm... [10:00] the videos are hosted like this: http://vo.fod4.com/v/befceb53c6/v600.mp4 [10:00] i would recommend logging, add a "-a logfile_date.log". wget will print all messages there then [10:00] the url is this: http://www.funnyordie.com/videos/befceb53c6/dani-weirdo-music-for-weirdos [10:01] ok, will do :D got this running on two computers and an ec2 instance now :D [10:02] different number ranges, right? :) [10:02] yep [10:02] hopefully will finish before the owner wakes up and pulls the plug [10:03] you need just the part of the url to get the video paths [10:03] :D [10:03] also change video to v1200.mp4 so you get the high bit rate version [10:05] i started from 135k and 100k going down [10:23] jonbro_: if you use wget --warc-file your archive could be put into wayback [10:24] if there is no WARC, it's usually a lost cause [10:25] oh really :( [10:25] and starting one wget per URL is super dumb especially when you're grabbing page requisites [10:25] can I warc it after the fact? [10:25] but also because it doesn't reuse connections [10:25] no [10:25] oh rly???? [10:25] oh shit. [10:25] I mean, there is only one page requsite, just a style.css [10:26] oh that's not so bad then [10:26] ok, cool, thank god :D [10:26] ivan`: would you rather make an input file with urls? thought about that too [10:27] that, or a terrible hack like this is what I'm using for one site [10:27] end = (i+1)*10000 [10:27] for i in range(0, 107): [10:27] print start, end [10:27] start = i*10000 + 1 [10:27] os.system(r'wget --output-document=tmp --warc-file=bugzilla.redhat.com-%s-%s --warc-cdx -e robots=off https://bugzilla.redhat.com/show_bug.cgi\?id={%s..%s}' % (str(start).zfill(8), str(end).zfill(8), start, end)) [10:27] ah interesting. [10:28] is that faster? [10:28] yes [10:28] oh wow yes [10:28] ok, once one of these chunks end will switch over to that. [10:29] for i in {0..135000}; do echo "http://nevertellmetheodds.org/t.php?id=$i" >> urls; done [10:29] wget -nv --convert-links --adjust-extension --warc-file=nevertellmetheodds.org_20140301 --warc-cdx -i urls [10:29] several times as fast [10:29] * Schbirid hides in a corner [10:29] yes [10:30] Just stick & after the wget call. Parallelise things :-) [10:30] heh [10:30] "how to annoy a webserver" :P [10:30] how to get a precious IPv4 IP banned for life [10:30] :D [10:31] is this #superhackysatursayday? [10:31] Pretty sure you system would fall over first, trying to start 135k processes at once :P [10:31] nah, i'll be fine [10:31] you'll OOM pretty quickly tho [10:31] the voice of experience ^^^ [10:32] schibirid I have a small problem [10:32] another unsatistied customer :(( [10:33] just found that stealth project alpha is corrupt and I tried to download it but it failed shortly before finishing. Same goes for Wonderful Life [10:33] give me the urls and i will re-grab them, might have been my fault [10:34] can I make multiple warc files? or is that a no no if I want to chuck it on archive.org [10:34] you can make multiple and then merge them, I beleive [10:35] jonbro_, https://github.com/alard/megawarc [10:36] cool! [10:38] what was that text upload service called? [10:39] pastee.org is nice [10:39] hmm ah here ok I got it up on pastebin http://pastebin.ca/2648419 [10:40] ok [11:08] unbeholde: https://www.quaddicted.com/files/temp/fp/ [11:08] 7z says they are intact [11:33] ah thank you wonderfile life is working, just waiting for the stealth project to finish! By the way I'm having trouble uploading Ons2.0.zip and ut3domfinal_winsetup.zip at the moment, it failed on me twice. Do you happen to have a direct link for them? [11:36] * unbeholde slaps Schbirid around a bit with a large fishbot [11:36] yay [11:36] uploading? [11:37] mmm to the almighty modDB [11:37] nothing more direct than the tarview links [11:40] Meh, why does a search for "rare chunks" only bring up academic papers? BitTorrent can't be more stupid than eMule, can it? I hope peers send out rare chunks first. https://encrypted.google.com/search?q=bittorrent+"rare+chunks" [11:41] it says tarview.php isn't a recognised file format [11:42] Nemo_bis: there is super seeding but apart from that iirc clients request what they want [11:47] Really? But clients don't even know the other leechers, they can only ask the wrong chunks... [11:49] https://wiki.theory.org/BitTorrentSpecification#Piece_downloading_strategy [11:52] Hm. https://trac.transmissionbt.com/ticket/3767 is marked fixed but I'm rather sure chrystal ball feature has not been implemented yet. [11:55] Dunno how much http://www.cs.sfu.ca/~mhefeeda/Papers/ism09.pdf applies [16:30] Need ops in ArchiveBot plz. Want to feed in Crimean sites, including media, while there's still time. [23:30] given that myopera bebo and canvas are already at the max rate and push is (practically) finished, is there any other project needing bandwidth? :) [23:55] hello. I know you folks are not archive.org, but I've been wondering if anybody can tell me if I'm fine uploading something there [23:56] or if there's a better place for both short-term availability and long-term preservation [23:57] basically, I don't know if you've heard about twitch plays pokémon, but it's been quite a phenomenon: https://en.wikipedia.org/wiki/Twitch_Plays_Pok%C3%A9mon [23:57] I've got nearly 3gb of logs, and many people have asked me to provide them for academical study [23:58] it's just under 400mb compressed, but I still don't want to host it myself