#archiveteam 2014-03-01,Sat

↑back Search

Time Nickname Message
09:48 πŸ”— jonbro_ heya!
09:48 πŸ”— jonbro_ could anyone help me with a last minute archive project?
09:49 πŸ”— jonbro_ a local pittsburgh message board is going down, and I am trying to wget it all before it dissapears.
09:50 πŸ”— Schbirid step a: link the linky link
09:50 πŸ”— jonbro_ nevertellmetheodds.org
09:50 πŸ”— jonbro_ I don't have a warrior project, just looping over each thread
09:51 πŸ”— jonbro_ supposedly he was going to take it down today, so I am not sure what the countdown clock actually looks like.
09:52 πŸ”— Schbirid if the threads are enough, for i in {1..135000}; do wget -nv http://nevertellmetheodds.org/t.php?id=$i; done :D
09:53 πŸ”— jonbro_ cool.
09:53 πŸ”— jonbro_ thanks much!
09:54 πŸ”— Schbirid that will not include the inline images though
09:54 πŸ”— jonbro_ ah I am not worried about that, it is all offsite links.
09:54 πŸ”— jonbro_ those were always inpermanent
09:57 πŸ”— Schbirid for i in {102960..135000}; do wget -e robots=off -nv --page-requisites --span-hosts --reject-regex="(/favicon.ico|/style.css)" --exclude-domains=ajax.googleapis.com,www.businesscasualarchnemesis.com --convert-links --adjust-extension http://nevertellmetheodds.org/t.php?id=$i; done
09:57 πŸ”— Schbirid small test, should grab inline images
09:58 πŸ”— godane so i'm grabbing sitemap of funnyordie.com
09:58 πŸ”— godane there is like 230 of them
09:59 πŸ”— godane from there i can grab the video pages
09:59 πŸ”— godane but i really don't have too
09:59 πŸ”— jonbro_ hmmmm...
10:00 πŸ”— godane the videos are hosted like this: http://vo.fod4.com/v/befceb53c6/v600.mp4
10:00 πŸ”— Schbirid i would recommend logging, add a "-a logfile_date.log". wget will print all messages there then
10:00 πŸ”— godane the url is this: http://www.funnyordie.com/videos/befceb53c6/dani-weirdo-music-for-weirdos
10:01 πŸ”— jonbro_ ok, will do :D got this running on two computers and an ec2 instance now :D
10:02 πŸ”— Schbirid different number ranges, right? :)
10:02 πŸ”— jonbro_ yep
10:02 πŸ”— jonbro_ hopefully will finish before the owner wakes up and pulls the plug
10:03 πŸ”— godane you need just the part of the url to get the video paths
10:03 πŸ”— Schbirid :D
10:03 πŸ”— godane also change video to v1200.mp4 so you get the high bit rate version
10:05 πŸ”— Schbirid i started from 135k and 100k going down
10:23 πŸ”— ivan` jonbro_: if you use wget --warc-file your archive could be put into wayback
10:24 πŸ”— ivan` if there is no WARC, it's usually a lost cause
10:25 πŸ”— jonbro_ oh really :(
10:25 πŸ”— ivan` and starting one wget per URL is super dumb especially when you're grabbing page requisites
10:25 πŸ”— jonbro_ can I warc it after the fact?
10:25 πŸ”— ivan` but also because it doesn't reuse connections
10:25 πŸ”— ivan` no
10:25 πŸ”— jonbro_ oh rly????
10:25 πŸ”— jonbro_ oh shit.
10:25 πŸ”— jonbro_ I mean, there is only one page requsite, just a style.css
10:26 πŸ”— ivan` oh that's not so bad then
10:26 πŸ”— jonbro_ ok, cool, thank god :D
10:26 πŸ”— Schbirid ivan`: would you rather make an input file with urls? thought about that too
10:27 πŸ”— ivan` that, or a terrible hack like this is what I'm using for one site
10:27 πŸ”— ivan` end = (i+1)*10000
10:27 πŸ”— ivan` for i in range(0, 107):
10:27 πŸ”— ivan` print start, end
10:27 πŸ”— ivan` start = i*10000 + 1
10:27 πŸ”— ivan` os.system(r'wget --output-document=tmp --warc-file=bugzilla.redhat.com-%s-%s --warc-cdx -e robots=off https://bugzilla.redhat.com/show_bug.cgi\?id={%s..%s}' % (str(start).zfill(8), str(end).zfill(8), start, end))
10:27 πŸ”— jonbro_ ah interesting.
10:28 πŸ”— jonbro_ is that faster?
10:28 πŸ”— ivan` yes
10:28 πŸ”— Schbirid oh wow yes
10:28 πŸ”— jonbro_ ok, once one of these chunks end will switch over to that.
10:29 πŸ”— Schbirid for i in {0..135000}; do echo "http://nevertellmetheodds.org/t.php?id=$i" >> urls; done
10:29 πŸ”— Schbirid wget -nv --convert-links --adjust-extension --warc-file=nevertellmetheodds.org_20140301 --warc-cdx -i urls
10:29 πŸ”— Schbirid several times as fast
10:29 πŸ”— * Schbirid hides in a corner
10:29 πŸ”— ivan` yes
10:30 πŸ”— Cameron_D Just stick & after the wget call. Parallelise things :-)
10:30 πŸ”— ivan` heh
10:30 πŸ”— Schbirid "how to annoy a webserver" :P
10:30 πŸ”— ivan` how to get a precious IPv4 IP banned for life
10:30 πŸ”— ersi :D
10:31 πŸ”— ersi is this #superhackysatursayday?
10:31 πŸ”— Cameron_D Pretty sure you system would fall over first, trying to start 135k processes at once :P
10:31 πŸ”— Smiley nah, i'll be fine
10:31 πŸ”— Smiley you'll OOM pretty quickly tho
10:31 πŸ”— Smiley the voice of experience ^^^
10:32 πŸ”— unbeholde schibirid I have a small problem
10:32 πŸ”— Schbirid another unsatistied customer :((
10:33 πŸ”— unbeholde just found that stealth project alpha is corrupt and I tried to download it but it failed shortly before finishing. Same goes for Wonderful Life
10:33 πŸ”— Schbirid give me the urls and i will re-grab them, might have been my fault
10:34 πŸ”— jonbro_ can I make multiple warc files? or is that a no no if I want to chuck it on archive.org
10:34 πŸ”— Cameron_D you can make multiple and then merge them, I beleive
10:35 πŸ”— garyrh jonbro_, https://github.com/alard/megawarc
10:36 πŸ”— jonbro_ cool!
10:38 πŸ”— unbeholde what was that text upload service called?
10:39 πŸ”— Schbirid pastee.org is nice
10:39 πŸ”— unbeholde hmm ah here ok I got it up on pastebin http://pastebin.ca/2648419
10:40 πŸ”— Schbirid ok
11:08 πŸ”— Schbirid unbeholde: https://www.quaddicted.com/files/temp/fp/
11:08 πŸ”— Schbirid 7z says they are intact
11:33 πŸ”— unbeholde ah thank you wonderfile life is working, just waiting for the stealth project to finish! By the way I'm having trouble uploading Ons2.0.zip and ut3domfinal_winsetup.zip at the moment, it failed on me twice. Do you happen to have a direct link for them?
11:36 πŸ”— * unbeholde slaps Schbirid around a bit with a large fishbot
11:36 πŸ”— Schbirid yay
11:36 πŸ”— Schbirid uploading?
11:37 πŸ”— unbeholde mmm to the almighty modDB
11:37 πŸ”— Schbirid nothing more direct than the tarview links
11:40 πŸ”— Nemo_bis Meh, why does a search for "rare chunks" only bring up academic papers? BitTorrent can't be more stupid than eMule, can it? I hope peers send out rare chunks first. https://encrypted.google.com/search?q=bittorrent+"rare+chunks"
11:41 πŸ”— unbeholde it says tarview.php isn't a recognised file format
11:42 πŸ”— Schbirid Nemo_bis: there is super seeding but apart from that iirc clients request what they want
11:47 πŸ”— Nemo_bis Really? But clients don't even know the other leechers, they can only ask the wrong chunks...
11:49 πŸ”— Schbirid https://wiki.theory.org/BitTorrentSpecification#Piece_downloading_strategy
11:52 πŸ”— Nemo_bis Hm. https://trac.transmissionbt.com/ticket/3767 is marked fixed but I'm rather sure chrystal ball feature has not been implemented yet.
11:55 πŸ”— Nemo_bis Dunno how much http://www.cs.sfu.ca/~mhefeeda/Papers/ism09.pdf applies
16:30 πŸ”— Asparagir Need ops in ArchiveBot plz. Want to feed in Crimean sites, including media, while there's still time.
23:30 πŸ”— gui77 given that myopera bebo and canvas are already at the max rate and push is (practically) finished, is there any other project needing bandwidth? :)
23:55 πŸ”— sanqui hello. I know you folks are not archive.org, but I've been wondering if anybody can tell me if I'm fine uploading something there
23:56 πŸ”— sanqui or if there's a better place for both short-term availability and long-term preservation
23:57 πŸ”— sanqui basically, I don't know if you've heard about twitch plays pokémon, but it's been quite a phenomenon: https://en.wikipedia.org/wiki/Twitch_Plays_Pok%C3%A9mon
23:57 πŸ”— sanqui I've got nearly 3gb of logs, and many people have asked me to provide them for academical study
23:58 πŸ”— sanqui it's just under 400mb compressed, but I still don't want to host it myself

irclogger-viewer