#archiveteam 2013-07-08,Mon

↑back Search

Time Nickname Message
02:00 🔗 GLaDOS By the way, we're still in need of figuring a way of grabbing images from Snapjoy. If you want to help, come visit us in #snapshut
02:48 🔗 wp494 a small snippet of what's going on in #pushharder: http://pastebin.com/ZuEskZww
03:21 🔗 wp494 just updated the IRC channel list
03:21 🔗 wp494 http://archiveteam.org/index.php?title=IRC
03:48 🔗 winr4r wp494: your claculations for the total possible number of puu.sh images is out, i make 62^5 = 916132832
03:48 🔗 winr4r and in reality, much less than that, since the first digit is only up to 3xxxx
03:49 🔗 winr4r calculations* sorry it's pre-caffeine
03:50 🔗 wp494 hence "calculations that probably aren't worth shit"
03:50 🔗 winr4r wp494: yeah mine probably aren't any better either
03:51 🔗 wp494 and yet here's mister 84% in mathematics
03:51 🔗 wp494 I should be able to do such simple things :c
03:51 🔗 winr4r that would assume they weren't like "start at 800 million to make it look like we are bigger than we are"
03:53 🔗 winr4r and if they're only at 3xxxx and they go 0-9 A-Z a-z, then wouldn't that be 62^4 * 3 = 44329008?
03:54 🔗 winr4r who knows, i still need caffeine
03:54 🔗 winr4r brb
04:20 🔗 winr4r in related news, bing still has a free search API tier in azure marketplace
04:20 🔗 winr4r guess what i am learning this morning!
04:22 🔗 xmc wheee
04:26 🔗 winr4r figure this is going to be the best way of finding *.webtv.net pages
04:26 🔗 winr4r and maybe snapjoy ones too!
04:35 🔗 Acebulf hey guys, I'm new here. Is there a way I could help archiving stuff?
04:38 🔗 ivan` there are a lot of ways to help
04:38 🔗 ivan` the xanga grab is ongoing and you can run xanga-grab; see #jenga
04:39 🔗 ivan` you can help set up new grabs e.g. puu.sh
04:39 🔗 ivan` you can also make WARCs of everything you like with wget
04:40 🔗 Acebulf nice, thanks
04:40 🔗 S[h]O[r]T winr4r ill run against passive dns for *.webtv.net and snapjoy. im limited to 10k per atm but most dont even have that many
04:45 🔗 winr4r S[h]O[r]T: if you're looking at *.webtv.net, there's only going to be a few results, because webtv.net addresses are community-X.webtv.net/username
04:45 🔗 winr4r S[h]O[r]T: snapjoy on the other hand, uses username.snapjoy.com, so that would be very helpful
04:51 🔗 Acebulf quick question: if I run the warrior, how much disk space will it use, and for how long?
04:52 🔗 S[h]O[r]T i dont even see a community*
04:52 🔗 S[h]O[r]T and yeah few results but here ya go. http://privatepaste.com/4c7f5c2edb
04:53 🔗 S[h]O[r]T dont see many *.snapjoy.com either
04:53 🔗 winr4r Acebulf: good question, i seem to recall reading that it'll be at most a few gigabytes, and that'll presumably be for only as long as that takes to upload
04:53 🔗 S[h]O[r]T like 300
04:53 🔗 winr4r S[h]O[r]T: 300 is better than 0!
04:54 🔗 winr4r S[h]O[r]T: interesting results there
04:54 🔗 S[h]O[r]T ^^ a few gb for sure. and Acebulf you can gracefully stop it whenever you want.
04:54 🔗 Acebulf winr4r -> will it start uploading automatically or is there an upload round in a couple months
04:54 🔗 winr4r Acebulf: starts uploading automatically as soon as the job is done
04:54 🔗 S[h]O[r]T if you chose a specific project it will go for as long as that project has items to grab and then finish. if you choose archiveteams choice it will work on whatever project is assigned by default at the time
04:55 🔗 S[h]O[r]T some of those hosts may be expired/old but they were at once in use
04:56 🔗 winr4r S[h]O[r]T: mind if i shove that in the AT pastebin? not sure if you used privatepaste for a reason
04:56 🔗 winr4r S[h]O[r]T: what did you get from snapjoy btw?
04:56 🔗 S[h]O[r]T the at pastebin is always super slow for me :p i just prefer privatepaste. shove it wherever you want
04:57 🔗 S[h]O[r]T formatt the list for snapjoy atm
04:57 🔗 S[h]O[r]T some copy/paste work
04:57 🔗 winr4r thanks :)
05:02 🔗 S[h]O[r]T snapjoy http://privatepaste.com/6da8e5582d
05:02 🔗 Acebulf nice, I got the xanga grabber running
05:03 🔗 S[h]O[r]T the webtv are all a records btw
05:07 🔗 winr4r S[h]O[r]T: thank you!
05:09 🔗 Acebulf on the dashboard for the tracker, here : http://tracker.archiveteam.org/xanga/#show-all
05:09 🔗 Acebulf what's the little icons next to the names on the righT?
05:10 🔗 S[h]O[r]T its the warrior icon, aka a guy running out of a burning building with things. it indicates those users are running the warrior and if you hover it shows you the version.
05:10 🔗 S[h]O[r]T other users without that icon are running the scripts standalone
05:11 🔗 Acebulf cool, thanks
05:16 🔗 Acebulf yay it worked! I got my first item completed
05:16 🔗 winr4r it certainly does!
05:21 🔗 Acebulf anyway ima head to bed and leave the warrior on overnight
05:21 🔗 Acebulf boom! second item completed!
05:22 🔗 winr4r good plan!
05:22 🔗 winr4r apparently bing API maxes out after like 1k results, and randomly returns duplicated ones for site:snapjoy.com
05:23 🔗 winr4r so plan B!
05:25 🔗 winr4r on the upside, that's about 1k things on webtv.net which we didn't know about before
05:39 🔗 winr4r oh hm, of course i can refine the query by doing 'search term site:webtv.net'
05:40 🔗 winr4r what queries should i run? just did geneaology, family history, family tree
05:46 🔗 winr4r haha this is neat, i'm getting thousands more unique pages with this
05:47 🔗 winr4r 5610 so far!
08:18 🔗 PepsiMax Hmm. My Warrior is uploading at 50 kB/s.
08:18 🔗 PepsiMax That 10 times as slow as it could.
08:19 🔗 PepsiMax 30 kB/s now. 11 hours before the task is uploaded.
08:23 🔗 SmileyG PepsiMax: we don't have much b/w spare for uploads :D
08:28 🔗 ersi PepsiMax: Don't worry.
08:39 🔗 PepsiMax eek
09:45 🔗 Nemo_bis is there an on demand service to get books uploaded from Google Books to archive.org?
09:46 🔗 Nemo_bis or a bookmarklet or whatever
09:47 🔗 omf_ not sure
09:47 🔗 omf_ I know IA has the want this book API
09:48 🔗 omf_ they might have other cool tools as well
09:50 🔗 omf_ Nemo_bis, got an example url of a google book
09:50 🔗 Nemo_bis omf_: AFAIK tehy officially have no relation to tpb
09:50 🔗 Nemo_bis hm?
09:51 🔗 ersi What does TPB has to do with anything?
09:51 🔗 omf_ an example url of a book on google books you would like to see on archive.org
09:53 🔗 Nemo_bis I don't have one now, I was asked about it
11:58 🔗 SketchCow archive.org is constantly grabbing google books, by the way.
12:00 🔗 godane thats good to know
12:04 🔗 godane SketchCow: for some reason this item can't be searched: https://archive.org/details/HD_Nation_124
12:47 🔗 SmileyG godane: to me it looks like the 'ben larden' might be breaking something
13:31 🔗 tef ola, btw https://github.com/internetarchive/warctools/ has the latest warctools code now
13:34 🔗 omf_ tef, is that github going to replace http://code.hanzoarchives.com/warc-tools or just be a mirror of it?
13:37 🔗 ersi tef: cool
13:38 🔗 ersi Who's "Stephen Jones"? :o
13:48 🔗 tef ersi: my coworker at hanzo (which i am now leaving)
13:48 🔗 tef i'm making sure the code gets pushed out before I disappear
13:48 🔗 tef then I can start writing crawlers again without worrying
13:48 🔗 tef because it's been a fight against management to maintain hanzowarctools in the open, and it isn't even that good a library
13:50 🔗 ersi shrug
13:50 🔗 ersi Thanks for keeping it open :)
13:51 🔗 omf_ I updated the wiki
14:32 🔗 SketchCow 1054965.3 / 1365798.3 MB Rate: 373.4 / 8786.8 KB Uploaded: 130760.0 MB [77%] 0d 10:03 [ R: 0.12]
14:32 🔗 SketchCow That's a huge-ass torrent.
14:33 🔗 ersi Indeedily
14:42 🔗 SmileyG \o/
14:42 🔗 SmileyG Pouet just finished :)
14:43 🔗 SmileyG Downloaded: 3918573 files, 250G in 3d 0h 20m 50s (1006 KB/s)
14:43 🔗 * SmileyG uploads
14:43 🔗 omf_ you have to break that apart, remember there is a 50gb hard limit on IA items
14:43 🔗 SmileyG Gah, ok how?
14:44 🔗 omf_ megawarc should be able to do it
14:44 🔗 SmileyG o_O
14:44 🔗 SmileyG I thought that'd just bundle it into a single warc.
14:44 🔗 ersi thought everyone knew this
14:45 🔗 SmileyG ersi: you'd be supprised what everyone doesn't know.
14:45 🔗 ersi but yes, be kind to IA's servers - it'll also be easier to upload
14:45 🔗 ersi AFAIK it isn't a hard limit on 50GB and you can go further.. but you'll probably have a pain in the ass experience uploading the item
14:45 🔗 omf_ I mean the meagwarc factory is designed to output 50gb warcs
14:46 🔗 omf_ I get an error everytime I hit 50gb on item and then an email from them about it
14:47 🔗 ersi well, it certainly doesn't hurt to break it up
14:49 🔗 SmileyG just figuring out how to break it up
14:49 🔗 SmileyG or can I just compress it down?
14:53 🔗 ersi It will probably not shrink to under 50GB from 250GB
14:59 🔗 winr4r ^
15:02 🔗 SmileyG shame :(
15:03 🔗 SmileyG ok so I'll try and split it up tomorrow.
15:03 🔗 tef ersi: if you or godane want to hack it i'll give you commit bit or take pull requests
15:03 🔗 tef my plan is to actually do some work on it, but now i'm working in a non profit for teaching kids to code and my life may become insane
15:21 🔗 DFJustin the hard limit is actually 100gb or more but 50gb is nicer on them
15:43 🔗 SketchCow (Technically, the hard limit is currently 2tb)
15:44 🔗 SketchCow But interesting shit snaps at 10gb, 50gb, 200gb
15:46 🔗 SketchCow 3do_m2 cd32 cdtv megacd neocd pippin saturn vsmile_cd
15:46 🔗 SketchCow _ReadMe_.txt cdi mac_hdd megacdj pcecd psx segacd
15:46 🔗 SketchCow root@teamarchive0:/0/PLEASUREDOME/MESS 0.149 Software List CHDs# ls
15:46 🔗 SketchCow That's a nice collection.
15:50 🔗 DFJustin btw https://archive.org/details/MESS-0.149.BIOS.ROMs should go in messmame
15:55 🔗 SketchCow Aware. There was a system weirdness and that item is in limbo.
15:56 🔗 DFJustin I really need to learn to check history
15:57 🔗 SketchCow I had to redo the TOSEC main page after implementing your changes.
15:57 🔗 SketchCow I ran into the upper limit of an entry's description!
17:29 🔗 Acebulf is the warrior slower than directly running the python files?
17:34 🔗 winr4r Acebulf: good question!
17:34 🔗 winr4r you should benchmark them and find out
17:34 🔗 winr4r run each for a day and see what happens
17:35 🔗 winr4r if i was making shit up on the spot, i'd say that yes, being in a VM imposes some degree of overhead, but that overhead gets swamped by real-world stuff, i.e. network latency/throughput and the like
17:36 🔗 winr4r but that would be making shit up!
17:36 🔗 winr4r go and find out for sure
17:49 🔗 Acebulf cool
18:12 🔗 Acebulf i checked out the xanga im downloading, and lol'd at "this gets old! sorry xanga, Myspace is so much better wayy better "
18:32 🔗 winr4r Acebulf: haha
18:33 🔗 winr4r HOW'S THAT WORKING OUT FOR YA http://archiveteam.org/index.php?title=Myspace#Datapocalypse_.232:_Deleting_all_your_shit
18:39 🔗 rexxar What happens if I run out of disk space while downloading with the scripts?
18:39 🔗 rexxar Will it automatically dump everything it's got and keep going, or will it just die?
18:41 🔗 winr4r rexxar: paging alard
18:42 🔗 DFJustin probably just die
18:43 🔗 * winr4r is looking at the code
18:43 🔗 winr4r pssst the correct answer is actually "don't do that"
18:48 🔗 Acebulf rexxar: likely python will call an IOError and the entire thing would crash, unless it's been specifically programmed not to do that
18:49 🔗 rexxar Okay. I'm messing about with Amazon's free AWS teir. One of the free options is an Ubuntu server with 8GB disk space.
18:49 🔗 winr4r no, it won't
18:49 🔗 rexxar Guess the answer is just "don't run too many concurrent downloads"
18:49 🔗 winr4r the python code invokes wget
18:49 🔗 winr4r so an exception won't be raised
18:51 🔗 winr4r it does explicitly check for a failure exit code, though, i'm looking to see what exactly it does
18:51 🔗 Acebulf ah i see
19:07 🔗 winr4r okay, not gospel but looking at it, it will just remove the item from your list of shit to download, and won't report a success to the tracker
19:08 🔗 winr4r which probably means that your items will go into the "out" items that never return
19:09 🔗 winr4r not sure if that requires manual intervention on the part of the tracker admin or if it's automatically handed over to another warrior if the job is out too long
19:15 🔗 winr4r anyway, what you actually want to know: if you run out of disk space, it doesn't get handled differently to every kind of error
19:15 🔗 antomatic At one point jobs were automatically reissued if they'd been out for a certain amount of time (8 hours?) but that also had a side-effect that it would therefore tend to re-issue the very biggest and longest-downloading jobs
19:15 🔗 winr4r i.e. a 404 or some shit
19:16 🔗 winr4r pretty sure it doesn't crash on every 404 error!
19:16 🔗 winr4r antomatic: oh, thanks for the clarification
19:16 🔗 antomatic it's perhaps a bit of a blind spot that the tracker can't tell if something is still being actively downloaded, or tried and failed, etc. it only hears about the successes.
19:17 🔗 rexxar Would it be very difficult to have the warriors report that they're still downloading, and then re-issue tasks that have died?
19:18 🔗 antomatic It doesn't seem like it should be - if my linux-fu improves maybe one day I can help with that. but I'm still coming up to speed there.
19:21 🔗 antomatic (and no disrespect, obviously - the whole warrior/tracker setup is amazing)
19:21 🔗 winr4r it is
19:24 🔗 winr4r rexxar: possibly not too difficult at all for someone that understands the code
19:24 🔗 DFJustin that would mean a lot more load on the tracker server which may not be in the cards
19:26 🔗 antomatic the queue of items that are 'out, not returned' can be reissued by the admin, either (I believe) at an individual user level or for all outstanding items.
19:30 🔗 winr4r DFJustin: the alternative, is for the warriors to report to the tracker when they get exit code 3 or 4 (I/O error or network error) to hand the job back
19:32 🔗 winr4r (wget exit code, that is)
19:34 🔗 winr4r that also means you distinguish "really huge-ass job" from "someone didn't allocate enough disk space", because as it is now, you can't tell the difference
19:41 🔗 winr4r on the other hand, you could just not run out of disk space!
22:55 🔗 SketchCow http://i.imgur.com/Dqq7wx1.gif
22:55 🔗 SketchCow How archive team gets bandwidth
22:56 🔗 S[h]O[r]T hahaha
23:06 🔗 ivan` http://imgnook.com/645o.gif when the local pipeline expert has to stop uploading error pages

irclogger-viewer