Time |
Nickname |
Message |
02:00
🔗
|
GLaDOS |
By the way, we're still in need of figuring a way of grabbing images from Snapjoy. If you want to help, come visit us in #snapshut |
02:48
🔗
|
wp494 |
a small snippet of what's going on in #pushharder: http://pastebin.com/ZuEskZww |
03:21
🔗
|
wp494 |
just updated the IRC channel list |
03:21
🔗
|
wp494 |
http://archiveteam.org/index.php?title=IRC |
03:48
🔗
|
winr4r |
wp494: your claculations for the total possible number of puu.sh images is out, i make 62^5 = 916132832 |
03:48
🔗
|
winr4r |
and in reality, much less than that, since the first digit is only up to 3xxxx |
03:49
🔗
|
winr4r |
calculations* sorry it's pre-caffeine |
03:50
🔗
|
wp494 |
hence "calculations that probably aren't worth shit" |
03:50
🔗
|
winr4r |
wp494: yeah mine probably aren't any better either |
03:51
🔗
|
wp494 |
and yet here's mister 84% in mathematics |
03:51
🔗
|
wp494 |
I should be able to do such simple things :c |
03:51
🔗
|
winr4r |
that would assume they weren't like "start at 800 million to make it look like we are bigger than we are" |
03:53
🔗
|
winr4r |
and if they're only at 3xxxx and they go 0-9 A-Z a-z, then wouldn't that be 62^4 * 3 = 44329008? |
03:54
🔗
|
winr4r |
who knows, i still need caffeine |
03:54
🔗
|
winr4r |
brb |
04:20
🔗
|
winr4r |
in related news, bing still has a free search API tier in azure marketplace |
04:20
🔗
|
winr4r |
guess what i am learning this morning! |
04:22
🔗
|
xmc |
wheee |
04:26
🔗
|
winr4r |
figure this is going to be the best way of finding *.webtv.net pages |
04:26
🔗
|
winr4r |
and maybe snapjoy ones too! |
04:35
🔗
|
Acebulf |
hey guys, I'm new here. Is there a way I could help archiving stuff? |
04:38
🔗
|
ivan` |
there are a lot of ways to help |
04:38
🔗
|
ivan` |
the xanga grab is ongoing and you can run xanga-grab; see #jenga |
04:39
🔗
|
ivan` |
you can help set up new grabs e.g. puu.sh |
04:39
🔗
|
ivan` |
you can also make WARCs of everything you like with wget |
04:40
🔗
|
Acebulf |
nice, thanks |
04:40
🔗
|
S[h]O[r]T |
winr4r ill run against passive dns for *.webtv.net and snapjoy. im limited to 10k per atm but most dont even have that many |
04:45
🔗
|
winr4r |
S[h]O[r]T: if you're looking at *.webtv.net, there's only going to be a few results, because webtv.net addresses are community-X.webtv.net/username |
04:45
🔗
|
winr4r |
S[h]O[r]T: snapjoy on the other hand, uses username.snapjoy.com, so that would be very helpful |
04:51
🔗
|
Acebulf |
quick question: if I run the warrior, how much disk space will it use, and for how long? |
04:52
🔗
|
S[h]O[r]T |
i dont even see a community* |
04:52
🔗
|
S[h]O[r]T |
and yeah few results but here ya go. http://privatepaste.com/4c7f5c2edb |
04:53
🔗
|
S[h]O[r]T |
dont see many *.snapjoy.com either |
04:53
🔗
|
winr4r |
Acebulf: good question, i seem to recall reading that it'll be at most a few gigabytes, and that'll presumably be for only as long as that takes to upload |
04:53
🔗
|
S[h]O[r]T |
like 300 |
04:53
🔗
|
winr4r |
S[h]O[r]T: 300 is better than 0! |
04:54
🔗
|
winr4r |
S[h]O[r]T: interesting results there |
04:54
🔗
|
S[h]O[r]T |
^^ a few gb for sure. and Acebulf you can gracefully stop it whenever you want. |
04:54
🔗
|
Acebulf |
winr4r -> will it start uploading automatically or is there an upload round in a couple months |
04:54
🔗
|
winr4r |
Acebulf: starts uploading automatically as soon as the job is done |
04:54
🔗
|
S[h]O[r]T |
if you chose a specific project it will go for as long as that project has items to grab and then finish. if you choose archiveteams choice it will work on whatever project is assigned by default at the time |
04:55
🔗
|
S[h]O[r]T |
some of those hosts may be expired/old but they were at once in use |
04:56
🔗
|
winr4r |
S[h]O[r]T: mind if i shove that in the AT pastebin? not sure if you used privatepaste for a reason |
04:56
🔗
|
winr4r |
S[h]O[r]T: what did you get from snapjoy btw? |
04:56
🔗
|
S[h]O[r]T |
the at pastebin is always super slow for me :p i just prefer privatepaste. shove it wherever you want |
04:57
🔗
|
S[h]O[r]T |
formatt the list for snapjoy atm |
04:57
🔗
|
S[h]O[r]T |
some copy/paste work |
04:57
🔗
|
winr4r |
thanks :) |
05:02
🔗
|
S[h]O[r]T |
snapjoy http://privatepaste.com/6da8e5582d |
05:02
🔗
|
Acebulf |
nice, I got the xanga grabber running |
05:03
🔗
|
S[h]O[r]T |
the webtv are all a records btw |
05:07
🔗
|
winr4r |
S[h]O[r]T: thank you! |
05:09
🔗
|
Acebulf |
on the dashboard for the tracker, here : http://tracker.archiveteam.org/xanga/#show-all |
05:09
🔗
|
Acebulf |
what's the little icons next to the names on the righT? |
05:10
🔗
|
S[h]O[r]T |
its the warrior icon, aka a guy running out of a burning building with things. it indicates those users are running the warrior and if you hover it shows you the version. |
05:10
🔗
|
S[h]O[r]T |
other users without that icon are running the scripts standalone |
05:11
🔗
|
Acebulf |
cool, thanks |
05:16
🔗
|
Acebulf |
yay it worked! I got my first item completed |
05:16
🔗
|
winr4r |
it certainly does! |
05:21
🔗
|
Acebulf |
anyway ima head to bed and leave the warrior on overnight |
05:21
🔗
|
Acebulf |
boom! second item completed! |
05:22
🔗
|
winr4r |
good plan! |
05:22
🔗
|
winr4r |
apparently bing API maxes out after like 1k results, and randomly returns duplicated ones for site:snapjoy.com |
05:23
🔗
|
winr4r |
so plan B! |
05:25
🔗
|
winr4r |
on the upside, that's about 1k things on webtv.net which we didn't know about before |
05:39
🔗
|
winr4r |
oh hm, of course i can refine the query by doing 'search term site:webtv.net' |
05:40
🔗
|
winr4r |
what queries should i run? just did geneaology, family history, family tree |
05:46
🔗
|
winr4r |
haha this is neat, i'm getting thousands more unique pages with this |
05:47
🔗
|
winr4r |
5610 so far! |
08:18
🔗
|
PepsiMax |
Hmm. My Warrior is uploading at 50 kB/s. |
08:18
🔗
|
PepsiMax |
That 10 times as slow as it could. |
08:19
🔗
|
PepsiMax |
30 kB/s now. 11 hours before the task is uploaded. |
08:23
🔗
|
SmileyG |
PepsiMax: we don't have much b/w spare for uploads :D |
08:28
🔗
|
ersi |
PepsiMax: Don't worry. |
08:39
🔗
|
PepsiMax |
eek |
09:45
🔗
|
Nemo_bis |
is there an on demand service to get books uploaded from Google Books to archive.org? |
09:46
🔗
|
Nemo_bis |
or a bookmarklet or whatever |
09:47
🔗
|
omf_ |
not sure |
09:47
🔗
|
omf_ |
I know IA has the want this book API |
09:48
🔗
|
omf_ |
they might have other cool tools as well |
09:50
🔗
|
omf_ |
Nemo_bis, got an example url of a google book |
09:50
🔗
|
Nemo_bis |
omf_: AFAIK tehy officially have no relation to tpb |
09:50
🔗
|
Nemo_bis |
hm? |
09:51
🔗
|
ersi |
What does TPB has to do with anything? |
09:51
🔗
|
omf_ |
an example url of a book on google books you would like to see on archive.org |
09:53
🔗
|
Nemo_bis |
I don't have one now, I was asked about it |
11:58
🔗
|
SketchCow |
archive.org is constantly grabbing google books, by the way. |
12:00
🔗
|
godane |
thats good to know |
12:04
🔗
|
godane |
SketchCow: for some reason this item can't be searched: https://archive.org/details/HD_Nation_124 |
12:47
🔗
|
SmileyG |
godane: to me it looks like the 'ben larden' might be breaking something |
13:31
🔗
|
tef |
ola, btw https://github.com/internetarchive/warctools/ has the latest warctools code now |
13:34
🔗
|
omf_ |
tef, is that github going to replace http://code.hanzoarchives.com/warc-tools or just be a mirror of it? |
13:37
🔗
|
ersi |
tef: cool |
13:38
🔗
|
ersi |
Who's "Stephen Jones"? :o |
13:48
🔗
|
tef |
ersi: my coworker at hanzo (which i am now leaving) |
13:48
🔗
|
tef |
i'm making sure the code gets pushed out before I disappear |
13:48
🔗
|
tef |
then I can start writing crawlers again without worrying |
13:48
🔗
|
tef |
because it's been a fight against management to maintain hanzowarctools in the open, and it isn't even that good a library |
13:50
🔗
|
ersi |
shrug |
13:50
🔗
|
ersi |
Thanks for keeping it open :) |
13:51
🔗
|
omf_ |
I updated the wiki |
14:32
🔗
|
SketchCow |
1054965.3 / 1365798.3 MB Rate: 373.4 / 8786.8 KB Uploaded: 130760.0 MB [77%] 0d 10:03 [ R: 0.12] |
14:32
🔗
|
SketchCow |
That's a huge-ass torrent. |
14:33
🔗
|
ersi |
Indeedily |
14:42
🔗
|
SmileyG |
\o/ |
14:42
🔗
|
SmileyG |
Pouet just finished :) |
14:43
🔗
|
SmileyG |
Downloaded: 3918573 files, 250G in 3d 0h 20m 50s (1006 KB/s) |
14:43
🔗
|
* |
SmileyG uploads |
14:43
🔗
|
omf_ |
you have to break that apart, remember there is a 50gb hard limit on IA items |
14:43
🔗
|
SmileyG |
Gah, ok how? |
14:44
🔗
|
omf_ |
megawarc should be able to do it |
14:44
🔗
|
SmileyG |
o_O |
14:44
🔗
|
SmileyG |
I thought that'd just bundle it into a single warc. |
14:44
🔗
|
ersi |
thought everyone knew this |
14:45
🔗
|
SmileyG |
ersi: you'd be supprised what everyone doesn't know. |
14:45
🔗
|
ersi |
but yes, be kind to IA's servers - it'll also be easier to upload |
14:45
🔗
|
ersi |
AFAIK it isn't a hard limit on 50GB and you can go further.. but you'll probably have a pain in the ass experience uploading the item |
14:45
🔗
|
omf_ |
I mean the meagwarc factory is designed to output 50gb warcs |
14:46
🔗
|
omf_ |
I get an error everytime I hit 50gb on item and then an email from them about it |
14:47
🔗
|
ersi |
well, it certainly doesn't hurt to break it up |
14:49
🔗
|
SmileyG |
just figuring out how to break it up |
14:49
🔗
|
SmileyG |
or can I just compress it down? |
14:53
🔗
|
ersi |
It will probably not shrink to under 50GB from 250GB |
14:59
🔗
|
winr4r |
^ |
15:02
🔗
|
SmileyG |
shame :( |
15:03
🔗
|
SmileyG |
ok so I'll try and split it up tomorrow. |
15:03
🔗
|
tef |
ersi: if you or godane want to hack it i'll give you commit bit or take pull requests |
15:03
🔗
|
tef |
my plan is to actually do some work on it, but now i'm working in a non profit for teaching kids to code and my life may become insane |
15:21
🔗
|
DFJustin |
the hard limit is actually 100gb or more but 50gb is nicer on them |
15:43
🔗
|
SketchCow |
(Technically, the hard limit is currently 2tb) |
15:44
🔗
|
SketchCow |
But interesting shit snaps at 10gb, 50gb, 200gb |
15:46
🔗
|
SketchCow |
3do_m2 cd32 cdtv megacd neocd pippin saturn vsmile_cd |
15:46
🔗
|
SketchCow |
_ReadMe_.txt cdi mac_hdd megacdj pcecd psx segacd |
15:46
🔗
|
SketchCow |
root@teamarchive0:/0/PLEASUREDOME/MESS 0.149 Software List CHDs# ls |
15:46
🔗
|
SketchCow |
That's a nice collection. |
15:50
🔗
|
DFJustin |
btw https://archive.org/details/MESS-0.149.BIOS.ROMs should go in messmame |
15:55
🔗
|
SketchCow |
Aware. There was a system weirdness and that item is in limbo. |
15:56
🔗
|
DFJustin |
I really need to learn to check history |
15:57
🔗
|
SketchCow |
I had to redo the TOSEC main page after implementing your changes. |
15:57
🔗
|
SketchCow |
I ran into the upper limit of an entry's description! |
17:29
🔗
|
Acebulf |
is the warrior slower than directly running the python files? |
17:34
🔗
|
winr4r |
Acebulf: good question! |
17:34
🔗
|
winr4r |
you should benchmark them and find out |
17:34
🔗
|
winr4r |
run each for a day and see what happens |
17:35
🔗
|
winr4r |
if i was making shit up on the spot, i'd say that yes, being in a VM imposes some degree of overhead, but that overhead gets swamped by real-world stuff, i.e. network latency/throughput and the like |
17:36
🔗
|
winr4r |
but that would be making shit up! |
17:36
🔗
|
winr4r |
go and find out for sure |
17:49
🔗
|
Acebulf |
cool |
18:12
🔗
|
Acebulf |
i checked out the xanga im downloading, and lol'd at "this gets old! sorry xanga, Myspace is so much better wayy better " |
18:32
🔗
|
winr4r |
Acebulf: haha |
18:33
🔗
|
winr4r |
HOW'S THAT WORKING OUT FOR YA http://archiveteam.org/index.php?title=Myspace#Datapocalypse_.232:_Deleting_all_your_shit |
18:39
🔗
|
rexxar |
What happens if I run out of disk space while downloading with the scripts? |
18:39
🔗
|
rexxar |
Will it automatically dump everything it's got and keep going, or will it just die? |
18:41
🔗
|
winr4r |
rexxar: paging alard |
18:42
🔗
|
DFJustin |
probably just die |
18:43
🔗
|
* |
winr4r is looking at the code |
18:43
🔗
|
winr4r |
pssst the correct answer is actually "don't do that" |
18:48
🔗
|
Acebulf |
rexxar: likely python will call an IOError and the entire thing would crash, unless it's been specifically programmed not to do that |
18:49
🔗
|
rexxar |
Okay. I'm messing about with Amazon's free AWS teir. One of the free options is an Ubuntu server with 8GB disk space. |
18:49
🔗
|
winr4r |
no, it won't |
18:49
🔗
|
rexxar |
Guess the answer is just "don't run too many concurrent downloads" |
18:49
🔗
|
winr4r |
the python code invokes wget |
18:49
🔗
|
winr4r |
so an exception won't be raised |
18:51
🔗
|
winr4r |
it does explicitly check for a failure exit code, though, i'm looking to see what exactly it does |
18:51
🔗
|
Acebulf |
ah i see |
19:07
🔗
|
winr4r |
okay, not gospel but looking at it, it will just remove the item from your list of shit to download, and won't report a success to the tracker |
19:08
🔗
|
winr4r |
which probably means that your items will go into the "out" items that never return |
19:09
🔗
|
winr4r |
not sure if that requires manual intervention on the part of the tracker admin or if it's automatically handed over to another warrior if the job is out too long |
19:15
🔗
|
winr4r |
anyway, what you actually want to know: if you run out of disk space, it doesn't get handled differently to every kind of error |
19:15
🔗
|
antomatic |
At one point jobs were automatically reissued if they'd been out for a certain amount of time (8 hours?) but that also had a side-effect that it would therefore tend to re-issue the very biggest and longest-downloading jobs |
19:15
🔗
|
winr4r |
i.e. a 404 or some shit |
19:16
🔗
|
winr4r |
pretty sure it doesn't crash on every 404 error! |
19:16
🔗
|
winr4r |
antomatic: oh, thanks for the clarification |
19:16
🔗
|
antomatic |
it's perhaps a bit of a blind spot that the tracker can't tell if something is still being actively downloaded, or tried and failed, etc. it only hears about the successes. |
19:17
🔗
|
rexxar |
Would it be very difficult to have the warriors report that they're still downloading, and then re-issue tasks that have died? |
19:18
🔗
|
antomatic |
It doesn't seem like it should be - if my linux-fu improves maybe one day I can help with that. but I'm still coming up to speed there. |
19:21
🔗
|
antomatic |
(and no disrespect, obviously - the whole warrior/tracker setup is amazing) |
19:21
🔗
|
winr4r |
it is |
19:24
🔗
|
winr4r |
rexxar: possibly not too difficult at all for someone that understands the code |
19:24
🔗
|
DFJustin |
that would mean a lot more load on the tracker server which may not be in the cards |
19:26
🔗
|
antomatic |
the queue of items that are 'out, not returned' can be reissued by the admin, either (I believe) at an individual user level or for all outstanding items. |
19:30
🔗
|
winr4r |
DFJustin: the alternative, is for the warriors to report to the tracker when they get exit code 3 or 4 (I/O error or network error) to hand the job back |
19:32
🔗
|
winr4r |
(wget exit code, that is) |
19:34
🔗
|
winr4r |
that also means you distinguish "really huge-ass job" from "someone didn't allocate enough disk space", because as it is now, you can't tell the difference |
19:41
🔗
|
winr4r |
on the other hand, you could just not run out of disk space! |
22:55
🔗
|
SketchCow |
http://i.imgur.com/Dqq7wx1.gif |
22:55
🔗
|
SketchCow |
How archive team gets bandwidth |
22:56
🔗
|
S[h]O[r]T |
hahaha |
23:06
🔗
|
ivan` |
http://imgnook.com/645o.gif when the local pipeline expert has to stop uploading error pages |