Time |
Nickname |
Message |
00:05
🔗
|
odie5533_ |
Do warriors receive a list of urls to download, or do they hunt for urls themselves? |
00:18
🔗
|
phillipsj |
prettry sure they get a list from the tacker. That way, everbody is trying different URLs |
00:41
🔗
|
odie5533_ |
but then the tracker needed to already have crawled the site, right? |
00:41
🔗
|
odie5533_ |
it seems like the site would be crawled twice then. how does the warrior help? |
00:43
🔗
|
drfsite |
Would anyone happen to have an archived copy of the media files here? |
00:43
🔗
|
drfsite |
https://web.archive.org/web/20040209025641/http://www.skycycleonline.com/media.html |
01:15
🔗
|
xmc |
odie5533_: usually we do a quick surface crawl to get valid id numbers and url formats, then fill in the tracker with things we've seen and things we've extrapolated |
03:32
🔗
|
Lord_Nigh |
did anyone archive the video of that dude knocking over the boulder? theres lots of dmca takedowns going around |
03:38
🔗
|
drfsite |
what video? |
03:38
🔗
|
odie5533_ |
this boy scout decided to knock over some million year old boulder to save children |
03:38
🔗
|
JRWR |
I know of the one |
03:40
🔗
|
odie5533_ |
Lord_Nigh: http://www.liveleak.com/view?i=727_1382054402 |
03:40
🔗
|
odie5533_ |
I'm surprised he didn't somehow manage to crush himself. |
03:41
🔗
|
odie5533_ |
yay glenn! |
03:43
🔗
|
JRWR |
Lord_Nigh: magnet:?xt=urn:btih:C49EFD4BE3FBFA7FEB8C4ABF18FAE5A5ADEAB61D&dn=jackass%20topples%20200-million-year%20rock%20formation.mp4.mp4&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.publicbt.com%3a80%2fannounce&tr=udp%3a%2f%2ftracker.ccc.de%3a80%2fannounce |
03:46
🔗
|
drfsite |
wow |
03:48
🔗
|
DFJustin |
I archived it |
04:07
🔗
|
DFJustin |
hmm who had that handy script to reupload youtube-dl output to ia |
04:08
🔗
|
joepie91 |
sounds like something I'd write, but it isn't |
04:09
🔗
|
JRWR |
I kinda wish I had a script where it was question and answer script to upload files to IA |
04:11
🔗
|
BlueMax |
does there even need to be question and answer? |
04:11
🔗
|
DFJustin |
found it http://code.google.com/p/emijrp/source/browse/trunk/scrapers/youtube2internetarchive.py |
04:38
🔗
|
odie5533_ |
Why isn't that on github!? |
04:40
🔗
|
odie5533_ |
Does emijrp ever come in here? |
04:40
🔗
|
yipdw |
yeah |
04:40
🔗
|
yipdw |
and probably just didn't decide to use github |
04:41
🔗
|
odie5533_ |
How often does he come in here? |
04:41
🔗
|
yipdw |
not sure |
04:52
🔗
|
odie5533_ |
DFJustin: Did you use that script? And if so, to upload what? |
05:01
🔗
|
godane |
so cause i'm nuts i found another tech podcast |
05:02
🔗
|
godane |
called The Tech Report Podcast |
05:02
🔗
|
godane |
good news is the rss feed looks like has all mp3 |
05:02
🔗
|
godane |
make pushing downloading and pushing it easier |
05:04
🔗
|
JRWR |
A New WikiDump has been made for the following Projects: https://archive.org/details/wiki-ftlwikicom https://archive.org/details/wiki-letsplaywikicom https://archive.org/details/wiki-lptwikicom and the big stuff https://archive.org/details/wiki-pcgamingwikicom |
05:30
🔗
|
godane |
i also just found a podcast called hacker pubic radio |
05:42
🔗
|
JRWR |
godane and its not on ia |
05:42
🔗
|
JRWR |
Sounds like a project! |
05:45
🔗
|
DFJustin |
I haven't used it yet |
05:45
🔗
|
DFJustin |
would need to adapt it to upload already-downloaded things rather than pulling fresh |
05:46
🔗
|
odie5533_ |
Do you upload every podcast you find? |
05:49
🔗
|
JRWR |
Why not? |
05:54
🔗
|
godane |
i will work on tech report podcast for the moment |
05:55
🔗
|
godane |
the hacker pubic radio is released in mp3, spx and ogg |
05:55
🔗
|
godane |
i'm grabbing the mp3 version since archive.org will make a ogg of that |
05:55
🔗
|
odie5533_ |
JRWR: Sounds like a lot of work for stuff that's usually pretty low quality... but if you want to, I wouldn't stop you |
05:58
🔗
|
JRWR |
well this is odd |
05:58
🔗
|
JRWR |
Why does the wiki teams batch downloader do POST on images |
05:59
🔗
|
JRWR |
that breaks NGINX |
05:59
🔗
|
odie5533_ |
What do you mean? |
05:59
🔗
|
JRWR |
2607:5300:60:ad1::1 - - [24/Oct/2013:01:52:53 -0400] [pcgamingwiki.com] "POST /images/2/2e/Zen_Puzzle_Garden_cover.png HTTP/1.0" 405 166 "-" "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0" |
06:00
🔗
|
odie5533_ |
That's bad. |
06:00
🔗
|
odie5533_ |
What are you using to dump the wiki? |
06:00
🔗
|
JRWR |
https://code.google.com/p/wikiteam/source/browse/trunk/dumpgenerator.py |
06:00
🔗
|
odie5533_ |
Who wrote it? |
06:00
🔗
|
odie5533_ |
wow that's a long script. |
06:01
🔗
|
JRWR |
look at the reversions |
06:01
🔗
|
odie5533_ |
nemo and emijrp |
06:02
🔗
|
odie5533_ |
line 671 |
06:02
🔗
|
JRWR |
line 671/1195 |
06:02
🔗
|
odie5533_ |
perhaps |
06:02
🔗
|
odie5533_ |
JRWR: What command did you use? |
06:04
🔗
|
JRWR |
launcher.py wiki.txt |
06:04
🔗
|
godane |
whats funny is that episode 1364 of hacker pubic radio talks about vintage tech icon pay phone coin box |
06:04
🔗
|
JRWR |
https://code.google.com/p/wikiteam/source/browse/trunk/batchdownload/launcher.py |
06:05
🔗
|
godane |
i will go after the website for stuff thats not .org, spx, and mp3 just so we have the other stuff |
06:07
🔗
|
odie5533 |
so the launcher.py calls the dumpgenerator.py? |
06:07
🔗
|
odie5533 |
crazy. |
06:07
🔗
|
JRWR |
Yep |
06:07
🔗
|
JRWR |
its meant for a big ol list of wikis |
06:08
🔗
|
odie5533 |
JRWR: well |
06:09
🔗
|
odie5533 |
for a quick fix, just delete the ", data=...") stuff |
06:09
🔗
|
odie5533 |
so that the line reads: urllib.urlretrieve(url=url, filename='%s/%s' % (imagepath, filename2)) |
06:09
🔗
|
odie5533 |
might break other stuff though! :D |
06:10
🔗
|
odie5533 |
but that code is hacking since he's overriding urllib internals. bad bad bad! But I've done similar stuff before heh |
06:11
🔗
|
Lord_Nigh |
http://bap.ece.cmu.edu/download/bap-0.8/ was released on oct 17 and taken down on oct 22; unsure why; it was also stored at a git repo at https://github.com/cmubap/bap which was taken down simultaneously; i'm in communications with someone who has a checkout of that git |
06:11
🔗
|
Lord_Nigh |
there is some lawyer related crap why it was taken down |
06:11
🔗
|
JRWR |
oh my |
06:11
🔗
|
JRWR |
sounds like a bittorrent mirror is in order |
06:11
🔗
|
Lord_Nigh |
exactly |
06:12
🔗
|
JRWR |
ill be happy to seed it for some time :) |
06:12
🔗
|
odie5533 |
don't these lawyers know that code wants to be free? :) |
06:13
🔗
|
Lord_Nigh |
especially since the 0.7 code is still up at http://bap.ece.cmu.edu/download/bap-0.7/ though it looks like it may have been modified when everything else was taken down |
06:13
🔗
|
odie5533 |
listing: http://webcache.googleusercontent.com/search?q=cache:http://bap.ece.cmu.edu/download/bap-0.8/&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&gws_rd=cr&ei=TLpoUuKaC8v5kQfvpoBY |
06:14
🔗
|
JRWR |
damn, who thought it was a good idea to do POSTs to get data |
06:14
🔗
|
Lord_Nigh |
the code WAS released as gplv2... so once i get a copy i'm pretty sure i'm allowed to further distribute it... |
06:14
🔗
|
odie5533 |
JRWR: looks like it was done to fix the GET not working oddly enough |
06:14
🔗
|
odie5533 |
Lord_Nigh: sort of. |
06:14
🔗
|
JRWR |
lol |
06:14
🔗
|
odie5533 |
Not if it's illegal code |
06:14
🔗
|
JRWR |
ill watch the logs |
06:15
🔗
|
Lord_Nigh |
afaik its not illegal |
06:15
🔗
|
odie5533 |
If it's illegal to begin with, and they had no right to release it, then you have no right either |
06:15
🔗
|
Lord_Nigh |
true |
06:15
🔗
|
JRWR |
dat user agent |
06:16
🔗
|
odie5533 |
JRWR: yeah, I'm not sure why they didn't just use URLopenerUserAgent().retriever(...) |
06:16
🔗
|
odie5533 |
*retrieve |
06:16
🔗
|
JRWR |
sounds like a rewrite is in order |
06:17
🔗
|
odie5533 |
Perhaps just a fix. If it were rewritten, I'd say change from urllib to Twisted. |
06:17
🔗
|
JRWR |
also, I noticed its border line a DoS |
06:18
🔗
|
JRWR |
it spams the fuck out of the webserver |
06:18
🔗
|
odie5533 |
that's not good. |
06:18
🔗
|
odie5533 |
also, it looks like the _urlopener, while looking a bit hackish, is actually recommended by the API docs. |
06:18
🔗
|
JRWR |
same network, im getting 40req/s |
06:19
🔗
|
odie5533 |
with Twisted I always use delays and set a max number of requests |
06:19
🔗
|
JRWR |
I dont mind, but adding random requests and maybe some better user agents would work |
06:19
🔗
|
odie5533 |
JRWR: What is it doing, exactly? You give it a list of image urls? |
06:19
🔗
|
JRWR |
no |
06:20
🔗
|
JRWR |
it dumps the ENTIRE contents of a wiki |
06:20
🔗
|
JRWR |
XML + Images |
06:21
🔗
|
odie5533 |
First does XML right? |
06:21
🔗
|
JRWR |
Yes |
06:21
🔗
|
JRWR |
uses the API to pull it all |
06:22
🔗
|
odie5533 |
Do you do a lot of wiki archiving? |
06:22
🔗
|
JRWR |
I own a Very LARGE wiki farm |
06:22
🔗
|
odie5533 |
What does that mean? |
06:22
🔗
|
JRWR |
and I hate messing with the database, my caches love me :) |
06:23
🔗
|
JRWR |
and well, I broke my own dump scripts that they include with mediawiki |
06:23
🔗
|
JRWR |
even now im dumping a 2G XML file |
06:25
🔗
|
JRWR |
Yay! its working |
06:25
🔗
|
JRWR |
all 2200 images |
06:30
🔗
|
JRWR |
oh god FTLWiki is Huge |
06:38
🔗
|
JRWR |
ah, thats more like it https://archive.org/details/wiki-pcgamingwikicom |
06:39
🔗
|
Nemo_bis |
odie5533: no, only emijrp; I just do some small changes |
06:40
🔗
|
Nemo_bis |
JRWR: what do you mean that it breaks nginx? |
06:40
🔗
|
JRWR |
it 405s on "true" files |
06:40
🔗
|
JRWR |
if you try and do a POST on them |
06:40
🔗
|
Nemo_bis |
perhaps we should try both then |
06:41
🔗
|
JRWR |
I would try GETs first, then POSTs |
06:41
🔗
|
Nemo_bis |
apparently POST was used because in some cases GET requests didn't work, according to the comment |
06:41
🔗
|
Nemo_bis |
yeah, sure; wanna submit a patch? :) |
06:41
🔗
|
JRWR |
uhhh..... me + python = bwhahah |
06:54
🔗
|
godane |
this sucks |
06:54
🔗
|
godane |
looks like there is already a collection |
06:55
🔗
|
godane |
but it was done badly and out of date |
06:55
🔗
|
godane |
this is about hacker pubic radio |
06:57
🔗
|
godane |
i may have redo the first two items i have uploaded |
06:57
🔗
|
godane |
add a _mp3 to item names just so they will upload |
06:58
🔗
|
godane |
some of the way this collection was done is sort of half ass |
06:58
🔗
|
godane |
like this item: https://archive.org/details/hpr1282 |
06:58
🔗
|
odie5533 |
Nemo_bis: just do GET. Leave POST for if the GET didn't work someone can fix to that |
06:58
🔗
|
godane |
it should only be hpr1282 in it |
06:58
🔗
|
odie5533 |
I don't think POST should ever be the default behavior. |
06:58
🔗
|
godane |
but hpr1284 is also in it |
07:00
🔗
|
odie5533 |
Nemo_bis: have you read through all the code of dumpgenerator.py? Or have you only made tiny fixes to it>? |
07:01
🔗
|
Nemo_bis |
odie5533: yes, I guess I read it all at some point in time |
07:02
🔗
|
odie5533 |
aren't there other scripts to generate backups of medaiwiki sites? |
07:03
🔗
|
JRWR |
there are, but this set is very nice as it does all the heavy lifting for you when it works |
07:04
🔗
|
JRWR |
I just submitted three bugs |
07:08
🔗
|
odie5533 |
JRWR: It would probably help your issues if you gave the specific commands you used to reproduce the problem |
07:08
🔗
|
odie5533 |
"1. Do a normal API based Full XML+Image Dump using SVN Trunk " |
07:11
🔗
|
JRWR |
odie5533 added a comment |
07:12
🔗
|
odie5533 |
looks better |
07:12
🔗
|
odie5533 |
Is dumping wikis popular? |
07:12
🔗
|
odie5533 |
Or is dumping other stuff more popular? |
07:12
🔗
|
JRWR |
somewhat |
07:13
🔗
|
JRWR |
its more common to find a wiki |
07:13
🔗
|
JRWR |
since mediawikis are easy to setup and allow for content to be stored |
07:13
🔗
|
JRWR |
I run PCGamingWiki.com (Their servers) and well 47k a day in visits is nice |
07:28
🔗
|
odie5533 |
JRWR: What do you use to view warc files? |
07:36
🔗
|
odie5533 |
http://www.magicthegatheringtactics.com/ is already down. I assume no one got a grab of it? |
07:40
🔗
|
godane |
odie5533: its not down for me |
07:41
🔗
|
odie5533 |
oh. won't load for me. someone should probably grab it since the game is shutting down |
09:39
🔗
|
odie5533 |
Does WARC support HTTP1.1? |
09:41
🔗
|
odie5533 |
I guess it does by splitting up the request/responses. |
09:41
🔗
|
odie5533 |
HTTP1.1 makes things more complicated... |
12:05
🔗
|
yipdw |
odie5533: so long as there's one or more responses to a given request, WARC/1.0 should be able to handle any such version of HTTP |
12:06
🔗
|
yipdw |
correction, zero or more responses per one request |
12:06
🔗
|
yipdw |
WARC will correctly capture a "no response received" situation |
16:14
🔗
|
DFJustin |
paging sketchcow / undersco2 - rsync to fos failing for lack of space on device |
18:10
🔗
|
undersco2 |
please bang on this and make sure you don't see any breakage or errors |
18:10
🔗
|
undersco2 |
http://archive.org/details/historicalsoftware |
18:37
🔗
|
phillipsj |
in-browser emulators? lynx won't touch it :P |
18:39
🔗
|
ats |
undersco2: it's a bit weird pointing at the Spectrum version of Elite -- isn't the BBC version (the original) in the archive? |
18:39
🔗
|
ats |
(Ian Bell actually recommends the NES version as the best 8-bit one...) |
18:40
🔗
|
undersco2 |
unsure, would be a SketchCow question |
18:40
🔗
|
undersco2 |
he picked the things |
18:41
🔗
|
DFJustin |
it's kind of pot luck currently as to what computer systems are working |
18:42
🔗
|
DFJustin |
bbc is in mess ought to work but there may be some silly issue with the compile |
18:44
🔗
|
* |
ats launches his Z80-equipped Cobra MkIII and goes for a spin |
18:47
🔗
|
touya |
new elite coming 2014, can't wait |
18:51
🔗
|
ats |
Spectrum, Apple ][ and Osborne I all seem to work OK for me, and the text looks good |
18:53
🔗
|
* |
ats idly ponders a "focus on British games" page along similar lines to point his students at... |
19:42
🔗
|
SketchCow |
Any weirdness, let me know |
19:44
🔗
|
SketchCow |
https://docs.google.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDgtQmxhQS1ibEJua1JRYlJScWt2dWc&usp=sharing |
19:57
🔗
|
SketchCow |
So look. |
19:57
🔗
|
SketchCow |
I shifted data off the filling partition |
20:02
🔗
|
yipdw |
SketchCow: my coworkers saw that software archive, they love it |
20:07
🔗
|
SketchCow |
Great |
20:46
🔗
|
JRWR |
I might have a new peoject to do |
20:46
🔗
|
JRWR |
http://community.eveonline.com/news/news-channels/eve-online-news/old-portrait-services-temporarily-re-enabled/ |
20:47
🔗
|
JRWR |
eve has re-enabled their old portrait server, Im already running a script right now that is brute forcing it, since the id for the avatar can be 1 ro 9000000 |
20:47
🔗
|
JRWR |
the old docs are here for it http://oldportraits.eveonline.com/ |
20:47
🔗
|
JRWR |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
20:48
🔗
|
SketchCow |
yahoosucks, good sir |
20:48
🔗
|
SketchCow |
I have the greatest question ever. |
20:48
🔗
|
SketchCow |
https://archive.org/details/VisiCalc_1979_SoftwareArts |
20:48
🔗
|
SketchCow |
I can't get it to do a second row of data |
20:48
🔗
|
SketchCow |
Any ideas? |
20:53
🔗
|
DFJustin |
there's probably an easier way but you can type >A2 |
20:53
🔗
|
DFJustin |
source https://archive.org/stream/atariusersguide00fyls#page/16/mode/2up |
20:56
🔗
|
deathy |
Nemo_bis: on hp ftp.. did a compression test on "hpdesignjet.zip" to see what's possible.. nothing much came out of it. "Compression Ratio: 1.010.", couple of hundred meg savings. Not useful at all to upload it I guess.. |
21:04
🔗
|
mistym |
SketchCow: Ha, I was *just* wondering the same thing |
21:05
🔗
|
mistym |
Oh huh, I entered something that made left/right do vertical scrolling instead |
21:12
🔗
|
Nemo_bis |
deathy: with what settings? |
21:13
🔗
|
Nemo_bis |
unless you have over 20 GB RAM, you'd need -U for that one :) |
21:18
🔗
|
deathy |
Nemo_bis: ran with "-lU" since that's what you mentioned yesterday. Just got a server with 48 GB of ram today :) |
21:20
🔗
|
JRWR |
I wonder if I should submit this project to the warriors |
21:20
🔗
|
JRWR |
this is taking forever, Ive got 9 million IDs to find |
21:25
🔗
|
Nemo_bis |
deathy: wow, so you don't even need to use -U :D how long did it take? maybe you can remove even -l |
21:26
🔗
|
Nemo_bis |
I suspect the piping done by lrztar has worse effects than lrzip directly on a tar on disk |
21:28
🔗
|
deathy |
Nemo_bis: 21 minutes for the lrzip. I actually unarchived, created a tar and then ran lrzip. Well..sleep now. Let me know if you want me to try it on any other big archives |
21:29
🔗
|
Jacek |
JRWR, I'd imagine their servers can handle a nice number of connections. Got threading? |
21:33
🔗
|
Nemo_bis |
deathy: impressive :) a test without -lU would be fun |
21:33
🔗
|
Nemo_bis |
maybe that's the wrong testcase, it's possible there isn't as much duplication as in others |
21:37
🔗
|
TSwift |
what if archive.org goes down |
21:37
🔗
|
TSwift |
do we archive archive.org |
21:46
🔗
|
Nemo_bis |
TSwift: yes, for instance I ask people to mirror my https://archive.org/details/wikimediacommons collection; I'd also like to know more about the Alexandria mirror |
21:47
🔗
|
Nemo_bis |
I wonder if some researcher is downloading huge datasets; usually the link to Internet2 is much less busy, iirc. https://monitor.archive.org/weathermap/weathermap.html Maybe someone I asked to mirror Commons files :) https://en.wikipedia.org/wiki/Category:Internet_mirror_services |
21:47
🔗
|
Nemo_bis |
Also fun: http://www.internet2.edu/news/pr/2013.04.24.first-100G-transcontinental-transmission-rande-link.html |
21:48
🔗
|
DFJustin |
TSwift: http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/ |
21:48
🔗
|
TSwift |
cool, ty |
21:49
🔗
|
Nemo_bis |
also, while you're at it: http://www.newegg.com/Product/Product.aspx?Item=N82E16840995035 ;) |
21:49
🔗
|
DFJustin |
I've been meaning to write a gui leech tool with the new ia python stuff but someone will probably beat me to it |
21:52
🔗
|
Nemo_bis |
DFJustin: which new stuff? https://pypi.python.org/pypi/internetarchive (which has quite impressive stats btw) |
21:56
🔗
|
DFJustin |
that's the one |
22:04
🔗
|
JRWR |
update of the eve project atm: http://pcgamingwiki.com/eve |
22:21
🔗
|
dzne |
ugh, efnet seriously doesn't even partially mask people's IP address after all these years? |
22:22
🔗
|
touya |
never did, never will |
22:29
🔗
|
joepie91 |
^ likely accurate |
22:35
🔗
|
dzne |
the things that never change are never good things |
22:56
🔗
|
JRWR |
lol |
23:08
🔗
|
JRWR |
I like freenodes system |
23:08
🔗
|
JRWR |
:) |
23:08
🔗
|
JRWR |
why are we not on freenode anyway? |
23:09
🔗
|
SketchCow |
I like EFNet |
23:10
🔗
|
balrog |
freenode is too structured for a band of rogue archivists :) |
23:30
🔗
|
JRWR |
man this is going to take forever, anyone have ideas? the eve online project im working on, I've contacted the devs with no response so far |
23:31
🔗
|
JRWR |
here is the code Im using for the worker right now: http://hastebin.com/tonamaxovu.php |
23:34
🔗
|
dzne |
what problem are you having? |
23:35
🔗
|
JRWR |
its a image every 0.5 |
23:35
🔗
|
JRWR |
the keyspace is 9 million |
23:35
🔗
|
JRWR |
they close on the 28th, the server |
23:35
🔗
|
JRWR |
0.5s |
23:36
🔗
|
dzne |
like they're throttling your connection? |
23:36
🔗
|
JRWR |
na, more like ccp being slow |
23:37
🔗
|
dzne |
when you say "worker" does that mean you have a pool of multiple of those things going at once? |
23:37
🔗
|
JRWR |
its a IIS server with a backend to MSSQL (I think) |
23:37
🔗
|
JRWR |
nope, just one ATM |
23:37
🔗
|
JRWR |
didnt want to kill it, but I didnt expect for it to be this slow |
23:37
🔗
|
dzne |
I'd run about 100 of those at once and see if that improves things :) |
23:38
🔗
|
JRWR |
illl give that a try, I hope they dont get mad at me |
23:38
🔗
|
dzne |
if they're closing down anyway... |
23:38
🔗
|
dzne |
they probably won't care/notice |
23:39
🔗
|
dzne |
what's "ccp" ? |
23:40
🔗
|
dzne |
oh n/m |
23:41
🔗
|
dzne |
don't know much about the game :) |
23:42
🔗
|
JRWR |
its all good, CCP are reditors and I have already made a post |
23:42
🔗
|
JRWR |
http://www.reddit.com/r/Eve/comments/1p5hrq/in_light_of_the_old_portrait_server_being_nuked/ |