Time |
Nickname |
Message |
05:46
🔗
|
SketchCow |
Hey. |
05:47
🔗
|
SketchCow |
Back in the US |
05:53
🔗
|
omf_ |
SketchCow, Is there a preference to how we keep the links in warc files? |
05:53
🔗
|
omf_ |
I couldn't find the answer on the wiki |
05:53
🔗
|
SketchCow |
OK, so I am a little worried about this. |
05:53
🔗
|
SketchCow |
We're solving this this week. |
05:54
🔗
|
SketchCow |
We have people wander in, go "I WANNA SAVE CAMELTOE.ORG" and later they go "I SAVED IT" |
05:54
🔗
|
SketchCow |
I want to make sure we're standardized on a WGET-WARC grab. |
05:56
🔗
|
omf_ |
this http://www.archiveteam.org/index.php?title=Wget_with_WARC_output |
05:56
🔗
|
omf_ |
and this http://www.archiveteam.org/index.php?title=Software |
05:57
🔗
|
omf_ |
You just want a single page that lays out the process from beginning to end |
05:57
🔗
|
SketchCow |
Yeah, they're there. |
05:57
🔗
|
SketchCow |
Right. |
05:57
🔗
|
omf_ |
Create a wiki page and I can start editing it |
05:57
🔗
|
SketchCow |
I want to call it Kamikazee |
05:57
🔗
|
SketchCow |
For a single individual |
05:58
🔗
|
omf_ |
I ask because we are working on backing up ugo, ign, 1up and gamespy |
05:58
🔗
|
godane |
my code for my scripts may have some use |
05:58
🔗
|
SketchCow |
I know. |
05:58
🔗
|
SketchCow |
And I want those grabbed semi-intelligently. |
05:59
🔗
|
godane |
like how to mirror forums and then grab images after that fact |
05:59
🔗
|
omf_ |
SketchCow, that is a tall order based on the test grabs we have made so far |
05:59
🔗
|
godane |
*after the fact |
06:00
🔗
|
godane |
i think i'm close to having all the g4tv.com videos |
06:00
🔗
|
godane |
the hd videos are the ones i don't know if i will have all of them |
06:00
🔗
|
godane |
or could get all of them or even storage for them |
06:01
🔗
|
SketchCow |
In general, I would HOPE it wasn't simple to grab those sites. |
06:02
🔗
|
omf_ |
My concern is the lack of a shutdown date |
06:05
🔗
|
SketchCow |
Yeah |
06:05
🔗
|
SketchCow |
It'll be 30 days or more |
06:06
🔗
|
omf_ |
I have a good solution if we do a multipass archive which can be completed in a shorter amount of time. We haven't seen any bans yet |
06:06
🔗
|
omf_ |
but previous projects all appear to be a single pass type approach |
06:08
🔗
|
omf_ |
First pass wget everything we can. Scan and map all those warcs for links and the link patterns we know. Do a second pass download using all the links we found and generated. |
06:10
🔗
|
omf_ |
I am also running a link mapper on these sites at present to find more buried content |
06:13
🔗
|
SketchCow |
If I had to guess, it's finding every subdomain for those domains. |
06:13
🔗
|
omf_ |
Already done |
06:13
🔗
|
SketchCow |
Is that on the wiki? |
06:14
🔗
|
omf_ |
we got this wiki page http://www.archiveteam.org/index.php?title=Ispygames but I do not have access to upload files or create new pages |
06:14
🔗
|
omf_ |
yeah I got a gamespy file, ign file and a 1up file |
06:14
🔗
|
omf_ |
and ugo |
06:18
🔗
|
omf_ |
For example of the 3,702 subdomains we know of for gamespy.com only 267 of them "work" |
06:18
🔗
|
omf_ |
Some of those redirect to other existing sites |
06:39
🔗
|
S[h]O[r]T |
omf you should be able to register for the wiki. i can change stuff tho if you need as well. |
06:41
🔗
|
omf_ |
I already have an account that I use to edit the wiki I just cannot create pages |
06:41
🔗
|
omf_ |
never gave it much thought |
06:41
🔗
|
S[h]O[r]T |
ah |
07:20
🔗
|
SketchCow |
I FINALLY wrote the stupid script to take a bunch of sets of filenames for one object. |
07:20
🔗
|
chronomex |
for creating multi-file items? |
07:21
🔗
|
SketchCow |
Example: http://archive.org/details/POWERDRIVE0198 |
07:21
🔗
|
SketchCow |
Pumped the CUE, BIN, ISO and JPG in |
07:21
🔗
|
chronomex |
ah bitchen |
07:22
🔗
|
SketchCow |
I had a whole class of waiting items for this. |
07:22
🔗
|
SketchCow |
So I can clear it out and get back into the groove |
07:23
🔗
|
chronomex |
today I met a gentleman who's scanned literally an order of magnitude more BSPs than I have |
07:23
🔗
|
chronomex |
he's entirely comfortable with putting them into IA |
07:23
🔗
|
SketchCow |
Good |
07:23
🔗
|
chronomex |
I'll coordinate that; do you think I should put them in the same collection? |
07:24
🔗
|
chronomex |
iirc the collection still has that restriction on it for all member items |
07:24
🔗
|
SketchCow |
Make a new collection |
07:24
🔗
|
chronomex |
k |
07:24
🔗
|
SketchCow |
Oh, wait, talking off the top of my head |
07:24
🔗
|
SketchCow |
To be honest, no, you should get your ass in gear on the undoing from your gang |
07:25
🔗
|
chronomex |
yes, I should |
07:25
🔗
|
SketchCow |
But now you have the back pocket "get it up" |
07:25
🔗
|
chronomex |
? |
07:25
🔗
|
chronomex |
I can't parse that |
07:32
🔗
|
SketchCow |
I mean that if you can't get the letter, we just make a new collection and use new guy's stuff |
07:33
🔗
|
chronomex |
ah yes |
07:41
🔗
|
SketchCow |
There, done, looking good. |
07:44
🔗
|
SketchCow |
PowerPlay0196.jpg PowerPlay0297.rar PowerPlay0399.jpg PowerPlay0596.rar PowerPlay0699.jpg PowerPlay0895.rar PowerPlay0998.jpg PowerPlay1195.rar PowerPlay1299.jpg |
07:44
🔗
|
SketchCow |
PowerPlay0196.rar PowerPlay0298.jpg PowerPlay0399.rar PowerPlay0597.jpg PowerPlay0699.rar PowerPlay0896.jpg PowerPlay0998.rar PowerPlay1196.jpg PowerPlay1299.rar |
07:44
🔗
|
SketchCow |
So these .rar files will be split up and then I can upload all |
07:45
🔗
|
SketchCow |
http://archive.org/details/powerdrivecd |
08:58
🔗
|
chronomex |
http://pipeline.corante.com/archives/2013/02/22/what_if_the_journal_disappears.php |
09:25
🔗
|
Lord_Nigh |
i assume the opensolaris stuff has been dealt with |
09:26
🔗
|
Lord_Nigh |
i'm getting 403 forbidden |
09:27
🔗
|
omf_ |
http://hub.opensolaris.org/bin/view/Main/ and http://src.opensolaris.org/source/ are both up |
09:29
🔗
|
omf_ |
chronomex, you got the oss list? If not I can build one up and start testing the repo pulls against what I got |
10:57
🔗
|
chronomex |
omf_: ughhhno |
11:39
🔗
|
chronomex |
omf_: fetching now, turns out the simplest way to copy out of this system is with sftp |
11:39
🔗
|
chronomex |
:) |
11:39
🔗
|
chronomex |
I'll turn it into a bunch of hg bundles once this finishes |
11:39
🔗
|
chronomex |
hg is pleasantly slow, I must say |
11:40
🔗
|
chronomex |
pity the system doesn't let you rsync, or this would be much faster |
11:41
🔗
|
omf_ |
true dat |
11:43
🔗
|
chronomex |
actually I think this is about the same speed as rsync |
11:43
🔗
|
chronomex |
the only way to win with many small files is tar/cpio, I think |
11:54
🔗
|
omf_ |
are any of these repos large? |
11:55
🔗
|
* |
chronomex shrugs |
11:55
🔗
|
chronomex |
still sucking em down |
11:55
🔗
|
chronomex |
hoping it'll fit on my 60G of free space in this laptop |
11:55
🔗
|
chronomex |
else I'll have to fire up the disk array |
12:05
🔗
|
chronomex |
there's a lot of fucking tiny ass files here |
12:05
🔗
|
chronomex |
this will take a while ... |
12:10
🔗
|
chronomex |
omf_: did you get http://defect.opensolaris.org/ ? |
16:39
🔗
|
omf_ |
someone else mentioned they had a script or something for bugzilla so I didn't try and grab it |
17:54
🔗
|
SketchCow |
Has opensolaris been dealt with? |
17:54
🔗
|
SketchCow |
Oh, we better get on this. |
17:54
🔗
|
SketchCow |
------------------------------ |
17:54
🔗
|
SketchCow |
OPENSOLARIS COORDINATION |
17:54
🔗
|
SketchCow |
#closedsolaris |
17:54
🔗
|
SketchCow |
------------------------------ |
18:31
🔗
|
db48x22 |
you come up with good names very quickly |
19:00
🔗
|
savetz |
what do I need to know re: Posterous? |
19:02
🔗
|
ersi |
That there's a posterous channel in #preposterus and that there's a AT warrior project either running or coming soon |
19:02
🔗
|
ersi |
scraping is done afaik, now it's grabbing dataz |
19:03
🔗
|
savetz |
I see the warrior project, it says "they will ban you, check in at IRC before running this" |
19:15
🔗
|
dashcloud |
so what happened with opensolaris? |
19:18
🔗
|
ersi |
Did you not see the notice, ten lines up? #closedsolaris |
19:23
🔗
|
dashcloud |
I did, but since I was pretty sure it was dead already and/or someone had grabbed all of their stuff previously, I was confused |
19:24
🔗
|
omf_ |
the site will be up till March 23, a few people seemed to misunderstand that |
19:26
🔗
|
ersi |
like, many |
19:54
🔗
|
db48x22 |
savetz: in order to run the posterous project correctly you need to be able to change ip addresses every hour |
19:56
🔗
|
godane |
how do i change my ip on linux? |
19:56
🔗
|
db48x22 |
if your isp can give you a new one via dhcp, then that's fairly easy |
19:56
🔗
|
db48x22 |
you'll have to convince your router to make that request though, if you have one |
20:01
🔗
|
dashcloud |
so how aggressive is the blocking? will it always block more than one instance that's running continuously? |
20:02
🔗
|
db48x22 |
they'll ban your ip no matter how slow you go |
20:02
🔗
|
db48x22 |
it's better to run flat out and get as much as you can in the hour |
20:02
🔗
|
aggrosk |
I've had some luck today at least. Looks like an instance I'm running elsewhere isn't getting the banhammer. |
20:03
🔗
|
db48x22 |
we haven't worked out a good way to hand off from address to address though |
20:04
🔗
|
db48x22 |
if you use a tun device to proxy to another machine, then you can probably move your proxy connection from server to server without stopping your downloads |
20:07
🔗
|
ersi |
dashcloud: they ban freggin' everything man |
20:08
🔗
|
db48x22 |
hrm |
20:08
🔗
|
db48x22 |
wget won't recurse |
20:08
🔗
|
db48x22 |
I put -r -l inf and it downloads the index.html and then stops |
20:08
🔗
|
aggrosk |
Well, the tor network is a good source of IP's. Not sure how you'd torify any download scripts though, or even if it's possible. |
20:08
🔗
|
db48x22 |
tor is very very slow |
20:09
🔗
|
aggrosk |
^ |
20:09
🔗
|
db48x22 |
but yea, you could do that with relatively little work |
20:09
🔗
|
db48x22 |
we'd never finish if that's all we did though |
20:11
🔗
|
db48x22 |
oooh, those idiots |
20:11
🔗
|
db48x22 |
all the links go to www3.whatever, so wget plays dumb |
20:20
🔗
|
godane |
so all non hd videos are downloaded |
20:20
🔗
|
godane |
computer tech videos is going to be come very big soon |
20:37
🔗
|
omf_ |
thanks for doing that godane I love me some techie computer videos |
20:56
🔗
|
DFJustin |
2000CD GET |
21:01
🔗
|
omf_ |
Just a reminder we have a wiki page about IRC channels http://www.archiveteam.org/index.php?title=IRC |
21:02
🔗
|
omf_ |
I just updated it with #closedsolaris #ispygames #preposterus |
21:18
🔗
|
turnkit |
How hard would it be to take a single day snapshot of eBay? Is that impossible? |
21:20
🔗
|
dashcloud |
please add #aohell to the list as well |
21:20
🔗
|
DFJustin |
I just did |
21:24
🔗
|
omf_ |
turnkit, maybe not all of ebay but a good chunk is possible |
21:25
🔗
|
omf_ |
you would premap the url scheme and need more than a few dozen clients downloading pages |
21:25
🔗
|
omf_ |
all day |
21:25
🔗
|
omf_ |
maybe see if the API can save time |
21:26
🔗
|
chronomex |
ebay is full of pictures of unique, interesting, historically relevant items that disappear after a month |
21:26
🔗
|
chronomex |
it's infuriating |
21:27
🔗
|
omf_ |
same with craigslist |
21:27
🔗
|
chronomex |
yes, ebay more so I think |
21:54
🔗
|
omf_ |
craigslist gets some stuff that would never be posted on ebay because it is not shippable. Slightly different markets, both very important |
22:31
🔗
|
godane |
i got the total number of videos for non-HD part of g4tv.com |
22:31
🔗
|
godane |
36466 |
22:32
🔗
|
omf_ |
that is what you have downloaded? |
22:32
🔗
|
godane |
yes |
22:32
🔗
|
godane |
that doesn't include hd videos |
23:24
🔗
|
dashcloud |
from this tweet: https://twitter.com/blefurgy/status/304955585172996096 learned about: http://matkelly.com/wail/ |
23:28
🔗
|
omf_ |
his previous app WARCreate I thought was more impressive |
23:29
🔗
|
omf_ |
I'll try it when there is a linux version |