| Time |
Nickname |
Message |
|
00:14
🔗
|
JesseW |
Restarting to try and make a full backup of my laptop. Wish me luck... |
|
00:15
🔗
|
|
JesseW has left |
|
00:15
🔗
|
hook54321 |
anyone know when this is happening or if they've started a small beta test of it yet? http://gizmodo.com/the-wayback-machine-is-getting-a-search-engine-1739099940 |
|
00:16
🔗
|
yipdw |
it's going to be at least a year |
|
00:18
🔗
|
hook54321 |
what do they have to do to get it working? |
|
00:27
🔗
|
bsmith093 |
anyone have the rest of the geekfu action grip podcast? i got what was in the podcast core sample on fos, but i'm pretty sure thats not all of it |
|
00:43
🔗
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
|
00:52
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
|
00:53
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
01:13
🔗
|
Frogging |
joepie91: for stuff like Python virtual environments are very helpful when you've got multiple applications. I imagine Ruby is similar |
|
01:14
🔗
|
Frogging |
if you're installing all your dependencies globally to the system you're gonna have a bad time |
|
01:15
🔗
|
dan- |
esp. for py3 stuff since venv comes bundled natively now, makes deployment instructions fairly nice |
|
01:37
🔗
|
|
Snoo26423 has joined #archiveteam-bs |
|
01:55
🔗
|
|
RichardG has joined #archiveteam-bs |
|
03:00
🔗
|
|
Snoo26423 has quit IRC (Read error: Operation timed out) |
|
03:03
🔗
|
|
Snoo26423 has joined #archiveteam-bs |
|
03:26
🔗
|
|
toad2 has joined #archiveteam-bs |
|
03:28
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
|
03:58
🔗
|
|
toad1 has joined #archiveteam-bs |
|
03:59
🔗
|
hook54321 |
is their a way to set files as non-public on archive.org? |
|
04:00
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
|
04:04
🔗
|
|
toad2 has joined #archiveteam-bs |
|
04:07
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
|
04:09
🔗
|
SketchCow |
Greets from Westminster, MD |
|
04:11
🔗
|
|
bwn__ has quit IRC (Read error: Operation timed out) |
|
04:16
🔗
|
|
toad1 has joined #archiveteam-bs |
|
04:18
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
|
04:18
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
|
04:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:37
🔗
|
|
toad2 has joined #archiveteam-bs |
|
04:39
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
|
05:09
🔗
|
|
toad1 has joined #archiveteam-bs |
|
05:10
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
|
05:18
🔗
|
hook54321 |
is their a way to submit a list of urls to be archived on the wayback machine? |
|
05:48
🔗
|
|
toad2 has joined #archiveteam-bs |
|
05:49
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
|
05:57
🔗
|
|
toad1 has joined #archiveteam-bs |
|
06:00
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
|
06:02
🔗
|
|
toad2 has joined #archiveteam-bs |
|
06:05
🔗
|
Honno |
Hey, I'm clueless how to use megawarc for this https://archive.org/details/archiveteam_gamemaker&tab=collection |
|
06:05
🔗
|
|
toad3 has joined #archiveteam-bs |
|
06:05
🔗
|
Honno |
I see that for IA you need to split your warcs up |
|
06:05
🔗
|
Honno |
But how do you put them back together? I see this https://github.com/alard/megawarc but have no clue how to use it |
|
06:05
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
|
06:06
🔗
|
Honno |
Do I need to get all the json files from all those items in that collection and put it in one file or something to use the above tool and aggghhh |
|
06:08
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
|
06:13
🔗
|
yipdw |
Honno: what are you trying to do |
|
06:14
🔗
|
Honno |
yipdw: make a warc composed of all those warcs |
|
06:15
🔗
|
yipdw |
Honno: use cat |
|
06:15
🔗
|
Honno |
yipdw: whats that sorry? |
|
06:15
🔗
|
yipdw |
if your only goal is concatentation it's faster than extract/compress |
|
06:16
🔗
|
yipdw |
cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz |
|
06:16
🔗
|
Honno |
is there like, a linux command where I can literally just write a concat command the all the file names? |
|
06:16
🔗
|
yipdw |
cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz |
|
06:16
🔗
|
godane |
so 1991-03 of tagesschau is getting uploaded |
|
06:17
🔗
|
hook54321 |
is their a way to submit a list of urls to be archived on the wayback machine? |
|
06:17
🔗
|
yipdw |
not officially; use web.archive.org's save page thing or you can send stuff in via archivebot |
|
06:18
🔗
|
yipdw |
Honno: the JSON file accompanying each megawarc item is there to make it possible to split the megawarc back into its source files |
|
06:18
🔗
|
yipdw |
so if you're splitting, yes, you want that |
|
06:18
🔗
|
yipdw |
but warc.gz files produced by megawarc are individually gzipped WARC records so concatentation is fine |
|
06:19
🔗
|
yipdw |
this applies only to the WARC output; catting warc.gz with tarballs may not do what you want |
|
06:19
🔗
|
yipdw |
fortunately most tarballs created in megawarced warrior output are empty tarball |
|
06:19
🔗
|
yipdw |
s |
|
06:20
🔗
|
Honno |
yipdw: sooo, no need for the JSON files if I'm going to concat right? |
|
06:20
🔗
|
yipdw |
if all you want to do is make a gigantic warc then no you don't need the JSON files |
|
06:20
🔗
|
yipdw |
I'm wondering why you want a gigantic warc, but that's a second question |
|
06:21
🔗
|
Honno |
yipdw: it's because components of one archive rely on things in the archive archives, for general browsing |
|
06:21
🔗
|
yipdw |
download all the warcs and load them up into pywb |
|
06:21
🔗
|
yipdw |
it'll find them |
|
06:21
🔗
|
yipdw |
wayback has similar functionality |
|
06:22
🔗
|
Honno |
wayback seems ridiculously hard to set up |
|
06:22
🔗
|
yipdw |
then try pywb, it's easier |
|
06:22
🔗
|
yipdw |
or webarchiveplayer, which is pywb with a nicer interface |
|
06:22
🔗
|
Honno |
I'm a complete noob by the way heh, I don't do programming or anything |
|
06:22
🔗
|
Honno |
yeah I tried webarchiveplayer, that doesn't seem to have the feature of using all things tho |
|
06:22
🔗
|
Honno |
also takes ridiculously long to load |
|
06:23
🔗
|
|
Microguru has joined #archiveteam-bs |
|
06:23
🔗
|
yipdw |
you're throwing hundreds of gigabytes of data |
|
06:23
🔗
|
yipdw |
it's going to take a while no matter |
|
06:23
🔗
|
Honno |
yeah heh, ugh |
|
06:24
🔗
|
yipdw |
in any case webarchiveplayer should support multiple WARCs fine |
|
06:24
🔗
|
yipdw |
I don't remember if it uses the cdx files |
|
06:24
🔗
|
yipdw |
or if it must reconstruct them |
|
06:24
🔗
|
yipdw |
you may have better luck downloading the WARC and CDX files, and dumping them in the same place |
|
06:25
🔗
|
Honno |
cdx huh, need to check what that is |
|
06:25
🔗
|
yipdw |
WARC index |
|
06:25
🔗
|
yipdw |
if webarchiveplayer can use the indexes you can avoid a costly reindexing |
|
06:25
🔗
|
Honno |
oh |
|
06:25
🔗
|
yipdw |
I know pywb uses indexes to speed up retrieval, I just can't remember whether or not it will use the ones generated at IA |
|
06:26
🔗
|
Honno |
well thanks yipdw, the ultimate goal is to web scrape data and extract all the game downloads from the site, but it seems theres a lot I need to learn about first |
|
06:28
🔗
|
yipdw |
you may want to ask ikreymer for more tips |
|
06:28
🔗
|
yipdw |
he pops in here occasionally |
|
06:28
🔗
|
Honno |
heh, another thing yipdw, the game downloads don't show up in the index of webarchiveplayer |
|
06:28
🔗
|
yipdw |
being the author of pywb I suspect he'll know more about it than me |
|
06:28
🔗
|
Honno |
ah right haha |
|
06:28
🔗
|
yipdw |
I don't know what that's from |
|
06:28
🔗
|
Honno |
all the downloads have a weird download link see, it's a query ie games/220702-karoshi-factory-remake-gmk/send_download?code=1ed32eb417091bed7fffe9e99269867ba01b54da |
|
06:29
🔗
|
Honno |
from games/220702/download |
|
06:29
🔗
|
Honno |
the site was pretty weird |
|
06:29
🔗
|
Honno |
I can't easily download the game files then? |
|
06:30
🔗
|
yipdw |
I don't know, I didn't participate in that one |
|
06:30
🔗
|
yipdw |
arkiver probably knows more about the quirks of that site |
|
06:31
🔗
|
Honno |
mhmk I'll see if they know |
|
06:32
🔗
|
Honno |
yipdw, where do I see who organized these crawls sorry? |
|
06:32
🔗
|
Honno |
I see the tracker lists folk, but thats people who contributed their computers right on the warrior |
|
06:32
🔗
|
yipdw |
oh I guess it was chfoo |
|
06:32
🔗
|
yipdw |
https://github.com/ArchiveTeam/gamemaker-sandbox-grab |
|
06:33
🔗
|
Honno |
yeah chfoo made the archive team wiki page about the project |
|
06:33
🔗
|
Honno |
also helped me out earlier so I spose thats the person I want heh |
|
06:34
🔗
|
Honno |
I'll be off, thanks for your help |
|
06:34
🔗
|
Honno |
Really need to learn this stuff, want to make a clean archive of the games from this old site |
|
06:36
🔗
|
yipdw |
np |
|
06:43
🔗
|
|
toad1 has joined #archiveteam-bs |
|
06:44
🔗
|
|
toad3 has quit IRC (Read error: Operation timed out) |
|
07:09
🔗
|
hook54321 |
is their a way to set files as non-public on archive.org? |
|
07:10
🔗
|
|
JesseW has joined #archiveteam-bs |
|
07:11
🔗
|
JesseW |
bsmith093: Finished all but Naruto (which is 18G uncompressed) -- now working on that. |
|
07:12
🔗
|
JesseW |
Currently up to 105G compressed, as opposed to the originals 108G. So it will likely be bigger, but probably not very. |
|
07:12
🔗
|
JesseW |
probably about 2GB bigger. |
|
07:13
🔗
|
JesseW |
hook54321: not as a normal user; IA staffers can do various things, though. |
|
07:36
🔗
|
|
bwn has joined #archiveteam-bs |
|
07:58
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
|
08:01
🔗
|
|
metalcamp has joined #archiveteam-bs |
|
08:12
🔗
|
|
JesseW has left |
|
08:16
🔗
|
joepie91 |
Frogging: "virtual environments" is the recommendation everybody automatically makes for Python and Ruby but 1) they are a hack that really shouldn't be necessary to begin with and 2) they don't actually fully solve the problem |
|
08:16
🔗
|
joepie91 |
they isolate dependencies on a per-application basis |
|
08:16
🔗
|
joepie91 |
but it doesn't magically allow for nested / differently versioned dependencies *within* a project |
|
08:17
🔗
|
joepie91 |
so the dep model remains broken |
|
08:17
🔗
|
joepie91 |
(and frankly, virtual environments are typically an utter mess to integrate with service/daemon managers and such) |
|
08:26
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
|
08:30
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
|
08:34
🔗
|
|
fie has joined #archiveteam-bs |
|
08:36
🔗
|
|
fie__ has quit IRC (Ping timeout: 244 seconds) |
|
08:55
🔗
|
|
lytv has joined #archiveteam-bs |
|
08:59
🔗
|
|
fie_ has joined #archiveteam-bs |
|
09:00
🔗
|
|
vtyl has quit IRC (Read error: Operation timed out) |
|
09:00
🔗
|
|
fie has quit IRC (Read error: Operation timed out) |
|
09:37
🔗
|
|
fie__ has joined #archiveteam-bs |
|
09:38
🔗
|
|
fie_ has quit IRC (Read error: Operation timed out) |
|
09:42
🔗
|
godane |
SketchCow: all of 2012 kpfa is uploaded |
|
09:42
🔗
|
godane |
i'm uploading 2013-01 now |
|
09:44
🔗
|
|
metalcamp has joined #archiveteam-bs |
|
09:45
🔗
|
|
fie_ has joined #archiveteam-bs |
|
09:46
🔗
|
|
fie__ has quit IRC (Read error: Operation timed out) |
|
09:49
🔗
|
|
fie__ has joined #archiveteam-bs |
|
09:49
🔗
|
|
fie__ has quit IRC (Client Quit) |
|
09:53
🔗
|
|
fie_ has quit IRC (Ping timeout: 370 seconds) |
|
09:55
🔗
|
|
metalcamp has quit IRC (Quit: Bye) |
|
10:06
🔗
|
|
metalcamp has joined #archiveteam-bs |
|
10:16
🔗
|
alfie |
morning all |
|
10:33
🔗
|
BnA-Rob1n |
Just read a blog post about 500px.com raising their cut for every sold picture from 30% to 70% ("to help the further growth of 500px"), one of the founders is the same as livejournal. Maby we should do a sanity grab? |
|
10:48
🔗
|
ersi |
Of 500px? Of LiveJournal? |
|
10:50
🔗
|
BnA-Rob1n |
Well the sanity grab of livejournal is already in the disco phase. So I mean it might be good to check up on 500px as well if it's feasible to do a sanity check |
|
10:51
🔗
|
ersi |
What the fuck is a disco phase |
|
10:53
🔗
|
ersi |
Oh, discovery phase |
|
10:53
🔗
|
HCross |
discovery |
|
11:08
🔗
|
alfie |
BEARS > BEES |
|
12:02
🔗
|
godane |
i'm up to 1991-03-31 of tagesschau evening news |
|
12:02
🔗
|
godane |
NOTE: there is no 1991-03-26 episode on there site |
|
12:27
🔗
|
godane |
i think uploads to IA are getting stuck |
|
12:34
🔗
|
HCross |
godane, ditto. Newsgrabber is getting stuck |
|
12:47
🔗
|
|
acridAxid has quit IRC (marauder) |
|
12:49
🔗
|
|
acridAxid has joined #archiveteam-bs |
|
12:57
🔗
|
|
alfie has quit IRC (Quit: Seeeya! - ZNC 1.6.3+deb1+jessie0) |
|
12:57
🔗
|
|
alfie has joined #archiveteam-bs |
|
13:38
🔗
|
|
schbirid has joined #archiveteam-bs |
|
14:07
🔗
|
|
chazchaz has quit IRC (Read error: Operation timed out) |
|
14:08
🔗
|
|
Honno has quit IRC (Read error: Connection reset by peer) |
|
14:14
🔗
|
|
Coderjoe has quit IRC (Ping timeout: 260 seconds) |
|
14:16
🔗
|
|
hook54321 has quit IRC (Ping timeout: 268 seconds) |
|
14:17
🔗
|
|
chazchaz has joined #archiveteam-bs |
|
14:39
🔗
|
Frogging |
ersi: The most fabulous phase of course :p |
|
14:41
🔗
|
HCross |
it depends, its either the discovery phase or the "angry person yelling" phase |
|
14:51
🔗
|
|
Coderjoe has joined #archiveteam-bs |
|
15:03
🔗
|
|
Honno has joined #archiveteam-bs |
|
15:11
🔗
|
|
vitzli has joined #archiveteam-bs |
|
16:13
🔗
|
|
closure has quit IRC (ZNC - 1.6.0 - http://znc.in) |
|
17:05
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
|
17:06
🔗
|
|
RichardG has joined #archiveteam-bs |
|
17:16
🔗
|
|
closure has joined #archiveteam-bs |
|
17:31
🔗
|
|
vitzli has quit IRC (Leaving) |
|
17:47
🔗
|
|
dxrt- has quit IRC (Ping timeout: 633 seconds) |
|
17:47
🔗
|
Smiley |
soooooooooo what craziness is Jason upto atm |
|
17:47
🔗
|
Smiley |
i'm wathcingf on twitter |
|
17:51
🔗
|
JW_work |
Smiley: just moving the manuals from one place to another, AFAIK |
|
17:54
🔗
|
phuzion |
Smiley: http://pastebin.com/3meEDnQ5 that is a bit of an overview of what's going on |
|
17:55
🔗
|
phuzion |
tl;dr: SketchCow and friends rescued a shitload of manuals, and now they're just moving the manuals into a consolidated space for money savings sake. |
|
18:01
🔗
|
Smiley |
oh these the one from that shop which closed? |
|
18:04
🔗
|
HCross |
If it wasnt for the other-side-of-the-world problem, id be there |
|
18:04
🔗
|
|
bsmith093 has quit IRC (Ping timeout: 258 seconds) |
|
18:05
🔗
|
Smiley |
nod |
|
18:05
🔗
|
Smiley |
money i don't have right now, time,... not really |
|
18:05
🔗
|
Smiley |
but i might of been able to help at least a bit |
|
18:05
🔗
|
Smiley |
hopefully moving on thursday \o/ |
|
18:07
🔗
|
HCross |
Jason needs some stuff to move in the UK :P |
|
18:17
🔗
|
|
DopefishJ has joined #archiveteam-bs |
|
18:17
🔗
|
|
swebb sets mode: +o DopefishJ |
|
18:18
🔗
|
|
bwn has quit IRC (Ping timeout: 246 seconds) |
|
18:19
🔗
|
|
DFJustin has quit IRC (Ping timeout: 274 seconds) |
|
18:48
🔗
|
|
bwn has joined #archiveteam-bs |
|
18:48
🔗
|
|
bsmith093 has joined #archiveteam-bs |
|
18:54
🔗
|
|
Smiley has quit IRC (Remote host closed the connection) |
|
18:56
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
|
19:23
🔗
|
JW_work |
HCross: have you signed up on the archivecorps mailing list? there may be some moving jobs there. :-) |
|
19:25
🔗
|
HCross2 |
I havent |
|
19:34
🔗
|
BnA-Rob1n |
signup is here: http://archive.us7.list-manage.com/subscribe?u=30ffefa96d1767cc661f2e3ce&id=3b19db5cef |
|
19:39
🔗
|
HCross2 |
Done |
|
19:49
🔗
|
|
tomwsmf-a has joined #archiveteam-bs |
|
19:54
🔗
|
|
DopefishJ is now known as DFJustin |
|
20:07
🔗
|
chfoo |
Honno: did you see the wiki page? i updated instructions on how to access it in wayback machine if that helps |
|
20:07
🔗
|
Honno |
chfoo, yeah I did, thanks for that, will do more into explaining how to get the warcs going offline |
|
20:08
🔗
|
Honno |
just got it all downloaded and running myself |
|
20:12
🔗
|
JW_work |
so much confusion in #archiveteam... |
|
20:13
🔗
|
|
tomwsmf-a has quit IRC (Ping timeout: 258 seconds) |
|
20:16
🔗
|
alfie |
JW_work: i was about to say... linebreaks aren't fuckin punctuation :P |
|
20:39
🔗
|
|
luckcolor has joined #archiveteam-bs |
|
20:39
🔗
|
|
luckcolor has left |
|
20:46
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
|
20:51
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
20:51
🔗
|
|
JetBalsa has joined #archiveteam-bs |
|
20:52
🔗
|
|
Tom__ has joined #archiveteam-bs |
|
20:52
🔗
|
xmc |
oi Tom__, so what's your question |
|
20:54
🔗
|
Tom__ |
So the thing is the archive team crawled a social network site. it has 519 collections. I want to find a specific profile, otherwise I need to download 519 collections which is a lot TB |
|
20:54
🔗
|
xmc |
hm yeah |
|
20:55
🔗
|
xmc |
you could download the .cdx files that go with, those are basically an index of urls |
|
20:55
🔗
|
Tom__ |
Yes, is there software to open it specifally? |
|
20:57
🔗
|
xmc |
not much that you might find useful |
|
20:57
🔗
|
Tom__ |
I mean what is the he best software to open the .cdx.idx files? I can open it with notepad, but its not good with spacing and aligning. |
|
20:57
🔗
|
xmc |
but they're just plain text files so you can just use grep |
|
20:57
🔗
|
xmc |
if you find a url in a cdx then that means it is available in the matching warc file |
|
20:59
🔗
|
Tom__ |
Ok, thank you. I will download the files and start searching. |
|
21:05
🔗
|
|
Tom__ has quit IRC (Quit: Page closed) |
|
21:10
🔗
|
|
luckcolor has joined #archiveteam-bs |
|
21:10
🔗
|
|
luckcolor has left |
|
21:26
🔗
|
BnA-Rob1n |
519 collections, is it hyves? |
|
21:32
🔗
|
BnA-Rob1n |
Tom__: I had a list around, uploaded it here: https://archive.org/details/warcindex-usernames.7z |
|
21:40
🔗
|
BnA-Rob1n |
added this list to the wiki for others searching an archive containing their own or a specific username on hyves |
|
22:17
🔗
|
|
Honno has quit IRC (Ping timeout: 492 seconds) |
|
22:25
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
|
22:52
🔗
|
|
hook54321 has joined #archiveteam-bs |
|
22:57
🔗
|
|
bauruine has quit IRC (Ping timeout: 260 seconds) |
|
23:14
🔗
|
|
bauruine has joined #archiveteam-bs |
|
23:22
🔗
|
|
hook54321 has quit IRC (Ping timeout: 268 seconds) |
|
23:44
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
|
23:49
🔗
|
|
RichardG has joined #archiveteam-bs |