Time |
Nickname |
Message |
00:14
🔗
|
JesseW |
Restarting to try and make a full backup of my laptop. Wish me luck... |
00:15
🔗
|
|
JesseW has left |
00:15
🔗
|
hook54321 |
anyone know when this is happening or if they've started a small beta test of it yet? http://gizmodo.com/the-wayback-machine-is-getting-a-search-engine-1739099940 |
00:16
🔗
|
yipdw |
it's going to be at least a year |
00:18
🔗
|
hook54321 |
what do they have to do to get it working? |
00:27
🔗
|
bsmith093 |
anyone have the rest of the geekfu action grip podcast? i got what was in the podcast core sample on fos, but i'm pretty sure thats not all of it |
00:43
🔗
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
00:52
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
00:53
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
01:13
🔗
|
Frogging |
joepie91: for stuff like Python virtual environments are very helpful when you've got multiple applications. I imagine Ruby is similar |
01:14
🔗
|
Frogging |
if you're installing all your dependencies globally to the system you're gonna have a bad time |
01:15
🔗
|
dan- |
esp. for py3 stuff since venv comes bundled natively now, makes deployment instructions fairly nice |
01:37
🔗
|
|
Snoo26423 has joined #archiveteam-bs |
01:55
🔗
|
|
RichardG has joined #archiveteam-bs |
03:00
🔗
|
|
Snoo26423 has quit IRC (Read error: Operation timed out) |
03:03
🔗
|
|
Snoo26423 has joined #archiveteam-bs |
03:26
🔗
|
|
toad2 has joined #archiveteam-bs |
03:28
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
03:58
🔗
|
|
toad1 has joined #archiveteam-bs |
03:59
🔗
|
hook54321 |
is their a way to set files as non-public on archive.org? |
04:00
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
04:04
🔗
|
|
toad2 has joined #archiveteam-bs |
04:07
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
04:09
🔗
|
SketchCow |
Greets from Westminster, MD |
04:11
🔗
|
|
bwn__ has quit IRC (Read error: Operation timed out) |
04:16
🔗
|
|
toad1 has joined #archiveteam-bs |
04:18
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:18
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
04:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:37
🔗
|
|
toad2 has joined #archiveteam-bs |
04:39
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
05:09
🔗
|
|
toad1 has joined #archiveteam-bs |
05:10
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
05:18
🔗
|
hook54321 |
is their a way to submit a list of urls to be archived on the wayback machine? |
05:48
🔗
|
|
toad2 has joined #archiveteam-bs |
05:49
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
05:57
🔗
|
|
toad1 has joined #archiveteam-bs |
06:00
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
06:02
🔗
|
|
toad2 has joined #archiveteam-bs |
06:05
🔗
|
Honno |
Hey, I'm clueless how to use megawarc for this https://archive.org/details/archiveteam_gamemaker&tab=collection |
06:05
🔗
|
|
toad3 has joined #archiveteam-bs |
06:05
🔗
|
Honno |
I see that for IA you need to split your warcs up |
06:05
🔗
|
Honno |
But how do you put them back together? I see this https://github.com/alard/megawarc but have no clue how to use it |
06:05
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
06:06
🔗
|
Honno |
Do I need to get all the json files from all those items in that collection and put it in one file or something to use the above tool and aggghhh |
06:08
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
06:13
🔗
|
yipdw |
Honno: what are you trying to do |
06:14
🔗
|
Honno |
yipdw: make a warc composed of all those warcs |
06:15
🔗
|
yipdw |
Honno: use cat |
06:15
🔗
|
Honno |
yipdw: whats that sorry? |
06:15
🔗
|
yipdw |
if your only goal is concatentation it's faster than extract/compress |
06:16
🔗
|
yipdw |
cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz |
06:16
🔗
|
Honno |
is there like, a linux command where I can literally just write a concat command the all the file names? |
06:16
🔗
|
yipdw |
cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz |
06:16
🔗
|
godane |
so 1991-03 of tagesschau is getting uploaded |
06:17
🔗
|
hook54321 |
is their a way to submit a list of urls to be archived on the wayback machine? |
06:17
🔗
|
yipdw |
not officially; use web.archive.org's save page thing or you can send stuff in via archivebot |
06:18
🔗
|
yipdw |
Honno: the JSON file accompanying each megawarc item is there to make it possible to split the megawarc back into its source files |
06:18
🔗
|
yipdw |
so if you're splitting, yes, you want that |
06:18
🔗
|
yipdw |
but warc.gz files produced by megawarc are individually gzipped WARC records so concatentation is fine |
06:19
🔗
|
yipdw |
this applies only to the WARC output; catting warc.gz with tarballs may not do what you want |
06:19
🔗
|
yipdw |
fortunately most tarballs created in megawarced warrior output are empty tarball |
06:19
🔗
|
yipdw |
s |
06:20
🔗
|
Honno |
yipdw: sooo, no need for the JSON files if I'm going to concat right? |
06:20
🔗
|
yipdw |
if all you want to do is make a gigantic warc then no you don't need the JSON files |
06:20
🔗
|
yipdw |
I'm wondering why you want a gigantic warc, but that's a second question |
06:21
🔗
|
Honno |
yipdw: it's because components of one archive rely on things in the archive archives, for general browsing |
06:21
🔗
|
yipdw |
download all the warcs and load them up into pywb |
06:21
🔗
|
yipdw |
it'll find them |
06:21
🔗
|
yipdw |
wayback has similar functionality |
06:22
🔗
|
Honno |
wayback seems ridiculously hard to set up |
06:22
🔗
|
yipdw |
then try pywb, it's easier |
06:22
🔗
|
yipdw |
or webarchiveplayer, which is pywb with a nicer interface |
06:22
🔗
|
Honno |
I'm a complete noob by the way heh, I don't do programming or anything |
06:22
🔗
|
Honno |
yeah I tried webarchiveplayer, that doesn't seem to have the feature of using all things tho |
06:22
🔗
|
Honno |
also takes ridiculously long to load |
06:23
🔗
|
|
Microguru has joined #archiveteam-bs |
06:23
🔗
|
yipdw |
you're throwing hundreds of gigabytes of data |
06:23
🔗
|
yipdw |
it's going to take a while no matter |
06:23
🔗
|
Honno |
yeah heh, ugh |
06:24
🔗
|
yipdw |
in any case webarchiveplayer should support multiple WARCs fine |
06:24
🔗
|
yipdw |
I don't remember if it uses the cdx files |
06:24
🔗
|
yipdw |
or if it must reconstruct them |
06:24
🔗
|
yipdw |
you may have better luck downloading the WARC and CDX files, and dumping them in the same place |
06:25
🔗
|
Honno |
cdx huh, need to check what that is |
06:25
🔗
|
yipdw |
WARC index |
06:25
🔗
|
yipdw |
if webarchiveplayer can use the indexes you can avoid a costly reindexing |
06:25
🔗
|
Honno |
oh |
06:25
🔗
|
yipdw |
I know pywb uses indexes to speed up retrieval, I just can't remember whether or not it will use the ones generated at IA |
06:26
🔗
|
Honno |
well thanks yipdw, the ultimate goal is to web scrape data and extract all the game downloads from the site, but it seems theres a lot I need to learn about first |
06:28
🔗
|
yipdw |
you may want to ask ikreymer for more tips |
06:28
🔗
|
yipdw |
he pops in here occasionally |
06:28
🔗
|
Honno |
heh, another thing yipdw, the game downloads don't show up in the index of webarchiveplayer |
06:28
🔗
|
yipdw |
being the author of pywb I suspect he'll know more about it than me |
06:28
🔗
|
Honno |
ah right haha |
06:28
🔗
|
yipdw |
I don't know what that's from |
06:28
🔗
|
Honno |
all the downloads have a weird download link see, it's a query ie games/220702-karoshi-factory-remake-gmk/send_download?code=1ed32eb417091bed7fffe9e99269867ba01b54da |
06:29
🔗
|
Honno |
from games/220702/download |
06:29
🔗
|
Honno |
the site was pretty weird |
06:29
🔗
|
Honno |
I can't easily download the game files then? |
06:30
🔗
|
yipdw |
I don't know, I didn't participate in that one |
06:30
🔗
|
yipdw |
arkiver probably knows more about the quirks of that site |
06:31
🔗
|
Honno |
mhmk I'll see if they know |
06:32
🔗
|
Honno |
yipdw, where do I see who organized these crawls sorry? |
06:32
🔗
|
Honno |
I see the tracker lists folk, but thats people who contributed their computers right on the warrior |
06:32
🔗
|
yipdw |
oh I guess it was chfoo |
06:32
🔗
|
yipdw |
https://github.com/ArchiveTeam/gamemaker-sandbox-grab |
06:33
🔗
|
Honno |
yeah chfoo made the archive team wiki page about the project |
06:33
🔗
|
Honno |
also helped me out earlier so I spose thats the person I want heh |
06:34
🔗
|
Honno |
I'll be off, thanks for your help |
06:34
🔗
|
Honno |
Really need to learn this stuff, want to make a clean archive of the games from this old site |
06:36
🔗
|
yipdw |
np |
06:43
🔗
|
|
toad1 has joined #archiveteam-bs |
06:44
🔗
|
|
toad3 has quit IRC (Read error: Operation timed out) |
07:09
🔗
|
hook54321 |
is their a way to set files as non-public on archive.org? |
07:10
🔗
|
|
JesseW has joined #archiveteam-bs |
07:11
🔗
|
JesseW |
bsmith093: Finished all but Naruto (which is 18G uncompressed) -- now working on that. |
07:12
🔗
|
JesseW |
Currently up to 105G compressed, as opposed to the originals 108G. So it will likely be bigger, but probably not very. |
07:12
🔗
|
JesseW |
probably about 2GB bigger. |
07:13
🔗
|
JesseW |
hook54321: not as a normal user; IA staffers can do various things, though. |
07:36
🔗
|
|
bwn has joined #archiveteam-bs |
07:58
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
08:01
🔗
|
|
metalcamp has joined #archiveteam-bs |
08:12
🔗
|
|
JesseW has left |
08:16
🔗
|
joepie91 |
Frogging: "virtual environments" is the recommendation everybody automatically makes for Python and Ruby but 1) they are a hack that really shouldn't be necessary to begin with and 2) they don't actually fully solve the problem |
08:16
🔗
|
joepie91 |
they isolate dependencies on a per-application basis |
08:16
🔗
|
joepie91 |
but it doesn't magically allow for nested / differently versioned dependencies *within* a project |
08:17
🔗
|
joepie91 |
so the dep model remains broken |
08:17
🔗
|
joepie91 |
(and frankly, virtual environments are typically an utter mess to integrate with service/daemon managers and such) |
08:26
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
08:30
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
08:34
🔗
|
|
fie has joined #archiveteam-bs |
08:36
🔗
|
|
fie__ has quit IRC (Ping timeout: 244 seconds) |
08:55
🔗
|
|
lytv has joined #archiveteam-bs |
08:59
🔗
|
|
fie_ has joined #archiveteam-bs |
09:00
🔗
|
|
vtyl has quit IRC (Read error: Operation timed out) |
09:00
🔗
|
|
fie has quit IRC (Read error: Operation timed out) |
09:37
🔗
|
|
fie__ has joined #archiveteam-bs |
09:38
🔗
|
|
fie_ has quit IRC (Read error: Operation timed out) |
09:42
🔗
|
godane |
SketchCow: all of 2012 kpfa is uploaded |
09:42
🔗
|
godane |
i'm uploading 2013-01 now |
09:44
🔗
|
|
metalcamp has joined #archiveteam-bs |
09:45
🔗
|
|
fie_ has joined #archiveteam-bs |
09:46
🔗
|
|
fie__ has quit IRC (Read error: Operation timed out) |
09:49
🔗
|
|
fie__ has joined #archiveteam-bs |
09:49
🔗
|
|
fie__ has quit IRC (Client Quit) |
09:53
🔗
|
|
fie_ has quit IRC (Ping timeout: 370 seconds) |
09:55
🔗
|
|
metalcamp has quit IRC (Quit: Bye) |
10:06
🔗
|
|
metalcamp has joined #archiveteam-bs |
10:16
🔗
|
alfie |
morning all |
10:33
🔗
|
BnA-Rob1n |
Just read a blog post about 500px.com raising their cut for every sold picture from 30% to 70% ("to help the further growth of 500px"), one of the founders is the same as livejournal. Maby we should do a sanity grab? |
10:48
🔗
|
ersi |
Of 500px? Of LiveJournal? |
10:50
🔗
|
BnA-Rob1n |
Well the sanity grab of livejournal is already in the disco phase. So I mean it might be good to check up on 500px as well if it's feasible to do a sanity check |
10:51
🔗
|
ersi |
What the fuck is a disco phase |
10:53
🔗
|
ersi |
Oh, discovery phase |
10:53
🔗
|
HCross |
discovery |
11:08
🔗
|
alfie |
BEARS > BEES |
12:02
🔗
|
godane |
i'm up to 1991-03-31 of tagesschau evening news |
12:02
🔗
|
godane |
NOTE: there is no 1991-03-26 episode on there site |
12:27
🔗
|
godane |
i think uploads to IA are getting stuck |
12:34
🔗
|
HCross |
godane, ditto. Newsgrabber is getting stuck |
12:47
🔗
|
|
acridAxid has quit IRC (marauder) |
12:49
🔗
|
|
acridAxid has joined #archiveteam-bs |
12:57
🔗
|
|
alfie has quit IRC (Quit: Seeeya! - ZNC 1.6.3+deb1+jessie0) |
12:57
🔗
|
|
alfie has joined #archiveteam-bs |
13:38
🔗
|
|
schbirid has joined #archiveteam-bs |
14:07
🔗
|
|
chazchaz has quit IRC (Read error: Operation timed out) |
14:08
🔗
|
|
Honno has quit IRC (Read error: Connection reset by peer) |
14:14
🔗
|
|
Coderjoe has quit IRC (Ping timeout: 260 seconds) |
14:16
🔗
|
|
hook54321 has quit IRC (Ping timeout: 268 seconds) |
14:17
🔗
|
|
chazchaz has joined #archiveteam-bs |
14:39
🔗
|
Frogging |
ersi: The most fabulous phase of course :p |
14:41
🔗
|
HCross |
it depends, its either the discovery phase or the "angry person yelling" phase |
14:51
🔗
|
|
Coderjoe has joined #archiveteam-bs |
15:03
🔗
|
|
Honno has joined #archiveteam-bs |
15:11
🔗
|
|
vitzli has joined #archiveteam-bs |
16:13
🔗
|
|
closure has quit IRC (ZNC - 1.6.0 - http://znc.in) |
17:05
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
17:06
🔗
|
|
RichardG has joined #archiveteam-bs |
17:16
🔗
|
|
closure has joined #archiveteam-bs |
17:31
🔗
|
|
vitzli has quit IRC (Leaving) |
17:47
🔗
|
|
dxrt- has quit IRC (Ping timeout: 633 seconds) |
17:47
🔗
|
Smiley |
soooooooooo what craziness is Jason upto atm |
17:47
🔗
|
Smiley |
i'm wathcingf on twitter |
17:51
🔗
|
JW_work |
Smiley: just moving the manuals from one place to another, AFAIK |
17:54
🔗
|
phuzion |
Smiley: http://pastebin.com/3meEDnQ5 that is a bit of an overview of what's going on |
17:55
🔗
|
phuzion |
tl;dr: SketchCow and friends rescued a shitload of manuals, and now they're just moving the manuals into a consolidated space for money savings sake. |
18:01
🔗
|
Smiley |
oh these the one from that shop which closed? |
18:04
🔗
|
HCross |
If it wasnt for the other-side-of-the-world problem, id be there |
18:04
🔗
|
|
bsmith093 has quit IRC (Ping timeout: 258 seconds) |
18:05
🔗
|
Smiley |
nod |
18:05
🔗
|
Smiley |
money i don't have right now, time,... not really |
18:05
🔗
|
Smiley |
but i might of been able to help at least a bit |
18:05
🔗
|
Smiley |
hopefully moving on thursday \o/ |
18:07
🔗
|
HCross |
Jason needs some stuff to move in the UK :P |
18:17
🔗
|
|
DopefishJ has joined #archiveteam-bs |
18:17
🔗
|
|
swebb sets mode: +o DopefishJ |
18:18
🔗
|
|
bwn has quit IRC (Ping timeout: 246 seconds) |
18:19
🔗
|
|
DFJustin has quit IRC (Ping timeout: 274 seconds) |
18:48
🔗
|
|
bwn has joined #archiveteam-bs |
18:48
🔗
|
|
bsmith093 has joined #archiveteam-bs |
18:54
🔗
|
|
Smiley has quit IRC (Remote host closed the connection) |
18:56
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
19:23
🔗
|
JW_work |
HCross: have you signed up on the archivecorps mailing list? there may be some moving jobs there. :-) |
19:25
🔗
|
HCross2 |
I havent |
19:34
🔗
|
BnA-Rob1n |
signup is here: http://archive.us7.list-manage.com/subscribe?u=30ffefa96d1767cc661f2e3ce&id=3b19db5cef |
19:39
🔗
|
HCross2 |
Done |
19:49
🔗
|
|
tomwsmf-a has joined #archiveteam-bs |
19:54
🔗
|
|
DopefishJ is now known as DFJustin |
20:07
🔗
|
chfoo |
Honno: did you see the wiki page? i updated instructions on how to access it in wayback machine if that helps |
20:07
🔗
|
Honno |
chfoo, yeah I did, thanks for that, will do more into explaining how to get the warcs going offline |
20:08
🔗
|
Honno |
just got it all downloaded and running myself |
20:12
🔗
|
JW_work |
so much confusion in #archiveteam... |
20:13
🔗
|
|
tomwsmf-a has quit IRC (Ping timeout: 258 seconds) |
20:16
🔗
|
alfie |
JW_work: i was about to say... linebreaks aren't fuckin punctuation :P |
20:39
🔗
|
|
luckcolor has joined #archiveteam-bs |
20:39
🔗
|
|
luckcolor has left |
20:46
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
20:51
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
20:51
🔗
|
|
JetBalsa has joined #archiveteam-bs |
20:52
🔗
|
|
Tom__ has joined #archiveteam-bs |
20:52
🔗
|
xmc |
oi Tom__, so what's your question |
20:54
🔗
|
Tom__ |
So the thing is the archive team crawled a social network site. it has 519 collections. I want to find a specific profile, otherwise I need to download 519 collections which is a lot TB |
20:54
🔗
|
xmc |
hm yeah |
20:55
🔗
|
xmc |
you could download the .cdx files that go with, those are basically an index of urls |
20:55
🔗
|
Tom__ |
Yes, is there software to open it specifally? |
20:57
🔗
|
xmc |
not much that you might find useful |
20:57
🔗
|
Tom__ |
I mean what is the he best software to open the .cdx.idx files? I can open it with notepad, but its not good with spacing and aligning. |
20:57
🔗
|
xmc |
but they're just plain text files so you can just use grep |
20:57
🔗
|
xmc |
if you find a url in a cdx then that means it is available in the matching warc file |
20:59
🔗
|
Tom__ |
Ok, thank you. I will download the files and start searching. |
21:05
🔗
|
|
Tom__ has quit IRC (Quit: Page closed) |
21:10
🔗
|
|
luckcolor has joined #archiveteam-bs |
21:10
🔗
|
|
luckcolor has left |
21:26
🔗
|
BnA-Rob1n |
519 collections, is it hyves? |
21:32
🔗
|
BnA-Rob1n |
Tom__: I had a list around, uploaded it here: https://archive.org/details/warcindex-usernames.7z |
21:40
🔗
|
BnA-Rob1n |
added this list to the wiki for others searching an archive containing their own or a specific username on hyves |
22:17
🔗
|
|
Honno has quit IRC (Ping timeout: 492 seconds) |
22:25
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
22:52
🔗
|
|
hook54321 has joined #archiveteam-bs |
22:57
🔗
|
|
bauruine has quit IRC (Ping timeout: 260 seconds) |
23:14
🔗
|
|
bauruine has joined #archiveteam-bs |
23:22
🔗
|
|
hook54321 has quit IRC (Ping timeout: 268 seconds) |
23:44
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
23:49
🔗
|
|
RichardG has joined #archiveteam-bs |