Time |
Nickname |
Message |
00:21
๐
|
arkiver |
Let's get googlecode running again |
00:21
๐
|
arkiver |
it will go to FOS |
00:21
๐
|
arkiver |
I'm sorting and requeueing some items |
00:45
๐
|
arkiver |
We have restarted Google Code!! |
00:46
๐
|
Frogging |
I thought the contents of Google Code are staying where they are, just read-only |
00:46
๐
|
Frogging |
or is Google probably going to delete them? |
00:47
๐
|
arkiver |
We're grabbing the original content and URLs |
00:47
๐
|
arkiver |
not the Google Code archive by Google |
00:47
๐
|
Frogging |
oh |
00:47
๐
|
arkiver |
The Google Code Archive is a great thing, but it does miss some information |
01:04
๐
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
01:07
๐
|
|
dashcloud has joined #archiveteam |
01:22
๐
|
|
Start has joined #archiveteam |
01:26
๐
|
davidar |
is this the right place to talk about emularity? |
01:27
๐
|
|
RedType has left |
01:47
๐
|
Guest45 |
i don't know if this is the correct place to talk about this or not but we should probably save the Google Drive-hosted homepages before they're deleted |
01:47
๐
|
Frogging |
What are you referring to? |
01:48
๐
|
davidar |
arkiver: also, I have the URL list we were talking about the other day, but the details are a little different to the information I was initially given |
01:48
๐
|
arkiver |
great! |
01:49
๐
|
arkiver |
It's quite late here and I'm almost going to bed though |
01:49
๐
|
arkiver |
Would you have some time in around 10 hours? |
01:49
๐
|
Guest45 |
@Frogging http://googleappsdeveloper.blogspot.com/2015/08/deprecating-web-hosting-support-in.html |
01:49
๐
|
davidar |
arkiver: sure |
01:50
๐
|
Frogging |
Oh. Yeah. That should probably be saved Guest45 |
01:50
๐
|
arkiver |
davidar: thanks! have a good day! |
01:51
๐
|
davidar |
:) |
01:55
๐
|
|
JesseW has joined #archiveteam |
02:06
๐
|
|
philpem has quit IRC (Ping timeout: 260 seconds) |
02:07
๐
|
Guest45 |
we should also probably save the windstream cable ISP domains too |
02:07
๐
|
Guest45 |
err, homepages |
02:07
๐
|
Guest45 |
path: home.windstream.net/USERNAME |
02:11
๐
|
davidar |
can I throw *.customer.netspace.net.au into the list too? |
02:13
๐
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
02:15
๐
|
|
ndiddy has joined #archiveteam |
02:15
๐
|
dxrt |
Haha.. anything acquired by TPG/iiNet needs saving ASAP :) |
02:28
๐
|
|
xXx_ndidd has quit IRC (Read error: Operation timed out) |
02:32
๐
|
|
kcaj has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in) |
02:45
๐
|
|
nwf has quit IRC (Read error: Operation timed out) |
02:45
๐
|
|
closure has quit IRC (Read error: Operation timed out) |
02:46
๐
|
|
kcaj has joined #archiveteam |
02:46
๐
|
|
aMunster has quit IRC (Read error: Operation timed out) |
02:46
๐
|
|
pgoetz has quit IRC (Read error: Operation timed out) |
02:46
๐
|
|
MMovie has quit IRC (Read error: Operation timed out) |
02:46
๐
|
|
pgoetz has joined #archiveteam |
02:47
๐
|
|
vegbrasil has quit IRC (Read error: Operation timed out) |
02:48
๐
|
|
beardicus has quit IRC (Read error: Operation timed out) |
02:49
๐
|
|
kcaj has quit IRC (Client Quit) |
02:53
๐
|
|
kcaj has joined #archiveteam |
03:06
๐
|
davidar |
dxrt: so, basically almost all of the au ISPs? :p |
03:06
๐
|
dxrt |
yeah sadly.. |
03:07
๐
|
|
closure has joined #archiveteam |
03:07
๐
|
davidar |
dxrt: is there anyone left that isn't part of the big 3 by now? |
03:08
๐
|
dxrt |
I'm honestly struggling to think of any. iiNet slowly but surely swallowed most of the little guys over ten years or so.. and then TPG came along and it was all over. |
03:09
๐
|
davidar |
:( |
03:09
๐
|
dxrt |
But lots and lots of valuable stuff that needs saving! |
03:13
๐
|
|
vegbrasil has joined #archiveteam |
03:25
๐
|
|
acridAxid has quit IRC (marauder) |
03:26
๐
|
|
acridAxid has joined #archiveteam |
03:34
๐
|
|
nwf has joined #archiveteam |
03:37
๐
|
|
JesseW has quit IRC (Quit: Leaving.) |
03:42
๐
|
|
closure has quit IRC (Read error: Operation timed out) |
03:42
๐
|
|
nwf has quit IRC (Read error: Operation timed out) |
03:45
๐
|
|
vegbrasil has quit IRC (Read error: Operation timed out) |
03:47
๐
|
|
rossdylan has quit IRC (Read error: Operation timed out) |
03:48
๐
|
|
mhazinsk has quit IRC (Ping timeout: 633 seconds) |
03:52
๐
|
|
wyatt8740 has joined #archiveteam |
03:53
๐
|
|
Jonimus has quit IRC (Ping timeout: 633 seconds) |
03:56
๐
|
|
wyatt8740 has quit IRC (Ping timeout: 246 seconds) |
04:14
๐
|
|
aMunster has joined #archiveteam |
04:14
๐
|
|
vegbrasil has joined #archiveteam |
04:17
๐
|
|
MMovie has joined #archiveteam |
04:17
๐
|
|
closure has joined #archiveteam |
04:23
๐
|
|
beardicus has joined #archiveteam |
04:24
๐
|
|
mhazinsk has joined #archiveteam |
04:26
๐
|
|
bwn has quit IRC (Ping timeout: 492 seconds) |
04:27
๐
|
|
wyatt8750 has joined #archiveteam |
04:31
๐
|
|
wyatt8750 has quit IRC (Client Quit) |
04:32
๐
|
|
nwf has joined #archiveteam |
04:34
๐
|
|
wyatt8750 has joined #archiveteam |
04:45
๐
|
roninski1 |
Hey, does anybody know if it's possible to get a specific file from a warc archive with warcat without extracting the whole thing? |
04:53
๐
|
roninski1 |
or some other way to pull a specific cile out? |
04:53
๐
|
roninski1 |
file* |
04:57
๐
|
yipdw |
https://pypi.python.org/pypi/Warcat/ |
04:57
๐
|
yipdw |
extract ought to do it |
04:57
๐
|
yipdw |
oh wait, without extracting the whole thing |
04:57
๐
|
yipdw |
sorry |
04:57
๐
|
roninski1 |
yeah XD |
04:58
๐
|
yipdw |
you could iterate through the records, look for the appropriate request/response pairs, and extract that |
05:00
๐
|
|
Jonimus has joined #archiveteam |
05:00
๐
|
roninski1 |
hmmmm |
05:00
๐
|
|
vitzli has joined #archiveteam |
05:01
๐
|
roninski1 |
how exactly would i go about that? |
05:02
๐
|
|
wvdp has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
05:05
๐
|
yipdw |
the only way that's immediately clear to me is warcat's library interface, so you'll be using Python if you choose that route |
05:05
๐
|
roninski1 |
i'll check out the API |
05:06
๐
|
yipdw |
you'd be checking out each record in warc.records and matching the headers in the WARC header or content header blocks for what you want |
05:07
๐
|
yipdw |
another possibility is using indexes on WARCs to speed things up |
05:07
๐
|
roninski1 |
well i actually have the exact offset and stuff i want |
05:07
๐
|
roninski1 |
{"target":{"container":"warc","offset":48081856711,"size":654183},"src_offsets":{"entry":48808889856,"data":48808890368,"next_entry":48809544704} |
05:07
๐
|
yipdw |
in that case you might not need warcat |
05:07
๐
|
|
atlogbot has joined #archiveteam |
05:07
๐
|
roninski1 |
yeah? |
05:08
๐
|
yipdw |
if you have the offset and size of the record, you can extract it directly |
05:08
๐
|
roninski1 |
how would i go about that? |
05:08
๐
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
05:08
๐
|
yipdw |
typically gzipped WARCs are zipped per-record, so if that offset refers to an offset in the file you can just seek there and copy size bytes out |
05:09
๐
|
yipdw |
er, refers to an offset where a gzipped record starts |
05:09
๐
|
yipdw |
I don't know what WARC this is |
05:09
๐
|
roninski1 |
it's a megawarc |
05:09
๐
|
roninski1 |
that's the offset/etc for the warc within that that i'm after |
05:09
๐
|
roninski1 |
if that's how warc's work |
05:10
๐
|
yipdw |
I forget if the offset refers to a byte offset in the compressed WARC |
05:10
๐
|
yipdw |
but you might as well assume it does and try |
05:10
๐
|
roninski1 |
sure |
05:11
๐
|
yipdw |
in fact |
05:11
๐
|
yipdw |
no need to assume |
05:11
๐
|
yipdw |
https://github.com/ArchiveTeam/megawarc |
05:11
๐
|
yipdw |
the JSON spec is there, with annotations |
05:11
๐
|
roninski1 |
awesomesauce |
05:11
๐
|
roninski1 |
okay cool so i can probs pull it out with dd then yeah? |
05:12
๐
|
yipdw |
I guess |
05:13
๐
|
roninski1 |
guess i'll try it |
05:15
๐
|
|
Sk1d has joined #archiveteam |
05:17
๐
|
|
wyatt8750 has quit IRC (Ping timeout: 246 seconds) |
05:27
๐
|
roninski1 |
ayy it worked |
05:30
๐
|
|
JesseW has joined #archiveteam |
05:39
๐
|
|
wyatt8750 has joined #archiveteam |
05:58
๐
|
|
roninski has joined #archiveteam |
06:00
๐
|
|
roninski1 has quit IRC (Ping timeout: 258 seconds) |
06:33
๐
|
|
Stiletto is now known as Stilett0 |
06:43
๐
|
|
dxrt- has quit IRC (Remote host closed the connection) |
06:45
๐
|
|
dxrt- has joined #archiveteam |
06:46
๐
|
|
dxrt sets mode: +o dxrt- |
06:47
๐
|
|
JesseW has quit IRC (Quit: Leaving.) |
06:49
๐
|
|
JesseW has joined #archiveteam |
06:54
๐
|
|
JesseW has quit IRC (Client Quit) |
06:58
๐
|
|
JesseW has joined #archiveteam |
07:51
๐
|
JesseW |
OK, I'm going to do a URLteam search for all the ISP webspace URLs listed on http://archiveteam.org/index.php?title=ISP_Hosting |
08:39
๐
|
|
Tomcat_ has joined #archiveteam |
08:41
๐
|
JesseW |
There are **117** URL patterns on that page (excluding offline ones) |
08:43
๐
|
JesseW |
In 5 old files, it's already found over 22,000 URLs. |
08:45
๐
|
|
JesseW has quit IRC (Quit: Leaving.) |
09:15
๐
|
|
philpem has joined #archiveteam |
09:19
๐
|
|
bwn has joined #archiveteam |
09:23
๐
|
|
ersi has quit IRC (Ping timeout: 260 seconds) |
09:34
๐
|
|
schbirid has joined #archiveteam |
09:36
๐
|
|
wvdp has joined #archiveteam |
09:39
๐
|
|
bzc6p has joined #archiveteam |
09:39
๐
|
|
bzc6p has left |
09:43
๐
|
|
ersi has joined #archiveteam |
09:51
๐
|
Nemo_bis |
And recursively archive each found subsite? |
09:51
๐
|
Nemo_bis |
Do we have an idea of the percentage of URL shorteners targets which are already archived by wayback? |
10:19
๐
|
|
paul has joined #archiveteam |
10:19
๐
|
paul |
hello |
10:20
๐
|
PurpleSym |
Hi. |
10:20
๐
|
paul |
So I was looking at the wiki page |
10:21
๐
|
paul |
and the loveisover archive seems to be down |
10:21
๐
|
paul |
but the wiki doesn't reflect that |
10:22
๐
|
PurpleSym |
Go ahead and edit it. |
10:23
๐
|
paul |
Cool, was just wondering if anyone had some info on what happened. |
10:23
๐
|
|
wvdp has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
10:25
๐
|
paul |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
10:27
๐
|
schbirid |
yahoosucks |
10:27
๐
|
ersi |
^ |
10:30
๐
|
paul |
Just read the article, that's pretty poor business practices. |
11:05
๐
|
|
RichardG_ has joined #archiveteam |
11:06
๐
|
|
RichardG has quit IRC (Ping timeout: 272 seconds) |
11:23
๐
|
|
vitzli has quit IRC (Leaving) |
11:35
๐
|
paul |
added a new archiver I found to the 4chan page, if anyone wants to check it out. I'm off to sleep |
11:42
๐
|
|
paul has quit IRC (Ping timeout: 268 seconds) |
13:06
๐
|
|
Stilett0 is now known as Stiletto |
13:41
๐
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:44
๐
|
|
dashcloud has joined #archiveteam |
13:59
๐
|
|
Tomcat_ has quit IRC (Remote host closed the connection) |
14:07
๐
|
|
roninski has left |
14:16
๐
|
arkiver |
davidar: do you have some time now? |
14:16
๐
|
|
vitzli has joined #archiveteam |
15:01
๐
|
|
Guest45 has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
15:08
๐
|
|
RichardG_ has quit IRC (Ping timeout: 272 seconds) |
15:14
๐
|
|
RichardG has joined #archiveteam |
15:34
๐
|
|
atomotic has joined #archiveteam |
15:45
๐
|
|
wvdp has joined #archiveteam |
16:00
๐
|
|
schbirid has quit IRC (Remote host closed the connection) |
16:07
๐
|
|
Tomcat_ has joined #archiveteam |
16:11
๐
|
|
nwf has quit IRC (Read error: Operation timed out) |
16:12
๐
|
|
MMovie1 has joined #archiveteam |
16:12
๐
|
|
mhazinsk has quit IRC (Read error: Operation timed out) |
16:13
๐
|
|
MMovie has quit IRC (Read error: Operation timed out) |
16:13
๐
|
|
aMunster has quit IRC (Read error: Operation timed out) |
16:14
๐
|
|
vegbrasil has quit IRC (Read error: Operation timed out) |
16:14
๐
|
|
closure has quit IRC (Read error: Operation timed out) |
16:15
๐
|
|
bwn_ has joined #archiveteam |
16:16
๐
|
|
Jonimus has quit IRC (Read error: Operation timed out) |
16:16
๐
|
|
MMovie1 has quit IRC (Read error: Operation timed out) |
16:19
๐
|
|
bwn has quit IRC (Read error: Operation timed out) |
16:21
๐
|
|
beardicus has quit IRC (Read error: Operation timed out) |
16:31
๐
|
|
RichardG has quit IRC (Ping timeout: 633 seconds) |
16:33
๐
|
|
Zei-Pii has quit IRC (Read error: Operation timed out) |
16:39
๐
|
|
RichardG has joined #archiveteam |
16:48
๐
|
|
beardicus has joined #archiveteam |
16:49
๐
|
|
atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzzโฆ) |
16:50
๐
|
|
trs81 is now known as trs80 |
16:52
๐
|
|
Emcy_ has quit IRC (Quit: Leaving) |
16:53
๐
|
|
vegbrasil has joined #archiveteam |
16:53
๐
|
|
Emcy has joined #archiveteam |
17:05
๐
|
|
rossdylan has joined #archiveteam |
17:14
๐
|
|
aMunster has joined #archiveteam |
17:22
๐
|
|
beardicus has quit IRC (Read error: Operation timed out) |
17:24
๐
|
|
signius has quit IRC (Read error: Operation timed out) |
17:28
๐
|
|
JesseW has joined #archiveteam |
17:29
๐
|
JesseW |
2,556,335 distinct URLs found in the URLteam data for the ISP hosting sites. |
17:30
๐
|
arkiver |
That's a lot, nice! |
17:35
๐
|
JesseW |
It's a 230M file. I'm dumping it on FOS, under /0/CDROMS/urlteam_isp_hosting_search_results_20160312.txt -- it should be there in fifteen minutes or so. |
17:37
๐
|
|
signius has joined #archiveteam |
17:38
๐
|
SimpBrai1 |
isp hosting is the mini version of geocities |
17:39
๐
|
SimpBrai1 |
in their own unique block, anyone semi in the know 10-15+ years ago, slapped a site up and helped a lot of people eventually go forward with websites |
17:39
๐
|
|
closure has joined #archiveteam |
17:39
๐
|
JesseW |
yep |
17:42
๐
|
|
beardicus has joined #archiveteam |
17:45
๐
|
|
Jonimus has joined #archiveteam |
17:48
๐
|
JesseW |
arkiver: OK, it's up on FOS. Let me know if further processing would be helpful. |
17:50
๐
|
|
MMovie has joined #archiveteam |
17:51
๐
|
|
mhazinsk has joined #archiveteam |
17:54
๐
|
|
wvdp has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
17:55
๐
|
|
JesseW has quit IRC (Quit: Leaving.) |
18:01
๐
|
|
SN4T14 has quit IRC (Remote host closed the connection) |
18:03
๐
|
|
nwf has joined #archiveteam |
18:29
๐
|
|
db48x has joined #archiveteam |
18:39
๐
|
|
bzc6p has joined #archiveteam |
18:43
๐
|
dashcloud |
so is this web archiving project going to be a warrior project, or scripts only? |
18:45
๐
|
arkiver |
warrior probably |
18:46
๐
|
arkiver |
We need a bit of rsync space for the profiles discovery of LiveJournal! |
18:46
๐
|
arkiver |
Not more then 1 GB I think |
18:46
๐
|
arkiver |
If you have some space available, please let me know |
18:48
๐
|
|
atomotic has joined #archiveteam |
18:48
๐
|
|
vitzli has quit IRC (Leaving) |
18:52
๐
|
arkiver |
LiveJournal discovery project is ready. When we have a target I'll start the grab |
18:55
๐
|
HCross |
give me a couple of mins |
18:55
๐
|
arkiver |
awesome! |
18:59
๐
|
|
wvdp has joined #archiveteam |
19:02
๐
|
HCross |
sent |
19:04
๐
|
bzc6p |
arkiver: we don't have channel yet, do we? |
19:04
๐
|
arkiver |
I don't think so |
19:10
๐
|
|
paul has joined #archiveteam |
19:12
๐
|
Asparagir |
arkiver: I have nothing useful to add to the LiveJournal project, but just wanted to say thank you for working on that. |
19:12
๐
|
arkiver |
:D |
19:19
๐
|
|
bzc6p has left |
19:31
๐
|
arkiver |
----------------------------------------------------------------- |
19:31
๐
|
arkiver |
We have started the LiveJournal discovery! |
19:31
๐
|
arkiver |
----------------------------------------------------------------- |
19:31
๐
|
arkiver |
I'm not sure how they are with banning |
19:31
๐
|
arkiver |
So be careful |
19:32
๐
|
|
remsen has quit IRC (ZNC 1.6.2 - http://znc.in) |
19:33
๐
|
|
remsen has joined #archiveteam |
19:35
๐
|
|
bwn_ has quit IRC (Ping timeout: 250 seconds) |
19:35
๐
|
|
JesseW has joined #archiveteam |
19:38
๐
|
Frogging |
arkiver: let me know when a channel is setup |
19:39
๐
|
arkiver |
Anyone has any idea for a channel for livejournal? |
19:39
๐
|
xmc |
deadjournal, but that's already a website that exists |
19:39
๐
|
xmc |
also it's not dead |
19:49
๐
|
ersi |
deardiary |
20:04
๐
|
|
bwn has joined #archiveteam |
20:09
๐
|
BnA-Rob1n |
spinning up some small discovery nodes to get some ip's |
20:24
๐
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
20:32
๐
|
Asparagir |
ZombieJournal? |
20:35
๐
|
SimpBrai1 |
wont be surprised that lj will eventually go down within a few years, better get a copy before it happens |
20:54
๐
|
Nemo_bis |
Alexa rank for LiveJournal is better than I expected https://en.wikipedia.org/w/index.php?title=LiveJournal&type=revision&diff=709745870&oldid=707364858 |
20:54
๐
|
|
paul has quit IRC (Ping timeout: 268 seconds) |
20:56
๐
|
xmc |
it's kind of a big thing in russia |
20:56
๐
|
Nemo_bis |
Yeah, 24th there |
20:56
๐
|
SimpBrai1 |
russian owner |
20:58
๐
|
yipdw |
SUP YALL |
21:00
๐
|
|
Tomcat_ has quit IRC (Read error: Operation timed out) |
21:00
๐
|
ersi |
A lot of dying web pages |
21:01
๐
|
yipdw |
no I mean that's the owner of LiveJournal |
21:02
๐
|
xmc |
^ |
21:02
๐
|
xmc |
SUP (Russian: ะกะฃะ, which means 'soup') |
21:02
๐
|
|
SN4T14 has joined #archiveteam |
21:02
๐
|
xmc |
https://en.wikipedia.org/wiki/SUP_Media |
21:03
๐
|
yipdw |
I have been waiting seven years to make that joke |
21:03
๐
|
yipdw |
I KNEW THIS DAY WOULD COME |
21:05
๐
|
SimpBrai1 |
lol |
21:07
๐
|
ersi |
Clever. |
21:07
๐
|
ersi |
I approve. |
21:08
๐
|
ersi |
yipdw is like a humour heron. Waiting to strike. |
21:14
๐
|
yipdw |
it's the only reason I use a bouncer |
21:22
๐
|
|
zino__ has quit IRC (Remote host closed the connection) |
21:36
๐
|
Asparagir |
Now do the one about updog |
21:38
๐
|
* |
ersi stares |
21:40
๐
|
|
zino has joined #archiveteam |
21:41
๐
|
Asparagir |
...i'll show myself out. |
21:47
๐
|
|
tephra has quit IRC (Ping timeout: 260 seconds) |
21:48
๐
|
|
tephra has joined #archiveteam |
21:56
๐
|
|
RichardG has quit IRC (Ping timeout: 272 seconds) |
22:20
๐
|
|
RichardG has joined #archiveteam |
22:27
๐
|
|
JesseW has quit IRC (Read error: Operation timed out) |
22:33
๐
|
|
Ravenloft has joined #archiveteam |
22:33
๐
|
Ravenloft |
hello there |
22:33
๐
|
Asparagir |
Howdy. |
22:34
๐
|
Ravenloft |
asking the pros for help, how to get the video from this site http://globoplay.globo.com/v/4869652/ |
22:35
๐
|
Asparagir |
Try a program called youtube-dl -- https://rg3.github.io/youtube-dl/ |
22:36
๐
|
Asparagir |
No guarantees, but it works on a lot of stuff. |
22:36
๐
|
dxrt |
I just tried youtube-dl and it seems to handle it fine ^^ |
22:36
๐
|
Asparagir |
\o/ |
22:39
๐
|
|
bzc6p has joined #archiveteam |
22:42
๐
|
Ravenloft |
dxrt which arguments did you use? |
22:43
๐
|
Ravenloft |
tried with only the URL and it didnt work |
22:43
๐
|
dxrt |
hmm, I just tried with the URL.. do you get any error? I'l try again. |
22:47
๐
|
|
bzc6p has left |
22:48
๐
|
Ravenloft |
yes, but I cant copy it, getting an alternative to cmd.exe right now |
22:51
๐
|
|
LastNinja has quit IRC (Ping timeout: 260 seconds) |
22:57
๐
|
|
zino has quit IRC (Remote host closed the connection) |
23:20
๐
|
|
zino has joined #archiveteam |