Time |
Nickname |
Message |
00:05
🔗
|
|
primus104 has quit IRC (Leaving.) |
01:15
🔗
|
|
Ymgve has quit IRC (Ping timeout: 506 seconds) |
01:16
🔗
|
|
Ymgve has joined #archiveteam |
01:17
🔗
|
|
d6e has left :3 |
01:30
🔗
|
|
Jonimus has joined #archiveteam |
01:42
🔗
|
|
schbirid2 has joined #archiveteam |
01:45
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
01:46
🔗
|
|
Emcy has quit IRC (Ping timeout: 306 seconds) |
01:54
🔗
|
|
aaaaaaaaa has joined #archiveteam |
02:03
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
02:04
🔗
|
|
mistym has joined #archiveteam |
02:04
🔗
|
|
Ymgve has quit IRC () |
02:16
🔗
|
|
Coderjoe has quit IRC (Read error: Connection reset by peer) |
02:16
🔗
|
|
Coderjoe has joined #archiveteam |
02:24
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
02:25
🔗
|
|
aaaaaaaaa has joined #archiveteam |
02:28
🔗
|
|
bzc6p__ has joined #archiveteam |
02:32
🔗
|
|
bzc6p_ has quit IRC (Read error: Operation timed out) |
02:35
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
02:45
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
02:46
🔗
|
|
aaaaaaaaa has joined #archiveteam |
02:47
🔗
|
|
BlueMaxim has joined #archiveteam |
03:00
🔗
|
|
JesseW has joined #archiveteam |
03:20
🔗
|
|
mistym has joined #archiveteam |
04:03
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
04:04
🔗
|
|
aaaaaaaaa has joined #archiveteam |
04:17
🔗
|
|
mr_rippit has joined #archiveteam |
04:17
🔗
|
|
ripvanwin has quit IRC (Read error: Connection reset by peer) |
04:22
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
04:26
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
04:40
🔗
|
|
Emcy has joined #archiveteam |
04:59
🔗
|
|
mistym has joined #archiveteam |
05:43
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
05:45
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
05:47
🔗
|
|
JesseW has joined #archiveteam |
06:08
🔗
|
|
godane has joined #archiveteam |
06:51
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
07:11
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
07:20
🔗
|
|
bzc6p__ is now known as bzc6p |
07:29
🔗
|
|
primus104 has joined #archiveteam |
07:31
🔗
|
|
rolf has joined #archiveteam |
07:33
🔗
|
|
Laverne has quit IRC (Read error: Operation timed out) |
07:51
🔗
|
|
mistym has joined #archiveteam |
08:03
🔗
|
|
mistym has quit IRC (Read error: Operation timed out) |
08:53
🔗
|
|
mistym has joined #archiveteam |
09:01
🔗
|
|
mistym has quit IRC (Ping timeout: 483 seconds) |
09:25
🔗
|
|
rolf has quit IRC (Leaving...) |
09:28
🔗
|
|
rolf has joined #archiveteam |
09:32
🔗
|
|
primus104 has quit IRC (Leaving.) |
09:33
🔗
|
|
rolf has quit IRC (Leaving...) |
10:05
🔗
|
|
SadDM has quit IRC (Ping timeout: 370 seconds) |
10:17
🔗
|
|
rolf has joined #archiveteam |
10:17
🔗
|
|
rolf has quit IRC (Client Quit) |
10:51
🔗
|
|
sirdancea has quit IRC (Remote host closed the connection) |
10:55
🔗
|
|
Ymgve has joined #archiveteam |
11:00
🔗
|
|
sirdancea has joined #archiveteam |
12:18
🔗
|
|
primus104 has joined #archiveteam |
12:19
🔗
|
|
Morbus has quit IRC (Quit: http://www.disobey.com/) |
12:24
🔗
|
|
Morbus has joined #archiveteam |
12:24
🔗
|
|
p9ne has joined #archiveteam |
12:45
🔗
|
|
p9ne has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) |
12:55
🔗
|
|
primus104 has quit IRC (Leaving.) |
13:00
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
13:30
🔗
|
|
McGEE has quit IRC (Quit: Connection closed for inactivity) |
13:31
🔗
|
|
sankin has joined #archiveteam |
13:37
🔗
|
|
p9ne has joined #archiveteam |
13:45
🔗
|
|
vOYtEC has quit IRC (Ping timeout: 362 seconds) |
13:57
🔗
|
|
primus104 has joined #archiveteam |
14:32
🔗
|
|
mistym has joined #archiveteam |
14:36
🔗
|
|
bzc6p_ has joined #archiveteam |
14:40
🔗
|
|
bzc6p has quit IRC (Read error: Operation timed out) |
14:40
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
14:58
🔗
|
|
JesseW has joined #archiveteam |
15:09
🔗
|
|
mistym has joined #archiveteam |
15:27
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
15:40
🔗
|
|
sankin has quit IRC (Leaving.) |
15:46
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
16:03
🔗
|
|
mistym has joined #archiveteam |
16:11
🔗
|
|
sirdancea has quit IRC (Read error: Operation timed out) |
16:21
🔗
|
|
lytv has quit IRC (Read error: Connection reset by peer) |
16:30
🔗
|
|
nox has quit IRC () |
16:32
🔗
|
|
lexicon has joined #archiveteam |
16:43
🔗
|
|
lytv has joined #archiveteam |
16:47
🔗
|
|
aaaaaaaaa has joined #archiveteam |
16:47
🔗
|
|
philpem has joined #archiveteam |
16:48
🔗
|
|
p9ne has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) |
16:57
🔗
|
|
Mayonaise has quit IRC (Read error: Operation timed out) |
16:58
🔗
|
|
bzc6p_ is now known as bzc6p |
16:58
🔗
|
|
joepie91_ has quit IRC (Read error: Operation timed out) |
16:58
🔗
|
|
tephra has quit IRC (Read error: Operation timed out) |
16:58
🔗
|
|
closure has quit IRC (Read error: Operation timed out) |
16:59
🔗
|
|
joepie91 has joined #archiveteam |
17:03
🔗
|
|
closure has joined #archiveteam |
17:03
🔗
|
|
Mayonaise has joined #archiveteam |
17:06
🔗
|
|
nox has joined #archiveteam |
17:07
🔗
|
|
aaaaaaaa_ has joined #archiveteam |
17:09
🔗
|
|
tephra has joined #archiveteam |
17:13
🔗
|
|
aaaaaaaaa has quit IRC (Ping timeout: 370 seconds) |
17:14
🔗
|
|
aaaaaaaa_ is now known as aaaaaaaaa |
17:24
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
17:27
🔗
|
|
mistym has joined #archiveteam |
17:30
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
17:46
🔗
|
|
primus104 has quit IRC (Leaving.) |
17:46
🔗
|
|
McGEE has joined #archiveteam |
17:50
🔗
|
|
mistym has joined #archiveteam |
17:51
🔗
|
|
Jonimus has quit IRC (Ping timeout: 370 seconds) |
18:22
🔗
|
|
sirdancea has joined #archiveteam |
18:56
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
19:01
🔗
|
|
mistym has joined #archiveteam |
19:02
🔗
|
|
db48x has joined #archiveteam |
19:21
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
19:22
🔗
|
|
mistym has joined #archiveteam |
19:31
🔗
|
|
SimpBrain has joined #archiveteam |
19:37
🔗
|
|
neku has joined #archiveteam |
19:37
🔗
|
neku |
Hi |
19:38
🔗
|
Kazzy |
hey |
19:38
🔗
|
aaaaaaaaa |
hello |
19:39
🔗
|
neku |
Was told by a friend (wub) to go here, I run Pomf.se and would need to archive it. |
19:40
🔗
|
|
deafnet has joined #archiveteam |
19:41
🔗
|
bzc6p |
neku: it's basically a file sharing service, right? |
19:41
🔗
|
neku |
bzc6p, correct |
19:42
🔗
|
bzc6p |
At first glance, I guess it doesn't have an index of uploaded files. |
19:42
🔗
|
Kazzy |
possible to get a list of all URLs/files? |
19:42
🔗
|
bzc6p |
Does it? |
19:42
🔗
|
neku |
It does not have a index, however files are public. |
19:43
🔗
|
bzc6p |
Hey. You said you run it. |
19:43
🔗
|
neku |
I do.. |
19:44
🔗
|
bzc6p |
Erm... so you know the structure the most. |
19:45
🔗
|
neku |
As said above, it does not have a index list, however files are not private as anyone can view them once they get a hold of the link. |
19:45
🔗
|
Kazzy |
judging by the fact that each file seems to keep its original extension, bruteforcing is out of the question |
19:45
🔗
|
aaaaaaaaa |
I think what he was asking is whether you have or can generate a list of all the file links. |
19:45
🔗
|
Kazzy |
archival will require some sort of 'inside knowledge' and a list of files |
19:46
🔗
|
bzc6p |
neku: I'd be surprised if a site admin couldn't provide a list of stored files. |
19:47
🔗
|
neku |
Well of course I can, it's all in a database, however my question is how would I archive this in the best way. |
19:47
🔗
|
|
WubTheCap has joined #archiveteam |
19:47
🔗
|
aaaaaaaaa |
neku: the basic way to archive a site is in a warc file. This records both the request to the website and the response. |
19:47
🔗
|
deafnet |
wubbois |
19:48
🔗
|
WubTheCap |
So if it didn't became clear yet, it seems Pomf.se has to shutdown soon due to lack of money and CDN issues |
19:48
🔗
|
WubTheCap |
It's like what happened to MediaCrush, which was archived |
19:48
🔗
|
WubTheCap |
http://archiveteam.org/index.php?title=MediaCrush |
19:48
🔗
|
deafnet |
neku: split up the 4tb into zips or something |
19:48
🔗
|
bzc6p |
How big is this, approximately? |
19:48
🔗
|
WubTheCap |
1.6 million files, 4 TB |
19:48
🔗
|
aaaaaaaaa |
there are tools that can do that, like wpull and wget. The most basic method is to prepare a list of urls and feed it to a warc aware tool. |
19:48
🔗
|
Kazzy |
neku: what's the total size? 4TB (assuming raid1) |
19:49
🔗
|
Kazzy |
oh, there we go |
19:49
🔗
|
Kazzy |
too big for archivebot, possibly a warrior project? (cc arkiver) |
19:49
🔗
|
WubTheCap |
I guess neku can make an index file from the database' filenames |
19:49
🔗
|
deafnet |
it took us 8 mins to get to this point |
19:50
🔗
|
aaaaaaaaa |
apparently, I'm a autist neckbeard |
19:50
🔗
|
neku |
WubTheCap, could work I guess |
19:50
🔗
|
|
McGEE has quit IRC (Quit: Connection closed for inactivity) |
19:51
🔗
|
WubTheCap |
There's some 404s in that database though, because of removed malware files and keeping a database record to prevent uploading them again |
19:51
🔗
|
aaaaaaaaa |
those can be recorded in the warc too. |
19:51
🔗
|
WubTheCap |
But, would it still be a better idea to contact info@archive.org for this? |
19:52
🔗
|
bzc6p |
aaaaaaaaa: +1 |
19:52
🔗
|
bzc6p |
(autist neckbeard) |
19:52
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
19:53
🔗
|
aaaaaaaaa |
WubTheCap: are you concerned about the space? |
19:53
🔗
|
WubTheCap |
aaaaaaaaa: It was SketchCow's recommendation yesterday |
19:53
🔗
|
bzc6p |
Well, content should be warced because original URLs should be available through wayback machine |
19:53
🔗
|
WubTheCap |
Yeah that's true |
19:54
🔗
|
WubTheCap |
Also I don't know why but Wayback Machine doesn't like manually inputted a.pomf.se URLs |
19:54
🔗
|
bzc6p |
ArchiveTeam uses 50GB warc pieces, that would result in 80 files. Neither size nor num of files is extremely large I think. However, SketchCow was who suggested contacting IA. |
19:54
🔗
|
WubTheCap |
WARCs are a different thing though |
19:54
🔗
|
aaaaaaaaa |
I think the concern may be the cost. I think 4 TB is $8000 in costs for them |
19:54
🔗
|
Kazzy |
WubTheCap: http://a.pomf.se/robots.txt |
19:54
🔗
|
Kazzy |
that's why |
19:54
🔗
|
WubTheCap |
I have no idea about file consensus on Pomf.se, but I assume most of them are under 2 MB screenshots or maybe under 10 MB WebM videos |
19:55
🔗
|
WubTheCap |
Kazzy: ia_archiver though, also it used to allow everything |
19:55
🔗
|
bzc6p |
aaaaaaaaa: This is archiveteam. I don't remember IA rejected 4 TB. |
19:56
🔗
|
bzc6p |
I don't even think IA counts the terabytes we upload. It's "nothing" on this scale. |
19:56
🔗
|
xmc |
4tb is awkwardly large to put in a single item |
19:56
🔗
|
WubTheCap |
MediaCrush was split into some 60 chunks I think |
19:56
🔗
|
xmc |
an IA item, afaik, lives as one thing on a computer |
19:56
🔗
|
bzc6p |
I though it should be done like AT does: 50GB warcs per item |
19:56
🔗
|
bzc6p |
then go to a collection |
19:56
🔗
|
xmc |
so 4TB would kind of monopolize a disk |
19:56
🔗
|
WubTheCap |
e.g. https://archive.org/details/mediacrush_coldstorage_part_1 |
19:57
🔗
|
xmc |
so split it up for making IA's life easier |
19:57
🔗
|
arkiver |
I think we should make this a warrior project |
19:57
🔗
|
arkiver |
4TB is fine |
19:58
🔗
|
bzc6p |
If WubTheCap, neku don't have the space to do the scraping alone, it's obviously a Warrior project |
19:58
🔗
|
WubTheCap |
bzc6p: There's 8 TB storage on Pomf's colocated server |
19:59
🔗
|
WubTheCap |
4 TB in use |
19:59
🔗
|
arkiver |
I'd rather like it to be a warrior project then a .tar packup project |
19:59
🔗
|
bzc6p |
if they do, we can give clear instructions on how to do |
19:59
🔗
|
bzc6p |
ok |
19:59
🔗
|
WubTheCap |
4x4 TB actually |
19:59
🔗
|
bzc6p |
how? |
19:59
🔗
|
WubTheCap |
2x4 TB RAID1 | 2x4 TB RAID1 |
19:59
🔗
|
arkiver |
ok |
20:00
🔗
|
arkiver |
neku: are you able to provide us with a list of all files? |
20:00
🔗
|
arkiver |
we'll take care of the rest then |
20:00
🔗
|
bzc6p |
Sorry for interrupting gentlemen, but I think it's time to go to a new channel |
20:00
🔗
|
WubTheCap |
arkiver: Uploads are still enabled though, do you want to wait for those to be disabled? |
20:00
🔗
|
deafnet |
suggestion #pomfret |
20:00
🔗
|
Kazzy |
a moving target is harder to hit, WubTheCap |
20:00
🔗
|
arkiver |
Yes, we should do that |
20:01
🔗
|
neku |
waiting until I disable uploading would probably be good |
20:01
🔗
|
arkiver |
neku: ok, we'll do that |
20:01
🔗
|
arkiver |
can you then provide us with a list of the files? |
20:02
🔗
|
arkiver |
(after uploading is disabled) |
20:02
🔗
|
neku |
At that point I will provide a list of all files |
20:03
🔗
|
Kazzy |
suggesting movement to #pomfret, to keep this channel clean ^^ |
20:03
🔗
|
arkiver |
neku: ok, thank you! |
20:04
🔗
|
bzc6p |
Right |
20:13
🔗
|
|
mistym has joined #archiveteam |
20:13
🔗
|
|
deafnet has left |
20:26
🔗
|
|
Jonimus has joined #archiveteam |
20:37
🔗
|
|
philpem has quit IRC (Remote host closed the connection) |
20:38
🔗
|
|
rolfb has joined #archiveteam |
20:48
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
20:48
🔗
|
|
rolfb has quit IRC (Leaving...) |
20:56
🔗
|
|
rolfb has joined #archiveteam |
20:57
🔗
|
|
rolfb has quit IRC (Linkinus - http://linkinus.com) |
21:06
🔗
|
|
mistym has joined #archiveteam |
21:25
🔗
|
schbirid2 |
http://archiveteam.org/index.php?title=MediaCrush could use a link tot he archive |
22:07
🔗
|
|
sirdancea has quit IRC (Read error: Operation timed out) |
22:12
🔗
|
|
SimpBrain has quit IRC (Quit: Leaving) |
23:24
🔗
|
|
neku has quit IRC (Quit: Leaving) |
23:25
🔗
|
Start |
so now that apple music's been introduced, looks like beats music doesn't have much longer to live |
23:25
🔗
|
|
Ymgve has quit IRC () |
23:26
🔗
|
Start |
http://www.beatsmusic.com/robots.txt |
23:26
🔗
|
Start |
heh |
23:27
🔗
|
Start |
most of the content appears to be on on.beatmusic.com |
23:27
🔗
|
Start |
https://encrypted.google.com/search?q=site%3Aon.beatsmusic.com |
23:48
🔗
|
|
primus104 has joined #archiveteam |