Time |
Nickname |
Message |
00:07
🔗
|
|
girst has quit IRC (Remote host closed the connection) |
01:07
🔗
|
|
bithippo has quit IRC (Textual IRC Client: www.textualapp.com) |
01:41
🔗
|
|
synm0nger has quit IRC (Quit: Wait, what?) |
01:57
🔗
|
|
SynMonger has joined #archiveteam-ot |
02:11
🔗
|
|
DogsRNice has quit IRC (Read error: Connection reset by peer) |
02:12
🔗
|
|
cerca has joined #archiveteam-ot |
03:19
🔗
|
|
icedice has joined #archiveteam-ot |
03:21
🔗
|
|
jamiew has joined #archiveteam-ot |
03:30
🔗
|
|
SoraUta has joined #archiveteam-ot |
03:37
🔗
|
|
jamiew has quit IRC (Textual IRC Client: www.textualapp.com) |
03:42
🔗
|
|
cerca has quit IRC (Remote host closed the connection) |
04:01
🔗
|
|
icedice has quit IRC (Read error: Operation timed out) |
04:51
🔗
|
|
qw3rty has joined #archiveteam-ot |
05:00
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 745 seconds) |
05:14
🔗
|
|
markedL has joined #archiveteam-ot |
05:15
🔗
|
markedL |
do people burn in new hard drives before putting data on them? |
05:17
🔗
|
kpcyrd |
I don't, now I'm wondering if I should |
05:27
🔗
|
Frogging |
I run badblocks |
05:28
🔗
|
Frogging |
It writes to every byte on the device and then reads the whole device |
05:29
🔗
|
|
godane has quit IRC (Ping timeout: 745 seconds) |
05:31
🔗
|
Frogging |
("writing to every byte" is an oversimplification because of sectors and all that, but you get the idea) |
05:32
🔗
|
Frogging |
I don't know if there's any point to doing so, but I don't see any reason not to. If the drive fails after a few passes of that, you've saved yourself some trouble later on. |
05:33
🔗
|
Frogging |
it'll also find any bad sectors; that's actually what badblocks is designed to do |
05:34
🔗
|
Frogging |
I'm sure they do that at the factory already, but still, why not do it again? Consider it a pre-formatting scrub |
05:40
🔗
|
|
godane has joined #archiveteam-ot |
05:51
🔗
|
|
tuluu has quit IRC (Remote host closed the connection) |
05:52
🔗
|
|
tuluu has joined #archiveteam-ot |
06:26
🔗
|
|
Flashfire has quit IRC (Read error: Connection reset by peer) |
06:27
🔗
|
|
Flashfire has joined #archiveteam-ot |
07:08
🔗
|
|
deevious has joined #archiveteam-ot |
07:12
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
08:11
🔗
|
|
deevious has quit IRC (Quit: deevious) |
08:15
🔗
|
|
deevious has joined #archiveteam-ot |
08:34
🔗
|
|
ShellyRol has quit IRC (Read error: Connection reset by peer) |
08:37
🔗
|
|
bluefoo has quit IRC (Read error: Operation timed out) |
08:51
🔗
|
|
ShellyRol has joined #archiveteam-ot |
09:04
🔗
|
|
SoraUta has quit IRC (Ping timeout: 610 seconds) |
09:39
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
09:54
🔗
|
|
Laverne has quit IRC (Ping timeout: 258 seconds) |
09:54
🔗
|
|
mls has quit IRC (Ping timeout: 258 seconds) |
09:54
🔗
|
|
VoynichCr has quit IRC (Ping timeout: 258 seconds) |
09:54
🔗
|
|
sHATNER has quit IRC (Ping timeout: 258 seconds) |
09:55
🔗
|
|
eythian has quit IRC (Ping timeout: 258 seconds) |
09:55
🔗
|
|
luckcolor has quit IRC (Ping timeout: 258 seconds) |
09:55
🔗
|
|
luckcolor has joined #archiveteam-ot |
09:57
🔗
|
|
eythian has joined #archiveteam-ot |
09:59
🔗
|
|
mls has joined #archiveteam-ot |
10:00
🔗
|
|
sHATNER has joined #archiveteam-ot |
10:26
🔗
|
|
deevious has quit IRC (Quit: deevious) |
10:31
🔗
|
|
BlueMaxim has joined #archiveteam-ot |
10:43
🔗
|
|
BlueMax has quit IRC (Ping timeout: 745 seconds) |
10:49
🔗
|
|
deevious has joined #archiveteam-ot |
10:59
🔗
|
|
VoynichCr has joined #archiveteam-ot |
11:00
🔗
|
|
Laverne has joined #archiveteam-ot |
12:23
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
12:47
🔗
|
|
bluefoo has joined #archiveteam-ot |
13:20
🔗
|
|
jamiew has joined #archiveteam-ot |
13:21
🔗
|
|
SoraUta has joined #archiveteam-ot |
13:24
🔗
|
|
jamiew has quit IRC (Client Quit) |
13:43
🔗
|
JAA |
Yeah, I also do essentially that (with SMART long test if available), and I also use fio to stress-test the mechanics for a few hours. |
13:55
🔗
|
|
tuluu has quit IRC (Read error: Connection refused) |
13:55
🔗
|
|
tuluu has joined #archiveteam-ot |
14:45
🔗
|
|
bluefoo has quit IRC (Ping timeout: 744 seconds) |
14:50
🔗
|
|
bluefoo has joined #archiveteam-ot |
15:07
🔗
|
|
girst has joined #archiveteam-ot |
15:15
🔗
|
|
deevious has quit IRC (Quit: deevious) |
15:38
🔗
|
|
mc2 has quit IRC (Read error: Operation timed out) |
15:58
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
16:41
🔗
|
|
jamiew has joined #archiveteam-ot |
17:01
🔗
|
|
jamiew has quit IRC (Textual IRC Client: www.textualapp.com) |
17:34
🔗
|
|
jamiew has joined #archiveteam-ot |
17:56
🔗
|
ivan |
markedL: I do a write-read test using my http://github.com/ludios/drive-checker which has never actually caught a problem on a new drive for me (they tend to be tested before they get shipped out) |
17:56
🔗
|
ivan |
but it does catch memory problems on the computer I use it on :-) |
17:58
🔗
|
markedL |
I guess I'm equally concerned about damage from shipping or improper storage. they likely worked when it left the factory. |
18:00
🔗
|
ivan |
that will show up as 'very DOA' |
18:01
🔗
|
ivan |
noises, fail to spin up |
18:09
🔗
|
|
qw3rty has quit IRC (Ping timeout: 745 seconds) |
18:13
🔗
|
|
qw3rty has joined #archiveteam-ot |
18:20
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
18:29
🔗
|
|
X-Scale` has joined #archiveteam-ot |
18:29
🔗
|
|
LowLevelM has quit IRC (Read error: Operation timed out) |
18:30
🔗
|
|
LowLevelM has joined #archiveteam-ot |
18:34
🔗
|
|
X-Scale has quit IRC (Ping timeout: 610 seconds) |
18:34
🔗
|
|
X-Scale` is now known as X-Scale |
20:19
🔗
|
|
jamiew_ has joined #archiveteam-ot |
20:20
🔗
|
|
MilkGames has joined #archiveteam-ot |
20:20
🔗
|
|
jamiew_ has quit IRC (Client Quit) |
20:21
🔗
|
MilkGames |
Hey there, how would I go about getting a web archive moved to the ArchiveTeam collection? |
20:33
🔗
|
MilkGames |
Sorry, just realised this is the wrong channel to ask that in. I'll ask in another. |
20:33
🔗
|
|
MilkGames has left |
20:43
🔗
|
|
DogsRNice has joined #archiveteam-ot |
21:06
🔗
|
|
jamiew_ has joined #archiveteam-ot |
21:10
🔗
|
|
oxguy3 has joined #archiveteam-ot |
21:16
🔗
|
oxguy3 |
uh hey, so i've got 204GB of gzipped WARC files from an FTP site... is there anything i should know before i attempt to upload this to archive.org with the ia command line tool? i've never uploaded anything remotely this big |
21:25
🔗
|
astrid |
you might want to split it into a handful of distinct ia items, depending. up to you though. |
21:25
🔗
|
astrid |
how big is each warc? |
21:27
🔗
|
oxguy3 |
i set wget to target 1GB file size, but most are a bit bigger, and some are huge -- got a 12gb and an 11gb |
21:27
🔗
|
astrid |
oh that's reasonable |
21:28
🔗
|
astrid |
:) |
21:29
🔗
|
oxguy3 |
yeah it's not too bad, and i figure it'd be best to keep them together so it's a singular package (i archived the entire FTP server lol) |
21:29
🔗
|
oxguy3 |
i guess i'll get to uploading... this is gonna take a while lol |
21:31
🔗
|
Kaz |
split to about 50gb/item ideally, I think is the common recommendation |
21:33
🔗
|
oxguy3 |
i thought it was 50 gb/file and 1000 files/item? https://help.archive.org/hc/en-us/articles/360016475032-Uploading-Tips |
21:37
🔗
|
oxguy3 |
oh wait, looks like items aren't supposed to be bigger than 100gb, shoot https://archive.org/services/docs/api/items.html#item-limitations |
21:37
🔗
|
oxguy3 |
alright if i'm uploading this into multiple items, does it matter which item i put the meta warc file in? |
21:40
🔗
|
Kaz |
yeah, don't make a 50tb item :) |
21:40
🔗
|
Kaz |
I'm pretty sure it wouldn't be possible anyway, as I think all files for an item live together on a disk |
21:54
🔗
|
oxguy3 |
i'm assuming i should set mediatype:web since this is WARC files, even though it's ftp rather than http, right? |
22:03
🔗
|
|
jamiew__ has joined #archiveteam-ot |
22:08
🔗
|
SketchCow |
Also, unless you're an approved archive team project, it won't go into wayback |
22:09
🔗
|
oxguy3 |
yeah i figured, but will it still be browseable on archive.org? |
22:09
🔗
|
SketchCow |
The item? Sure |
22:09
🔗
|
|
jamiew_ has quit IRC (Read error: Operation timed out) |
22:09
🔗
|
oxguy3 |
like as in, would you be able to browse the full contents of the server in some easy way, instead of having to dig through warc.gz files? |
22:09
🔗
|
SketchCow |
Nope |
22:10
🔗
|
oxguy3 |
ah, hmm. would it be better if i just uploaded the actual raw files instead of the WARCs? (i didnt include --delete-after in my wget command so i have them raw as well) |
22:10
🔗
|
SketchCow |
I don't know how more or not more better it is. |
22:11
🔗
|
SketchCow |
What FTP site is it |
22:11
🔗
|
SketchCow |
Dare you to say Intel |
22:11
🔗
|
oxguy3 |
vikings.flashspot.tv -- the Minnesota Vikings used it to share video and photos with the press for many years |
22:11
🔗
|
SketchCow |
Is it still up |
22:12
🔗
|
oxguy3 |
yes, but hasn't been updated since 2017 |
22:12
🔗
|
SketchCow |
Just pass this to archivebot to do it |
22:12
🔗
|
SketchCow |
Then it's all handled and it goes in wayback |
22:13
🔗
|
oxguy3 |
it requires a login, i wasn't sure if that would be an issue |
22:14
🔗
|
SketchCow |
Upload the raw files. |
22:14
🔗
|
SketchCow |
How many files is it. |
22:14
🔗
|
oxguy3 |
okay cool |
22:14
🔗
|
oxguy3 |
uhhh, a lot... let me see |
22:14
🔗
|
SketchCow |
Either way, it's going to be a nightmare |
22:14
🔗
|
oxguy3 |
33462 |
22:15
🔗
|
SketchCow |
Yeah, raw files, have a ball |
22:15
🔗
|
SketchCow |
Easiest if you upload it as a large set of .ZIP files |
22:15
🔗
|
oxguy3 |
ah yeah, that sounds better than making 34+ items lol |
22:16
🔗
|
SketchCow |
Let's put it this way, it's going to be awful no matter what. |
22:16
🔗
|
SketchCow |
A few largish .zip files will do |
22:16
🔗
|
astrid |
a .zip per toplevel folder maybe |
22:17
🔗
|
SketchCow |
Yeah, not too many, and not too large |
22:17
🔗
|
SketchCow |
It's an art |
22:17
🔗
|
astrid |
there are many right answers and many wrong answers |
22:17
🔗
|
oxguy3 |
problem with that is there are two top-level folders which surpass 50GB lol |
22:17
🔗
|
oxguy3 |
i'll figure something out |
22:18
🔗
|
astrid |
the main constraint on archive.org items is that each item has to live on a hard drive with all of its files together |
22:18
🔗
|
astrid |
so if you exceed commercially shipping hard drives it won't be able to fit anywhere |
22:18
🔗
|
astrid |
and if you get close to it then it makes their end of things ... more complicated |
22:19
🔗
|
oxguy3 |
yeah, i think i'm gonna make one item for a 72GB folder, one item for a 65GB folder, and one item for everything else (which totals 73GB) |
22:19
🔗
|
astrid |
oh those can live in one item together i'd say |
22:20
🔗
|
astrid |
three largeish zip files in a single item is a solid choice here |
22:20
🔗
|
oxguy3 |
hmm, i thought the rule was no files over 50gb? |
22:21
🔗
|
oxguy3 |
i was planning on splitting the two mega folders into zips for each sub item |
22:21
🔗
|
astrid |
imo better to keep them together so they don't get lost |
22:22
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
22:22
🔗
|
oxguy3 |
i'll keep them in the same item, but divide them into multiple zips |
22:24
🔗
|
|
jamiew_ has joined #archiveteam-ot |
22:27
🔗
|
|
jamiew_ has quit IRC (Client Quit) |
22:28
🔗
|
|
jamiew__ has quit IRC (Read error: Operation timed out) |
22:35
🔗
|
oxguy3 |
alright they're slowly getting zipped -- my home server has a wimpy CPU so it's gonna be a while. aye carumba, what a messy project |
22:35
🔗
|
markedL |
the ftp site has same credentials as web site? |
22:36
🔗
|
oxguy3 |
yep! |
22:36
🔗
|
oxguy3 |
the website seems to just be an FTP client |
22:45
🔗
|
markedL |
it pretends to accept anonymous but dunno what email addresses it will aprove |
22:48
🔗
|
oxguy3 |
i'll dm you the login (im not too concerned about sharing it now that i have a complete mirror) |
23:05
🔗
|
markedL |
what tools does ftp into warc ? |
23:08
🔗
|
JAA |
wpull can do that. |
23:08
🔗
|
JAA |
I don't think there's any standard on how to save FTP to WARC though. |
23:09
🔗
|
oxguy3 |
i did it with wget |
23:10
🔗
|
oxguy3 |
wget --user="vPR-Read" --password="removed" ftp://vikings.flashspot.tv/ --mirror --warc-file=vikings --warc-max-size=1G --warc-header="ftp-user: vPR-Read" |
23:12
🔗
|
markedL |
since the credentials are on the open web, I feel ethics are different but I'm not going to use them since it's done already |
23:12
🔗
|
markedL |
maybe someone else would prefer a web in warc copy |
23:13
🔗
|
oxguy3 |
i have the full warc mirror fyi |
23:13
🔗
|
Frogging |
Is there really much point to capturing FTP to WARC, though? |
23:14
🔗
|
Frogging |
The files are all independent of each other and the headers are irrelevant |
23:14
🔗
|
oxguy3 |
¯\_(ツ)_/¯ that's how the archiveteam ftp project does it so i just copied them |
23:15
🔗
|
JAA |
I've been asking myself that as well. What it's nice for is keeping the retrieval commands tightly coupled to the data. |
23:15
🔗
|
Frogging |
Just make sure to preserve the timestamps. zip or tar will do that for you |
23:16
🔗
|
markedL |
you get hashes |
23:16
🔗
|
Frogging |
(assuming you downloaded with something that preserves timestamps) |
23:16
🔗
|
Frogging |
FTP has hashes? |
23:16
🔗
|
JAA |
WARC does. |
23:16
🔗
|
Frogging |
ah |
23:16
🔗
|
Frogging |
find . -type f -exec md5sum {} + > md5sum.txt |
23:16
🔗
|
oxguy3 |
wget created .listing files in every directory which included timestamps, so i figure that's probably good enough |
23:16
🔗
|
JAA |
FTP per standard doesn't, but there are extensions. |
23:18
🔗
|
Frogging |
wget should have applied the timestamps to the downloaded files |
23:18
🔗
|
Frogging |
which will be preserved if you zip/tar them |
23:19
🔗
|
markedL |
there needs to be more .warc support, it's a little repetitive defending archival properties for something that's not friendly to use |
23:19
🔗
|
oxguy3 |
yeah i believe it did |
23:57
🔗
|
|
martini has joined #archiveteam-ot |