| Time |
Nickname |
Message |
|
00:21
🔗
|
|
kristian_ has quit IRC (Leaving) |
|
01:13
🔗
|
aschmitz |
Anyone have experience archiving Disqus fora? I count 55 for NPR after dropping those with "dev" or "stage" in the name. |
|
01:24
🔗
|
r3c0d3x |
aschmitz: I was actually looking into this a bit already and I'll write up a GitHub gist about it in a minute. One note to preface all this: the comments will stay on Disqus for quite a while longer after NPR removes the embeds from their site. We probably don't need to rush on this. |
|
01:26
🔗
|
aschmitz |
Yeah, it looked like that when I was digging into it a bit. |
|
01:34
🔗
|
r3c0d3x |
aschmitz: Quickly threw this together, it has all the info I was able to gather: https://gist.github.com/r3c0d3x/ff33ff59bd2432a5a81a32669eb5a390 |
|
01:47
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
|
01:47
🔗
|
|
HCross has joined #archiveteam-bs |
|
01:59
🔗
|
aschmitz |
r3c0d3x: Cool, thanks. Added a bit in my fork: https://gist.github.com/aschmitz/19dfb67be5d0d71c74431074191062dc |
|
02:10
🔗
|
|
tomwsmf has quit IRC (Read error: Operation timed out) |
|
02:26
🔗
|
|
mr-b has left |
|
03:14
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
03:15
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
04:09
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
04:12
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
04:17
🔗
|
|
JesseW has joined #archiveteam-bs |
|
04:17
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
|
04:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:35
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
04:35
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
|
04:35
🔗
|
|
HCross has joined #archiveteam-bs |
|
04:43
🔗
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
|
04:43
🔗
|
|
Meroje has quit IRC (Quit: bye!) |
|
04:44
🔗
|
|
Meroje has joined #archiveteam-bs |
|
04:53
🔗
|
|
DFJustin has joined #archiveteam-bs |
|
04:53
🔗
|
|
swebb sets mode: +o DFJustin |
|
05:05
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
|
05:10
🔗
|
|
DFJustin has joined #archiveteam-bs |
|
05:15
🔗
|
|
HCross has quit IRC (Read error: Operation timed out) |
|
05:15
🔗
|
|
HCross has joined #archiveteam-bs |
|
05:57
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
|
05:58
🔗
|
|
phuzion has joined #archiveteam-bs |
|
05:59
🔗
|
SketchCow |
Intense Floppy Grabs continue |
|
05:59
🔗
|
JesseW |
That sounds like some kind of sex toy |
|
06:00
🔗
|
JesseW |
Buy "Intense Floppy Grabs" today for deep, sensual pleasure! |
|
06:05
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
|
06:05
🔗
|
|
sep332 has quit IRC (Read error: Operation timed out) |
|
06:05
🔗
|
|
midas1 has quit IRC (Read error: Operation timed out) |
|
06:07
🔗
|
godane |
just know there are sex toys that are senting data back to the company |
|
06:07
🔗
|
godane |
also this: http://www.dailydot.com/layer8/hackers-and-vibrators-oh-my/ |
|
06:07
🔗
|
|
midas1 has joined #archiveteam-bs |
|
06:07
🔗
|
|
sep332 has joined #archiveteam-bs |
|
06:09
🔗
|
JesseW |
thank you for that |
|
06:10
🔗
|
godane |
your welcome |
|
06:11
🔗
|
godane |
i remember reading something about that and could find the exact article |
|
06:11
🔗
|
godane |
but that was close enough to it |
|
06:13
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
06:13
🔗
|
|
phuzion has joined #archiveteam-bs |
|
06:23
🔗
|
godane |
turns out sploid.gizmodo.com sitemaps was big |
|
06:23
🔗
|
godane |
i think about 10gb for all of it |
|
06:24
🔗
|
godane |
maybe its 9gb |
|
06:24
🔗
|
godane |
but its still big |
|
07:03
🔗
|
|
Honno has joined #archiveteam-bs |
|
07:11
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
|
07:31
🔗
|
|
REiN^ has quit IRC (Read error: Connection reset by peer) |
|
07:33
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
|
07:36
🔗
|
|
phuzion has joined #archiveteam-bs |
|
07:52
🔗
|
|
schbirid has joined #archiveteam-bs |
|
07:53
🔗
|
fie__ |
SketchCow, floppy grabs what? |
|
08:03
🔗
|
Medowar |
...floppy disks? |
|
08:22
🔗
|
SketchCow |
Apple II floppies |
|
08:27
🔗
|
SketchCow |
Work on http://fos.textfiles.com/pipeline.html began |
|
08:27
🔗
|
SketchCow |
Lots to do |
|
08:33
🔗
|
godane |
SketchCow: turns out we don't have all of gawker.com |
|
08:33
🔗
|
SketchCow |
? |
|
08:33
🔗
|
godane |
or kotaku or lifehacker |
|
08:33
🔗
|
SketchCow |
Really. |
|
08:33
🔗
|
SketchCow |
Why? |
|
08:33
🔗
|
godane |
dump sitemap |
|
08:33
🔗
|
SketchCow |
Well, have they deleted it all now? |
|
08:33
🔗
|
godane |
http://kotaku.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59 |
|
08:34
🔗
|
godane |
no they have not deleted it yet |
|
08:35
🔗
|
godane |
i have noticed that sitemap by date hacks weird |
|
08:35
🔗
|
godane |
but cause i tested on maybe 2005 or 2006 sitemaps it looks like it had everything |
|
08:37
🔗
|
|
REiN^ has joined #archiveteam-bs |
|
08:41
🔗
|
godane |
ok the sitemap urls a funking with us |
|
08:41
🔗
|
godane |
kotaku.com for 2008-11 (one above) has 3034 urls |
|
08:42
🔗
|
godane |
but if you use gawker.com in its place you get 1971 |
|
08:43
🔗
|
godane |
so when i say the sitemap acts weird it does act weird |
|
08:45
🔗
|
godane |
SketchCow: also i think archivebot when after gawker.com and other sites own by gawker back in 2014 or 2015 |
|
08:45
🔗
|
godane |
so my sitemap grab just maybe incomplete |
|
08:48
🔗
|
godane |
even my sitemap grab of sploid.gizmodo.com is incomplete |
|
08:48
🔗
|
godane |
:'( |
|
08:48
🔗
|
SketchCow |
Dust yourself and go for it again |
|
08:49
🔗
|
godane |
i'm doing that now |
|
08:52
🔗
|
SketchCow |
I'm watching classic movies and ripping Apple II disks, and both are going swimmingly. |
|
08:52
🔗
|
godane |
curl -s 'http://gawker.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59' | sed 's|><|>\n<|g' | grep 'http' | sed 's|.*http://|http://|g' | sed 's|.*https://|http://|g' | sed "s|</image:loc>||g" | sed 's|]]>||g' |
|
08:52
🔗
|
godane |
thats my code for grabbing the urls |
|
08:54
🔗
|
godane |
after 2006-01 is done will try setup my script to attack each month of 2006 for gawker |
|
09:02
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
09:17
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
|
09:17
🔗
|
|
HCross has joined #archiveteam-bs |
|
09:34
🔗
|
|
GE has joined #archiveteam-bs |
|
09:34
🔗
|
|
HCross2 has quit IRC (Quit: Connection closed for inactivity) |
|
09:52
🔗
|
|
Selavi has quit IRC (Ping timeout: 260 seconds) |
|
09:53
🔗
|
|
Kksmkrn has joined #archiveteam-bs |
|
09:53
🔗
|
|
Kksmkrn has quit IRC (Connection closed) |
|
09:53
🔗
|
|
Kksmkrn has joined #archiveteam-bs |
|
10:00
🔗
|
|
Selavi has joined #archiveteam-bs |
|
10:09
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
10:14
🔗
|
luckcolor |
someone may want to go after this https://twitter.com/antisec_ita/status/767856654486503424 |
|
10:16
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
10:23
🔗
|
|
divingk has quit IRC (ChatZilla 0.9.92 [Firefox 47.0/20160604131506]) |
|
10:31
🔗
|
|
Kksmkrn has quit IRC (Quit: leaving) |
|
11:35
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
|
11:35
🔗
|
|
HCross has joined #archiveteam-bs |
|
11:44
🔗
|
atrocity |
https://www.reddit.com/r/Minecraft/comments/4z36un/mojangs_official_youtube_channel_was_suspended/ |
|
11:44
🔗
|
atrocity |
stay classy, youtube |
|
11:47
🔗
|
joepie91 |
lol. |
|
12:35
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
12:37
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
12:57
🔗
|
|
GE_ has joined #archiveteam-bs |
|
12:59
🔗
|
|
GE has quit IRC (Ping timeout: 255 seconds) |
|
12:59
🔗
|
|
GE_ is now known as GE |
|
13:03
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
13:04
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
|
13:04
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
13:27
🔗
|
|
beardicus has quit IRC (bye) |
|
13:28
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
13:31
🔗
|
|
beardicus has joined #archiveteam-bs |
|
13:35
🔗
|
|
beardicus has quit IRC (Client Quit) |
|
13:37
🔗
|
|
beardicus has joined #archiveteam-bs |
|
13:45
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
13:45
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
13:46
🔗
|
|
wp494 has quit IRC (Read error: Operation timed out) |
|
13:47
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
14:16
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
|
14:42
🔗
|
|
tomwsmf has joined #archiveteam-bs |
|
14:47
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
15:15
🔗
|
|
wp494 has joined #archiveteam-bs |
|
15:18
🔗
|
|
JesseW has joined #archiveteam-bs |
|
15:25
🔗
|
|
JesseW has quit IRC (Read error: Operation timed out) |
|
15:34
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
15:56
🔗
|
|
VADemon has joined #archiveteam-bs |
|
16:03
🔗
|
|
GE has joined #archiveteam-bs |
|
16:14
🔗
|
SketchCow |
http://fos.textfiles.com/pipeline.html is OK but needs another run! Which it will get shortly. |
|
16:32
🔗
|
sep332 |
Does archive.org have a policy about which YouTube pages get saved? |
|
16:32
🔗
|
sep332 |
"Gangnam Style" was getting saved like 6x per day https://web.archive.org/web/*/https://www.youtube.com/watch?v=9bZkp7q19f0 |
|
16:33
🔗
|
sep332 |
(it doesn't seem to have video data or any comments though) |
|
16:33
🔗
|
SketchCow |
That's being worked on internally |
|
16:33
🔗
|
sep332 |
by "policy" I mean for auto-crawling |
|
16:33
🔗
|
sep332 |
ok |
|
16:42
🔗
|
|
HCross2 has joined #archiveteam-bs |
|
16:42
🔗
|
SketchCow |
The goal is in the future it will deduplicate these. |
|
16:42
🔗
|
arkiver |
SketchCow: awesome!! |
|
16:43
🔗
|
arkiver |
hmm |
|
16:43
🔗
|
sep332 |
Is the video data being saved and just not rendered correctly, or plain not collected? |
|
16:43
🔗
|
arkiver |
SketchCow: you can remove extratorrent from there |
|
16:44
🔗
|
SketchCow |
I'll do additional work after I finish the script. |
|
16:44
🔗
|
SketchCow |
A little time to go |
|
16:44
🔗
|
arkiver |
ok |
|
16:51
🔗
|
|
irl has joined #archiveteam-bs |
|
16:51
🔗
|
irl |
SketchCow: hi |
|
16:51
🔗
|
SketchCow |
Hiiiiiiiiii |
|
16:51
🔗
|
irl |
hiiiiiiiiiiiiiiiiiii |
|
16:51
🔗
|
xmc |
hi. |
|
16:52
🔗
|
irl |
SketchCow: i hear you like manuals |
|
16:52
🔗
|
SketchCow |
I do. |
|
16:52
🔗
|
irl |
cool |
|
16:52
🔗
|
SketchCow |
I heard you like scanning them |
|
16:52
🔗
|
irl |
i have manuals |
|
16:52
🔗
|
irl |
and a scanner coming |
|
16:52
🔗
|
irl |
in the ebay post |
|
16:52
🔗
|
SketchCow |
Try not to damage the originals too much and have a fantastic time. |
|
16:53
🔗
|
SketchCow |
Scan at 600dpi TIFF files, put into either .ZIPs or into directories. |
|
16:53
🔗
|
irl |
X.25 interface cards, network simulation software, and other things relevant to internet engineering |
|
16:53
🔗
|
SketchCow |
I can give you an FTP drop |
|
16:53
🔗
|
irl |
ok awesome |
|
16:53
🔗
|
irl |
so i don't go directly to IA? |
|
16:53
🔗
|
irl |
you'll help out with metadata maybe? |
|
16:53
🔗
|
xmc |
you do your own metadata |
|
16:53
🔗
|
xmc |
don't make SketchCow do it |
|
16:53
🔗
|
irl |
hehe |
|
16:53
🔗
|
xmc |
it's not that hard |
|
16:54
🔗
|
irl |
got a link for how to do metadata in a nice format? |
|
16:54
🔗
|
irl |
also, i have some reel-to-reel tapes and 8" floppies |
|
16:54
🔗
|
SketchCow |
I can give general information. |
|
16:54
🔗
|
xmc |
like, how to type in the title and date and author? |
|
16:54
🔗
|
SketchCow |
In a best case, it's: |
|
16:55
🔗
|
SketchCow |
Title, date of creation, creator (company or individual), and then a capsule description. |
|
16:55
🔗
|
irl |
that seems reasonable |
|
16:55
🔗
|
irl |
so i'm not understanding FTP drop then, because that sounds like i'm creating an IA collection |
|
17:00
🔗
|
|
tomaspark has quit IRC (Ping timeout: 255 seconds) |
|
17:03
🔗
|
HCross2 |
SketchCow: who should I contact if I need to change the payment method for an archive.org donation? |
|
17:12
🔗
|
SketchCow |
mail info@archive.org |
|
17:12
🔗
|
SketchCow |
irl: So there's two ways to upload |
|
17:12
🔗
|
SketchCow |
You can upload yourself, or you can build up a pile of directories and I can give you an FTP drop and I shove them in. |
|
17:12
🔗
|
SketchCow |
A collection can be made and you can work on it, but I can do that initial upload process using scripts. I find that helps for bulk uploaders. |
|
17:13
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
|
17:21
🔗
|
irl |
SketchCow: ah ok cool |
|
17:21
🔗
|
irl |
so how should the metadata be done within the pile of directories? |
|
17:22
🔗
|
irl |
is there some json or yaml or something format? |
|
17:24
🔗
|
SketchCow |
Whatever you're comfortable with, I can work with |
|
17:28
🔗
|
irl |
SketchCow: i could do a csv like http://internetarchive.readthedocs.io/en/latest/cli.html#modifying-metadata-in-bulk |
|
17:29
🔗
|
SketchCow |
Entirely up to you. However you want. |
|
17:29
🔗
|
irl |
ok, just trying to find the easiest way for you |
|
17:55
🔗
|
godane |
SketchCow: daily sitemaps of gawker.com is happening |
|
17:55
🔗
|
SketchCow |
Great |
|
17:56
🔗
|
godane |
i just hope they sitemap does going crazy like before with monthly ones |
|
17:59
🔗
|
SketchCow |
Any amount of Gawker functioning right now is a gift. |
|
17:59
🔗
|
SketchCow |
Or any of the properties. |
|
17:59
🔗
|
SketchCow |
And when Univision steps in, it's going to be a bloodbath |
|
18:00
🔗
|
SketchCow |
The Univision buy is so insane I'm assuming it's some corrupt reason we don't understand |
|
18:00
🔗
|
SketchCow |
Or Denton invented some snow-job that Univision bought |
|
18:59
🔗
|
|
bzc6p has joined #archiveteam-bs |
|
18:59
🔗
|
|
swebb sets mode: +o bzc6p |
|
19:00
🔗
|
bzc6p |
ErkDog: you reported that yahoo answers items get stuck. Do you use the new wget-lua? |
|
19:01
🔗
|
bzc6p |
--version |
|
19:01
🔗
|
bzc6p |
there is a new one from 20160530 |
|
19:19
🔗
|
|
VerifiedJ has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
|
19:52
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
|
19:58
🔗
|
|
BartoCH has joined #archiveteam-bs |
|
21:04
🔗
|
|
HCross2 has quit IRC (Quit: Connection closed for inactivity) |
|
21:12
🔗
|
hook54321 |
SketchCow: Is there a minimum education background requirement (other than experience) for jobs at the internet archive? |
|
21:13
🔗
|
xmc |
you should apply |
|
21:13
🔗
|
xmc |
http://archive.org/about/jobs.php |
|
21:19
🔗
|
hook54321 |
I'm not located in California unfortunately. I would consider applying for many of them if I had more programming experience. |
|
21:19
🔗
|
xmc |
then why are you asking? |
|
21:29
🔗
|
hook54321 |
Wondering for possible jobs in the future, and not all of say that on-site presence is required. |
|
21:31
🔗
|
hook54321 |
*all of them |
|
21:40
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
|
22:07
🔗
|
|
schbirid2 has joined #archiveteam-bs |
|
22:10
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
|
22:13
🔗
|
|
whydomain has joined #archiveteam-bs |
|
22:15
🔗
|
whydomain |
PurpleSym: what design DIY book scanner did you make? (I'm considering https://linearbookscanner.org/ ) |
|
22:25
🔗
|
* |
FalconK looks around |
|
22:25
🔗
|
FalconK |
hey, look! https://archive.org/details/cbcnews201607-201608 |
|
22:26
🔗
|
xmc |
:) |
|
22:26
🔗
|
xmc |
that's a lot of hourly news |
|
22:33
🔗
|
|
RichardG has joined #archiveteam-bs |
|
22:40
🔗
|
FalconK |
I've got a cronjob pulling it down every hour |
|
22:42
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
|
22:45
🔗
|
|
schbirid2 has joined #archiveteam-bs |
|
22:46
🔗
|
FalconK |
I started doing it mostly because I thought it provided an interesting perspective on Trump, and I noticed CBC didn't keep a public archive of them |
|
22:46
🔗
|
xmc |
huh |
|
22:46
🔗
|
FalconK |
they do *have* an archive of them |
|
22:47
🔗
|
FalconK |
not sure how to access it. probably in person. |
|
22:47
🔗
|
xmc |
:| |
|
23:09
🔗
|
|
whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
|
23:16
🔗
|
|
JW_work1 has joined #archiveteam-bs |
|
23:18
🔗
|
|
JW_work has quit IRC (Read error: Operation timed out) |
|
23:23
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
|
23:38
🔗
|
|
rchrch has joined #archiveteam-bs |
|
23:45
🔗
|
|
kristian_ has joined #archiveteam-bs |
|
23:48
🔗
|
|
RichardG has joined #archiveteam-bs |
|
23:56
🔗
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |