Time |
Nickname |
Message |
00:21
🔗
|
|
kristian_ has quit IRC (Leaving) |
01:13
🔗
|
aschmitz |
Anyone have experience archiving Disqus fora? I count 55 for NPR after dropping those with "dev" or "stage" in the name. |
01:24
🔗
|
r3c0d3x |
aschmitz: I was actually looking into this a bit already and I'll write up a GitHub gist about it in a minute. One note to preface all this: the comments will stay on Disqus for quite a while longer after NPR removes the embeds from their site. We probably don't need to rush on this. |
01:26
🔗
|
aschmitz |
Yeah, it looked like that when I was digging into it a bit. |
01:34
🔗
|
r3c0d3x |
aschmitz: Quickly threw this together, it has all the info I was able to gather: https://gist.github.com/r3c0d3x/ff33ff59bd2432a5a81a32669eb5a390 |
01:47
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
01:47
🔗
|
|
HCross has joined #archiveteam-bs |
01:59
🔗
|
aschmitz |
r3c0d3x: Cool, thanks. Added a bit in my fork: https://gist.github.com/aschmitz/19dfb67be5d0d71c74431074191062dc |
02:10
🔗
|
|
tomwsmf has quit IRC (Read error: Operation timed out) |
02:26
🔗
|
|
mr-b has left |
03:14
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
03:15
🔗
|
|
BartoCH has joined #archiveteam-bs |
04:09
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
04:12
🔗
|
|
BartoCH has joined #archiveteam-bs |
04:17
🔗
|
|
JesseW has joined #archiveteam-bs |
04:17
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:35
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
04:35
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
04:35
🔗
|
|
HCross has joined #archiveteam-bs |
04:43
🔗
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
04:43
🔗
|
|
Meroje has quit IRC (Quit: bye!) |
04:44
🔗
|
|
Meroje has joined #archiveteam-bs |
04:53
🔗
|
|
DFJustin has joined #archiveteam-bs |
04:53
🔗
|
|
swebb sets mode: +o DFJustin |
05:05
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
05:10
🔗
|
|
DFJustin has joined #archiveteam-bs |
05:15
🔗
|
|
HCross has quit IRC (Read error: Operation timed out) |
05:15
🔗
|
|
HCross has joined #archiveteam-bs |
05:57
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
05:58
🔗
|
|
phuzion has joined #archiveteam-bs |
05:59
🔗
|
SketchCow |
Intense Floppy Grabs continue |
05:59
🔗
|
JesseW |
That sounds like some kind of sex toy |
06:00
🔗
|
JesseW |
Buy "Intense Floppy Grabs" today for deep, sensual pleasure! |
06:05
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
06:05
🔗
|
|
sep332 has quit IRC (Read error: Operation timed out) |
06:05
🔗
|
|
midas1 has quit IRC (Read error: Operation timed out) |
06:07
🔗
|
godane |
just know there are sex toys that are senting data back to the company |
06:07
🔗
|
godane |
also this: http://www.dailydot.com/layer8/hackers-and-vibrators-oh-my/ |
06:07
🔗
|
|
midas1 has joined #archiveteam-bs |
06:07
🔗
|
|
sep332 has joined #archiveteam-bs |
06:09
🔗
|
JesseW |
thank you for that |
06:10
🔗
|
godane |
your welcome |
06:11
🔗
|
godane |
i remember reading something about that and could find the exact article |
06:11
🔗
|
godane |
but that was close enough to it |
06:13
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
06:13
🔗
|
|
phuzion has joined #archiveteam-bs |
06:23
🔗
|
godane |
turns out sploid.gizmodo.com sitemaps was big |
06:23
🔗
|
godane |
i think about 10gb for all of it |
06:24
🔗
|
godane |
maybe its 9gb |
06:24
🔗
|
godane |
but its still big |
07:03
🔗
|
|
Honno has joined #archiveteam-bs |
07:11
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
07:31
🔗
|
|
REiN^ has quit IRC (Read error: Connection reset by peer) |
07:33
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
07:36
🔗
|
|
phuzion has joined #archiveteam-bs |
07:52
🔗
|
|
schbirid has joined #archiveteam-bs |
07:53
🔗
|
fie__ |
SketchCow, floppy grabs what? |
08:03
🔗
|
Medowar |
...floppy disks? |
08:22
🔗
|
SketchCow |
Apple II floppies |
08:27
🔗
|
SketchCow |
Work on http://fos.textfiles.com/pipeline.html began |
08:27
🔗
|
SketchCow |
Lots to do |
08:33
🔗
|
godane |
SketchCow: turns out we don't have all of gawker.com |
08:33
🔗
|
SketchCow |
? |
08:33
🔗
|
godane |
or kotaku or lifehacker |
08:33
🔗
|
SketchCow |
Really. |
08:33
🔗
|
SketchCow |
Why? |
08:33
🔗
|
godane |
dump sitemap |
08:33
🔗
|
SketchCow |
Well, have they deleted it all now? |
08:33
🔗
|
godane |
http://kotaku.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59 |
08:34
🔗
|
godane |
no they have not deleted it yet |
08:35
🔗
|
godane |
i have noticed that sitemap by date hacks weird |
08:35
🔗
|
godane |
but cause i tested on maybe 2005 or 2006 sitemaps it looks like it had everything |
08:37
🔗
|
|
REiN^ has joined #archiveteam-bs |
08:41
🔗
|
godane |
ok the sitemap urls a funking with us |
08:41
🔗
|
godane |
kotaku.com for 2008-11 (one above) has 3034 urls |
08:42
🔗
|
godane |
but if you use gawker.com in its place you get 1971 |
08:43
🔗
|
godane |
so when i say the sitemap acts weird it does act weird |
08:45
🔗
|
godane |
SketchCow: also i think archivebot when after gawker.com and other sites own by gawker back in 2014 or 2015 |
08:45
🔗
|
godane |
so my sitemap grab just maybe incomplete |
08:48
🔗
|
godane |
even my sitemap grab of sploid.gizmodo.com is incomplete |
08:48
🔗
|
godane |
:'( |
08:48
🔗
|
SketchCow |
Dust yourself and go for it again |
08:49
🔗
|
godane |
i'm doing that now |
08:52
🔗
|
SketchCow |
I'm watching classic movies and ripping Apple II disks, and both are going swimmingly. |
08:52
🔗
|
godane |
curl -s 'http://gawker.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59' | sed 's|><|>\n<|g' | grep 'http' | sed 's|.*http://|http://|g' | sed 's|.*https://|http://|g' | sed "s|</image:loc>||g" | sed 's|]]>||g' |
08:52
🔗
|
godane |
thats my code for grabbing the urls |
08:54
🔗
|
godane |
after 2006-01 is done will try setup my script to attack each month of 2006 for gawker |
09:02
🔗
|
|
BartoCH has joined #archiveteam-bs |
09:17
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
09:17
🔗
|
|
HCross has joined #archiveteam-bs |
09:34
🔗
|
|
GE has joined #archiveteam-bs |
09:34
🔗
|
|
HCross2 has quit IRC (Quit: Connection closed for inactivity) |
09:52
🔗
|
|
Selavi has quit IRC (Ping timeout: 260 seconds) |
09:53
🔗
|
|
Kksmkrn has joined #archiveteam-bs |
09:53
🔗
|
|
Kksmkrn has quit IRC (Connection closed) |
09:53
🔗
|
|
Kksmkrn has joined #archiveteam-bs |
10:00
🔗
|
|
Selavi has joined #archiveteam-bs |
10:09
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
10:14
🔗
|
luckcolor |
someone may want to go after this https://twitter.com/antisec_ita/status/767856654486503424 |
10:16
🔗
|
|
BartoCH has joined #archiveteam-bs |
10:23
🔗
|
|
divingk has quit IRC (ChatZilla 0.9.92 [Firefox 47.0/20160604131506]) |
10:31
🔗
|
|
Kksmkrn has quit IRC (Quit: leaving) |
11:35
🔗
|
|
HCross has quit IRC (Ping timeout: 246 seconds) |
11:35
🔗
|
|
HCross has joined #archiveteam-bs |
11:44
🔗
|
atrocity |
https://www.reddit.com/r/Minecraft/comments/4z36un/mojangs_official_youtube_channel_was_suspended/ |
11:44
🔗
|
atrocity |
stay classy, youtube |
11:47
🔗
|
joepie91 |
lol. |
12:35
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
12:37
🔗
|
|
BartoCH has joined #archiveteam-bs |
12:57
🔗
|
|
GE_ has joined #archiveteam-bs |
12:59
🔗
|
|
GE has quit IRC (Ping timeout: 255 seconds) |
12:59
🔗
|
|
GE_ is now known as GE |
13:03
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:04
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:04
🔗
|
|
BartoCH has joined #archiveteam-bs |
13:27
🔗
|
|
beardicus has quit IRC (bye) |
13:28
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:31
🔗
|
|
beardicus has joined #archiveteam-bs |
13:35
🔗
|
|
beardicus has quit IRC (Client Quit) |
13:37
🔗
|
|
beardicus has joined #archiveteam-bs |
13:45
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:45
🔗
|
|
BartoCH has joined #archiveteam-bs |
13:46
🔗
|
|
wp494 has quit IRC (Read error: Operation timed out) |
13:47
🔗
|
|
dashcloud has joined #archiveteam-bs |
14:16
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
14:42
🔗
|
|
tomwsmf has joined #archiveteam-bs |
14:47
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
15:15
🔗
|
|
wp494 has joined #archiveteam-bs |
15:18
🔗
|
|
JesseW has joined #archiveteam-bs |
15:25
🔗
|
|
JesseW has quit IRC (Read error: Operation timed out) |
15:34
🔗
|
|
BartoCH has joined #archiveteam-bs |
15:56
🔗
|
|
VADemon has joined #archiveteam-bs |
16:03
🔗
|
|
GE has joined #archiveteam-bs |
16:14
🔗
|
SketchCow |
http://fos.textfiles.com/pipeline.html is OK but needs another run! Which it will get shortly. |
16:32
🔗
|
sep332 |
Does archive.org have a policy about which YouTube pages get saved? |
16:32
🔗
|
sep332 |
"Gangnam Style" was getting saved like 6x per day https://web.archive.org/web/*/https://www.youtube.com/watch?v=9bZkp7q19f0 |
16:33
🔗
|
sep332 |
(it doesn't seem to have video data or any comments though) |
16:33
🔗
|
SketchCow |
That's being worked on internally |
16:33
🔗
|
sep332 |
by "policy" I mean for auto-crawling |
16:33
🔗
|
sep332 |
ok |
16:42
🔗
|
|
HCross2 has joined #archiveteam-bs |
16:42
🔗
|
SketchCow |
The goal is in the future it will deduplicate these. |
16:42
🔗
|
arkiver |
SketchCow: awesome!! |
16:43
🔗
|
arkiver |
hmm |
16:43
🔗
|
sep332 |
Is the video data being saved and just not rendered correctly, or plain not collected? |
16:43
🔗
|
arkiver |
SketchCow: you can remove extratorrent from there |
16:44
🔗
|
SketchCow |
I'll do additional work after I finish the script. |
16:44
🔗
|
SketchCow |
A little time to go |
16:44
🔗
|
arkiver |
ok |
16:51
🔗
|
|
irl has joined #archiveteam-bs |
16:51
🔗
|
irl |
SketchCow: hi |
16:51
🔗
|
SketchCow |
Hiiiiiiiiii |
16:51
🔗
|
irl |
hiiiiiiiiiiiiiiiiiii |
16:51
🔗
|
xmc |
hi. |
16:52
🔗
|
irl |
SketchCow: i hear you like manuals |
16:52
🔗
|
SketchCow |
I do. |
16:52
🔗
|
irl |
cool |
16:52
🔗
|
SketchCow |
I heard you like scanning them |
16:52
🔗
|
irl |
i have manuals |
16:52
🔗
|
irl |
and a scanner coming |
16:52
🔗
|
irl |
in the ebay post |
16:52
🔗
|
SketchCow |
Try not to damage the originals too much and have a fantastic time. |
16:53
🔗
|
SketchCow |
Scan at 600dpi TIFF files, put into either .ZIPs or into directories. |
16:53
🔗
|
irl |
X.25 interface cards, network simulation software, and other things relevant to internet engineering |
16:53
🔗
|
SketchCow |
I can give you an FTP drop |
16:53
🔗
|
irl |
ok awesome |
16:53
🔗
|
irl |
so i don't go directly to IA? |
16:53
🔗
|
irl |
you'll help out with metadata maybe? |
16:53
🔗
|
xmc |
you do your own metadata |
16:53
🔗
|
xmc |
don't make SketchCow do it |
16:53
🔗
|
irl |
hehe |
16:53
🔗
|
xmc |
it's not that hard |
16:54
🔗
|
irl |
got a link for how to do metadata in a nice format? |
16:54
🔗
|
irl |
also, i have some reel-to-reel tapes and 8" floppies |
16:54
🔗
|
SketchCow |
I can give general information. |
16:54
🔗
|
xmc |
like, how to type in the title and date and author? |
16:54
🔗
|
SketchCow |
In a best case, it's: |
16:55
🔗
|
SketchCow |
Title, date of creation, creator (company or individual), and then a capsule description. |
16:55
🔗
|
irl |
that seems reasonable |
16:55
🔗
|
irl |
so i'm not understanding FTP drop then, because that sounds like i'm creating an IA collection |
17:00
🔗
|
|
tomaspark has quit IRC (Ping timeout: 255 seconds) |
17:03
🔗
|
HCross2 |
SketchCow: who should I contact if I need to change the payment method for an archive.org donation? |
17:12
🔗
|
SketchCow |
mail info@archive.org |
17:12
🔗
|
SketchCow |
irl: So there's two ways to upload |
17:12
🔗
|
SketchCow |
You can upload yourself, or you can build up a pile of directories and I can give you an FTP drop and I shove them in. |
17:12
🔗
|
SketchCow |
A collection can be made and you can work on it, but I can do that initial upload process using scripts. I find that helps for bulk uploaders. |
17:13
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
17:21
🔗
|
irl |
SketchCow: ah ok cool |
17:21
🔗
|
irl |
so how should the metadata be done within the pile of directories? |
17:22
🔗
|
irl |
is there some json or yaml or something format? |
17:24
🔗
|
SketchCow |
Whatever you're comfortable with, I can work with |
17:28
🔗
|
irl |
SketchCow: i could do a csv like http://internetarchive.readthedocs.io/en/latest/cli.html#modifying-metadata-in-bulk |
17:29
🔗
|
SketchCow |
Entirely up to you. However you want. |
17:29
🔗
|
irl |
ok, just trying to find the easiest way for you |
17:55
🔗
|
godane |
SketchCow: daily sitemaps of gawker.com is happening |
17:55
🔗
|
SketchCow |
Great |
17:56
🔗
|
godane |
i just hope they sitemap does going crazy like before with monthly ones |
17:59
🔗
|
SketchCow |
Any amount of Gawker functioning right now is a gift. |
17:59
🔗
|
SketchCow |
Or any of the properties. |
17:59
🔗
|
SketchCow |
And when Univision steps in, it's going to be a bloodbath |
18:00
🔗
|
SketchCow |
The Univision buy is so insane I'm assuming it's some corrupt reason we don't understand |
18:00
🔗
|
SketchCow |
Or Denton invented some snow-job that Univision bought |
18:59
🔗
|
|
bzc6p has joined #archiveteam-bs |
18:59
🔗
|
|
swebb sets mode: +o bzc6p |
19:00
🔗
|
bzc6p |
ErkDog: you reported that yahoo answers items get stuck. Do you use the new wget-lua? |
19:01
🔗
|
bzc6p |
--version |
19:01
🔗
|
bzc6p |
there is a new one from 20160530 |
19:19
🔗
|
|
VerifiedJ has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
19:52
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
19:58
🔗
|
|
BartoCH has joined #archiveteam-bs |
21:04
🔗
|
|
HCross2 has quit IRC (Quit: Connection closed for inactivity) |
21:12
🔗
|
hook54321 |
SketchCow: Is there a minimum education background requirement (other than experience) for jobs at the internet archive? |
21:13
🔗
|
xmc |
you should apply |
21:13
🔗
|
xmc |
http://archive.org/about/jobs.php |
21:19
🔗
|
hook54321 |
I'm not located in California unfortunately. I would consider applying for many of them if I had more programming experience. |
21:19
🔗
|
xmc |
then why are you asking? |
21:29
🔗
|
hook54321 |
Wondering for possible jobs in the future, and not all of say that on-site presence is required. |
21:31
🔗
|
hook54321 |
*all of them |
21:40
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
22:07
🔗
|
|
schbirid2 has joined #archiveteam-bs |
22:10
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
22:13
🔗
|
|
whydomain has joined #archiveteam-bs |
22:15
🔗
|
whydomain |
PurpleSym: what design DIY book scanner did you make? (I'm considering https://linearbookscanner.org/ ) |
22:25
🔗
|
* |
FalconK looks around |
22:25
🔗
|
FalconK |
hey, look! https://archive.org/details/cbcnews201607-201608 |
22:26
🔗
|
xmc |
:) |
22:26
🔗
|
xmc |
that's a lot of hourly news |
22:33
🔗
|
|
RichardG has joined #archiveteam-bs |
22:40
🔗
|
FalconK |
I've got a cronjob pulling it down every hour |
22:42
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
22:45
🔗
|
|
schbirid2 has joined #archiveteam-bs |
22:46
🔗
|
FalconK |
I started doing it mostly because I thought it provided an interesting perspective on Trump, and I noticed CBC didn't keep a public archive of them |
22:46
🔗
|
xmc |
huh |
22:46
🔗
|
FalconK |
they do *have* an archive of them |
22:47
🔗
|
FalconK |
not sure how to access it. probably in person. |
22:47
🔗
|
xmc |
:| |
23:09
🔗
|
|
whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
23:16
🔗
|
|
JW_work1 has joined #archiveteam-bs |
23:18
🔗
|
|
JW_work has quit IRC (Read error: Operation timed out) |
23:23
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
23:38
🔗
|
|
rchrch has joined #archiveteam-bs |
23:45
🔗
|
|
kristian_ has joined #archiveteam-bs |
23:48
🔗
|
|
RichardG has joined #archiveteam-bs |
23:56
🔗
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |