#archiveteam-bs 2019-01-29,Tue

↑back Search

Time Nickname Message
00:56 πŸ”— VerfiedJ has quit IRC (Quit: Leaving)
01:04 πŸ”— benjinsmi has quit IRC (Leaving)
01:22 πŸ”— Hani has quit IRC (Read error: Connection reset by peer)
01:22 πŸ”— Hani has joined #archiveteam-bs
01:50 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
01:53 πŸ”— Sk1d has joined #archiveteam-bs
02:17 πŸ”— icedice has quit IRC (Quit: Leaving)
02:19 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
02:21 πŸ”— Sk1d has joined #archiveteam-bs
02:38 πŸ”— ubahn_ has joined #archiveteam-bs
02:45 πŸ”— ubahn has quit IRC (Ping timeout: 615 seconds)
02:52 πŸ”— Wizzito has joined #archiveteam-bs
02:52 πŸ”— Wizzito https://archive.org/details/WiiShopChannelBackup Y'all might appreciate this
02:52 πŸ”— Wizzito Saw it and thought of archiveteam
03:02 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
03:04 πŸ”— ubahn_ has quit IRC (Quit: ubahn_)
03:06 πŸ”— Sk1d has joined #archiveteam-bs
03:51 πŸ”— newbie81 has joined #archiveteam-bs
03:52 πŸ”— newbie81 Hey all, can anyone recommend a good way of deduplicating warc files? I'm scraping a twitter feed that tends to delete lots of stuff, but my hourly cronjob is producing a lot of duplicate data
03:52 πŸ”— newbie81 is now known as jianaran
04:04 πŸ”— Wizzito has quit IRC (Quit: Leaving)
04:15 πŸ”— benjins has joined #archiveteam-bs
04:20 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
04:23 πŸ”— Sk1d has joined #archiveteam-bs
04:33 πŸ”— qw3rty117 has joined #archiveteam-bs
04:39 πŸ”— odemgi_ has joined #archiveteam-bs
04:39 πŸ”— qw3rty116 has quit IRC (Read error: Operation timed out)
04:42 πŸ”— odemgi has quit IRC (Ping timeout: 252 seconds)
04:42 πŸ”— odemg has quit IRC (Ping timeout: 265 seconds)
04:55 πŸ”— odemg has joined #archiveteam-bs
06:11 πŸ”— newbie45 has joined #archiveteam-bs
06:14 πŸ”— jianaran has quit IRC (Ping timeout: 268 seconds)
06:15 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
06:19 πŸ”— Sk1d has joined #archiveteam-bs
06:39 πŸ”— chimyatta has joined #archiveteam-bs
06:39 πŸ”— newbie45 has quit IRC (Ping timeout: 268 seconds)
06:40 πŸ”— astrid has quit IRC (Read error: Operation timed out)
06:47 πŸ”— astrid has joined #archiveteam-bs
06:48 πŸ”— svchfoo3 sets mode: +o astrid
07:07 πŸ”— wyatt8740 has joined #archiveteam-bs
07:17 πŸ”— wyatt8740 has quit IRC (Ping timeout: 255 seconds)
07:18 πŸ”— wyatt8740 has joined #archiveteam-bs
07:23 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
07:33 πŸ”— Exairnous has quit IRC (Ping timeout: 246 seconds)
07:43 πŸ”— VADemon has quit IRC (Read error: Connection reset by peer)
08:05 πŸ”— jodizzle jianaran: Someone correct me if I'm wrong, but I believe there's usually a digest in the WARC header that can be used to dedupe.
08:06 πŸ”— jodizzle WARC-Payload-Digest
08:07 πŸ”— jodizzle In some setups it's used to write WARC revisit records
08:16 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
08:19 πŸ”— Sk1d has joined #archiveteam-bs
08:36 πŸ”— wp494_ has joined #archiveteam-bs
08:40 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
08:55 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
08:56 πŸ”— BlueMax has joined #archiveteam-bs
09:00 πŸ”— Coderjo_ has joined #archiveteam-bs
09:05 πŸ”— Coderjo has quit IRC (Ping timeout: 615 seconds)
09:51 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
09:55 πŸ”— Sk1d has joined #archiveteam-bs
10:14 πŸ”— BlueMax has quit IRC (Quit: Leaving)
10:29 πŸ”— atomicthu has quit IRC (No Ping reply in 180 seconds.)
10:35 πŸ”— atomicthu has joined #archiveteam-bs
11:10 πŸ”— m007a83_ has joined #archiveteam-bs
11:11 πŸ”— m007a83 has quit IRC (Read error: Operation timed out)
11:12 πŸ”— Hani111 has joined #archiveteam-bs
11:14 πŸ”— Hani has quit IRC (Read error: Operation timed out)
11:14 πŸ”— Hani111 is now known as Hani
11:46 πŸ”— Hani111 has joined #archiveteam-bs
11:48 πŸ”— Hani has quit IRC (Read error: Operation timed out)
11:57 πŸ”— Hani111 has quit IRC (Ping timeout: 615 seconds)
13:04 πŸ”— Mateon1 has quit IRC (Read error: Operation timed out)
13:04 πŸ”— Mateon1 has joined #archiveteam-bs
13:13 πŸ”— VerfiedJ has joined #archiveteam-bs
13:21 πŸ”— JAA Yeah, you'll want to dedupe based on the digest, but I wouldn't recommend implementing that yourself unless you're very familiar with the WARC spec. There should be code for this in some of our projects, but I'm not sure how reusable it is.
13:22 πŸ”— VADemon has joined #archiveteam-bs
13:23 πŸ”— PurpleSym crocoite has a tool (now with tests :)) which does that: https://github.com/PromyLOPh/crocoite/blob/master/crocoite/tools.py#L36
13:36 πŸ”— ubahn has joined #archiveteam-bs
14:32 πŸ”— astrid oOoOo
14:38 πŸ”— arkiver what is this crocoite?
14:38 πŸ”— arkiver I wouldnΒ΄t use it for deduplication!
14:39 πŸ”— arkiver PurpleSym: astrid: ^
14:39 πŸ”— PurpleSym Can you elaborate?
14:39 πŸ”— PurpleSym crocoite is chromebot.
14:41 πŸ”— arkiver hmm
14:43 πŸ”— arkiver actually, it might be good enough, just looked at https://github.com/webrecorder/warcio/blob/master/warcio/warcwriter.py#L169-L184
14:43 πŸ”— arkiver initially saw a lot of WARC headers missing
14:43 πŸ”— arkiver Do we know what is happening with the WARC-Block-Digest?
14:43 πŸ”— arkiver or was that never saved in the records initially
14:44 πŸ”— PurpleSym I think warcio takes care of that.
14:44 πŸ”— arkiver hmm
14:44 πŸ”— arkiver WARC records generated by crocoite therefore are an abstract view on the resource they represent and not necessarily the data sent over the wire. A URL fetched with HTTP/2 for example will still result in a HTTP/1.1 request/response pair in the WARC file. This may be undesireable from an archivist’s point of view (β€œsave the data exactly like we received it”). But this level of abstraction is inevitable when dealing with more tha
14:44 πŸ”— arkiver n one protocol.
14:44 πŸ”— arkiver I think this came up reviously
14:45 πŸ”— arkiver previously*
14:45 πŸ”— arkiver does this stuff go into the Wayback Machine?
14:45 πŸ”— PurpleSym I think so, https://archive.org/details/archiveteam_chromebot
14:48 πŸ”— arkiver Do we know if brozzler has the same problem?
14:48 πŸ”— arkiver Brozzler uses chrome too I think, but captures content differently
14:48 πŸ”— PurpleSym Afaik brozzler uses a proxy.
14:49 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
14:49 πŸ”— PurpleSym So, no, it does not have this β€œproblem”.
14:51 πŸ”— arkiver Also regarding televisiontunes, I asked, will know soon
14:51 πŸ”— PurpleSym Thanks.
14:51 πŸ”— arkiver What is the reason weΒ΄re currently using crocoite and not chromebot?
14:52 πŸ”— PurpleSym brozzler, you mean?
14:52 πŸ”— arkiver err yeah
14:52 πŸ”— arkiver crocoite and not brozzler
14:52 πŸ”— * arkiver was distracted
14:52 πŸ”— Sk1d has joined #archiveteam-bs
14:53 πŸ”— PurpleSym I didn’t know about brozzler when I started this project.
14:54 πŸ”— PurpleSym And since nobody’s spent some time setting up a brozzler IRC bot, well, …
14:54 πŸ”— arkiver I see
14:54 πŸ”— PurpleSym Also I believe the DOM snapshot/page screenshot crocite creates are important.
14:55 πŸ”— PurpleSym (I haven’t seen any other software doing that yet.)
15:03 πŸ”— arkiver right, ok
15:03 πŸ”— arkiver thanks
15:07 πŸ”— PurpleSym arkiver: Do you have any thoughts on how to push IC datasheets to IA? t3’s been throwing vendor websites into archivebot, but ultimately we need them as items on IA I believe.
15:08 πŸ”— arkiver yes, definitely want them on IA
15:08 πŸ”— arkiver donΒ΄t we already have a datasheets collection on IA?
15:09 πŸ”— arkiver https://archive.org/details/ic_datasheets
15:10 πŸ”— arkiver we probably want to add them to that collection
15:10 πŸ”— PurpleSym That’s what I was thinking. What about metadata? The description field is not very useful imo.
15:11 πŸ”— PurpleSym Like: Vendor, markings, revision, …
15:14 πŸ”— arkiver the description field on IA?
15:14 πŸ”— arkiver IΒ΄d try to add as much as possible as metadata fields
15:15 πŸ”— PurpleSym Are they indexed/searchable?
15:17 πŸ”— arkiver like contributor, creator, language, source, publisher, licenseurl, etc.
15:17 πŸ”— arkiver basically just as much as possible from https://archive.org/services/docs/api/metadata-schema/index.html
15:17 πŸ”— arkiver PurpleSym: I think so, but if not we can always add stuff to the description later
15:17 πŸ”— arkiver source would be the direct URL to the PDF
15:18 πŸ”— arkiver and make sure itΒ΄s in the Wayback Machine, but most has already gone through archivebot as you said so that should be good
15:19 πŸ”— PurpleSym Thanks for the link, will look into that and see how we can map available metadata onto that schema.
15:19 πŸ”— arkiver awesome
15:19 πŸ”— arkiver and if anything doesnΒ΄t fit, please let me know
15:19 πŸ”— arkiver to create custom fields
15:24 πŸ”— icedice has joined #archiveteam-bs
15:27 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
15:30 πŸ”— Sk1d has joined #archiveteam-bs
15:32 πŸ”— ubahn has quit IRC (Quit: ubahn)
15:37 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
15:40 πŸ”— Sk1d has joined #archiveteam-bs
15:53 πŸ”— Oddly has joined #archiveteam-bs
15:57 πŸ”— t3 Hey.
15:58 πŸ”— t3 arkiver: I made an account, about a month ago, to upload WARCs of the datasheets, but I think I am being throttled because I get "503 Slow Down" messages.
15:58 πŸ”— t3 I'm not exactly sure what to do.
16:00 πŸ”— t3 Currently, the easiest solution is to make WARCs of the websites that have the datasheets and then upload them to IA.
16:01 πŸ”— arkiver t3: you can use archivebot?
16:01 πŸ”— t3 arkiver: I will take a look at your link.
16:02 πŸ”— t3 arkiver: Well, the issue is that ArchiveBot is usually busy, and I don't want to look bad by over-using it.
16:03 πŸ”— t3 Quite honestly, it would be nice to be an ops on that channel. I'm trying my best not to hoard resources.
16:03 πŸ”— arkiver trc: please ping me if you read this
16:03 πŸ”— arkiver t3: you can everything now too right?
16:04 πŸ”— t3 arkiver: What is trc?
16:04 πŸ”— arkiver so we donΒ΄t usually add WARCs to the Wayback Machine that are uploaded by any users
16:04 πŸ”— arkiver that way it would be too easy for others to get bad data in the Wayback Machin
16:04 πŸ”— arkiver Machine*
16:04 πŸ”— arkiver trc is/was someone here
16:05 πŸ”— t3 arkiver: Yeah, but I'm basically doing bulk uploads of WARCs using grab-site.
16:06 πŸ”— arkiver where
16:06 πŸ”— t3 I'm hoping to be whitelisted so that the WARCs can be injested into Wayback Machine.
16:06 πŸ”— t3 But I have over 300 GB of pending uploads, because I'm being throttled.
16:07 πŸ”— arkiver what are the items, what is your accunt
16:07 πŸ”— arkiver account*
16:07 πŸ”— t3 arkiver: https://archive.org/details/@warchiver
16:07 πŸ”— arkiver I canΒ΄t promise anything about getting into the Wayback Machine, because of the reason I previously mentioned
16:07 πŸ”— t3 Yes, understood.
16:09 πŸ”— t3 I also have WARCs for websites that are not covered on the Wayback Machine.
16:11 πŸ”— astrid has left ][
16:18 πŸ”— t3 The items? Oh, well I have WARCs, basically. There are a lot of WARCs. More than 10 semiconductor websites, with more to come from the wiki page I've been working on (https://www.archiveteam.org/index.php?title=IC_datasheets). I also have many WARCs of corporate websites.
16:20 πŸ”— t3 I also have WARCs of under-covered blogs and other sites that are still not on Wayback Machine.
16:22 πŸ”— chimyatta has quit IRC (Quit: quitting)
16:24 πŸ”— t3 I currently have 205 WARCs queued for uploading.
16:29 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
16:31 πŸ”— Sk1d has joined #archiveteam-bs
16:36 πŸ”— Hani111 has joined #archiveteam-bs
16:36 πŸ”— Hani111 is now known as Hani
16:50 πŸ”— Oddly has quit IRC (Ping timeout: 255 seconds)
16:54 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
16:57 πŸ”— Sk1d has joined #archiveteam-bs
17:10 πŸ”— chimyatta has joined #archiveteam-bs
17:34 πŸ”— wp494 has joined #archiveteam-bs
17:37 πŸ”— wp494_ has quit IRC (Ping timeout: 268 seconds)
17:40 πŸ”— Despatche has joined #archiveteam-bs
18:01 πŸ”— ubahn has joined #archiveteam-bs
18:10 πŸ”— m007a83_ is now known as m007a83
18:11 πŸ”— m007a83 has quit IRC (Quit: Fuck you Comcast)
18:11 πŸ”— m007a83 has joined #archiveteam-bs
18:24 πŸ”— Fusl is anyone here able and willing to dump the warc from here into the WBM? https://archive.org/details/www.opennicproject.org-2019-01-29-43f55a63
18:26 πŸ”— schbirid has joined #archiveteam-bs
18:34 πŸ”— Hani111 has joined #archiveteam-bs
18:34 πŸ”— Hani has quit IRC (Read error: Connection reset by peer)
18:35 πŸ”— Hani111 is now known as Hani
18:39 πŸ”— t3 Fusl: I'm just curious. How did you set up your upload script?
18:43 πŸ”— ubahn has quit IRC (Quit: ubahn)
18:48 πŸ”— arkiver Fusl: you donΒ΄t have to upload a CDX file, btw
18:48 πŸ”— Fusl i don't?
18:48 πŸ”— arkiver no
18:48 πŸ”— arkiver IA derives the WARC files and creates itΒ΄s own CDX file
18:48 πŸ”— arkiver uploading only the WARC file is enough
18:48 πŸ”— Fusl what about -meta.warc.gz?
18:49 πŸ”— arkiver yes, that one contains valuable information
18:49 πŸ”— arkiver upload all WARCs, no CDXs, the CDX can be directly derived from the WARC and holds no special information
18:52 πŸ”— JAA arkiver: Curious, what does IA do when you do upload a CDX? Is it moved to a different filename or just deleted entirely?
18:52 πŸ”— arkiver not sure about overwriting
18:53 πŸ”— arkiver in this case a .cdx file was uploaded and IA will just ignore it
18:54 πŸ”— Fusl t3:
18:54 πŸ”— Fusl id="${1}"
18:54 πŸ”— Fusl test -d "/data/${id}" || { echo "grab-site folder with ID ${id} doesn't exist"; exit 1; }
18:54 πŸ”— Fusl docker run -ti --rm --log-driver=none -v /data/:/data/:ro ateam/ia:latest upload "${id}" "/data/${id}/${id}-"*".warc.gz" --metadata=mediatype:web
18:55 πŸ”— Fusl with #!/bin/bash in frist line, just didnt copy that in here
18:56 πŸ”— arkiver PurpleSym: green light for televisiontunes. please make sure you to get the metadata right :)
18:56 πŸ”— Fusl grab-site also downloads everything into /data so there's that
18:56 πŸ”— t3 Fusl: Thanks. I haven't used Docker to run grab-site before.
18:57 πŸ”— Fusl here's my grab-site Dockerfile: http://xor.meo.ws/puJtxPYKJZxVxLL1PLoNVTF68k2wVLq7.txt
18:57 πŸ”— PurpleSym arkiver: Unfortunately there’s not much metadata available. All I have right now is a title and the original file URL(s).
18:58 πŸ”— arkiver ok
18:58 πŸ”— Fusl run the grab-site web dashboard server with `docker run -p 29000:29000 --entrypoint gs-server ateam/grab-site:latest` and the actual grab-site with `docker logs -f -t $(docker run --rm -d -e GRAB_SITE_HOST=172.17.0.1 --name "grab-site_$(cat /proc/sys/kernel/random/uuid)" -v /data:/data:rw ateam/grab-site:latest --igon --import-ignores /data/ignores --no-offsite-links "${@}" | tee /dev/stderr)` and you're
18:58 πŸ”— Fusl good to go
18:59 πŸ”— arkiver please set the direct MP3 URL as source and add tags if available, for example http://televisiontunes.com/Olympics_-_1964_-_Tokyo_Melody.html has Sports and Olympics
19:00 πŸ”— arkiver PurpleSym: ^
19:00 πŸ”— PurpleSym Oh, I must’ve missed the tags while extracting metadata. I’ll fix this.
19:00 πŸ”— arkiver yeah, looks like some donΒ΄t have tags
19:00 πŸ”— PurpleSym Can you create a collection before I start uploading?
19:00 πŸ”— t3 Fusl: Thanks! I have been doing everything without Docker.
19:00 πŸ”— arkiver PurpleSym: what is your email?
19:00 πŸ”— arkiver from your account
19:01 πŸ”— t3 Fusl: To prevent cluttering this channel, can I private message you?
19:01 πŸ”— Fusl sure
19:01 πŸ”— PurpleSym arkiver: lars+archive@6xq.net
19:02 πŸ”— arkiver thanks
19:20 πŸ”— ubahn has joined #archiveteam-bs
19:46 πŸ”— Oddly has joined #archiveteam-bs
19:47 πŸ”— arkiver PurpleSym: https://archive.org/details/tvtunes you have access
20:06 πŸ”— Iridium has quit IRC (no.money.no.love)
20:09 πŸ”— Iridium has joined #archiveteam-bs
20:37 πŸ”— Exairnous has joined #archiveteam-bs
21:12 πŸ”— BlueMax has joined #archiveteam-bs
21:18 πŸ”— schbirid has quit IRC (Remote host closed the connection)
21:24 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
21:26 πŸ”— Sk1d has joined #archiveteam-bs
22:18 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
22:21 πŸ”— Sk1d has joined #archiveteam-bs
22:34 πŸ”— gandalf has joined #archiveteam-bs
23:03 πŸ”— MR9K has quit IRC (tilde lounge - https://irc.tilde.team)
23:03 πŸ”— MR9K has joined #archiveteam-bs
23:06 πŸ”— ubahn has quit IRC (Quit: ubahn)
23:14 πŸ”— ndiddy has joined #archiveteam-bs
23:16 πŸ”— tomaspark has quit IRC (Read error: Connection reset by peer)
23:55 πŸ”— Oddly has quit IRC (Ping timeout: 255 seconds)

irclogger-viewer